Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Original article
Small molecule annotation for the Protein Data
Bank
Sanchayita Sen1 Jasmine Young2 John M Berrisford1 Minyu Chen3
Matthew J Conroy1 Shuchismita Dutta2 Luigi Di Costanzo2
Guanghua Gao2 Sutapa Ghosh2 Brian P Hudson2 Reiko Igarashi3
Yumiko Kengaku3 Yuhe Liang2 Ezra Peisach2 Irina Persikova2
Abhik Mukhopadhyay1 Buvaneswari Coimbatore Narayanan2
Gaurav Sahni1 Junko Sato3 Monica Sekharan2 Chenghua Shao2
Lihua Tan2 and Marina A Zhuravleva2
1Protein Data Bank in Europe (PDBe) EMBL-EBI Wellcome Trust Genome Campus Hinxton CB10 1SD
UK 2RCSB Protein Data Bank (RCSB PDB) Department of Chemistry and Chemical Biology Rutgers
University Piscataway NJ 08854-8087 USA and 3Protein Data Bank Japan (PDBj) Osaka University
Osaka Japan
Corresponding author Tel thorn441223494431 Fax thorn441223494468 Email ssenebiacuk
Citation details Sen S Young J Berrisford J M et al Small molecule annotation for the protein data bank
Database (2014) Vol 2014 article ID bau116 doi101093databasebau116
Received 28 September 2014 Revised 29 October 2014 Accepted 5 November 2014
Abstract
The Protein Data Bank (PDB) is the single global repository for three-dimensional struc-
tures of biological macromolecules and their complexes and its more than 100 000 struc-
tures contain more than 20 000 distinct ligands or small molecules bound to proteins and
nucleic acids Information about these small molecules and their interactions with pro-
teins and nucleic acids is crucial for our understanding of biochemical processes and
vital for structure-based drug design Small molecules present in a deposited structure
may be attached to a polymer or may occur as a separate non-covalently linked ligand
During curation of a newly deposited structure by wwPDB annotation staff each mol-
ecule is cross-referenced to the PDB Chemical Component Dictionary (CCD) If the mol-
ecule is new to the PDB a dictionary description is created for it The information about
all small molecule components found in the PDB is distributed via the ftp archive as an
external reference file Small molecule annotation in the PDB also includes information
about ligand-binding sites and about covalent and other linkages between ligands and
macromolecules During the remediation of the peptide-like antibiotics and inhibitors
present in the PDB archive in 2011 it became clear that additional annotation was
required for consistent representation of these molecules which are quite often com-
posed of several sequential subcomponents including modified amino acids and other
VC The Author(s) 2014 Published by Oxford University Press Page 1 of 11This is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40) which permits
unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited
(page number not for citation purposes)
Database 2014 1ndash11
doi 101093databasebau116
Original article
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
chemical groups The connectivity information of the modified amino acids is necessary
for correct representation of these biologically interesting molecules The combined
information is made available via a new resource called the Biologically Interesting
molecules Reference Dictionary which is complementary to the CCD and is now
routinely used for annotation of peptide-like antibiotics and inhibitors
Introduction
The Protein Data Bank (PDB) is the single global repository
for three-dimensional (3D) structures of biological macro-
molecules and their complexes (1) The four partners of the
Worldwide PDB organization (wwPDB httpwwpdborg)
are the Research Collaboratory for Structural
Bioinformatics (RCSB PDB httprcsborg) (2) the PDB in
Europe (PDBe httppdbeorg) (3) the PDB Japan (PDBj
httppdbjorg) (4) and the Biological Magnetic Resonance
Bank (BMRB httpbmrbwiscedu) (5) They act as depos-
ition curation and distribution centres for PDB data
Although the PDB archive is focussed on macromolecules a
wide variety of small molecules are encountered bound to
proteins and nucleic acids Currently there are gt20 000 dis-
tinct kinds of small molecule present in the archive and
they are described in the wwPDB Chemical Component
Dictionary (CCD) These compounds include metals
ions cofactors fatty acids carbohydrates proteinogenic
(standard) and modified amino acids and nucleotides
chromophores antibiotics inhibitors and various other
compounds that may be naturally bound to a macromol-
ecule or acquired during purification or crystallization
The first step in ligand annotation by wwPDB curators
is to identify all the distinct chemical entities that are pre-
sent in a newly deposited structure including all polymers
and small molecules (6) PDB annotation is a complex sci-
entific process that requires understanding of the inter-
actions between small molecules and macromolecules
Aspects of small molecule annotation include
1 identifying small molecules in a newly deposited PDB
entry that are already present in the CCD
2 creating definitions for any small molecules that are
new to the PDB
3 geometry and stereochemistry validation
4 evaluating the fit of the model coordinates to the
experimental data
5 identifying any covalent links with other residues or
components
6 annotation of ligand binding sites and
7 extending or updating the annotation in Biologically
Interesting molecules Reference Dictionary (BIRD) for
peptide-like inhibitor and antibiotic molecules
The wwPDB CCD
The number of structures in the PDB archive has grown
from 7 in 1971 to gt100 000 in 2014 (7 8) All these struc-
tures are experimentally derived atomistic models of bio-
logically important proteins and nucleic acids from a huge
variety of organisms Many proteins in the PDB have sub-
strates co-factors reaction products or analogues of such
compounds bound to them In addition many proteins and
nucleic acids contain modified amino acid or nucleotide
residues Hence identifying monomeric components incor-
porated in the polymers and ligands is an important first
step of PDB annotation (6)
The chemical-component annotation of a PDB entry in-
volves identification of every small molecule that is present
in the structure either as part of a polymer or as a non-
covalently bound ligand With the increasing number of
structures in the PDB the number of unique chemical enti-
ties associated with them is increasing as well (Figure 1)
For annotation purposes it is important to identify and de-
scribe the chemical entities that are deposited to the PDB
in a systematic and consistent manner The wwPDB part-
ners achieved this through the creation of a chemical refer-
ence dictionary This contains the description of every
unique chemical entity which can then be reused in subse-
quent depositions that contain the same entity This dic-
tionary is known as the wwPDB CCD and currently
contains chemical definitions of more than 20 000 distinct
chemical entities
The wwPDB annotation software verifies for all small
molecules in a new deposition if they already occur in the
CCD When a small molecule in the deposited entry
matches an existing chemical definition in the CCD it is
assigned the same three-character identifier as in the CCD
The atom nomenclature of the small molecule in the entry
is updated to reflect the dictionary definition If a molecule
is new to the PDB then a dictionary description is created
for it The atomic coordinates from the deposited structure
are used for the initial deduction of the connectivity bond
orders and stereochemistry of the molecule In cases where
the small molecule has poor geometry extra annotation is
required to define the correct connectivity and bond
orders The dictionary definition for every molecule con-
tains both coordinates generated with idealized geometry
Page 2 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
and example experimental coordinates along with ma-
chine-readable chemical descriptors such as SMILES
strings (9) InChi (10) and InChi keys The CCD definition
also contains the systematic name synonyms chemical
formula formula weight and formal charge of the
molecule
The PDB uses PDB exchangemacromolecular
Crystallographic Information File format (PDBxmmCIF)
(11) to describe the contents of a PDB entry The advantages
of using mmCIF as the data model include its flexibility and
extensibility which allow addition of new data items to ad-
dress the ever-increasing size and complexity of deposited
structures The CCD is also maintained in CIF format (11)
and every chemical entity in the dictionary is assigned a
unique three-character identifier Each CCD entry contains
five CIF categories that provide the machine-readable chem-
ical description of the small molecule (Figure 2) The unique
three-character code assigned to a chemical component is
used as the primary key to connect the different categories
in the dictionary The _chem_comp category contains the
name synonyms chemical formula formula weight and
formal charge of the molecule along with some information
about the parent PDB entry which was used to construct
the dictionary entry For entities such as chromophores and
modified amino acids and nucleic acids the data
item_chem_compmon_nstd_parent_comp_id provides
information about the lsquoparentrsquo of the compound For ex-
ample the amino acid phosphotyrosine is derived from tyro-
sine which is therefore its parent The chromophore CRO
present in green fluorescent protein is derived from the
tripeptide Ser-Tyr-Gly Hence the dictionary definition of
CRO has Ser-Tyr-Gly listed as the parent of this chromo-
phore The _chem_comp category provides structural classi-
fication information about the compound The data items
_chem_comptype and _chem_comppdbx_type are used to
annotate the type and class of the molecule respectively A
standard or a sidechain modified L-amino acid is a member
of the class lsquoL-peptide linkingrsquo Amino acids and their modi-
fied forms occur in proteins and are designated as type
lsquoATOMPrsquo Sugars can be classified as lsquoD-saccharidersquo or lsquoL-
saccharidersquo depending on their configuration and they are
annotated as type lsquoATOMSrsquo
The _chem_comp_atom category contains atom-level
details about the compound This category not only
includes atom name element type aromaticity flag chiral-
ity and charge for every atom but also both the generated
and example experimental 3D coordinates for the entire
chemical entity Chemistry software CORINA (12) is used
to produce the programmatically generated conformation
The bond-order information is stored in the _chem_
comp_bond category whereas the SMILES and InChi
strings are in the _pdbx_chem_comp_descriptor category
Standard chemistry software such as from ACDlabs (http
wwwacdlabscom) CACTVS (13) and OpenEye (14) is
used to generate the systematic name SMILES and InChI
descriptors for the compound As part of a wwPDB collab-
oration with the Cambridge Crystallography Data Centre
Cambridge Structural Database (CSD) coordinates for
those small molecules in the PDB that also exist in CSD
will be included in the future as well
Figure 1 Number of new PDB chemical entity definitions created annually between 2000 and 2013
Database Vol 2014 Article ID bau116 Page 3 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
The wwPDB annotation guidelines require the CCD
definition to pertain to the neutral unbound form of every
compound In cases where a new chemical entity is cova-
lently linked to another small molecule or to a polymer a
leaving atom (usually an oxygen atom labelled as OXT) is
introduced in the chemical definition of the compound In
this way the newly built chemical definition can be reused
in future depositions where the same compound may occur
either in covalently bound form or as a lsquofreersquo ligand The
use of leaving atoms to capture multiple chemical forms of
the same compound is especially common for amino acids
nucleotides and carbohydrates (Figures 3 and 4)
Annotation of ligand-binding sites
Covalently linked ligands are quite common in the PDB
including co-factors such as pyridoxal phosphate and
carbohydrates such as N-acetylglucosamine Non-covalently
linked ligands are usually bound to one or more polymers
through H-bonding ionic and van der Waals interactions
Some ligands are instrumental to the biological function of a
biomacromolecule or complex Obviously the binding en-
vironment of a ligand is important to stabilize a particular
conformational state that facilitates its biological function
The site-delineation software used in PDB annotation is
derived from the CCP4 (15) program CONTACT (http
wwwccp4acukhtmlcontacthtml) It determines which
polymeric residues make up the binding site for each non-
polymeric ligand present in the structure (Figure 5) using a
cut-off distance of 37 A from any atom of the ligand and
taking crystallographic symmetry into account For oligo-
saccharides and peptide-like inhibitors and antibiotics the
software generates binding-site information for the entire
molecule instead of for the individual moieties (Figure 6) In
addition to the software-generated site information any
author-provided details about catalytic site residues in a
protein are also included in the annotated entry In the
mmCIF file for a given entry the CIF categories _struct_site
Figure 2 Abbreviated category relationship diagram for the key CIF categories that are used in the CCD Three major categories _chem_comp
_chem_comp_bond and _chem_comp_atom are joined together to generate the machine readable dictionary description of the chemical entity The
unique three character code assigned to every new chemical entity acts as the primary key in the _chem_comp categoryid (coloured in purple) and is
used to connect the other categories (_chem_comp_bondcomp_id and _chem_comp_atomcomp_id)
Page 4 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
and _struct_site_gen contain the site information for all the
ligands present in the structure The struct_site category
contains residue-level information Each binding site is
identified using a unique alphanumeric identifier (AC1
AC2 ZZ9) which serves as the primary key for the
_struct_site category The _struct_site_gen category contains
information about all the neighbouring residues within
37 A of any atom of the ligand The data item _struct_site_
gensite_id is directly inherited from the struct_siteid of the
_struct_site category (Figure 7) In addition to the binding-
site annotation all covalent bonds between ligands and
polymeric residues or other ligands are annotated as cova-
lent linkages in the _struct_conn category of the PDBx
mmCIF file
In protein and DNA structures metal ions are often
found in coordination complexes with other molecules or
ions The metals or metal clusters present in metallopro-
teins such as haemoglobins transferrins cytochromes
nitrogenases and hydrogenases are bound to nitrogen sul-
phur or oxygen atoms belonging to protein residues The
details of the bond angles for each of these coordinated
metal centres are listed in REMARK 620 of the PDB file
(Figure 8) or the _pdbx_struct_conn_angle category of the
PDBxmmCIF file
Annotation of peptide-like inhibitors andantibiotics
In 2011 a major remediation was carried out by wwPDB
curators to improve the representation of peptide-like
inhibitors and antibiotics so that they can be more easily
identified and studied (16) Currently there are nearly
1300 such structures present in the PDB archive
The presence or absence of consecutive peptide bonds
determines how these molecules are represented in the
PDB If a peptide-like molecule contains two or more
consecutive peptide bonds it is annotated as a standard
polymer with information about recommended name
source organism and linkage information between the
amino acids and any non-standard residues present in the
molecule Cross-referencing to UniProt (17) for gene
products or to Norine (18) for non-ribosomal peptides is
attempted for such compounds
If the ligand has fewer than two consecutive peptide
bonds (examples of this include several PPACK
inhibitors and other peptide-like molecules) it is annotated
as a non-polymer like any other small molecule in
the CCD but with an identifiable subcomponent se-
quence running from amino terminal (N) to carboxy (C)
terminal
Antibiotics such as teicoplanin and vancomycin have
sugars or fatty acids attached to their peptide core Such
compounds are represented as a lsquogrouprsquo (16) which
includes the polymeric and non-polymeric constituents of
the antibiotic along with details about all the linkages
between them (Figure 9)
Figure 3 a-D-Glucose can form a(1ndash4) glycosidic linkages with other
carbohydrate molecules During the oligomerization process the O1
oxygen (highlighted in the figure) of the glucose is eliminated by the O4
oxygen of the other carbohydrate To account for this condensation re-
action the O1 oxygen of a-D-glucose (GLC) is annotated in the CCD as a
leaving atom The two-dimensional diagram in this figure is a copy of
the image from the RCSB PDB website (httprcsborgpdbligand
ligandsummarydohetIdfrac14GLC) It was generated using the ChemAxon
software (httpwwwchemaxoncom)
Figure 4 Seven a-D-glucose (GLC) molecules undergo condensation re-
action to form the circular oligosaccharide b-cyclodextrin [from PDB
entry 2v8l (30)]
Database Vol 2014 Article ID bau116 Page 5 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
BIRD
The large amount of chemical and biological information
that emerged during the remediation of the peptide-like
antibiotic and inhibitor molecules in the PDB required sys-
tematic referencing and documenting for curation of future
depositions A new resource was created to assist the future
annotation of such compounds containing both chemical
and biological information The chemical information is
stored in Peptide Reference Dictionary (PRD) files
whereas the biological function is documented in a separ-
ate family file The chemical and biological information
are separately annotated so that chemically similar
Figure 5 Binding site for the Plk-2 inhibitor (7R)-8-cyclopentyl-7-ethyl-5-methyl-78-dihydropteridin-6(5H)-one (3 letter code 11 G) in PDB entry 4i6b
(31) The figure depicts the neighbouring residues that are within 37 A of the ligand 11 G
Figure 6 The environment for the oligosaccharide poly-N-acetylglucosamine (PNAG) is annotated instead of listing the environment of individual
sugars This avoids repeating the same sugar molecule in multiple binding sites
Page 6 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
molecules either with conserved core polymer sequence or
signature sequence motifs can be grouped under the same
family of antibiotics or peptide-like molecules For ex-
ample antibiotics such as chlorionectin A vancomycin
teicoplanin and balhimycin are part of the glycopeptide
antibiotic family For each chemically distinct peptide-like
molecule a new PRD description is created with a unique
identifier and information about its chemical composition
connectivity structural description (eg glycopeptide
anthracycline lipopeptide etc) and function (eg
immunosuppressant enzyme inhibitor thrombin inhibitor
etc) Information about a family of molecules is stored in
a FAM file which mainly includes biological annotation
(such as function mechanism of action and
Figure 7 Diagram showing the relationship between the_struct_site and _struct_site_gen categories used for annotation of ligand-binding sites The
_struct_site category holds information about the ligands that are present in the PDB entry and every ligand in this category is assigned a alphanu-
meric binding site identifier The _struct_site_gen category contains information of the residues that are present within the vicinity of the ligands
described in the struct_site category Both the categories are joined by the binding site identifier
Figure 8 Tetrahedrally coordinated Zn ion in entry 2VW4 (32) along with the annotation of the bond angles The REMARK 620 annotation indicates
the software calculated bond angle values between Zn A 503 and its surrounding residues The surrounding residues in anticlockwise direction are
Glu A 195 HisB 165 and Asp B 167 The sidechain carboxylate group of the Glu residue exists in two alternate conformation (A and B conformers)
The angle between GluA195B-Zn-HisB165 is 1178 GluA195B-Zn-Asp(OD1)B167 is 866 HisB165-Zn-Asp(OD1)B167 is 916 Glu195B-Zn-
ASP(OD2)B167 is 1052 HisB165-Zn-Asp(OD2)B167 is 1249 Asp(OD1)B167-Zn-Asp(OD2)B167 is 571 Glu(OE1)A195B-Zn-Glu(OE2)A195A is 241
HisB165-Zn- Glu(OE2)A195A is 1228 Asp(OD1)B167-Zn-Glu(OE2)A195A is 1093 and Asp(OD2)B167-Zn-Glu(OE2)A195A is 1107
Database Vol 2014 Article ID bau116 Page 7 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
pharmacological action) and the list of family members in
the PRD (16) Remediated and newly released PDB entries
containing peptide-like compounds include PRD identifiers
in the PDBx files The new resource comprising both
chemical and biological information for these peptide-like
molecules is called the BIRD Both the PRD and family
files are distributed via the wwPDB ftp area (ftpftp
wwpdborgpubpdbdatabird) Currently there are nearly
200 families and more than 700 PRD entries in BIRD The
release of any new PRD entries or families is synchronized
with the release of the first PDB entry containing the cor-
responding peptide-like molecule
Figure 9 Annotation of the glycopeptide antibiotic teicoplanin involves lsquochopping uprsquo the molecule into its component chemical entities that are vali-
dated against the CCD The bonds highlighted in yellow demarcate the individual entities
Page 8 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
Validation of ligands in the PDB
The wwPDB structure validation process includes geometry
checks for all standard and non-standard amino acids and
nucleotides as well as carbohydrates and ligands Since
2008 wwPDB has convened several method-specific valid-
ation task forces (VTFs) composed of community experts
to provide recommendations on structure and data valid-
ation (19ndash21) The recommendations of the X-ray VTF
have been implemented in a software pipeline that is used
during deposition and annotation as a stand-alone server
and for validation of all legacy crystal structures in the arch-
ive (22) The validation pipeline uses a variety of external
and internal software To assess bond lengths bond angles
torsion angles and chirality for every small molecule present
in an entry use is made of the program Mogul (23) which
compares the geometry to that encountered in high-quality
small molecule structures in the CSD (24) The validation
pipeline also analyses the fit of ligands to the electron dens-
ity using approaches introduced by the Uppsala Electron-
Density Server (25) For every ligand in an entry the local
ligand density fit (LLDF) is calculated this is essentially a
Z-score of the ligandrsquos real space R-value (RSR) (26) relative
to the mean and standard deviation of the RSR values of the
neighbouring polymeric standard residues and nucleotides
(taking crystallographic symmetry into account) and using
a cut-off distance of 5 A from any ligand atom LLDF values
gt2 are highlighted in the report produced by the validation
pipeline (Figure 10) The validation report generated during
the annotation process is sent back to the depositor so it can
be included when a paper describing the structure is submit-
ted for publication Upon release of an entry the validation
pipeline is run on it again and the resulting report is made
publicly available
Conclusions and future directions
The PDB archive contains a wide variety of chemical enti-
ties many of which play a crucial role in modulating
biochemical reactions Correct identification and representa-
tion of these small molecules in the PDB at the annotation
stage is crucial to allow these molecules to be identified
compared etc by users Every structure deposited to the
PDB is dissected to identify the individual chemical entities
that are subsequently compared and validated against the
CCD When a chemical entity matches an existing diction-
ary entry it is assigned the same three-character identifier as
Figure 10 rA weighted 2Fo-Fc map of a carbohydrate binding protein shown at a contour level of 035eA^3 Very little electron density is observed
for the oligosaccharide molecule This is reflected in the high LLDF values (shown in parentheses) for each of the component carbohydrate moieties
Database Vol 2014 Article ID bau116 Page 9 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
in the CCD In addition the nomenclature of the entityrsquos
atoms is updated to reflect the dictionary definition Once
the individual components have been correctly identified in-
formation is added about all the non-standard covalent H-
bonding and ionic interactions that occur in the structure
The annotation software also generates binding-site infor-
mation for every bound molecule For all crystal structures
symmetry information is taken into account during the iden-
tification of binding sites and non-covalent interactions
The rapid growth and increased complexity of the struc-
tures to PDB has prompted continuous review of the anno-
tation tools and processes used during processing and
handling the data deposited to the PDB To support the sci-
entific advancements in the field of structural biology the
wwPDB partners are collaborating to develop a new joint
deposition and annotation system (27) The new annotation
system contains four major modules such as a chemical
component module a sequence module a validation mod-
ule and a module for derived data The chemistry module
can perform simultaneous searches against both the CCD
and BIRD The new system has an interactive interface and
visualization tools that aid in annotation and facilitate cor-
rect representation of the data (28) The chemistry module
has been tested on previous depositions and has significantly
improved processing efficiency and data quality
AcknowledgementsThe 3D images in the manuscript were generated using PyMOL
(29) The authors are grateful to Prof Gerard Kleywegt for helpful
comments on this manuscript
Funding
PDBe is funded by EMBL-EBI Wellcome Trust (088944) BBSRC
(BBJ0074711 BBI02576X1 BBK0169701) RCSB is funded by
NSF NIH DOE (DBI-1338415) PDBj is funded by JST-NBDC
Institute for Protein Research IPR Osaka University The open
access fee for paper entitled lsquoSmall molecule annotation for the pro-
tein data bankrsquo will be paid by the Wellcome Trust Grant 088944
Conflict of interest None declared
References
1 BermanHM HenrickK KleywegtG et al (2012) The
Worldwide Protein Data Bank Int Tables Crystallogr F 827ndash832
2 RosePW BiC BluhmWF et al (2013) The RCSB Protein
Data Bank new resources for research and education Nucleic
Acids Res 41 D475ndashD482
3 GutmanasA AlhroubY BattleGM et al (2014) PDBe Protein
Data Bank in Europe Nucleic Acids Res 42 D285ndashD291
4 KinjoAR SuzukiH YamashitaR et al (2012) Protein Data
Bank Japan (PDBj) maintaining a structural data archive and re-
source description framework format Nulceic Acids Res 40
D453ndashD460
5 UlrichEL AkutsuH DoreleijersJF et al (2008)
BioMagResBank Nucleic Acids Res 36 D402ndashD408
6 DuttaS BurkhardtK YoungJ et al (2009) Data deposition
and annotation at the Worldwide Protein Data Bank Mol
Biotechnol 42 1ndash13
7 BermanHM KleywegtGJ NakamuraH et al (2013) How
community has shaped the Protein Data Bank Structure 21
1485ndash1491
8 Editorial (2014) Hard datamdashit has been no small feat for the
Protein Data Bank to stay relevant for 100000 structures
Nature 509 260
9 WeiningerD (1988) SMILES a chemical language and infor-
mation system 1 Introduction to methodology and encoding
rules J Chem Inf Model 28 31ndash36
10 HellerS McNaughtA SteinS et al (2013) InChImdashthe
worldwide chemical structure identifier standard
J Cheminform 5 7
11 BournePE BermanHM McMahonB et al (1997)
Macromolecular crystallographic information file Methods
Enzymol 277 571ndash590
12 GasteigerJ RudolphC and SadowskiJ (1990) Automatic gen-
eration of 3D-atomic coordinates for organic molecules
Tetrahedron Comp Method 3 537ndash547
13 IhlenfeldtWD TakahasiY AbeH et al (1992) CACTVS a
chemistry algorithm development environment In Machida K
Nishioka T (eds) Daijuukagakutouronkai Dainijuukai
Kouzoukasseisoukan Shinpojiumu Kouenyoushishuu Kyoto
University Press Kyoto pp 102ndash105
14 StahlM and MauserH (2005) Database clustering with a
combination of fingerprint and maximum common substructure
methods J Chem Inf Model 45 542ndash548
15 WinnMD BallardCC CowtanKD et al (2011) Overview
of the CCP4 suite and current developments Acta Crystallogr
D D67 235ndash242
16 DuttaS DimitropoulosD FengZ et al (2014) Improving the
representation of peptide-like inhibitor and antibiotic molecules
in the Protein Data Bank Biopolymers 101 659ndash668
17 The UniProt Consortium (2011) Reorganizing the protein space
at the Universal Protein Resource (UniProt) Nucleic Acids Res
40 D71ndashD75
18 CabocheS PupinM LeclereV et al (2008) NORINE a data-
base of nonribosomal peptides Nucleic Acids Res 36
D326ndashD331
19 ReadRJ AdamsPD ArendallWB et al (2011) A new gen-
eration of crystallographic validation tools for the Protein Data
Bank Structure 19 1395ndash1412
20 MontelioneGT NilgesM BaxA et al (2013)
Recommendations of the wwPDB NMR Structure Validation
Task Force Structure 21 1563ndash1570
21 HendersonR SaliA BakerML et al (2012) Outcome of the
first electron microscopy validation task force meeting
Structure 20 205ndash214
22 GoreS VelankarS and KleywegtGJ (2012) Implementing an
X-ray validation pipeline for the Protein Data Bank Acta
Crystallogr D68 478ndash483
23 BrunoIJ ColeJC KesslerM et al (2004) Retrieval of crys-
tallographically-derived molecular geometry information
J Chem Inf Model 44 2133ndash2144
Page 10 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
24 AllenFH BellardS BriceMD et al (1979) The Cambridge
Crystallographic Data Centre computer-based search retrieval
analysis and display of information Acta Crystallogr B 35
2331ndash2339
25 KleywegtGJ HarrisMR ZouJY et al (2004) The
Uppsala Electron-Density Server Acta Crystallogr D 60 2240ndash2249
26 JonesTA ZouJY CowanSW et al (1991) Improved methods
for building protein models in electron density maps and the loca-
tion of errors in these models Acta Crytslallogr A 47 110ndash119
27 QuesadaM WestbrookJ OldfieldT et al (2011) The
wwPDB common tool for deposition and annotation Acta
Crystallogr A67 C403ndashC404
28 YoungJ FengZ DimitropolousD et al (2013) Chemical an-
notation of small and peptide-like molecules at the Protein Data
Bank Database 2013 1
29 The PyMOL Molecular Graphics System Version 12r3pre
Schrodinger LLC [Online]
30 TungJY ChangMD ChouWI et al (2008) Crystal struc-
tures of the starch-binding domain from Rhizopus oryzae
glucoamylase reveal a polysaccharide-binding path Biochem J
416 27ndash36
31 AubeleDL HomRK AdlerM et al (2013) Selective and
brain-permeable polo-like kinase-2 (Plk-2) inhibitors that reduce
alpha-synuclein phosphorylation in rat brain ChemMedChem
8 1295ndash1313
32 EllisMJ BuffeySG HoughMA et al (2008) On-line
optical and X-ray spectroscopies with crystallography an inte-
grated approach for determining metalloprotein structures in
functionally well defined states J Synchrotron Radiat 15
433ndash439
Database Vol 2014 Article ID bau116 Page 11 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
chemical groups The connectivity information of the modified amino acids is necessary
for correct representation of these biologically interesting molecules The combined
information is made available via a new resource called the Biologically Interesting
molecules Reference Dictionary which is complementary to the CCD and is now
routinely used for annotation of peptide-like antibiotics and inhibitors
Introduction
The Protein Data Bank (PDB) is the single global repository
for three-dimensional (3D) structures of biological macro-
molecules and their complexes (1) The four partners of the
Worldwide PDB organization (wwPDB httpwwpdborg)
are the Research Collaboratory for Structural
Bioinformatics (RCSB PDB httprcsborg) (2) the PDB in
Europe (PDBe httppdbeorg) (3) the PDB Japan (PDBj
httppdbjorg) (4) and the Biological Magnetic Resonance
Bank (BMRB httpbmrbwiscedu) (5) They act as depos-
ition curation and distribution centres for PDB data
Although the PDB archive is focussed on macromolecules a
wide variety of small molecules are encountered bound to
proteins and nucleic acids Currently there are gt20 000 dis-
tinct kinds of small molecule present in the archive and
they are described in the wwPDB Chemical Component
Dictionary (CCD) These compounds include metals
ions cofactors fatty acids carbohydrates proteinogenic
(standard) and modified amino acids and nucleotides
chromophores antibiotics inhibitors and various other
compounds that may be naturally bound to a macromol-
ecule or acquired during purification or crystallization
The first step in ligand annotation by wwPDB curators
is to identify all the distinct chemical entities that are pre-
sent in a newly deposited structure including all polymers
and small molecules (6) PDB annotation is a complex sci-
entific process that requires understanding of the inter-
actions between small molecules and macromolecules
Aspects of small molecule annotation include
1 identifying small molecules in a newly deposited PDB
entry that are already present in the CCD
2 creating definitions for any small molecules that are
new to the PDB
3 geometry and stereochemistry validation
4 evaluating the fit of the model coordinates to the
experimental data
5 identifying any covalent links with other residues or
components
6 annotation of ligand binding sites and
7 extending or updating the annotation in Biologically
Interesting molecules Reference Dictionary (BIRD) for
peptide-like inhibitor and antibiotic molecules
The wwPDB CCD
The number of structures in the PDB archive has grown
from 7 in 1971 to gt100 000 in 2014 (7 8) All these struc-
tures are experimentally derived atomistic models of bio-
logically important proteins and nucleic acids from a huge
variety of organisms Many proteins in the PDB have sub-
strates co-factors reaction products or analogues of such
compounds bound to them In addition many proteins and
nucleic acids contain modified amino acid or nucleotide
residues Hence identifying monomeric components incor-
porated in the polymers and ligands is an important first
step of PDB annotation (6)
The chemical-component annotation of a PDB entry in-
volves identification of every small molecule that is present
in the structure either as part of a polymer or as a non-
covalently bound ligand With the increasing number of
structures in the PDB the number of unique chemical enti-
ties associated with them is increasing as well (Figure 1)
For annotation purposes it is important to identify and de-
scribe the chemical entities that are deposited to the PDB
in a systematic and consistent manner The wwPDB part-
ners achieved this through the creation of a chemical refer-
ence dictionary This contains the description of every
unique chemical entity which can then be reused in subse-
quent depositions that contain the same entity This dic-
tionary is known as the wwPDB CCD and currently
contains chemical definitions of more than 20 000 distinct
chemical entities
The wwPDB annotation software verifies for all small
molecules in a new deposition if they already occur in the
CCD When a small molecule in the deposited entry
matches an existing chemical definition in the CCD it is
assigned the same three-character identifier as in the CCD
The atom nomenclature of the small molecule in the entry
is updated to reflect the dictionary definition If a molecule
is new to the PDB then a dictionary description is created
for it The atomic coordinates from the deposited structure
are used for the initial deduction of the connectivity bond
orders and stereochemistry of the molecule In cases where
the small molecule has poor geometry extra annotation is
required to define the correct connectivity and bond
orders The dictionary definition for every molecule con-
tains both coordinates generated with idealized geometry
Page 2 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
and example experimental coordinates along with ma-
chine-readable chemical descriptors such as SMILES
strings (9) InChi (10) and InChi keys The CCD definition
also contains the systematic name synonyms chemical
formula formula weight and formal charge of the
molecule
The PDB uses PDB exchangemacromolecular
Crystallographic Information File format (PDBxmmCIF)
(11) to describe the contents of a PDB entry The advantages
of using mmCIF as the data model include its flexibility and
extensibility which allow addition of new data items to ad-
dress the ever-increasing size and complexity of deposited
structures The CCD is also maintained in CIF format (11)
and every chemical entity in the dictionary is assigned a
unique three-character identifier Each CCD entry contains
five CIF categories that provide the machine-readable chem-
ical description of the small molecule (Figure 2) The unique
three-character code assigned to a chemical component is
used as the primary key to connect the different categories
in the dictionary The _chem_comp category contains the
name synonyms chemical formula formula weight and
formal charge of the molecule along with some information
about the parent PDB entry which was used to construct
the dictionary entry For entities such as chromophores and
modified amino acids and nucleic acids the data
item_chem_compmon_nstd_parent_comp_id provides
information about the lsquoparentrsquo of the compound For ex-
ample the amino acid phosphotyrosine is derived from tyro-
sine which is therefore its parent The chromophore CRO
present in green fluorescent protein is derived from the
tripeptide Ser-Tyr-Gly Hence the dictionary definition of
CRO has Ser-Tyr-Gly listed as the parent of this chromo-
phore The _chem_comp category provides structural classi-
fication information about the compound The data items
_chem_comptype and _chem_comppdbx_type are used to
annotate the type and class of the molecule respectively A
standard or a sidechain modified L-amino acid is a member
of the class lsquoL-peptide linkingrsquo Amino acids and their modi-
fied forms occur in proteins and are designated as type
lsquoATOMPrsquo Sugars can be classified as lsquoD-saccharidersquo or lsquoL-
saccharidersquo depending on their configuration and they are
annotated as type lsquoATOMSrsquo
The _chem_comp_atom category contains atom-level
details about the compound This category not only
includes atom name element type aromaticity flag chiral-
ity and charge for every atom but also both the generated
and example experimental 3D coordinates for the entire
chemical entity Chemistry software CORINA (12) is used
to produce the programmatically generated conformation
The bond-order information is stored in the _chem_
comp_bond category whereas the SMILES and InChi
strings are in the _pdbx_chem_comp_descriptor category
Standard chemistry software such as from ACDlabs (http
wwwacdlabscom) CACTVS (13) and OpenEye (14) is
used to generate the systematic name SMILES and InChI
descriptors for the compound As part of a wwPDB collab-
oration with the Cambridge Crystallography Data Centre
Cambridge Structural Database (CSD) coordinates for
those small molecules in the PDB that also exist in CSD
will be included in the future as well
Figure 1 Number of new PDB chemical entity definitions created annually between 2000 and 2013
Database Vol 2014 Article ID bau116 Page 3 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
The wwPDB annotation guidelines require the CCD
definition to pertain to the neutral unbound form of every
compound In cases where a new chemical entity is cova-
lently linked to another small molecule or to a polymer a
leaving atom (usually an oxygen atom labelled as OXT) is
introduced in the chemical definition of the compound In
this way the newly built chemical definition can be reused
in future depositions where the same compound may occur
either in covalently bound form or as a lsquofreersquo ligand The
use of leaving atoms to capture multiple chemical forms of
the same compound is especially common for amino acids
nucleotides and carbohydrates (Figures 3 and 4)
Annotation of ligand-binding sites
Covalently linked ligands are quite common in the PDB
including co-factors such as pyridoxal phosphate and
carbohydrates such as N-acetylglucosamine Non-covalently
linked ligands are usually bound to one or more polymers
through H-bonding ionic and van der Waals interactions
Some ligands are instrumental to the biological function of a
biomacromolecule or complex Obviously the binding en-
vironment of a ligand is important to stabilize a particular
conformational state that facilitates its biological function
The site-delineation software used in PDB annotation is
derived from the CCP4 (15) program CONTACT (http
wwwccp4acukhtmlcontacthtml) It determines which
polymeric residues make up the binding site for each non-
polymeric ligand present in the structure (Figure 5) using a
cut-off distance of 37 A from any atom of the ligand and
taking crystallographic symmetry into account For oligo-
saccharides and peptide-like inhibitors and antibiotics the
software generates binding-site information for the entire
molecule instead of for the individual moieties (Figure 6) In
addition to the software-generated site information any
author-provided details about catalytic site residues in a
protein are also included in the annotated entry In the
mmCIF file for a given entry the CIF categories _struct_site
Figure 2 Abbreviated category relationship diagram for the key CIF categories that are used in the CCD Three major categories _chem_comp
_chem_comp_bond and _chem_comp_atom are joined together to generate the machine readable dictionary description of the chemical entity The
unique three character code assigned to every new chemical entity acts as the primary key in the _chem_comp categoryid (coloured in purple) and is
used to connect the other categories (_chem_comp_bondcomp_id and _chem_comp_atomcomp_id)
Page 4 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
and _struct_site_gen contain the site information for all the
ligands present in the structure The struct_site category
contains residue-level information Each binding site is
identified using a unique alphanumeric identifier (AC1
AC2 ZZ9) which serves as the primary key for the
_struct_site category The _struct_site_gen category contains
information about all the neighbouring residues within
37 A of any atom of the ligand The data item _struct_site_
gensite_id is directly inherited from the struct_siteid of the
_struct_site category (Figure 7) In addition to the binding-
site annotation all covalent bonds between ligands and
polymeric residues or other ligands are annotated as cova-
lent linkages in the _struct_conn category of the PDBx
mmCIF file
In protein and DNA structures metal ions are often
found in coordination complexes with other molecules or
ions The metals or metal clusters present in metallopro-
teins such as haemoglobins transferrins cytochromes
nitrogenases and hydrogenases are bound to nitrogen sul-
phur or oxygen atoms belonging to protein residues The
details of the bond angles for each of these coordinated
metal centres are listed in REMARK 620 of the PDB file
(Figure 8) or the _pdbx_struct_conn_angle category of the
PDBxmmCIF file
Annotation of peptide-like inhibitors andantibiotics
In 2011 a major remediation was carried out by wwPDB
curators to improve the representation of peptide-like
inhibitors and antibiotics so that they can be more easily
identified and studied (16) Currently there are nearly
1300 such structures present in the PDB archive
The presence or absence of consecutive peptide bonds
determines how these molecules are represented in the
PDB If a peptide-like molecule contains two or more
consecutive peptide bonds it is annotated as a standard
polymer with information about recommended name
source organism and linkage information between the
amino acids and any non-standard residues present in the
molecule Cross-referencing to UniProt (17) for gene
products or to Norine (18) for non-ribosomal peptides is
attempted for such compounds
If the ligand has fewer than two consecutive peptide
bonds (examples of this include several PPACK
inhibitors and other peptide-like molecules) it is annotated
as a non-polymer like any other small molecule in
the CCD but with an identifiable subcomponent se-
quence running from amino terminal (N) to carboxy (C)
terminal
Antibiotics such as teicoplanin and vancomycin have
sugars or fatty acids attached to their peptide core Such
compounds are represented as a lsquogrouprsquo (16) which
includes the polymeric and non-polymeric constituents of
the antibiotic along with details about all the linkages
between them (Figure 9)
Figure 3 a-D-Glucose can form a(1ndash4) glycosidic linkages with other
carbohydrate molecules During the oligomerization process the O1
oxygen (highlighted in the figure) of the glucose is eliminated by the O4
oxygen of the other carbohydrate To account for this condensation re-
action the O1 oxygen of a-D-glucose (GLC) is annotated in the CCD as a
leaving atom The two-dimensional diagram in this figure is a copy of
the image from the RCSB PDB website (httprcsborgpdbligand
ligandsummarydohetIdfrac14GLC) It was generated using the ChemAxon
software (httpwwwchemaxoncom)
Figure 4 Seven a-D-glucose (GLC) molecules undergo condensation re-
action to form the circular oligosaccharide b-cyclodextrin [from PDB
entry 2v8l (30)]
Database Vol 2014 Article ID bau116 Page 5 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
BIRD
The large amount of chemical and biological information
that emerged during the remediation of the peptide-like
antibiotic and inhibitor molecules in the PDB required sys-
tematic referencing and documenting for curation of future
depositions A new resource was created to assist the future
annotation of such compounds containing both chemical
and biological information The chemical information is
stored in Peptide Reference Dictionary (PRD) files
whereas the biological function is documented in a separ-
ate family file The chemical and biological information
are separately annotated so that chemically similar
Figure 5 Binding site for the Plk-2 inhibitor (7R)-8-cyclopentyl-7-ethyl-5-methyl-78-dihydropteridin-6(5H)-one (3 letter code 11 G) in PDB entry 4i6b
(31) The figure depicts the neighbouring residues that are within 37 A of the ligand 11 G
Figure 6 The environment for the oligosaccharide poly-N-acetylglucosamine (PNAG) is annotated instead of listing the environment of individual
sugars This avoids repeating the same sugar molecule in multiple binding sites
Page 6 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
molecules either with conserved core polymer sequence or
signature sequence motifs can be grouped under the same
family of antibiotics or peptide-like molecules For ex-
ample antibiotics such as chlorionectin A vancomycin
teicoplanin and balhimycin are part of the glycopeptide
antibiotic family For each chemically distinct peptide-like
molecule a new PRD description is created with a unique
identifier and information about its chemical composition
connectivity structural description (eg glycopeptide
anthracycline lipopeptide etc) and function (eg
immunosuppressant enzyme inhibitor thrombin inhibitor
etc) Information about a family of molecules is stored in
a FAM file which mainly includes biological annotation
(such as function mechanism of action and
Figure 7 Diagram showing the relationship between the_struct_site and _struct_site_gen categories used for annotation of ligand-binding sites The
_struct_site category holds information about the ligands that are present in the PDB entry and every ligand in this category is assigned a alphanu-
meric binding site identifier The _struct_site_gen category contains information of the residues that are present within the vicinity of the ligands
described in the struct_site category Both the categories are joined by the binding site identifier
Figure 8 Tetrahedrally coordinated Zn ion in entry 2VW4 (32) along with the annotation of the bond angles The REMARK 620 annotation indicates
the software calculated bond angle values between Zn A 503 and its surrounding residues The surrounding residues in anticlockwise direction are
Glu A 195 HisB 165 and Asp B 167 The sidechain carboxylate group of the Glu residue exists in two alternate conformation (A and B conformers)
The angle between GluA195B-Zn-HisB165 is 1178 GluA195B-Zn-Asp(OD1)B167 is 866 HisB165-Zn-Asp(OD1)B167 is 916 Glu195B-Zn-
ASP(OD2)B167 is 1052 HisB165-Zn-Asp(OD2)B167 is 1249 Asp(OD1)B167-Zn-Asp(OD2)B167 is 571 Glu(OE1)A195B-Zn-Glu(OE2)A195A is 241
HisB165-Zn- Glu(OE2)A195A is 1228 Asp(OD1)B167-Zn-Glu(OE2)A195A is 1093 and Asp(OD2)B167-Zn-Glu(OE2)A195A is 1107
Database Vol 2014 Article ID bau116 Page 7 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
pharmacological action) and the list of family members in
the PRD (16) Remediated and newly released PDB entries
containing peptide-like compounds include PRD identifiers
in the PDBx files The new resource comprising both
chemical and biological information for these peptide-like
molecules is called the BIRD Both the PRD and family
files are distributed via the wwPDB ftp area (ftpftp
wwpdborgpubpdbdatabird) Currently there are nearly
200 families and more than 700 PRD entries in BIRD The
release of any new PRD entries or families is synchronized
with the release of the first PDB entry containing the cor-
responding peptide-like molecule
Figure 9 Annotation of the glycopeptide antibiotic teicoplanin involves lsquochopping uprsquo the molecule into its component chemical entities that are vali-
dated against the CCD The bonds highlighted in yellow demarcate the individual entities
Page 8 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
Validation of ligands in the PDB
The wwPDB structure validation process includes geometry
checks for all standard and non-standard amino acids and
nucleotides as well as carbohydrates and ligands Since
2008 wwPDB has convened several method-specific valid-
ation task forces (VTFs) composed of community experts
to provide recommendations on structure and data valid-
ation (19ndash21) The recommendations of the X-ray VTF
have been implemented in a software pipeline that is used
during deposition and annotation as a stand-alone server
and for validation of all legacy crystal structures in the arch-
ive (22) The validation pipeline uses a variety of external
and internal software To assess bond lengths bond angles
torsion angles and chirality for every small molecule present
in an entry use is made of the program Mogul (23) which
compares the geometry to that encountered in high-quality
small molecule structures in the CSD (24) The validation
pipeline also analyses the fit of ligands to the electron dens-
ity using approaches introduced by the Uppsala Electron-
Density Server (25) For every ligand in an entry the local
ligand density fit (LLDF) is calculated this is essentially a
Z-score of the ligandrsquos real space R-value (RSR) (26) relative
to the mean and standard deviation of the RSR values of the
neighbouring polymeric standard residues and nucleotides
(taking crystallographic symmetry into account) and using
a cut-off distance of 5 A from any ligand atom LLDF values
gt2 are highlighted in the report produced by the validation
pipeline (Figure 10) The validation report generated during
the annotation process is sent back to the depositor so it can
be included when a paper describing the structure is submit-
ted for publication Upon release of an entry the validation
pipeline is run on it again and the resulting report is made
publicly available
Conclusions and future directions
The PDB archive contains a wide variety of chemical enti-
ties many of which play a crucial role in modulating
biochemical reactions Correct identification and representa-
tion of these small molecules in the PDB at the annotation
stage is crucial to allow these molecules to be identified
compared etc by users Every structure deposited to the
PDB is dissected to identify the individual chemical entities
that are subsequently compared and validated against the
CCD When a chemical entity matches an existing diction-
ary entry it is assigned the same three-character identifier as
Figure 10 rA weighted 2Fo-Fc map of a carbohydrate binding protein shown at a contour level of 035eA^3 Very little electron density is observed
for the oligosaccharide molecule This is reflected in the high LLDF values (shown in parentheses) for each of the component carbohydrate moieties
Database Vol 2014 Article ID bau116 Page 9 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
in the CCD In addition the nomenclature of the entityrsquos
atoms is updated to reflect the dictionary definition Once
the individual components have been correctly identified in-
formation is added about all the non-standard covalent H-
bonding and ionic interactions that occur in the structure
The annotation software also generates binding-site infor-
mation for every bound molecule For all crystal structures
symmetry information is taken into account during the iden-
tification of binding sites and non-covalent interactions
The rapid growth and increased complexity of the struc-
tures to PDB has prompted continuous review of the anno-
tation tools and processes used during processing and
handling the data deposited to the PDB To support the sci-
entific advancements in the field of structural biology the
wwPDB partners are collaborating to develop a new joint
deposition and annotation system (27) The new annotation
system contains four major modules such as a chemical
component module a sequence module a validation mod-
ule and a module for derived data The chemistry module
can perform simultaneous searches against both the CCD
and BIRD The new system has an interactive interface and
visualization tools that aid in annotation and facilitate cor-
rect representation of the data (28) The chemistry module
has been tested on previous depositions and has significantly
improved processing efficiency and data quality
AcknowledgementsThe 3D images in the manuscript were generated using PyMOL
(29) The authors are grateful to Prof Gerard Kleywegt for helpful
comments on this manuscript
Funding
PDBe is funded by EMBL-EBI Wellcome Trust (088944) BBSRC
(BBJ0074711 BBI02576X1 BBK0169701) RCSB is funded by
NSF NIH DOE (DBI-1338415) PDBj is funded by JST-NBDC
Institute for Protein Research IPR Osaka University The open
access fee for paper entitled lsquoSmall molecule annotation for the pro-
tein data bankrsquo will be paid by the Wellcome Trust Grant 088944
Conflict of interest None declared
References
1 BermanHM HenrickK KleywegtG et al (2012) The
Worldwide Protein Data Bank Int Tables Crystallogr F 827ndash832
2 RosePW BiC BluhmWF et al (2013) The RCSB Protein
Data Bank new resources for research and education Nucleic
Acids Res 41 D475ndashD482
3 GutmanasA AlhroubY BattleGM et al (2014) PDBe Protein
Data Bank in Europe Nucleic Acids Res 42 D285ndashD291
4 KinjoAR SuzukiH YamashitaR et al (2012) Protein Data
Bank Japan (PDBj) maintaining a structural data archive and re-
source description framework format Nulceic Acids Res 40
D453ndashD460
5 UlrichEL AkutsuH DoreleijersJF et al (2008)
BioMagResBank Nucleic Acids Res 36 D402ndashD408
6 DuttaS BurkhardtK YoungJ et al (2009) Data deposition
and annotation at the Worldwide Protein Data Bank Mol
Biotechnol 42 1ndash13
7 BermanHM KleywegtGJ NakamuraH et al (2013) How
community has shaped the Protein Data Bank Structure 21
1485ndash1491
8 Editorial (2014) Hard datamdashit has been no small feat for the
Protein Data Bank to stay relevant for 100000 structures
Nature 509 260
9 WeiningerD (1988) SMILES a chemical language and infor-
mation system 1 Introduction to methodology and encoding
rules J Chem Inf Model 28 31ndash36
10 HellerS McNaughtA SteinS et al (2013) InChImdashthe
worldwide chemical structure identifier standard
J Cheminform 5 7
11 BournePE BermanHM McMahonB et al (1997)
Macromolecular crystallographic information file Methods
Enzymol 277 571ndash590
12 GasteigerJ RudolphC and SadowskiJ (1990) Automatic gen-
eration of 3D-atomic coordinates for organic molecules
Tetrahedron Comp Method 3 537ndash547
13 IhlenfeldtWD TakahasiY AbeH et al (1992) CACTVS a
chemistry algorithm development environment In Machida K
Nishioka T (eds) Daijuukagakutouronkai Dainijuukai
Kouzoukasseisoukan Shinpojiumu Kouenyoushishuu Kyoto
University Press Kyoto pp 102ndash105
14 StahlM and MauserH (2005) Database clustering with a
combination of fingerprint and maximum common substructure
methods J Chem Inf Model 45 542ndash548
15 WinnMD BallardCC CowtanKD et al (2011) Overview
of the CCP4 suite and current developments Acta Crystallogr
D D67 235ndash242
16 DuttaS DimitropoulosD FengZ et al (2014) Improving the
representation of peptide-like inhibitor and antibiotic molecules
in the Protein Data Bank Biopolymers 101 659ndash668
17 The UniProt Consortium (2011) Reorganizing the protein space
at the Universal Protein Resource (UniProt) Nucleic Acids Res
40 D71ndashD75
18 CabocheS PupinM LeclereV et al (2008) NORINE a data-
base of nonribosomal peptides Nucleic Acids Res 36
D326ndashD331
19 ReadRJ AdamsPD ArendallWB et al (2011) A new gen-
eration of crystallographic validation tools for the Protein Data
Bank Structure 19 1395ndash1412
20 MontelioneGT NilgesM BaxA et al (2013)
Recommendations of the wwPDB NMR Structure Validation
Task Force Structure 21 1563ndash1570
21 HendersonR SaliA BakerML et al (2012) Outcome of the
first electron microscopy validation task force meeting
Structure 20 205ndash214
22 GoreS VelankarS and KleywegtGJ (2012) Implementing an
X-ray validation pipeline for the Protein Data Bank Acta
Crystallogr D68 478ndash483
23 BrunoIJ ColeJC KesslerM et al (2004) Retrieval of crys-
tallographically-derived molecular geometry information
J Chem Inf Model 44 2133ndash2144
Page 10 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
24 AllenFH BellardS BriceMD et al (1979) The Cambridge
Crystallographic Data Centre computer-based search retrieval
analysis and display of information Acta Crystallogr B 35
2331ndash2339
25 KleywegtGJ HarrisMR ZouJY et al (2004) The
Uppsala Electron-Density Server Acta Crystallogr D 60 2240ndash2249
26 JonesTA ZouJY CowanSW et al (1991) Improved methods
for building protein models in electron density maps and the loca-
tion of errors in these models Acta Crytslallogr A 47 110ndash119
27 QuesadaM WestbrookJ OldfieldT et al (2011) The
wwPDB common tool for deposition and annotation Acta
Crystallogr A67 C403ndashC404
28 YoungJ FengZ DimitropolousD et al (2013) Chemical an-
notation of small and peptide-like molecules at the Protein Data
Bank Database 2013 1
29 The PyMOL Molecular Graphics System Version 12r3pre
Schrodinger LLC [Online]
30 TungJY ChangMD ChouWI et al (2008) Crystal struc-
tures of the starch-binding domain from Rhizopus oryzae
glucoamylase reveal a polysaccharide-binding path Biochem J
416 27ndash36
31 AubeleDL HomRK AdlerM et al (2013) Selective and
brain-permeable polo-like kinase-2 (Plk-2) inhibitors that reduce
alpha-synuclein phosphorylation in rat brain ChemMedChem
8 1295ndash1313
32 EllisMJ BuffeySG HoughMA et al (2008) On-line
optical and X-ray spectroscopies with crystallography an inte-
grated approach for determining metalloprotein structures in
functionally well defined states J Synchrotron Radiat 15
433ndash439
Database Vol 2014 Article ID bau116 Page 11 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
and example experimental coordinates along with ma-
chine-readable chemical descriptors such as SMILES
strings (9) InChi (10) and InChi keys The CCD definition
also contains the systematic name synonyms chemical
formula formula weight and formal charge of the
molecule
The PDB uses PDB exchangemacromolecular
Crystallographic Information File format (PDBxmmCIF)
(11) to describe the contents of a PDB entry The advantages
of using mmCIF as the data model include its flexibility and
extensibility which allow addition of new data items to ad-
dress the ever-increasing size and complexity of deposited
structures The CCD is also maintained in CIF format (11)
and every chemical entity in the dictionary is assigned a
unique three-character identifier Each CCD entry contains
five CIF categories that provide the machine-readable chem-
ical description of the small molecule (Figure 2) The unique
three-character code assigned to a chemical component is
used as the primary key to connect the different categories
in the dictionary The _chem_comp category contains the
name synonyms chemical formula formula weight and
formal charge of the molecule along with some information
about the parent PDB entry which was used to construct
the dictionary entry For entities such as chromophores and
modified amino acids and nucleic acids the data
item_chem_compmon_nstd_parent_comp_id provides
information about the lsquoparentrsquo of the compound For ex-
ample the amino acid phosphotyrosine is derived from tyro-
sine which is therefore its parent The chromophore CRO
present in green fluorescent protein is derived from the
tripeptide Ser-Tyr-Gly Hence the dictionary definition of
CRO has Ser-Tyr-Gly listed as the parent of this chromo-
phore The _chem_comp category provides structural classi-
fication information about the compound The data items
_chem_comptype and _chem_comppdbx_type are used to
annotate the type and class of the molecule respectively A
standard or a sidechain modified L-amino acid is a member
of the class lsquoL-peptide linkingrsquo Amino acids and their modi-
fied forms occur in proteins and are designated as type
lsquoATOMPrsquo Sugars can be classified as lsquoD-saccharidersquo or lsquoL-
saccharidersquo depending on their configuration and they are
annotated as type lsquoATOMSrsquo
The _chem_comp_atom category contains atom-level
details about the compound This category not only
includes atom name element type aromaticity flag chiral-
ity and charge for every atom but also both the generated
and example experimental 3D coordinates for the entire
chemical entity Chemistry software CORINA (12) is used
to produce the programmatically generated conformation
The bond-order information is stored in the _chem_
comp_bond category whereas the SMILES and InChi
strings are in the _pdbx_chem_comp_descriptor category
Standard chemistry software such as from ACDlabs (http
wwwacdlabscom) CACTVS (13) and OpenEye (14) is
used to generate the systematic name SMILES and InChI
descriptors for the compound As part of a wwPDB collab-
oration with the Cambridge Crystallography Data Centre
Cambridge Structural Database (CSD) coordinates for
those small molecules in the PDB that also exist in CSD
will be included in the future as well
Figure 1 Number of new PDB chemical entity definitions created annually between 2000 and 2013
Database Vol 2014 Article ID bau116 Page 3 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
The wwPDB annotation guidelines require the CCD
definition to pertain to the neutral unbound form of every
compound In cases where a new chemical entity is cova-
lently linked to another small molecule or to a polymer a
leaving atom (usually an oxygen atom labelled as OXT) is
introduced in the chemical definition of the compound In
this way the newly built chemical definition can be reused
in future depositions where the same compound may occur
either in covalently bound form or as a lsquofreersquo ligand The
use of leaving atoms to capture multiple chemical forms of
the same compound is especially common for amino acids
nucleotides and carbohydrates (Figures 3 and 4)
Annotation of ligand-binding sites
Covalently linked ligands are quite common in the PDB
including co-factors such as pyridoxal phosphate and
carbohydrates such as N-acetylglucosamine Non-covalently
linked ligands are usually bound to one or more polymers
through H-bonding ionic and van der Waals interactions
Some ligands are instrumental to the biological function of a
biomacromolecule or complex Obviously the binding en-
vironment of a ligand is important to stabilize a particular
conformational state that facilitates its biological function
The site-delineation software used in PDB annotation is
derived from the CCP4 (15) program CONTACT (http
wwwccp4acukhtmlcontacthtml) It determines which
polymeric residues make up the binding site for each non-
polymeric ligand present in the structure (Figure 5) using a
cut-off distance of 37 A from any atom of the ligand and
taking crystallographic symmetry into account For oligo-
saccharides and peptide-like inhibitors and antibiotics the
software generates binding-site information for the entire
molecule instead of for the individual moieties (Figure 6) In
addition to the software-generated site information any
author-provided details about catalytic site residues in a
protein are also included in the annotated entry In the
mmCIF file for a given entry the CIF categories _struct_site
Figure 2 Abbreviated category relationship diagram for the key CIF categories that are used in the CCD Three major categories _chem_comp
_chem_comp_bond and _chem_comp_atom are joined together to generate the machine readable dictionary description of the chemical entity The
unique three character code assigned to every new chemical entity acts as the primary key in the _chem_comp categoryid (coloured in purple) and is
used to connect the other categories (_chem_comp_bondcomp_id and _chem_comp_atomcomp_id)
Page 4 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
and _struct_site_gen contain the site information for all the
ligands present in the structure The struct_site category
contains residue-level information Each binding site is
identified using a unique alphanumeric identifier (AC1
AC2 ZZ9) which serves as the primary key for the
_struct_site category The _struct_site_gen category contains
information about all the neighbouring residues within
37 A of any atom of the ligand The data item _struct_site_
gensite_id is directly inherited from the struct_siteid of the
_struct_site category (Figure 7) In addition to the binding-
site annotation all covalent bonds between ligands and
polymeric residues or other ligands are annotated as cova-
lent linkages in the _struct_conn category of the PDBx
mmCIF file
In protein and DNA structures metal ions are often
found in coordination complexes with other molecules or
ions The metals or metal clusters present in metallopro-
teins such as haemoglobins transferrins cytochromes
nitrogenases and hydrogenases are bound to nitrogen sul-
phur or oxygen atoms belonging to protein residues The
details of the bond angles for each of these coordinated
metal centres are listed in REMARK 620 of the PDB file
(Figure 8) or the _pdbx_struct_conn_angle category of the
PDBxmmCIF file
Annotation of peptide-like inhibitors andantibiotics
In 2011 a major remediation was carried out by wwPDB
curators to improve the representation of peptide-like
inhibitors and antibiotics so that they can be more easily
identified and studied (16) Currently there are nearly
1300 such structures present in the PDB archive
The presence or absence of consecutive peptide bonds
determines how these molecules are represented in the
PDB If a peptide-like molecule contains two or more
consecutive peptide bonds it is annotated as a standard
polymer with information about recommended name
source organism and linkage information between the
amino acids and any non-standard residues present in the
molecule Cross-referencing to UniProt (17) for gene
products or to Norine (18) for non-ribosomal peptides is
attempted for such compounds
If the ligand has fewer than two consecutive peptide
bonds (examples of this include several PPACK
inhibitors and other peptide-like molecules) it is annotated
as a non-polymer like any other small molecule in
the CCD but with an identifiable subcomponent se-
quence running from amino terminal (N) to carboxy (C)
terminal
Antibiotics such as teicoplanin and vancomycin have
sugars or fatty acids attached to their peptide core Such
compounds are represented as a lsquogrouprsquo (16) which
includes the polymeric and non-polymeric constituents of
the antibiotic along with details about all the linkages
between them (Figure 9)
Figure 3 a-D-Glucose can form a(1ndash4) glycosidic linkages with other
carbohydrate molecules During the oligomerization process the O1
oxygen (highlighted in the figure) of the glucose is eliminated by the O4
oxygen of the other carbohydrate To account for this condensation re-
action the O1 oxygen of a-D-glucose (GLC) is annotated in the CCD as a
leaving atom The two-dimensional diagram in this figure is a copy of
the image from the RCSB PDB website (httprcsborgpdbligand
ligandsummarydohetIdfrac14GLC) It was generated using the ChemAxon
software (httpwwwchemaxoncom)
Figure 4 Seven a-D-glucose (GLC) molecules undergo condensation re-
action to form the circular oligosaccharide b-cyclodextrin [from PDB
entry 2v8l (30)]
Database Vol 2014 Article ID bau116 Page 5 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
BIRD
The large amount of chemical and biological information
that emerged during the remediation of the peptide-like
antibiotic and inhibitor molecules in the PDB required sys-
tematic referencing and documenting for curation of future
depositions A new resource was created to assist the future
annotation of such compounds containing both chemical
and biological information The chemical information is
stored in Peptide Reference Dictionary (PRD) files
whereas the biological function is documented in a separ-
ate family file The chemical and biological information
are separately annotated so that chemically similar
Figure 5 Binding site for the Plk-2 inhibitor (7R)-8-cyclopentyl-7-ethyl-5-methyl-78-dihydropteridin-6(5H)-one (3 letter code 11 G) in PDB entry 4i6b
(31) The figure depicts the neighbouring residues that are within 37 A of the ligand 11 G
Figure 6 The environment for the oligosaccharide poly-N-acetylglucosamine (PNAG) is annotated instead of listing the environment of individual
sugars This avoids repeating the same sugar molecule in multiple binding sites
Page 6 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
molecules either with conserved core polymer sequence or
signature sequence motifs can be grouped under the same
family of antibiotics or peptide-like molecules For ex-
ample antibiotics such as chlorionectin A vancomycin
teicoplanin and balhimycin are part of the glycopeptide
antibiotic family For each chemically distinct peptide-like
molecule a new PRD description is created with a unique
identifier and information about its chemical composition
connectivity structural description (eg glycopeptide
anthracycline lipopeptide etc) and function (eg
immunosuppressant enzyme inhibitor thrombin inhibitor
etc) Information about a family of molecules is stored in
a FAM file which mainly includes biological annotation
(such as function mechanism of action and
Figure 7 Diagram showing the relationship between the_struct_site and _struct_site_gen categories used for annotation of ligand-binding sites The
_struct_site category holds information about the ligands that are present in the PDB entry and every ligand in this category is assigned a alphanu-
meric binding site identifier The _struct_site_gen category contains information of the residues that are present within the vicinity of the ligands
described in the struct_site category Both the categories are joined by the binding site identifier
Figure 8 Tetrahedrally coordinated Zn ion in entry 2VW4 (32) along with the annotation of the bond angles The REMARK 620 annotation indicates
the software calculated bond angle values between Zn A 503 and its surrounding residues The surrounding residues in anticlockwise direction are
Glu A 195 HisB 165 and Asp B 167 The sidechain carboxylate group of the Glu residue exists in two alternate conformation (A and B conformers)
The angle between GluA195B-Zn-HisB165 is 1178 GluA195B-Zn-Asp(OD1)B167 is 866 HisB165-Zn-Asp(OD1)B167 is 916 Glu195B-Zn-
ASP(OD2)B167 is 1052 HisB165-Zn-Asp(OD2)B167 is 1249 Asp(OD1)B167-Zn-Asp(OD2)B167 is 571 Glu(OE1)A195B-Zn-Glu(OE2)A195A is 241
HisB165-Zn- Glu(OE2)A195A is 1228 Asp(OD1)B167-Zn-Glu(OE2)A195A is 1093 and Asp(OD2)B167-Zn-Glu(OE2)A195A is 1107
Database Vol 2014 Article ID bau116 Page 7 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
pharmacological action) and the list of family members in
the PRD (16) Remediated and newly released PDB entries
containing peptide-like compounds include PRD identifiers
in the PDBx files The new resource comprising both
chemical and biological information for these peptide-like
molecules is called the BIRD Both the PRD and family
files are distributed via the wwPDB ftp area (ftpftp
wwpdborgpubpdbdatabird) Currently there are nearly
200 families and more than 700 PRD entries in BIRD The
release of any new PRD entries or families is synchronized
with the release of the first PDB entry containing the cor-
responding peptide-like molecule
Figure 9 Annotation of the glycopeptide antibiotic teicoplanin involves lsquochopping uprsquo the molecule into its component chemical entities that are vali-
dated against the CCD The bonds highlighted in yellow demarcate the individual entities
Page 8 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
Validation of ligands in the PDB
The wwPDB structure validation process includes geometry
checks for all standard and non-standard amino acids and
nucleotides as well as carbohydrates and ligands Since
2008 wwPDB has convened several method-specific valid-
ation task forces (VTFs) composed of community experts
to provide recommendations on structure and data valid-
ation (19ndash21) The recommendations of the X-ray VTF
have been implemented in a software pipeline that is used
during deposition and annotation as a stand-alone server
and for validation of all legacy crystal structures in the arch-
ive (22) The validation pipeline uses a variety of external
and internal software To assess bond lengths bond angles
torsion angles and chirality for every small molecule present
in an entry use is made of the program Mogul (23) which
compares the geometry to that encountered in high-quality
small molecule structures in the CSD (24) The validation
pipeline also analyses the fit of ligands to the electron dens-
ity using approaches introduced by the Uppsala Electron-
Density Server (25) For every ligand in an entry the local
ligand density fit (LLDF) is calculated this is essentially a
Z-score of the ligandrsquos real space R-value (RSR) (26) relative
to the mean and standard deviation of the RSR values of the
neighbouring polymeric standard residues and nucleotides
(taking crystallographic symmetry into account) and using
a cut-off distance of 5 A from any ligand atom LLDF values
gt2 are highlighted in the report produced by the validation
pipeline (Figure 10) The validation report generated during
the annotation process is sent back to the depositor so it can
be included when a paper describing the structure is submit-
ted for publication Upon release of an entry the validation
pipeline is run on it again and the resulting report is made
publicly available
Conclusions and future directions
The PDB archive contains a wide variety of chemical enti-
ties many of which play a crucial role in modulating
biochemical reactions Correct identification and representa-
tion of these small molecules in the PDB at the annotation
stage is crucial to allow these molecules to be identified
compared etc by users Every structure deposited to the
PDB is dissected to identify the individual chemical entities
that are subsequently compared and validated against the
CCD When a chemical entity matches an existing diction-
ary entry it is assigned the same three-character identifier as
Figure 10 rA weighted 2Fo-Fc map of a carbohydrate binding protein shown at a contour level of 035eA^3 Very little electron density is observed
for the oligosaccharide molecule This is reflected in the high LLDF values (shown in parentheses) for each of the component carbohydrate moieties
Database Vol 2014 Article ID bau116 Page 9 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
in the CCD In addition the nomenclature of the entityrsquos
atoms is updated to reflect the dictionary definition Once
the individual components have been correctly identified in-
formation is added about all the non-standard covalent H-
bonding and ionic interactions that occur in the structure
The annotation software also generates binding-site infor-
mation for every bound molecule For all crystal structures
symmetry information is taken into account during the iden-
tification of binding sites and non-covalent interactions
The rapid growth and increased complexity of the struc-
tures to PDB has prompted continuous review of the anno-
tation tools and processes used during processing and
handling the data deposited to the PDB To support the sci-
entific advancements in the field of structural biology the
wwPDB partners are collaborating to develop a new joint
deposition and annotation system (27) The new annotation
system contains four major modules such as a chemical
component module a sequence module a validation mod-
ule and a module for derived data The chemistry module
can perform simultaneous searches against both the CCD
and BIRD The new system has an interactive interface and
visualization tools that aid in annotation and facilitate cor-
rect representation of the data (28) The chemistry module
has been tested on previous depositions and has significantly
improved processing efficiency and data quality
AcknowledgementsThe 3D images in the manuscript were generated using PyMOL
(29) The authors are grateful to Prof Gerard Kleywegt for helpful
comments on this manuscript
Funding
PDBe is funded by EMBL-EBI Wellcome Trust (088944) BBSRC
(BBJ0074711 BBI02576X1 BBK0169701) RCSB is funded by
NSF NIH DOE (DBI-1338415) PDBj is funded by JST-NBDC
Institute for Protein Research IPR Osaka University The open
access fee for paper entitled lsquoSmall molecule annotation for the pro-
tein data bankrsquo will be paid by the Wellcome Trust Grant 088944
Conflict of interest None declared
References
1 BermanHM HenrickK KleywegtG et al (2012) The
Worldwide Protein Data Bank Int Tables Crystallogr F 827ndash832
2 RosePW BiC BluhmWF et al (2013) The RCSB Protein
Data Bank new resources for research and education Nucleic
Acids Res 41 D475ndashD482
3 GutmanasA AlhroubY BattleGM et al (2014) PDBe Protein
Data Bank in Europe Nucleic Acids Res 42 D285ndashD291
4 KinjoAR SuzukiH YamashitaR et al (2012) Protein Data
Bank Japan (PDBj) maintaining a structural data archive and re-
source description framework format Nulceic Acids Res 40
D453ndashD460
5 UlrichEL AkutsuH DoreleijersJF et al (2008)
BioMagResBank Nucleic Acids Res 36 D402ndashD408
6 DuttaS BurkhardtK YoungJ et al (2009) Data deposition
and annotation at the Worldwide Protein Data Bank Mol
Biotechnol 42 1ndash13
7 BermanHM KleywegtGJ NakamuraH et al (2013) How
community has shaped the Protein Data Bank Structure 21
1485ndash1491
8 Editorial (2014) Hard datamdashit has been no small feat for the
Protein Data Bank to stay relevant for 100000 structures
Nature 509 260
9 WeiningerD (1988) SMILES a chemical language and infor-
mation system 1 Introduction to methodology and encoding
rules J Chem Inf Model 28 31ndash36
10 HellerS McNaughtA SteinS et al (2013) InChImdashthe
worldwide chemical structure identifier standard
J Cheminform 5 7
11 BournePE BermanHM McMahonB et al (1997)
Macromolecular crystallographic information file Methods
Enzymol 277 571ndash590
12 GasteigerJ RudolphC and SadowskiJ (1990) Automatic gen-
eration of 3D-atomic coordinates for organic molecules
Tetrahedron Comp Method 3 537ndash547
13 IhlenfeldtWD TakahasiY AbeH et al (1992) CACTVS a
chemistry algorithm development environment In Machida K
Nishioka T (eds) Daijuukagakutouronkai Dainijuukai
Kouzoukasseisoukan Shinpojiumu Kouenyoushishuu Kyoto
University Press Kyoto pp 102ndash105
14 StahlM and MauserH (2005) Database clustering with a
combination of fingerprint and maximum common substructure
methods J Chem Inf Model 45 542ndash548
15 WinnMD BallardCC CowtanKD et al (2011) Overview
of the CCP4 suite and current developments Acta Crystallogr
D D67 235ndash242
16 DuttaS DimitropoulosD FengZ et al (2014) Improving the
representation of peptide-like inhibitor and antibiotic molecules
in the Protein Data Bank Biopolymers 101 659ndash668
17 The UniProt Consortium (2011) Reorganizing the protein space
at the Universal Protein Resource (UniProt) Nucleic Acids Res
40 D71ndashD75
18 CabocheS PupinM LeclereV et al (2008) NORINE a data-
base of nonribosomal peptides Nucleic Acids Res 36
D326ndashD331
19 ReadRJ AdamsPD ArendallWB et al (2011) A new gen-
eration of crystallographic validation tools for the Protein Data
Bank Structure 19 1395ndash1412
20 MontelioneGT NilgesM BaxA et al (2013)
Recommendations of the wwPDB NMR Structure Validation
Task Force Structure 21 1563ndash1570
21 HendersonR SaliA BakerML et al (2012) Outcome of the
first electron microscopy validation task force meeting
Structure 20 205ndash214
22 GoreS VelankarS and KleywegtGJ (2012) Implementing an
X-ray validation pipeline for the Protein Data Bank Acta
Crystallogr D68 478ndash483
23 BrunoIJ ColeJC KesslerM et al (2004) Retrieval of crys-
tallographically-derived molecular geometry information
J Chem Inf Model 44 2133ndash2144
Page 10 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
24 AllenFH BellardS BriceMD et al (1979) The Cambridge
Crystallographic Data Centre computer-based search retrieval
analysis and display of information Acta Crystallogr B 35
2331ndash2339
25 KleywegtGJ HarrisMR ZouJY et al (2004) The
Uppsala Electron-Density Server Acta Crystallogr D 60 2240ndash2249
26 JonesTA ZouJY CowanSW et al (1991) Improved methods
for building protein models in electron density maps and the loca-
tion of errors in these models Acta Crytslallogr A 47 110ndash119
27 QuesadaM WestbrookJ OldfieldT et al (2011) The
wwPDB common tool for deposition and annotation Acta
Crystallogr A67 C403ndashC404
28 YoungJ FengZ DimitropolousD et al (2013) Chemical an-
notation of small and peptide-like molecules at the Protein Data
Bank Database 2013 1
29 The PyMOL Molecular Graphics System Version 12r3pre
Schrodinger LLC [Online]
30 TungJY ChangMD ChouWI et al (2008) Crystal struc-
tures of the starch-binding domain from Rhizopus oryzae
glucoamylase reveal a polysaccharide-binding path Biochem J
416 27ndash36
31 AubeleDL HomRK AdlerM et al (2013) Selective and
brain-permeable polo-like kinase-2 (Plk-2) inhibitors that reduce
alpha-synuclein phosphorylation in rat brain ChemMedChem
8 1295ndash1313
32 EllisMJ BuffeySG HoughMA et al (2008) On-line
optical and X-ray spectroscopies with crystallography an inte-
grated approach for determining metalloprotein structures in
functionally well defined states J Synchrotron Radiat 15
433ndash439
Database Vol 2014 Article ID bau116 Page 11 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
The wwPDB annotation guidelines require the CCD
definition to pertain to the neutral unbound form of every
compound In cases where a new chemical entity is cova-
lently linked to another small molecule or to a polymer a
leaving atom (usually an oxygen atom labelled as OXT) is
introduced in the chemical definition of the compound In
this way the newly built chemical definition can be reused
in future depositions where the same compound may occur
either in covalently bound form or as a lsquofreersquo ligand The
use of leaving atoms to capture multiple chemical forms of
the same compound is especially common for amino acids
nucleotides and carbohydrates (Figures 3 and 4)
Annotation of ligand-binding sites
Covalently linked ligands are quite common in the PDB
including co-factors such as pyridoxal phosphate and
carbohydrates such as N-acetylglucosamine Non-covalently
linked ligands are usually bound to one or more polymers
through H-bonding ionic and van der Waals interactions
Some ligands are instrumental to the biological function of a
biomacromolecule or complex Obviously the binding en-
vironment of a ligand is important to stabilize a particular
conformational state that facilitates its biological function
The site-delineation software used in PDB annotation is
derived from the CCP4 (15) program CONTACT (http
wwwccp4acukhtmlcontacthtml) It determines which
polymeric residues make up the binding site for each non-
polymeric ligand present in the structure (Figure 5) using a
cut-off distance of 37 A from any atom of the ligand and
taking crystallographic symmetry into account For oligo-
saccharides and peptide-like inhibitors and antibiotics the
software generates binding-site information for the entire
molecule instead of for the individual moieties (Figure 6) In
addition to the software-generated site information any
author-provided details about catalytic site residues in a
protein are also included in the annotated entry In the
mmCIF file for a given entry the CIF categories _struct_site
Figure 2 Abbreviated category relationship diagram for the key CIF categories that are used in the CCD Three major categories _chem_comp
_chem_comp_bond and _chem_comp_atom are joined together to generate the machine readable dictionary description of the chemical entity The
unique three character code assigned to every new chemical entity acts as the primary key in the _chem_comp categoryid (coloured in purple) and is
used to connect the other categories (_chem_comp_bondcomp_id and _chem_comp_atomcomp_id)
Page 4 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
and _struct_site_gen contain the site information for all the
ligands present in the structure The struct_site category
contains residue-level information Each binding site is
identified using a unique alphanumeric identifier (AC1
AC2 ZZ9) which serves as the primary key for the
_struct_site category The _struct_site_gen category contains
information about all the neighbouring residues within
37 A of any atom of the ligand The data item _struct_site_
gensite_id is directly inherited from the struct_siteid of the
_struct_site category (Figure 7) In addition to the binding-
site annotation all covalent bonds between ligands and
polymeric residues or other ligands are annotated as cova-
lent linkages in the _struct_conn category of the PDBx
mmCIF file
In protein and DNA structures metal ions are often
found in coordination complexes with other molecules or
ions The metals or metal clusters present in metallopro-
teins such as haemoglobins transferrins cytochromes
nitrogenases and hydrogenases are bound to nitrogen sul-
phur or oxygen atoms belonging to protein residues The
details of the bond angles for each of these coordinated
metal centres are listed in REMARK 620 of the PDB file
(Figure 8) or the _pdbx_struct_conn_angle category of the
PDBxmmCIF file
Annotation of peptide-like inhibitors andantibiotics
In 2011 a major remediation was carried out by wwPDB
curators to improve the representation of peptide-like
inhibitors and antibiotics so that they can be more easily
identified and studied (16) Currently there are nearly
1300 such structures present in the PDB archive
The presence or absence of consecutive peptide bonds
determines how these molecules are represented in the
PDB If a peptide-like molecule contains two or more
consecutive peptide bonds it is annotated as a standard
polymer with information about recommended name
source organism and linkage information between the
amino acids and any non-standard residues present in the
molecule Cross-referencing to UniProt (17) for gene
products or to Norine (18) for non-ribosomal peptides is
attempted for such compounds
If the ligand has fewer than two consecutive peptide
bonds (examples of this include several PPACK
inhibitors and other peptide-like molecules) it is annotated
as a non-polymer like any other small molecule in
the CCD but with an identifiable subcomponent se-
quence running from amino terminal (N) to carboxy (C)
terminal
Antibiotics such as teicoplanin and vancomycin have
sugars or fatty acids attached to their peptide core Such
compounds are represented as a lsquogrouprsquo (16) which
includes the polymeric and non-polymeric constituents of
the antibiotic along with details about all the linkages
between them (Figure 9)
Figure 3 a-D-Glucose can form a(1ndash4) glycosidic linkages with other
carbohydrate molecules During the oligomerization process the O1
oxygen (highlighted in the figure) of the glucose is eliminated by the O4
oxygen of the other carbohydrate To account for this condensation re-
action the O1 oxygen of a-D-glucose (GLC) is annotated in the CCD as a
leaving atom The two-dimensional diagram in this figure is a copy of
the image from the RCSB PDB website (httprcsborgpdbligand
ligandsummarydohetIdfrac14GLC) It was generated using the ChemAxon
software (httpwwwchemaxoncom)
Figure 4 Seven a-D-glucose (GLC) molecules undergo condensation re-
action to form the circular oligosaccharide b-cyclodextrin [from PDB
entry 2v8l (30)]
Database Vol 2014 Article ID bau116 Page 5 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
BIRD
The large amount of chemical and biological information
that emerged during the remediation of the peptide-like
antibiotic and inhibitor molecules in the PDB required sys-
tematic referencing and documenting for curation of future
depositions A new resource was created to assist the future
annotation of such compounds containing both chemical
and biological information The chemical information is
stored in Peptide Reference Dictionary (PRD) files
whereas the biological function is documented in a separ-
ate family file The chemical and biological information
are separately annotated so that chemically similar
Figure 5 Binding site for the Plk-2 inhibitor (7R)-8-cyclopentyl-7-ethyl-5-methyl-78-dihydropteridin-6(5H)-one (3 letter code 11 G) in PDB entry 4i6b
(31) The figure depicts the neighbouring residues that are within 37 A of the ligand 11 G
Figure 6 The environment for the oligosaccharide poly-N-acetylglucosamine (PNAG) is annotated instead of listing the environment of individual
sugars This avoids repeating the same sugar molecule in multiple binding sites
Page 6 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
molecules either with conserved core polymer sequence or
signature sequence motifs can be grouped under the same
family of antibiotics or peptide-like molecules For ex-
ample antibiotics such as chlorionectin A vancomycin
teicoplanin and balhimycin are part of the glycopeptide
antibiotic family For each chemically distinct peptide-like
molecule a new PRD description is created with a unique
identifier and information about its chemical composition
connectivity structural description (eg glycopeptide
anthracycline lipopeptide etc) and function (eg
immunosuppressant enzyme inhibitor thrombin inhibitor
etc) Information about a family of molecules is stored in
a FAM file which mainly includes biological annotation
(such as function mechanism of action and
Figure 7 Diagram showing the relationship between the_struct_site and _struct_site_gen categories used for annotation of ligand-binding sites The
_struct_site category holds information about the ligands that are present in the PDB entry and every ligand in this category is assigned a alphanu-
meric binding site identifier The _struct_site_gen category contains information of the residues that are present within the vicinity of the ligands
described in the struct_site category Both the categories are joined by the binding site identifier
Figure 8 Tetrahedrally coordinated Zn ion in entry 2VW4 (32) along with the annotation of the bond angles The REMARK 620 annotation indicates
the software calculated bond angle values between Zn A 503 and its surrounding residues The surrounding residues in anticlockwise direction are
Glu A 195 HisB 165 and Asp B 167 The sidechain carboxylate group of the Glu residue exists in two alternate conformation (A and B conformers)
The angle between GluA195B-Zn-HisB165 is 1178 GluA195B-Zn-Asp(OD1)B167 is 866 HisB165-Zn-Asp(OD1)B167 is 916 Glu195B-Zn-
ASP(OD2)B167 is 1052 HisB165-Zn-Asp(OD2)B167 is 1249 Asp(OD1)B167-Zn-Asp(OD2)B167 is 571 Glu(OE1)A195B-Zn-Glu(OE2)A195A is 241
HisB165-Zn- Glu(OE2)A195A is 1228 Asp(OD1)B167-Zn-Glu(OE2)A195A is 1093 and Asp(OD2)B167-Zn-Glu(OE2)A195A is 1107
Database Vol 2014 Article ID bau116 Page 7 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
pharmacological action) and the list of family members in
the PRD (16) Remediated and newly released PDB entries
containing peptide-like compounds include PRD identifiers
in the PDBx files The new resource comprising both
chemical and biological information for these peptide-like
molecules is called the BIRD Both the PRD and family
files are distributed via the wwPDB ftp area (ftpftp
wwpdborgpubpdbdatabird) Currently there are nearly
200 families and more than 700 PRD entries in BIRD The
release of any new PRD entries or families is synchronized
with the release of the first PDB entry containing the cor-
responding peptide-like molecule
Figure 9 Annotation of the glycopeptide antibiotic teicoplanin involves lsquochopping uprsquo the molecule into its component chemical entities that are vali-
dated against the CCD The bonds highlighted in yellow demarcate the individual entities
Page 8 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
Validation of ligands in the PDB
The wwPDB structure validation process includes geometry
checks for all standard and non-standard amino acids and
nucleotides as well as carbohydrates and ligands Since
2008 wwPDB has convened several method-specific valid-
ation task forces (VTFs) composed of community experts
to provide recommendations on structure and data valid-
ation (19ndash21) The recommendations of the X-ray VTF
have been implemented in a software pipeline that is used
during deposition and annotation as a stand-alone server
and for validation of all legacy crystal structures in the arch-
ive (22) The validation pipeline uses a variety of external
and internal software To assess bond lengths bond angles
torsion angles and chirality for every small molecule present
in an entry use is made of the program Mogul (23) which
compares the geometry to that encountered in high-quality
small molecule structures in the CSD (24) The validation
pipeline also analyses the fit of ligands to the electron dens-
ity using approaches introduced by the Uppsala Electron-
Density Server (25) For every ligand in an entry the local
ligand density fit (LLDF) is calculated this is essentially a
Z-score of the ligandrsquos real space R-value (RSR) (26) relative
to the mean and standard deviation of the RSR values of the
neighbouring polymeric standard residues and nucleotides
(taking crystallographic symmetry into account) and using
a cut-off distance of 5 A from any ligand atom LLDF values
gt2 are highlighted in the report produced by the validation
pipeline (Figure 10) The validation report generated during
the annotation process is sent back to the depositor so it can
be included when a paper describing the structure is submit-
ted for publication Upon release of an entry the validation
pipeline is run on it again and the resulting report is made
publicly available
Conclusions and future directions
The PDB archive contains a wide variety of chemical enti-
ties many of which play a crucial role in modulating
biochemical reactions Correct identification and representa-
tion of these small molecules in the PDB at the annotation
stage is crucial to allow these molecules to be identified
compared etc by users Every structure deposited to the
PDB is dissected to identify the individual chemical entities
that are subsequently compared and validated against the
CCD When a chemical entity matches an existing diction-
ary entry it is assigned the same three-character identifier as
Figure 10 rA weighted 2Fo-Fc map of a carbohydrate binding protein shown at a contour level of 035eA^3 Very little electron density is observed
for the oligosaccharide molecule This is reflected in the high LLDF values (shown in parentheses) for each of the component carbohydrate moieties
Database Vol 2014 Article ID bau116 Page 9 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
in the CCD In addition the nomenclature of the entityrsquos
atoms is updated to reflect the dictionary definition Once
the individual components have been correctly identified in-
formation is added about all the non-standard covalent H-
bonding and ionic interactions that occur in the structure
The annotation software also generates binding-site infor-
mation for every bound molecule For all crystal structures
symmetry information is taken into account during the iden-
tification of binding sites and non-covalent interactions
The rapid growth and increased complexity of the struc-
tures to PDB has prompted continuous review of the anno-
tation tools and processes used during processing and
handling the data deposited to the PDB To support the sci-
entific advancements in the field of structural biology the
wwPDB partners are collaborating to develop a new joint
deposition and annotation system (27) The new annotation
system contains four major modules such as a chemical
component module a sequence module a validation mod-
ule and a module for derived data The chemistry module
can perform simultaneous searches against both the CCD
and BIRD The new system has an interactive interface and
visualization tools that aid in annotation and facilitate cor-
rect representation of the data (28) The chemistry module
has been tested on previous depositions and has significantly
improved processing efficiency and data quality
AcknowledgementsThe 3D images in the manuscript were generated using PyMOL
(29) The authors are grateful to Prof Gerard Kleywegt for helpful
comments on this manuscript
Funding
PDBe is funded by EMBL-EBI Wellcome Trust (088944) BBSRC
(BBJ0074711 BBI02576X1 BBK0169701) RCSB is funded by
NSF NIH DOE (DBI-1338415) PDBj is funded by JST-NBDC
Institute for Protein Research IPR Osaka University The open
access fee for paper entitled lsquoSmall molecule annotation for the pro-
tein data bankrsquo will be paid by the Wellcome Trust Grant 088944
Conflict of interest None declared
References
1 BermanHM HenrickK KleywegtG et al (2012) The
Worldwide Protein Data Bank Int Tables Crystallogr F 827ndash832
2 RosePW BiC BluhmWF et al (2013) The RCSB Protein
Data Bank new resources for research and education Nucleic
Acids Res 41 D475ndashD482
3 GutmanasA AlhroubY BattleGM et al (2014) PDBe Protein
Data Bank in Europe Nucleic Acids Res 42 D285ndashD291
4 KinjoAR SuzukiH YamashitaR et al (2012) Protein Data
Bank Japan (PDBj) maintaining a structural data archive and re-
source description framework format Nulceic Acids Res 40
D453ndashD460
5 UlrichEL AkutsuH DoreleijersJF et al (2008)
BioMagResBank Nucleic Acids Res 36 D402ndashD408
6 DuttaS BurkhardtK YoungJ et al (2009) Data deposition
and annotation at the Worldwide Protein Data Bank Mol
Biotechnol 42 1ndash13
7 BermanHM KleywegtGJ NakamuraH et al (2013) How
community has shaped the Protein Data Bank Structure 21
1485ndash1491
8 Editorial (2014) Hard datamdashit has been no small feat for the
Protein Data Bank to stay relevant for 100000 structures
Nature 509 260
9 WeiningerD (1988) SMILES a chemical language and infor-
mation system 1 Introduction to methodology and encoding
rules J Chem Inf Model 28 31ndash36
10 HellerS McNaughtA SteinS et al (2013) InChImdashthe
worldwide chemical structure identifier standard
J Cheminform 5 7
11 BournePE BermanHM McMahonB et al (1997)
Macromolecular crystallographic information file Methods
Enzymol 277 571ndash590
12 GasteigerJ RudolphC and SadowskiJ (1990) Automatic gen-
eration of 3D-atomic coordinates for organic molecules
Tetrahedron Comp Method 3 537ndash547
13 IhlenfeldtWD TakahasiY AbeH et al (1992) CACTVS a
chemistry algorithm development environment In Machida K
Nishioka T (eds) Daijuukagakutouronkai Dainijuukai
Kouzoukasseisoukan Shinpojiumu Kouenyoushishuu Kyoto
University Press Kyoto pp 102ndash105
14 StahlM and MauserH (2005) Database clustering with a
combination of fingerprint and maximum common substructure
methods J Chem Inf Model 45 542ndash548
15 WinnMD BallardCC CowtanKD et al (2011) Overview
of the CCP4 suite and current developments Acta Crystallogr
D D67 235ndash242
16 DuttaS DimitropoulosD FengZ et al (2014) Improving the
representation of peptide-like inhibitor and antibiotic molecules
in the Protein Data Bank Biopolymers 101 659ndash668
17 The UniProt Consortium (2011) Reorganizing the protein space
at the Universal Protein Resource (UniProt) Nucleic Acids Res
40 D71ndashD75
18 CabocheS PupinM LeclereV et al (2008) NORINE a data-
base of nonribosomal peptides Nucleic Acids Res 36
D326ndashD331
19 ReadRJ AdamsPD ArendallWB et al (2011) A new gen-
eration of crystallographic validation tools for the Protein Data
Bank Structure 19 1395ndash1412
20 MontelioneGT NilgesM BaxA et al (2013)
Recommendations of the wwPDB NMR Structure Validation
Task Force Structure 21 1563ndash1570
21 HendersonR SaliA BakerML et al (2012) Outcome of the
first electron microscopy validation task force meeting
Structure 20 205ndash214
22 GoreS VelankarS and KleywegtGJ (2012) Implementing an
X-ray validation pipeline for the Protein Data Bank Acta
Crystallogr D68 478ndash483
23 BrunoIJ ColeJC KesslerM et al (2004) Retrieval of crys-
tallographically-derived molecular geometry information
J Chem Inf Model 44 2133ndash2144
Page 10 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
24 AllenFH BellardS BriceMD et al (1979) The Cambridge
Crystallographic Data Centre computer-based search retrieval
analysis and display of information Acta Crystallogr B 35
2331ndash2339
25 KleywegtGJ HarrisMR ZouJY et al (2004) The
Uppsala Electron-Density Server Acta Crystallogr D 60 2240ndash2249
26 JonesTA ZouJY CowanSW et al (1991) Improved methods
for building protein models in electron density maps and the loca-
tion of errors in these models Acta Crytslallogr A 47 110ndash119
27 QuesadaM WestbrookJ OldfieldT et al (2011) The
wwPDB common tool for deposition and annotation Acta
Crystallogr A67 C403ndashC404
28 YoungJ FengZ DimitropolousD et al (2013) Chemical an-
notation of small and peptide-like molecules at the Protein Data
Bank Database 2013 1
29 The PyMOL Molecular Graphics System Version 12r3pre
Schrodinger LLC [Online]
30 TungJY ChangMD ChouWI et al (2008) Crystal struc-
tures of the starch-binding domain from Rhizopus oryzae
glucoamylase reveal a polysaccharide-binding path Biochem J
416 27ndash36
31 AubeleDL HomRK AdlerM et al (2013) Selective and
brain-permeable polo-like kinase-2 (Plk-2) inhibitors that reduce
alpha-synuclein phosphorylation in rat brain ChemMedChem
8 1295ndash1313
32 EllisMJ BuffeySG HoughMA et al (2008) On-line
optical and X-ray spectroscopies with crystallography an inte-
grated approach for determining metalloprotein structures in
functionally well defined states J Synchrotron Radiat 15
433ndash439
Database Vol 2014 Article ID bau116 Page 11 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
and _struct_site_gen contain the site information for all the
ligands present in the structure The struct_site category
contains residue-level information Each binding site is
identified using a unique alphanumeric identifier (AC1
AC2 ZZ9) which serves as the primary key for the
_struct_site category The _struct_site_gen category contains
information about all the neighbouring residues within
37 A of any atom of the ligand The data item _struct_site_
gensite_id is directly inherited from the struct_siteid of the
_struct_site category (Figure 7) In addition to the binding-
site annotation all covalent bonds between ligands and
polymeric residues or other ligands are annotated as cova-
lent linkages in the _struct_conn category of the PDBx
mmCIF file
In protein and DNA structures metal ions are often
found in coordination complexes with other molecules or
ions The metals or metal clusters present in metallopro-
teins such as haemoglobins transferrins cytochromes
nitrogenases and hydrogenases are bound to nitrogen sul-
phur or oxygen atoms belonging to protein residues The
details of the bond angles for each of these coordinated
metal centres are listed in REMARK 620 of the PDB file
(Figure 8) or the _pdbx_struct_conn_angle category of the
PDBxmmCIF file
Annotation of peptide-like inhibitors andantibiotics
In 2011 a major remediation was carried out by wwPDB
curators to improve the representation of peptide-like
inhibitors and antibiotics so that they can be more easily
identified and studied (16) Currently there are nearly
1300 such structures present in the PDB archive
The presence or absence of consecutive peptide bonds
determines how these molecules are represented in the
PDB If a peptide-like molecule contains two or more
consecutive peptide bonds it is annotated as a standard
polymer with information about recommended name
source organism and linkage information between the
amino acids and any non-standard residues present in the
molecule Cross-referencing to UniProt (17) for gene
products or to Norine (18) for non-ribosomal peptides is
attempted for such compounds
If the ligand has fewer than two consecutive peptide
bonds (examples of this include several PPACK
inhibitors and other peptide-like molecules) it is annotated
as a non-polymer like any other small molecule in
the CCD but with an identifiable subcomponent se-
quence running from amino terminal (N) to carboxy (C)
terminal
Antibiotics such as teicoplanin and vancomycin have
sugars or fatty acids attached to their peptide core Such
compounds are represented as a lsquogrouprsquo (16) which
includes the polymeric and non-polymeric constituents of
the antibiotic along with details about all the linkages
between them (Figure 9)
Figure 3 a-D-Glucose can form a(1ndash4) glycosidic linkages with other
carbohydrate molecules During the oligomerization process the O1
oxygen (highlighted in the figure) of the glucose is eliminated by the O4
oxygen of the other carbohydrate To account for this condensation re-
action the O1 oxygen of a-D-glucose (GLC) is annotated in the CCD as a
leaving atom The two-dimensional diagram in this figure is a copy of
the image from the RCSB PDB website (httprcsborgpdbligand
ligandsummarydohetIdfrac14GLC) It was generated using the ChemAxon
software (httpwwwchemaxoncom)
Figure 4 Seven a-D-glucose (GLC) molecules undergo condensation re-
action to form the circular oligosaccharide b-cyclodextrin [from PDB
entry 2v8l (30)]
Database Vol 2014 Article ID bau116 Page 5 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
BIRD
The large amount of chemical and biological information
that emerged during the remediation of the peptide-like
antibiotic and inhibitor molecules in the PDB required sys-
tematic referencing and documenting for curation of future
depositions A new resource was created to assist the future
annotation of such compounds containing both chemical
and biological information The chemical information is
stored in Peptide Reference Dictionary (PRD) files
whereas the biological function is documented in a separ-
ate family file The chemical and biological information
are separately annotated so that chemically similar
Figure 5 Binding site for the Plk-2 inhibitor (7R)-8-cyclopentyl-7-ethyl-5-methyl-78-dihydropteridin-6(5H)-one (3 letter code 11 G) in PDB entry 4i6b
(31) The figure depicts the neighbouring residues that are within 37 A of the ligand 11 G
Figure 6 The environment for the oligosaccharide poly-N-acetylglucosamine (PNAG) is annotated instead of listing the environment of individual
sugars This avoids repeating the same sugar molecule in multiple binding sites
Page 6 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
molecules either with conserved core polymer sequence or
signature sequence motifs can be grouped under the same
family of antibiotics or peptide-like molecules For ex-
ample antibiotics such as chlorionectin A vancomycin
teicoplanin and balhimycin are part of the glycopeptide
antibiotic family For each chemically distinct peptide-like
molecule a new PRD description is created with a unique
identifier and information about its chemical composition
connectivity structural description (eg glycopeptide
anthracycline lipopeptide etc) and function (eg
immunosuppressant enzyme inhibitor thrombin inhibitor
etc) Information about a family of molecules is stored in
a FAM file which mainly includes biological annotation
(such as function mechanism of action and
Figure 7 Diagram showing the relationship between the_struct_site and _struct_site_gen categories used for annotation of ligand-binding sites The
_struct_site category holds information about the ligands that are present in the PDB entry and every ligand in this category is assigned a alphanu-
meric binding site identifier The _struct_site_gen category contains information of the residues that are present within the vicinity of the ligands
described in the struct_site category Both the categories are joined by the binding site identifier
Figure 8 Tetrahedrally coordinated Zn ion in entry 2VW4 (32) along with the annotation of the bond angles The REMARK 620 annotation indicates
the software calculated bond angle values between Zn A 503 and its surrounding residues The surrounding residues in anticlockwise direction are
Glu A 195 HisB 165 and Asp B 167 The sidechain carboxylate group of the Glu residue exists in two alternate conformation (A and B conformers)
The angle between GluA195B-Zn-HisB165 is 1178 GluA195B-Zn-Asp(OD1)B167 is 866 HisB165-Zn-Asp(OD1)B167 is 916 Glu195B-Zn-
ASP(OD2)B167 is 1052 HisB165-Zn-Asp(OD2)B167 is 1249 Asp(OD1)B167-Zn-Asp(OD2)B167 is 571 Glu(OE1)A195B-Zn-Glu(OE2)A195A is 241
HisB165-Zn- Glu(OE2)A195A is 1228 Asp(OD1)B167-Zn-Glu(OE2)A195A is 1093 and Asp(OD2)B167-Zn-Glu(OE2)A195A is 1107
Database Vol 2014 Article ID bau116 Page 7 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
pharmacological action) and the list of family members in
the PRD (16) Remediated and newly released PDB entries
containing peptide-like compounds include PRD identifiers
in the PDBx files The new resource comprising both
chemical and biological information for these peptide-like
molecules is called the BIRD Both the PRD and family
files are distributed via the wwPDB ftp area (ftpftp
wwpdborgpubpdbdatabird) Currently there are nearly
200 families and more than 700 PRD entries in BIRD The
release of any new PRD entries or families is synchronized
with the release of the first PDB entry containing the cor-
responding peptide-like molecule
Figure 9 Annotation of the glycopeptide antibiotic teicoplanin involves lsquochopping uprsquo the molecule into its component chemical entities that are vali-
dated against the CCD The bonds highlighted in yellow demarcate the individual entities
Page 8 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
Validation of ligands in the PDB
The wwPDB structure validation process includes geometry
checks for all standard and non-standard amino acids and
nucleotides as well as carbohydrates and ligands Since
2008 wwPDB has convened several method-specific valid-
ation task forces (VTFs) composed of community experts
to provide recommendations on structure and data valid-
ation (19ndash21) The recommendations of the X-ray VTF
have been implemented in a software pipeline that is used
during deposition and annotation as a stand-alone server
and for validation of all legacy crystal structures in the arch-
ive (22) The validation pipeline uses a variety of external
and internal software To assess bond lengths bond angles
torsion angles and chirality for every small molecule present
in an entry use is made of the program Mogul (23) which
compares the geometry to that encountered in high-quality
small molecule structures in the CSD (24) The validation
pipeline also analyses the fit of ligands to the electron dens-
ity using approaches introduced by the Uppsala Electron-
Density Server (25) For every ligand in an entry the local
ligand density fit (LLDF) is calculated this is essentially a
Z-score of the ligandrsquos real space R-value (RSR) (26) relative
to the mean and standard deviation of the RSR values of the
neighbouring polymeric standard residues and nucleotides
(taking crystallographic symmetry into account) and using
a cut-off distance of 5 A from any ligand atom LLDF values
gt2 are highlighted in the report produced by the validation
pipeline (Figure 10) The validation report generated during
the annotation process is sent back to the depositor so it can
be included when a paper describing the structure is submit-
ted for publication Upon release of an entry the validation
pipeline is run on it again and the resulting report is made
publicly available
Conclusions and future directions
The PDB archive contains a wide variety of chemical enti-
ties many of which play a crucial role in modulating
biochemical reactions Correct identification and representa-
tion of these small molecules in the PDB at the annotation
stage is crucial to allow these molecules to be identified
compared etc by users Every structure deposited to the
PDB is dissected to identify the individual chemical entities
that are subsequently compared and validated against the
CCD When a chemical entity matches an existing diction-
ary entry it is assigned the same three-character identifier as
Figure 10 rA weighted 2Fo-Fc map of a carbohydrate binding protein shown at a contour level of 035eA^3 Very little electron density is observed
for the oligosaccharide molecule This is reflected in the high LLDF values (shown in parentheses) for each of the component carbohydrate moieties
Database Vol 2014 Article ID bau116 Page 9 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
in the CCD In addition the nomenclature of the entityrsquos
atoms is updated to reflect the dictionary definition Once
the individual components have been correctly identified in-
formation is added about all the non-standard covalent H-
bonding and ionic interactions that occur in the structure
The annotation software also generates binding-site infor-
mation for every bound molecule For all crystal structures
symmetry information is taken into account during the iden-
tification of binding sites and non-covalent interactions
The rapid growth and increased complexity of the struc-
tures to PDB has prompted continuous review of the anno-
tation tools and processes used during processing and
handling the data deposited to the PDB To support the sci-
entific advancements in the field of structural biology the
wwPDB partners are collaborating to develop a new joint
deposition and annotation system (27) The new annotation
system contains four major modules such as a chemical
component module a sequence module a validation mod-
ule and a module for derived data The chemistry module
can perform simultaneous searches against both the CCD
and BIRD The new system has an interactive interface and
visualization tools that aid in annotation and facilitate cor-
rect representation of the data (28) The chemistry module
has been tested on previous depositions and has significantly
improved processing efficiency and data quality
AcknowledgementsThe 3D images in the manuscript were generated using PyMOL
(29) The authors are grateful to Prof Gerard Kleywegt for helpful
comments on this manuscript
Funding
PDBe is funded by EMBL-EBI Wellcome Trust (088944) BBSRC
(BBJ0074711 BBI02576X1 BBK0169701) RCSB is funded by
NSF NIH DOE (DBI-1338415) PDBj is funded by JST-NBDC
Institute for Protein Research IPR Osaka University The open
access fee for paper entitled lsquoSmall molecule annotation for the pro-
tein data bankrsquo will be paid by the Wellcome Trust Grant 088944
Conflict of interest None declared
References
1 BermanHM HenrickK KleywegtG et al (2012) The
Worldwide Protein Data Bank Int Tables Crystallogr F 827ndash832
2 RosePW BiC BluhmWF et al (2013) The RCSB Protein
Data Bank new resources for research and education Nucleic
Acids Res 41 D475ndashD482
3 GutmanasA AlhroubY BattleGM et al (2014) PDBe Protein
Data Bank in Europe Nucleic Acids Res 42 D285ndashD291
4 KinjoAR SuzukiH YamashitaR et al (2012) Protein Data
Bank Japan (PDBj) maintaining a structural data archive and re-
source description framework format Nulceic Acids Res 40
D453ndashD460
5 UlrichEL AkutsuH DoreleijersJF et al (2008)
BioMagResBank Nucleic Acids Res 36 D402ndashD408
6 DuttaS BurkhardtK YoungJ et al (2009) Data deposition
and annotation at the Worldwide Protein Data Bank Mol
Biotechnol 42 1ndash13
7 BermanHM KleywegtGJ NakamuraH et al (2013) How
community has shaped the Protein Data Bank Structure 21
1485ndash1491
8 Editorial (2014) Hard datamdashit has been no small feat for the
Protein Data Bank to stay relevant for 100000 structures
Nature 509 260
9 WeiningerD (1988) SMILES a chemical language and infor-
mation system 1 Introduction to methodology and encoding
rules J Chem Inf Model 28 31ndash36
10 HellerS McNaughtA SteinS et al (2013) InChImdashthe
worldwide chemical structure identifier standard
J Cheminform 5 7
11 BournePE BermanHM McMahonB et al (1997)
Macromolecular crystallographic information file Methods
Enzymol 277 571ndash590
12 GasteigerJ RudolphC and SadowskiJ (1990) Automatic gen-
eration of 3D-atomic coordinates for organic molecules
Tetrahedron Comp Method 3 537ndash547
13 IhlenfeldtWD TakahasiY AbeH et al (1992) CACTVS a
chemistry algorithm development environment In Machida K
Nishioka T (eds) Daijuukagakutouronkai Dainijuukai
Kouzoukasseisoukan Shinpojiumu Kouenyoushishuu Kyoto
University Press Kyoto pp 102ndash105
14 StahlM and MauserH (2005) Database clustering with a
combination of fingerprint and maximum common substructure
methods J Chem Inf Model 45 542ndash548
15 WinnMD BallardCC CowtanKD et al (2011) Overview
of the CCP4 suite and current developments Acta Crystallogr
D D67 235ndash242
16 DuttaS DimitropoulosD FengZ et al (2014) Improving the
representation of peptide-like inhibitor and antibiotic molecules
in the Protein Data Bank Biopolymers 101 659ndash668
17 The UniProt Consortium (2011) Reorganizing the protein space
at the Universal Protein Resource (UniProt) Nucleic Acids Res
40 D71ndashD75
18 CabocheS PupinM LeclereV et al (2008) NORINE a data-
base of nonribosomal peptides Nucleic Acids Res 36
D326ndashD331
19 ReadRJ AdamsPD ArendallWB et al (2011) A new gen-
eration of crystallographic validation tools for the Protein Data
Bank Structure 19 1395ndash1412
20 MontelioneGT NilgesM BaxA et al (2013)
Recommendations of the wwPDB NMR Structure Validation
Task Force Structure 21 1563ndash1570
21 HendersonR SaliA BakerML et al (2012) Outcome of the
first electron microscopy validation task force meeting
Structure 20 205ndash214
22 GoreS VelankarS and KleywegtGJ (2012) Implementing an
X-ray validation pipeline for the Protein Data Bank Acta
Crystallogr D68 478ndash483
23 BrunoIJ ColeJC KesslerM et al (2004) Retrieval of crys-
tallographically-derived molecular geometry information
J Chem Inf Model 44 2133ndash2144
Page 10 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
24 AllenFH BellardS BriceMD et al (1979) The Cambridge
Crystallographic Data Centre computer-based search retrieval
analysis and display of information Acta Crystallogr B 35
2331ndash2339
25 KleywegtGJ HarrisMR ZouJY et al (2004) The
Uppsala Electron-Density Server Acta Crystallogr D 60 2240ndash2249
26 JonesTA ZouJY CowanSW et al (1991) Improved methods
for building protein models in electron density maps and the loca-
tion of errors in these models Acta Crytslallogr A 47 110ndash119
27 QuesadaM WestbrookJ OldfieldT et al (2011) The
wwPDB common tool for deposition and annotation Acta
Crystallogr A67 C403ndashC404
28 YoungJ FengZ DimitropolousD et al (2013) Chemical an-
notation of small and peptide-like molecules at the Protein Data
Bank Database 2013 1
29 The PyMOL Molecular Graphics System Version 12r3pre
Schrodinger LLC [Online]
30 TungJY ChangMD ChouWI et al (2008) Crystal struc-
tures of the starch-binding domain from Rhizopus oryzae
glucoamylase reveal a polysaccharide-binding path Biochem J
416 27ndash36
31 AubeleDL HomRK AdlerM et al (2013) Selective and
brain-permeable polo-like kinase-2 (Plk-2) inhibitors that reduce
alpha-synuclein phosphorylation in rat brain ChemMedChem
8 1295ndash1313
32 EllisMJ BuffeySG HoughMA et al (2008) On-line
optical and X-ray spectroscopies with crystallography an inte-
grated approach for determining metalloprotein structures in
functionally well defined states J Synchrotron Radiat 15
433ndash439
Database Vol 2014 Article ID bau116 Page 11 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
BIRD
The large amount of chemical and biological information
that emerged during the remediation of the peptide-like
antibiotic and inhibitor molecules in the PDB required sys-
tematic referencing and documenting for curation of future
depositions A new resource was created to assist the future
annotation of such compounds containing both chemical
and biological information The chemical information is
stored in Peptide Reference Dictionary (PRD) files
whereas the biological function is documented in a separ-
ate family file The chemical and biological information
are separately annotated so that chemically similar
Figure 5 Binding site for the Plk-2 inhibitor (7R)-8-cyclopentyl-7-ethyl-5-methyl-78-dihydropteridin-6(5H)-one (3 letter code 11 G) in PDB entry 4i6b
(31) The figure depicts the neighbouring residues that are within 37 A of the ligand 11 G
Figure 6 The environment for the oligosaccharide poly-N-acetylglucosamine (PNAG) is annotated instead of listing the environment of individual
sugars This avoids repeating the same sugar molecule in multiple binding sites
Page 6 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
molecules either with conserved core polymer sequence or
signature sequence motifs can be grouped under the same
family of antibiotics or peptide-like molecules For ex-
ample antibiotics such as chlorionectin A vancomycin
teicoplanin and balhimycin are part of the glycopeptide
antibiotic family For each chemically distinct peptide-like
molecule a new PRD description is created with a unique
identifier and information about its chemical composition
connectivity structural description (eg glycopeptide
anthracycline lipopeptide etc) and function (eg
immunosuppressant enzyme inhibitor thrombin inhibitor
etc) Information about a family of molecules is stored in
a FAM file which mainly includes biological annotation
(such as function mechanism of action and
Figure 7 Diagram showing the relationship between the_struct_site and _struct_site_gen categories used for annotation of ligand-binding sites The
_struct_site category holds information about the ligands that are present in the PDB entry and every ligand in this category is assigned a alphanu-
meric binding site identifier The _struct_site_gen category contains information of the residues that are present within the vicinity of the ligands
described in the struct_site category Both the categories are joined by the binding site identifier
Figure 8 Tetrahedrally coordinated Zn ion in entry 2VW4 (32) along with the annotation of the bond angles The REMARK 620 annotation indicates
the software calculated bond angle values between Zn A 503 and its surrounding residues The surrounding residues in anticlockwise direction are
Glu A 195 HisB 165 and Asp B 167 The sidechain carboxylate group of the Glu residue exists in two alternate conformation (A and B conformers)
The angle between GluA195B-Zn-HisB165 is 1178 GluA195B-Zn-Asp(OD1)B167 is 866 HisB165-Zn-Asp(OD1)B167 is 916 Glu195B-Zn-
ASP(OD2)B167 is 1052 HisB165-Zn-Asp(OD2)B167 is 1249 Asp(OD1)B167-Zn-Asp(OD2)B167 is 571 Glu(OE1)A195B-Zn-Glu(OE2)A195A is 241
HisB165-Zn- Glu(OE2)A195A is 1228 Asp(OD1)B167-Zn-Glu(OE2)A195A is 1093 and Asp(OD2)B167-Zn-Glu(OE2)A195A is 1107
Database Vol 2014 Article ID bau116 Page 7 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
pharmacological action) and the list of family members in
the PRD (16) Remediated and newly released PDB entries
containing peptide-like compounds include PRD identifiers
in the PDBx files The new resource comprising both
chemical and biological information for these peptide-like
molecules is called the BIRD Both the PRD and family
files are distributed via the wwPDB ftp area (ftpftp
wwpdborgpubpdbdatabird) Currently there are nearly
200 families and more than 700 PRD entries in BIRD The
release of any new PRD entries or families is synchronized
with the release of the first PDB entry containing the cor-
responding peptide-like molecule
Figure 9 Annotation of the glycopeptide antibiotic teicoplanin involves lsquochopping uprsquo the molecule into its component chemical entities that are vali-
dated against the CCD The bonds highlighted in yellow demarcate the individual entities
Page 8 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
Validation of ligands in the PDB
The wwPDB structure validation process includes geometry
checks for all standard and non-standard amino acids and
nucleotides as well as carbohydrates and ligands Since
2008 wwPDB has convened several method-specific valid-
ation task forces (VTFs) composed of community experts
to provide recommendations on structure and data valid-
ation (19ndash21) The recommendations of the X-ray VTF
have been implemented in a software pipeline that is used
during deposition and annotation as a stand-alone server
and for validation of all legacy crystal structures in the arch-
ive (22) The validation pipeline uses a variety of external
and internal software To assess bond lengths bond angles
torsion angles and chirality for every small molecule present
in an entry use is made of the program Mogul (23) which
compares the geometry to that encountered in high-quality
small molecule structures in the CSD (24) The validation
pipeline also analyses the fit of ligands to the electron dens-
ity using approaches introduced by the Uppsala Electron-
Density Server (25) For every ligand in an entry the local
ligand density fit (LLDF) is calculated this is essentially a
Z-score of the ligandrsquos real space R-value (RSR) (26) relative
to the mean and standard deviation of the RSR values of the
neighbouring polymeric standard residues and nucleotides
(taking crystallographic symmetry into account) and using
a cut-off distance of 5 A from any ligand atom LLDF values
gt2 are highlighted in the report produced by the validation
pipeline (Figure 10) The validation report generated during
the annotation process is sent back to the depositor so it can
be included when a paper describing the structure is submit-
ted for publication Upon release of an entry the validation
pipeline is run on it again and the resulting report is made
publicly available
Conclusions and future directions
The PDB archive contains a wide variety of chemical enti-
ties many of which play a crucial role in modulating
biochemical reactions Correct identification and representa-
tion of these small molecules in the PDB at the annotation
stage is crucial to allow these molecules to be identified
compared etc by users Every structure deposited to the
PDB is dissected to identify the individual chemical entities
that are subsequently compared and validated against the
CCD When a chemical entity matches an existing diction-
ary entry it is assigned the same three-character identifier as
Figure 10 rA weighted 2Fo-Fc map of a carbohydrate binding protein shown at a contour level of 035eA^3 Very little electron density is observed
for the oligosaccharide molecule This is reflected in the high LLDF values (shown in parentheses) for each of the component carbohydrate moieties
Database Vol 2014 Article ID bau116 Page 9 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
in the CCD In addition the nomenclature of the entityrsquos
atoms is updated to reflect the dictionary definition Once
the individual components have been correctly identified in-
formation is added about all the non-standard covalent H-
bonding and ionic interactions that occur in the structure
The annotation software also generates binding-site infor-
mation for every bound molecule For all crystal structures
symmetry information is taken into account during the iden-
tification of binding sites and non-covalent interactions
The rapid growth and increased complexity of the struc-
tures to PDB has prompted continuous review of the anno-
tation tools and processes used during processing and
handling the data deposited to the PDB To support the sci-
entific advancements in the field of structural biology the
wwPDB partners are collaborating to develop a new joint
deposition and annotation system (27) The new annotation
system contains four major modules such as a chemical
component module a sequence module a validation mod-
ule and a module for derived data The chemistry module
can perform simultaneous searches against both the CCD
and BIRD The new system has an interactive interface and
visualization tools that aid in annotation and facilitate cor-
rect representation of the data (28) The chemistry module
has been tested on previous depositions and has significantly
improved processing efficiency and data quality
AcknowledgementsThe 3D images in the manuscript were generated using PyMOL
(29) The authors are grateful to Prof Gerard Kleywegt for helpful
comments on this manuscript
Funding
PDBe is funded by EMBL-EBI Wellcome Trust (088944) BBSRC
(BBJ0074711 BBI02576X1 BBK0169701) RCSB is funded by
NSF NIH DOE (DBI-1338415) PDBj is funded by JST-NBDC
Institute for Protein Research IPR Osaka University The open
access fee for paper entitled lsquoSmall molecule annotation for the pro-
tein data bankrsquo will be paid by the Wellcome Trust Grant 088944
Conflict of interest None declared
References
1 BermanHM HenrickK KleywegtG et al (2012) The
Worldwide Protein Data Bank Int Tables Crystallogr F 827ndash832
2 RosePW BiC BluhmWF et al (2013) The RCSB Protein
Data Bank new resources for research and education Nucleic
Acids Res 41 D475ndashD482
3 GutmanasA AlhroubY BattleGM et al (2014) PDBe Protein
Data Bank in Europe Nucleic Acids Res 42 D285ndashD291
4 KinjoAR SuzukiH YamashitaR et al (2012) Protein Data
Bank Japan (PDBj) maintaining a structural data archive and re-
source description framework format Nulceic Acids Res 40
D453ndashD460
5 UlrichEL AkutsuH DoreleijersJF et al (2008)
BioMagResBank Nucleic Acids Res 36 D402ndashD408
6 DuttaS BurkhardtK YoungJ et al (2009) Data deposition
and annotation at the Worldwide Protein Data Bank Mol
Biotechnol 42 1ndash13
7 BermanHM KleywegtGJ NakamuraH et al (2013) How
community has shaped the Protein Data Bank Structure 21
1485ndash1491
8 Editorial (2014) Hard datamdashit has been no small feat for the
Protein Data Bank to stay relevant for 100000 structures
Nature 509 260
9 WeiningerD (1988) SMILES a chemical language and infor-
mation system 1 Introduction to methodology and encoding
rules J Chem Inf Model 28 31ndash36
10 HellerS McNaughtA SteinS et al (2013) InChImdashthe
worldwide chemical structure identifier standard
J Cheminform 5 7
11 BournePE BermanHM McMahonB et al (1997)
Macromolecular crystallographic information file Methods
Enzymol 277 571ndash590
12 GasteigerJ RudolphC and SadowskiJ (1990) Automatic gen-
eration of 3D-atomic coordinates for organic molecules
Tetrahedron Comp Method 3 537ndash547
13 IhlenfeldtWD TakahasiY AbeH et al (1992) CACTVS a
chemistry algorithm development environment In Machida K
Nishioka T (eds) Daijuukagakutouronkai Dainijuukai
Kouzoukasseisoukan Shinpojiumu Kouenyoushishuu Kyoto
University Press Kyoto pp 102ndash105
14 StahlM and MauserH (2005) Database clustering with a
combination of fingerprint and maximum common substructure
methods J Chem Inf Model 45 542ndash548
15 WinnMD BallardCC CowtanKD et al (2011) Overview
of the CCP4 suite and current developments Acta Crystallogr
D D67 235ndash242
16 DuttaS DimitropoulosD FengZ et al (2014) Improving the
representation of peptide-like inhibitor and antibiotic molecules
in the Protein Data Bank Biopolymers 101 659ndash668
17 The UniProt Consortium (2011) Reorganizing the protein space
at the Universal Protein Resource (UniProt) Nucleic Acids Res
40 D71ndashD75
18 CabocheS PupinM LeclereV et al (2008) NORINE a data-
base of nonribosomal peptides Nucleic Acids Res 36
D326ndashD331
19 ReadRJ AdamsPD ArendallWB et al (2011) A new gen-
eration of crystallographic validation tools for the Protein Data
Bank Structure 19 1395ndash1412
20 MontelioneGT NilgesM BaxA et al (2013)
Recommendations of the wwPDB NMR Structure Validation
Task Force Structure 21 1563ndash1570
21 HendersonR SaliA BakerML et al (2012) Outcome of the
first electron microscopy validation task force meeting
Structure 20 205ndash214
22 GoreS VelankarS and KleywegtGJ (2012) Implementing an
X-ray validation pipeline for the Protein Data Bank Acta
Crystallogr D68 478ndash483
23 BrunoIJ ColeJC KesslerM et al (2004) Retrieval of crys-
tallographically-derived molecular geometry information
J Chem Inf Model 44 2133ndash2144
Page 10 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
24 AllenFH BellardS BriceMD et al (1979) The Cambridge
Crystallographic Data Centre computer-based search retrieval
analysis and display of information Acta Crystallogr B 35
2331ndash2339
25 KleywegtGJ HarrisMR ZouJY et al (2004) The
Uppsala Electron-Density Server Acta Crystallogr D 60 2240ndash2249
26 JonesTA ZouJY CowanSW et al (1991) Improved methods
for building protein models in electron density maps and the loca-
tion of errors in these models Acta Crytslallogr A 47 110ndash119
27 QuesadaM WestbrookJ OldfieldT et al (2011) The
wwPDB common tool for deposition and annotation Acta
Crystallogr A67 C403ndashC404
28 YoungJ FengZ DimitropolousD et al (2013) Chemical an-
notation of small and peptide-like molecules at the Protein Data
Bank Database 2013 1
29 The PyMOL Molecular Graphics System Version 12r3pre
Schrodinger LLC [Online]
30 TungJY ChangMD ChouWI et al (2008) Crystal struc-
tures of the starch-binding domain from Rhizopus oryzae
glucoamylase reveal a polysaccharide-binding path Biochem J
416 27ndash36
31 AubeleDL HomRK AdlerM et al (2013) Selective and
brain-permeable polo-like kinase-2 (Plk-2) inhibitors that reduce
alpha-synuclein phosphorylation in rat brain ChemMedChem
8 1295ndash1313
32 EllisMJ BuffeySG HoughMA et al (2008) On-line
optical and X-ray spectroscopies with crystallography an inte-
grated approach for determining metalloprotein structures in
functionally well defined states J Synchrotron Radiat 15
433ndash439
Database Vol 2014 Article ID bau116 Page 11 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
molecules either with conserved core polymer sequence or
signature sequence motifs can be grouped under the same
family of antibiotics or peptide-like molecules For ex-
ample antibiotics such as chlorionectin A vancomycin
teicoplanin and balhimycin are part of the glycopeptide
antibiotic family For each chemically distinct peptide-like
molecule a new PRD description is created with a unique
identifier and information about its chemical composition
connectivity structural description (eg glycopeptide
anthracycline lipopeptide etc) and function (eg
immunosuppressant enzyme inhibitor thrombin inhibitor
etc) Information about a family of molecules is stored in
a FAM file which mainly includes biological annotation
(such as function mechanism of action and
Figure 7 Diagram showing the relationship between the_struct_site and _struct_site_gen categories used for annotation of ligand-binding sites The
_struct_site category holds information about the ligands that are present in the PDB entry and every ligand in this category is assigned a alphanu-
meric binding site identifier The _struct_site_gen category contains information of the residues that are present within the vicinity of the ligands
described in the struct_site category Both the categories are joined by the binding site identifier
Figure 8 Tetrahedrally coordinated Zn ion in entry 2VW4 (32) along with the annotation of the bond angles The REMARK 620 annotation indicates
the software calculated bond angle values between Zn A 503 and its surrounding residues The surrounding residues in anticlockwise direction are
Glu A 195 HisB 165 and Asp B 167 The sidechain carboxylate group of the Glu residue exists in two alternate conformation (A and B conformers)
The angle between GluA195B-Zn-HisB165 is 1178 GluA195B-Zn-Asp(OD1)B167 is 866 HisB165-Zn-Asp(OD1)B167 is 916 Glu195B-Zn-
ASP(OD2)B167 is 1052 HisB165-Zn-Asp(OD2)B167 is 1249 Asp(OD1)B167-Zn-Asp(OD2)B167 is 571 Glu(OE1)A195B-Zn-Glu(OE2)A195A is 241
HisB165-Zn- Glu(OE2)A195A is 1228 Asp(OD1)B167-Zn-Glu(OE2)A195A is 1093 and Asp(OD2)B167-Zn-Glu(OE2)A195A is 1107
Database Vol 2014 Article ID bau116 Page 7 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
pharmacological action) and the list of family members in
the PRD (16) Remediated and newly released PDB entries
containing peptide-like compounds include PRD identifiers
in the PDBx files The new resource comprising both
chemical and biological information for these peptide-like
molecules is called the BIRD Both the PRD and family
files are distributed via the wwPDB ftp area (ftpftp
wwpdborgpubpdbdatabird) Currently there are nearly
200 families and more than 700 PRD entries in BIRD The
release of any new PRD entries or families is synchronized
with the release of the first PDB entry containing the cor-
responding peptide-like molecule
Figure 9 Annotation of the glycopeptide antibiotic teicoplanin involves lsquochopping uprsquo the molecule into its component chemical entities that are vali-
dated against the CCD The bonds highlighted in yellow demarcate the individual entities
Page 8 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
Validation of ligands in the PDB
The wwPDB structure validation process includes geometry
checks for all standard and non-standard amino acids and
nucleotides as well as carbohydrates and ligands Since
2008 wwPDB has convened several method-specific valid-
ation task forces (VTFs) composed of community experts
to provide recommendations on structure and data valid-
ation (19ndash21) The recommendations of the X-ray VTF
have been implemented in a software pipeline that is used
during deposition and annotation as a stand-alone server
and for validation of all legacy crystal structures in the arch-
ive (22) The validation pipeline uses a variety of external
and internal software To assess bond lengths bond angles
torsion angles and chirality for every small molecule present
in an entry use is made of the program Mogul (23) which
compares the geometry to that encountered in high-quality
small molecule structures in the CSD (24) The validation
pipeline also analyses the fit of ligands to the electron dens-
ity using approaches introduced by the Uppsala Electron-
Density Server (25) For every ligand in an entry the local
ligand density fit (LLDF) is calculated this is essentially a
Z-score of the ligandrsquos real space R-value (RSR) (26) relative
to the mean and standard deviation of the RSR values of the
neighbouring polymeric standard residues and nucleotides
(taking crystallographic symmetry into account) and using
a cut-off distance of 5 A from any ligand atom LLDF values
gt2 are highlighted in the report produced by the validation
pipeline (Figure 10) The validation report generated during
the annotation process is sent back to the depositor so it can
be included when a paper describing the structure is submit-
ted for publication Upon release of an entry the validation
pipeline is run on it again and the resulting report is made
publicly available
Conclusions and future directions
The PDB archive contains a wide variety of chemical enti-
ties many of which play a crucial role in modulating
biochemical reactions Correct identification and representa-
tion of these small molecules in the PDB at the annotation
stage is crucial to allow these molecules to be identified
compared etc by users Every structure deposited to the
PDB is dissected to identify the individual chemical entities
that are subsequently compared and validated against the
CCD When a chemical entity matches an existing diction-
ary entry it is assigned the same three-character identifier as
Figure 10 rA weighted 2Fo-Fc map of a carbohydrate binding protein shown at a contour level of 035eA^3 Very little electron density is observed
for the oligosaccharide molecule This is reflected in the high LLDF values (shown in parentheses) for each of the component carbohydrate moieties
Database Vol 2014 Article ID bau116 Page 9 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
in the CCD In addition the nomenclature of the entityrsquos
atoms is updated to reflect the dictionary definition Once
the individual components have been correctly identified in-
formation is added about all the non-standard covalent H-
bonding and ionic interactions that occur in the structure
The annotation software also generates binding-site infor-
mation for every bound molecule For all crystal structures
symmetry information is taken into account during the iden-
tification of binding sites and non-covalent interactions
The rapid growth and increased complexity of the struc-
tures to PDB has prompted continuous review of the anno-
tation tools and processes used during processing and
handling the data deposited to the PDB To support the sci-
entific advancements in the field of structural biology the
wwPDB partners are collaborating to develop a new joint
deposition and annotation system (27) The new annotation
system contains four major modules such as a chemical
component module a sequence module a validation mod-
ule and a module for derived data The chemistry module
can perform simultaneous searches against both the CCD
and BIRD The new system has an interactive interface and
visualization tools that aid in annotation and facilitate cor-
rect representation of the data (28) The chemistry module
has been tested on previous depositions and has significantly
improved processing efficiency and data quality
AcknowledgementsThe 3D images in the manuscript were generated using PyMOL
(29) The authors are grateful to Prof Gerard Kleywegt for helpful
comments on this manuscript
Funding
PDBe is funded by EMBL-EBI Wellcome Trust (088944) BBSRC
(BBJ0074711 BBI02576X1 BBK0169701) RCSB is funded by
NSF NIH DOE (DBI-1338415) PDBj is funded by JST-NBDC
Institute for Protein Research IPR Osaka University The open
access fee for paper entitled lsquoSmall molecule annotation for the pro-
tein data bankrsquo will be paid by the Wellcome Trust Grant 088944
Conflict of interest None declared
References
1 BermanHM HenrickK KleywegtG et al (2012) The
Worldwide Protein Data Bank Int Tables Crystallogr F 827ndash832
2 RosePW BiC BluhmWF et al (2013) The RCSB Protein
Data Bank new resources for research and education Nucleic
Acids Res 41 D475ndashD482
3 GutmanasA AlhroubY BattleGM et al (2014) PDBe Protein
Data Bank in Europe Nucleic Acids Res 42 D285ndashD291
4 KinjoAR SuzukiH YamashitaR et al (2012) Protein Data
Bank Japan (PDBj) maintaining a structural data archive and re-
source description framework format Nulceic Acids Res 40
D453ndashD460
5 UlrichEL AkutsuH DoreleijersJF et al (2008)
BioMagResBank Nucleic Acids Res 36 D402ndashD408
6 DuttaS BurkhardtK YoungJ et al (2009) Data deposition
and annotation at the Worldwide Protein Data Bank Mol
Biotechnol 42 1ndash13
7 BermanHM KleywegtGJ NakamuraH et al (2013) How
community has shaped the Protein Data Bank Structure 21
1485ndash1491
8 Editorial (2014) Hard datamdashit has been no small feat for the
Protein Data Bank to stay relevant for 100000 structures
Nature 509 260
9 WeiningerD (1988) SMILES a chemical language and infor-
mation system 1 Introduction to methodology and encoding
rules J Chem Inf Model 28 31ndash36
10 HellerS McNaughtA SteinS et al (2013) InChImdashthe
worldwide chemical structure identifier standard
J Cheminform 5 7
11 BournePE BermanHM McMahonB et al (1997)
Macromolecular crystallographic information file Methods
Enzymol 277 571ndash590
12 GasteigerJ RudolphC and SadowskiJ (1990) Automatic gen-
eration of 3D-atomic coordinates for organic molecules
Tetrahedron Comp Method 3 537ndash547
13 IhlenfeldtWD TakahasiY AbeH et al (1992) CACTVS a
chemistry algorithm development environment In Machida K
Nishioka T (eds) Daijuukagakutouronkai Dainijuukai
Kouzoukasseisoukan Shinpojiumu Kouenyoushishuu Kyoto
University Press Kyoto pp 102ndash105
14 StahlM and MauserH (2005) Database clustering with a
combination of fingerprint and maximum common substructure
methods J Chem Inf Model 45 542ndash548
15 WinnMD BallardCC CowtanKD et al (2011) Overview
of the CCP4 suite and current developments Acta Crystallogr
D D67 235ndash242
16 DuttaS DimitropoulosD FengZ et al (2014) Improving the
representation of peptide-like inhibitor and antibiotic molecules
in the Protein Data Bank Biopolymers 101 659ndash668
17 The UniProt Consortium (2011) Reorganizing the protein space
at the Universal Protein Resource (UniProt) Nucleic Acids Res
40 D71ndashD75
18 CabocheS PupinM LeclereV et al (2008) NORINE a data-
base of nonribosomal peptides Nucleic Acids Res 36
D326ndashD331
19 ReadRJ AdamsPD ArendallWB et al (2011) A new gen-
eration of crystallographic validation tools for the Protein Data
Bank Structure 19 1395ndash1412
20 MontelioneGT NilgesM BaxA et al (2013)
Recommendations of the wwPDB NMR Structure Validation
Task Force Structure 21 1563ndash1570
21 HendersonR SaliA BakerML et al (2012) Outcome of the
first electron microscopy validation task force meeting
Structure 20 205ndash214
22 GoreS VelankarS and KleywegtGJ (2012) Implementing an
X-ray validation pipeline for the Protein Data Bank Acta
Crystallogr D68 478ndash483
23 BrunoIJ ColeJC KesslerM et al (2004) Retrieval of crys-
tallographically-derived molecular geometry information
J Chem Inf Model 44 2133ndash2144
Page 10 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
24 AllenFH BellardS BriceMD et al (1979) The Cambridge
Crystallographic Data Centre computer-based search retrieval
analysis and display of information Acta Crystallogr B 35
2331ndash2339
25 KleywegtGJ HarrisMR ZouJY et al (2004) The
Uppsala Electron-Density Server Acta Crystallogr D 60 2240ndash2249
26 JonesTA ZouJY CowanSW et al (1991) Improved methods
for building protein models in electron density maps and the loca-
tion of errors in these models Acta Crytslallogr A 47 110ndash119
27 QuesadaM WestbrookJ OldfieldT et al (2011) The
wwPDB common tool for deposition and annotation Acta
Crystallogr A67 C403ndashC404
28 YoungJ FengZ DimitropolousD et al (2013) Chemical an-
notation of small and peptide-like molecules at the Protein Data
Bank Database 2013 1
29 The PyMOL Molecular Graphics System Version 12r3pre
Schrodinger LLC [Online]
30 TungJY ChangMD ChouWI et al (2008) Crystal struc-
tures of the starch-binding domain from Rhizopus oryzae
glucoamylase reveal a polysaccharide-binding path Biochem J
416 27ndash36
31 AubeleDL HomRK AdlerM et al (2013) Selective and
brain-permeable polo-like kinase-2 (Plk-2) inhibitors that reduce
alpha-synuclein phosphorylation in rat brain ChemMedChem
8 1295ndash1313
32 EllisMJ BuffeySG HoughMA et al (2008) On-line
optical and X-ray spectroscopies with crystallography an inte-
grated approach for determining metalloprotein structures in
functionally well defined states J Synchrotron Radiat 15
433ndash439
Database Vol 2014 Article ID bau116 Page 11 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
pharmacological action) and the list of family members in
the PRD (16) Remediated and newly released PDB entries
containing peptide-like compounds include PRD identifiers
in the PDBx files The new resource comprising both
chemical and biological information for these peptide-like
molecules is called the BIRD Both the PRD and family
files are distributed via the wwPDB ftp area (ftpftp
wwpdborgpubpdbdatabird) Currently there are nearly
200 families and more than 700 PRD entries in BIRD The
release of any new PRD entries or families is synchronized
with the release of the first PDB entry containing the cor-
responding peptide-like molecule
Figure 9 Annotation of the glycopeptide antibiotic teicoplanin involves lsquochopping uprsquo the molecule into its component chemical entities that are vali-
dated against the CCD The bonds highlighted in yellow demarcate the individual entities
Page 8 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
Validation of ligands in the PDB
The wwPDB structure validation process includes geometry
checks for all standard and non-standard amino acids and
nucleotides as well as carbohydrates and ligands Since
2008 wwPDB has convened several method-specific valid-
ation task forces (VTFs) composed of community experts
to provide recommendations on structure and data valid-
ation (19ndash21) The recommendations of the X-ray VTF
have been implemented in a software pipeline that is used
during deposition and annotation as a stand-alone server
and for validation of all legacy crystal structures in the arch-
ive (22) The validation pipeline uses a variety of external
and internal software To assess bond lengths bond angles
torsion angles and chirality for every small molecule present
in an entry use is made of the program Mogul (23) which
compares the geometry to that encountered in high-quality
small molecule structures in the CSD (24) The validation
pipeline also analyses the fit of ligands to the electron dens-
ity using approaches introduced by the Uppsala Electron-
Density Server (25) For every ligand in an entry the local
ligand density fit (LLDF) is calculated this is essentially a
Z-score of the ligandrsquos real space R-value (RSR) (26) relative
to the mean and standard deviation of the RSR values of the
neighbouring polymeric standard residues and nucleotides
(taking crystallographic symmetry into account) and using
a cut-off distance of 5 A from any ligand atom LLDF values
gt2 are highlighted in the report produced by the validation
pipeline (Figure 10) The validation report generated during
the annotation process is sent back to the depositor so it can
be included when a paper describing the structure is submit-
ted for publication Upon release of an entry the validation
pipeline is run on it again and the resulting report is made
publicly available
Conclusions and future directions
The PDB archive contains a wide variety of chemical enti-
ties many of which play a crucial role in modulating
biochemical reactions Correct identification and representa-
tion of these small molecules in the PDB at the annotation
stage is crucial to allow these molecules to be identified
compared etc by users Every structure deposited to the
PDB is dissected to identify the individual chemical entities
that are subsequently compared and validated against the
CCD When a chemical entity matches an existing diction-
ary entry it is assigned the same three-character identifier as
Figure 10 rA weighted 2Fo-Fc map of a carbohydrate binding protein shown at a contour level of 035eA^3 Very little electron density is observed
for the oligosaccharide molecule This is reflected in the high LLDF values (shown in parentheses) for each of the component carbohydrate moieties
Database Vol 2014 Article ID bau116 Page 9 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
in the CCD In addition the nomenclature of the entityrsquos
atoms is updated to reflect the dictionary definition Once
the individual components have been correctly identified in-
formation is added about all the non-standard covalent H-
bonding and ionic interactions that occur in the structure
The annotation software also generates binding-site infor-
mation for every bound molecule For all crystal structures
symmetry information is taken into account during the iden-
tification of binding sites and non-covalent interactions
The rapid growth and increased complexity of the struc-
tures to PDB has prompted continuous review of the anno-
tation tools and processes used during processing and
handling the data deposited to the PDB To support the sci-
entific advancements in the field of structural biology the
wwPDB partners are collaborating to develop a new joint
deposition and annotation system (27) The new annotation
system contains four major modules such as a chemical
component module a sequence module a validation mod-
ule and a module for derived data The chemistry module
can perform simultaneous searches against both the CCD
and BIRD The new system has an interactive interface and
visualization tools that aid in annotation and facilitate cor-
rect representation of the data (28) The chemistry module
has been tested on previous depositions and has significantly
improved processing efficiency and data quality
AcknowledgementsThe 3D images in the manuscript were generated using PyMOL
(29) The authors are grateful to Prof Gerard Kleywegt for helpful
comments on this manuscript
Funding
PDBe is funded by EMBL-EBI Wellcome Trust (088944) BBSRC
(BBJ0074711 BBI02576X1 BBK0169701) RCSB is funded by
NSF NIH DOE (DBI-1338415) PDBj is funded by JST-NBDC
Institute for Protein Research IPR Osaka University The open
access fee for paper entitled lsquoSmall molecule annotation for the pro-
tein data bankrsquo will be paid by the Wellcome Trust Grant 088944
Conflict of interest None declared
References
1 BermanHM HenrickK KleywegtG et al (2012) The
Worldwide Protein Data Bank Int Tables Crystallogr F 827ndash832
2 RosePW BiC BluhmWF et al (2013) The RCSB Protein
Data Bank new resources for research and education Nucleic
Acids Res 41 D475ndashD482
3 GutmanasA AlhroubY BattleGM et al (2014) PDBe Protein
Data Bank in Europe Nucleic Acids Res 42 D285ndashD291
4 KinjoAR SuzukiH YamashitaR et al (2012) Protein Data
Bank Japan (PDBj) maintaining a structural data archive and re-
source description framework format Nulceic Acids Res 40
D453ndashD460
5 UlrichEL AkutsuH DoreleijersJF et al (2008)
BioMagResBank Nucleic Acids Res 36 D402ndashD408
6 DuttaS BurkhardtK YoungJ et al (2009) Data deposition
and annotation at the Worldwide Protein Data Bank Mol
Biotechnol 42 1ndash13
7 BermanHM KleywegtGJ NakamuraH et al (2013) How
community has shaped the Protein Data Bank Structure 21
1485ndash1491
8 Editorial (2014) Hard datamdashit has been no small feat for the
Protein Data Bank to stay relevant for 100000 structures
Nature 509 260
9 WeiningerD (1988) SMILES a chemical language and infor-
mation system 1 Introduction to methodology and encoding
rules J Chem Inf Model 28 31ndash36
10 HellerS McNaughtA SteinS et al (2013) InChImdashthe
worldwide chemical structure identifier standard
J Cheminform 5 7
11 BournePE BermanHM McMahonB et al (1997)
Macromolecular crystallographic information file Methods
Enzymol 277 571ndash590
12 GasteigerJ RudolphC and SadowskiJ (1990) Automatic gen-
eration of 3D-atomic coordinates for organic molecules
Tetrahedron Comp Method 3 537ndash547
13 IhlenfeldtWD TakahasiY AbeH et al (1992) CACTVS a
chemistry algorithm development environment In Machida K
Nishioka T (eds) Daijuukagakutouronkai Dainijuukai
Kouzoukasseisoukan Shinpojiumu Kouenyoushishuu Kyoto
University Press Kyoto pp 102ndash105
14 StahlM and MauserH (2005) Database clustering with a
combination of fingerprint and maximum common substructure
methods J Chem Inf Model 45 542ndash548
15 WinnMD BallardCC CowtanKD et al (2011) Overview
of the CCP4 suite and current developments Acta Crystallogr
D D67 235ndash242
16 DuttaS DimitropoulosD FengZ et al (2014) Improving the
representation of peptide-like inhibitor and antibiotic molecules
in the Protein Data Bank Biopolymers 101 659ndash668
17 The UniProt Consortium (2011) Reorganizing the protein space
at the Universal Protein Resource (UniProt) Nucleic Acids Res
40 D71ndashD75
18 CabocheS PupinM LeclereV et al (2008) NORINE a data-
base of nonribosomal peptides Nucleic Acids Res 36
D326ndashD331
19 ReadRJ AdamsPD ArendallWB et al (2011) A new gen-
eration of crystallographic validation tools for the Protein Data
Bank Structure 19 1395ndash1412
20 MontelioneGT NilgesM BaxA et al (2013)
Recommendations of the wwPDB NMR Structure Validation
Task Force Structure 21 1563ndash1570
21 HendersonR SaliA BakerML et al (2012) Outcome of the
first electron microscopy validation task force meeting
Structure 20 205ndash214
22 GoreS VelankarS and KleywegtGJ (2012) Implementing an
X-ray validation pipeline for the Protein Data Bank Acta
Crystallogr D68 478ndash483
23 BrunoIJ ColeJC KesslerM et al (2004) Retrieval of crys-
tallographically-derived molecular geometry information
J Chem Inf Model 44 2133ndash2144
Page 10 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
24 AllenFH BellardS BriceMD et al (1979) The Cambridge
Crystallographic Data Centre computer-based search retrieval
analysis and display of information Acta Crystallogr B 35
2331ndash2339
25 KleywegtGJ HarrisMR ZouJY et al (2004) The
Uppsala Electron-Density Server Acta Crystallogr D 60 2240ndash2249
26 JonesTA ZouJY CowanSW et al (1991) Improved methods
for building protein models in electron density maps and the loca-
tion of errors in these models Acta Crytslallogr A 47 110ndash119
27 QuesadaM WestbrookJ OldfieldT et al (2011) The
wwPDB common tool for deposition and annotation Acta
Crystallogr A67 C403ndashC404
28 YoungJ FengZ DimitropolousD et al (2013) Chemical an-
notation of small and peptide-like molecules at the Protein Data
Bank Database 2013 1
29 The PyMOL Molecular Graphics System Version 12r3pre
Schrodinger LLC [Online]
30 TungJY ChangMD ChouWI et al (2008) Crystal struc-
tures of the starch-binding domain from Rhizopus oryzae
glucoamylase reveal a polysaccharide-binding path Biochem J
416 27ndash36
31 AubeleDL HomRK AdlerM et al (2013) Selective and
brain-permeable polo-like kinase-2 (Plk-2) inhibitors that reduce
alpha-synuclein phosphorylation in rat brain ChemMedChem
8 1295ndash1313
32 EllisMJ BuffeySG HoughMA et al (2008) On-line
optical and X-ray spectroscopies with crystallography an inte-
grated approach for determining metalloprotein structures in
functionally well defined states J Synchrotron Radiat 15
433ndash439
Database Vol 2014 Article ID bau116 Page 11 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
Validation of ligands in the PDB
The wwPDB structure validation process includes geometry
checks for all standard and non-standard amino acids and
nucleotides as well as carbohydrates and ligands Since
2008 wwPDB has convened several method-specific valid-
ation task forces (VTFs) composed of community experts
to provide recommendations on structure and data valid-
ation (19ndash21) The recommendations of the X-ray VTF
have been implemented in a software pipeline that is used
during deposition and annotation as a stand-alone server
and for validation of all legacy crystal structures in the arch-
ive (22) The validation pipeline uses a variety of external
and internal software To assess bond lengths bond angles
torsion angles and chirality for every small molecule present
in an entry use is made of the program Mogul (23) which
compares the geometry to that encountered in high-quality
small molecule structures in the CSD (24) The validation
pipeline also analyses the fit of ligands to the electron dens-
ity using approaches introduced by the Uppsala Electron-
Density Server (25) For every ligand in an entry the local
ligand density fit (LLDF) is calculated this is essentially a
Z-score of the ligandrsquos real space R-value (RSR) (26) relative
to the mean and standard deviation of the RSR values of the
neighbouring polymeric standard residues and nucleotides
(taking crystallographic symmetry into account) and using
a cut-off distance of 5 A from any ligand atom LLDF values
gt2 are highlighted in the report produced by the validation
pipeline (Figure 10) The validation report generated during
the annotation process is sent back to the depositor so it can
be included when a paper describing the structure is submit-
ted for publication Upon release of an entry the validation
pipeline is run on it again and the resulting report is made
publicly available
Conclusions and future directions
The PDB archive contains a wide variety of chemical enti-
ties many of which play a crucial role in modulating
biochemical reactions Correct identification and representa-
tion of these small molecules in the PDB at the annotation
stage is crucial to allow these molecules to be identified
compared etc by users Every structure deposited to the
PDB is dissected to identify the individual chemical entities
that are subsequently compared and validated against the
CCD When a chemical entity matches an existing diction-
ary entry it is assigned the same three-character identifier as
Figure 10 rA weighted 2Fo-Fc map of a carbohydrate binding protein shown at a contour level of 035eA^3 Very little electron density is observed
for the oligosaccharide molecule This is reflected in the high LLDF values (shown in parentheses) for each of the component carbohydrate moieties
Database Vol 2014 Article ID bau116 Page 9 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
in the CCD In addition the nomenclature of the entityrsquos
atoms is updated to reflect the dictionary definition Once
the individual components have been correctly identified in-
formation is added about all the non-standard covalent H-
bonding and ionic interactions that occur in the structure
The annotation software also generates binding-site infor-
mation for every bound molecule For all crystal structures
symmetry information is taken into account during the iden-
tification of binding sites and non-covalent interactions
The rapid growth and increased complexity of the struc-
tures to PDB has prompted continuous review of the anno-
tation tools and processes used during processing and
handling the data deposited to the PDB To support the sci-
entific advancements in the field of structural biology the
wwPDB partners are collaborating to develop a new joint
deposition and annotation system (27) The new annotation
system contains four major modules such as a chemical
component module a sequence module a validation mod-
ule and a module for derived data The chemistry module
can perform simultaneous searches against both the CCD
and BIRD The new system has an interactive interface and
visualization tools that aid in annotation and facilitate cor-
rect representation of the data (28) The chemistry module
has been tested on previous depositions and has significantly
improved processing efficiency and data quality
AcknowledgementsThe 3D images in the manuscript were generated using PyMOL
(29) The authors are grateful to Prof Gerard Kleywegt for helpful
comments on this manuscript
Funding
PDBe is funded by EMBL-EBI Wellcome Trust (088944) BBSRC
(BBJ0074711 BBI02576X1 BBK0169701) RCSB is funded by
NSF NIH DOE (DBI-1338415) PDBj is funded by JST-NBDC
Institute for Protein Research IPR Osaka University The open
access fee for paper entitled lsquoSmall molecule annotation for the pro-
tein data bankrsquo will be paid by the Wellcome Trust Grant 088944
Conflict of interest None declared
References
1 BermanHM HenrickK KleywegtG et al (2012) The
Worldwide Protein Data Bank Int Tables Crystallogr F 827ndash832
2 RosePW BiC BluhmWF et al (2013) The RCSB Protein
Data Bank new resources for research and education Nucleic
Acids Res 41 D475ndashD482
3 GutmanasA AlhroubY BattleGM et al (2014) PDBe Protein
Data Bank in Europe Nucleic Acids Res 42 D285ndashD291
4 KinjoAR SuzukiH YamashitaR et al (2012) Protein Data
Bank Japan (PDBj) maintaining a structural data archive and re-
source description framework format Nulceic Acids Res 40
D453ndashD460
5 UlrichEL AkutsuH DoreleijersJF et al (2008)
BioMagResBank Nucleic Acids Res 36 D402ndashD408
6 DuttaS BurkhardtK YoungJ et al (2009) Data deposition
and annotation at the Worldwide Protein Data Bank Mol
Biotechnol 42 1ndash13
7 BermanHM KleywegtGJ NakamuraH et al (2013) How
community has shaped the Protein Data Bank Structure 21
1485ndash1491
8 Editorial (2014) Hard datamdashit has been no small feat for the
Protein Data Bank to stay relevant for 100000 structures
Nature 509 260
9 WeiningerD (1988) SMILES a chemical language and infor-
mation system 1 Introduction to methodology and encoding
rules J Chem Inf Model 28 31ndash36
10 HellerS McNaughtA SteinS et al (2013) InChImdashthe
worldwide chemical structure identifier standard
J Cheminform 5 7
11 BournePE BermanHM McMahonB et al (1997)
Macromolecular crystallographic information file Methods
Enzymol 277 571ndash590
12 GasteigerJ RudolphC and SadowskiJ (1990) Automatic gen-
eration of 3D-atomic coordinates for organic molecules
Tetrahedron Comp Method 3 537ndash547
13 IhlenfeldtWD TakahasiY AbeH et al (1992) CACTVS a
chemistry algorithm development environment In Machida K
Nishioka T (eds) Daijuukagakutouronkai Dainijuukai
Kouzoukasseisoukan Shinpojiumu Kouenyoushishuu Kyoto
University Press Kyoto pp 102ndash105
14 StahlM and MauserH (2005) Database clustering with a
combination of fingerprint and maximum common substructure
methods J Chem Inf Model 45 542ndash548
15 WinnMD BallardCC CowtanKD et al (2011) Overview
of the CCP4 suite and current developments Acta Crystallogr
D D67 235ndash242
16 DuttaS DimitropoulosD FengZ et al (2014) Improving the
representation of peptide-like inhibitor and antibiotic molecules
in the Protein Data Bank Biopolymers 101 659ndash668
17 The UniProt Consortium (2011) Reorganizing the protein space
at the Universal Protein Resource (UniProt) Nucleic Acids Res
40 D71ndashD75
18 CabocheS PupinM LeclereV et al (2008) NORINE a data-
base of nonribosomal peptides Nucleic Acids Res 36
D326ndashD331
19 ReadRJ AdamsPD ArendallWB et al (2011) A new gen-
eration of crystallographic validation tools for the Protein Data
Bank Structure 19 1395ndash1412
20 MontelioneGT NilgesM BaxA et al (2013)
Recommendations of the wwPDB NMR Structure Validation
Task Force Structure 21 1563ndash1570
21 HendersonR SaliA BakerML et al (2012) Outcome of the
first electron microscopy validation task force meeting
Structure 20 205ndash214
22 GoreS VelankarS and KleywegtGJ (2012) Implementing an
X-ray validation pipeline for the Protein Data Bank Acta
Crystallogr D68 478ndash483
23 BrunoIJ ColeJC KesslerM et al (2004) Retrieval of crys-
tallographically-derived molecular geometry information
J Chem Inf Model 44 2133ndash2144
Page 10 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
24 AllenFH BellardS BriceMD et al (1979) The Cambridge
Crystallographic Data Centre computer-based search retrieval
analysis and display of information Acta Crystallogr B 35
2331ndash2339
25 KleywegtGJ HarrisMR ZouJY et al (2004) The
Uppsala Electron-Density Server Acta Crystallogr D 60 2240ndash2249
26 JonesTA ZouJY CowanSW et al (1991) Improved methods
for building protein models in electron density maps and the loca-
tion of errors in these models Acta Crytslallogr A 47 110ndash119
27 QuesadaM WestbrookJ OldfieldT et al (2011) The
wwPDB common tool for deposition and annotation Acta
Crystallogr A67 C403ndashC404
28 YoungJ FengZ DimitropolousD et al (2013) Chemical an-
notation of small and peptide-like molecules at the Protein Data
Bank Database 2013 1
29 The PyMOL Molecular Graphics System Version 12r3pre
Schrodinger LLC [Online]
30 TungJY ChangMD ChouWI et al (2008) Crystal struc-
tures of the starch-binding domain from Rhizopus oryzae
glucoamylase reveal a polysaccharide-binding path Biochem J
416 27ndash36
31 AubeleDL HomRK AdlerM et al (2013) Selective and
brain-permeable polo-like kinase-2 (Plk-2) inhibitors that reduce
alpha-synuclein phosphorylation in rat brain ChemMedChem
8 1295ndash1313
32 EllisMJ BuffeySG HoughMA et al (2008) On-line
optical and X-ray spectroscopies with crystallography an inte-
grated approach for determining metalloprotein structures in
functionally well defined states J Synchrotron Radiat 15
433ndash439
Database Vol 2014 Article ID bau116 Page 11 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
in the CCD In addition the nomenclature of the entityrsquos
atoms is updated to reflect the dictionary definition Once
the individual components have been correctly identified in-
formation is added about all the non-standard covalent H-
bonding and ionic interactions that occur in the structure
The annotation software also generates binding-site infor-
mation for every bound molecule For all crystal structures
symmetry information is taken into account during the iden-
tification of binding sites and non-covalent interactions
The rapid growth and increased complexity of the struc-
tures to PDB has prompted continuous review of the anno-
tation tools and processes used during processing and
handling the data deposited to the PDB To support the sci-
entific advancements in the field of structural biology the
wwPDB partners are collaborating to develop a new joint
deposition and annotation system (27) The new annotation
system contains four major modules such as a chemical
component module a sequence module a validation mod-
ule and a module for derived data The chemistry module
can perform simultaneous searches against both the CCD
and BIRD The new system has an interactive interface and
visualization tools that aid in annotation and facilitate cor-
rect representation of the data (28) The chemistry module
has been tested on previous depositions and has significantly
improved processing efficiency and data quality
AcknowledgementsThe 3D images in the manuscript were generated using PyMOL
(29) The authors are grateful to Prof Gerard Kleywegt for helpful
comments on this manuscript
Funding
PDBe is funded by EMBL-EBI Wellcome Trust (088944) BBSRC
(BBJ0074711 BBI02576X1 BBK0169701) RCSB is funded by
NSF NIH DOE (DBI-1338415) PDBj is funded by JST-NBDC
Institute for Protein Research IPR Osaka University The open
access fee for paper entitled lsquoSmall molecule annotation for the pro-
tein data bankrsquo will be paid by the Wellcome Trust Grant 088944
Conflict of interest None declared
References
1 BermanHM HenrickK KleywegtG et al (2012) The
Worldwide Protein Data Bank Int Tables Crystallogr F 827ndash832
2 RosePW BiC BluhmWF et al (2013) The RCSB Protein
Data Bank new resources for research and education Nucleic
Acids Res 41 D475ndashD482
3 GutmanasA AlhroubY BattleGM et al (2014) PDBe Protein
Data Bank in Europe Nucleic Acids Res 42 D285ndashD291
4 KinjoAR SuzukiH YamashitaR et al (2012) Protein Data
Bank Japan (PDBj) maintaining a structural data archive and re-
source description framework format Nulceic Acids Res 40
D453ndashD460
5 UlrichEL AkutsuH DoreleijersJF et al (2008)
BioMagResBank Nucleic Acids Res 36 D402ndashD408
6 DuttaS BurkhardtK YoungJ et al (2009) Data deposition
and annotation at the Worldwide Protein Data Bank Mol
Biotechnol 42 1ndash13
7 BermanHM KleywegtGJ NakamuraH et al (2013) How
community has shaped the Protein Data Bank Structure 21
1485ndash1491
8 Editorial (2014) Hard datamdashit has been no small feat for the
Protein Data Bank to stay relevant for 100000 structures
Nature 509 260
9 WeiningerD (1988) SMILES a chemical language and infor-
mation system 1 Introduction to methodology and encoding
rules J Chem Inf Model 28 31ndash36
10 HellerS McNaughtA SteinS et al (2013) InChImdashthe
worldwide chemical structure identifier standard
J Cheminform 5 7
11 BournePE BermanHM McMahonB et al (1997)
Macromolecular crystallographic information file Methods
Enzymol 277 571ndash590
12 GasteigerJ RudolphC and SadowskiJ (1990) Automatic gen-
eration of 3D-atomic coordinates for organic molecules
Tetrahedron Comp Method 3 537ndash547
13 IhlenfeldtWD TakahasiY AbeH et al (1992) CACTVS a
chemistry algorithm development environment In Machida K
Nishioka T (eds) Daijuukagakutouronkai Dainijuukai
Kouzoukasseisoukan Shinpojiumu Kouenyoushishuu Kyoto
University Press Kyoto pp 102ndash105
14 StahlM and MauserH (2005) Database clustering with a
combination of fingerprint and maximum common substructure
methods J Chem Inf Model 45 542ndash548
15 WinnMD BallardCC CowtanKD et al (2011) Overview
of the CCP4 suite and current developments Acta Crystallogr
D D67 235ndash242
16 DuttaS DimitropoulosD FengZ et al (2014) Improving the
representation of peptide-like inhibitor and antibiotic molecules
in the Protein Data Bank Biopolymers 101 659ndash668
17 The UniProt Consortium (2011) Reorganizing the protein space
at the Universal Protein Resource (UniProt) Nucleic Acids Res
40 D71ndashD75
18 CabocheS PupinM LeclereV et al (2008) NORINE a data-
base of nonribosomal peptides Nucleic Acids Res 36
D326ndashD331
19 ReadRJ AdamsPD ArendallWB et al (2011) A new gen-
eration of crystallographic validation tools for the Protein Data
Bank Structure 19 1395ndash1412
20 MontelioneGT NilgesM BaxA et al (2013)
Recommendations of the wwPDB NMR Structure Validation
Task Force Structure 21 1563ndash1570
21 HendersonR SaliA BakerML et al (2012) Outcome of the
first electron microscopy validation task force meeting
Structure 20 205ndash214
22 GoreS VelankarS and KleywegtGJ (2012) Implementing an
X-ray validation pipeline for the Protein Data Bank Acta
Crystallogr D68 478ndash483
23 BrunoIJ ColeJC KesslerM et al (2004) Retrieval of crys-
tallographically-derived molecular geometry information
J Chem Inf Model 44 2133ndash2144
Page 10 of 11 Database Vol 2014 Article ID bau116
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
24 AllenFH BellardS BriceMD et al (1979) The Cambridge
Crystallographic Data Centre computer-based search retrieval
analysis and display of information Acta Crystallogr B 35
2331ndash2339
25 KleywegtGJ HarrisMR ZouJY et al (2004) The
Uppsala Electron-Density Server Acta Crystallogr D 60 2240ndash2249
26 JonesTA ZouJY CowanSW et al (1991) Improved methods
for building protein models in electron density maps and the loca-
tion of errors in these models Acta Crytslallogr A 47 110ndash119
27 QuesadaM WestbrookJ OldfieldT et al (2011) The
wwPDB common tool for deposition and annotation Acta
Crystallogr A67 C403ndashC404
28 YoungJ FengZ DimitropolousD et al (2013) Chemical an-
notation of small and peptide-like molecules at the Protein Data
Bank Database 2013 1
29 The PyMOL Molecular Graphics System Version 12r3pre
Schrodinger LLC [Online]
30 TungJY ChangMD ChouWI et al (2008) Crystal struc-
tures of the starch-binding domain from Rhizopus oryzae
glucoamylase reveal a polysaccharide-binding path Biochem J
416 27ndash36
31 AubeleDL HomRK AdlerM et al (2013) Selective and
brain-permeable polo-like kinase-2 (Plk-2) inhibitors that reduce
alpha-synuclein phosphorylation in rat brain ChemMedChem
8 1295ndash1313
32 EllisMJ BuffeySG HoughMA et al (2008) On-line
optical and X-ray spectroscopies with crystallography an inte-
grated approach for determining metalloprotein structures in
functionally well defined states J Synchrotron Radiat 15
433ndash439
Database Vol 2014 Article ID bau116 Page 11 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from
24 AllenFH BellardS BriceMD et al (1979) The Cambridge
Crystallographic Data Centre computer-based search retrieval
analysis and display of information Acta Crystallogr B 35
2331ndash2339
25 KleywegtGJ HarrisMR ZouJY et al (2004) The
Uppsala Electron-Density Server Acta Crystallogr D 60 2240ndash2249
26 JonesTA ZouJY CowanSW et al (1991) Improved methods
for building protein models in electron density maps and the loca-
tion of errors in these models Acta Crytslallogr A 47 110ndash119
27 QuesadaM WestbrookJ OldfieldT et al (2011) The
wwPDB common tool for deposition and annotation Acta
Crystallogr A67 C403ndashC404
28 YoungJ FengZ DimitropolousD et al (2013) Chemical an-
notation of small and peptide-like molecules at the Protein Data
Bank Database 2013 1
29 The PyMOL Molecular Graphics System Version 12r3pre
Schrodinger LLC [Online]
30 TungJY ChangMD ChouWI et al (2008) Crystal struc-
tures of the starch-binding domain from Rhizopus oryzae
glucoamylase reveal a polysaccharide-binding path Biochem J
416 27ndash36
31 AubeleDL HomRK AdlerM et al (2013) Selective and
brain-permeable polo-like kinase-2 (Plk-2) inhibitors that reduce
alpha-synuclein phosphorylation in rat brain ChemMedChem
8 1295ndash1313
32 EllisMJ BuffeySG HoughMA et al (2008) On-line
optical and X-ray spectroscopies with crystallography an inte-
grated approach for determining metalloprotein structures in
functionally well defined states J Synchrotron Radiat 15
433ndash439
Database Vol 2014 Article ID bau116 Page 11 of 11
by guest on Novem
ber 22 2016httpdatabaseoxfordjournalsorg
Dow
nloaded from