Upload
neha-paddillaya
View
219
Download
0
Embed Size (px)
Citation preview
8/9/2019 adv bi unit 1
1/39
8/9/2019 adv bi unit 1
2/39
contained (4,5)6,433 records #ith a total o ((,543,(72,85) bases9
see the EMBL DB statistics page!
-t can be accessed and searched through the ; system at EB-, or
one can do#nload the entire database as lat iles! 1n e'ample o #hat
an entry looks like is gi"en or the human ra oncogene protein, -D:
*;1&; !
"enBan www.ncbi.nlm.nih.gov!"enban!
The GenBank nucleotide database is maintained by the
8/9/2019 adv bi unit 1
3/39
primary ones, or ha"e a dierent organi$ation o the data to better suit
some speciic purpose! *o#e"er, the nucleotide sequences themsel"es
should al#ays be a"ailable in the EMBL>GenBank databases! -n this
sense, the databases belo# are secondary databases!
&ni"ene www.ncbi.nlm.nih.gov!&ni"ene!
The /niGene system attempts to process the GenBank sequence data
into a non%redundant set o gene%oriented clusters! Each /niGene
cluster contains sequences that represent a unique gene, as #ell as
related inormation such as the tissue types in #hich the gene has
been e'pressed and map location!
S"# genome'www.stan(ord.edu!Saccharomyces!
The accharomyces Genome Database +GD is a scientiic databaseo the molecular biology and genetics o the yeast accharomyces
cere"isiae!
EB) "enomes www.ebi.ac.u!genomes!
This #eb site pro"ides access and statistics or the completed
genomes, and inormation about ongoing pro?ects!
"enome Biology www.ncbi.nlm.nih.gov!"enomes!
The Genome Biology site at
8/9/2019 adv bi unit 1
4/39
Ensembl is a ?oint pro?ect bet#een EMBL%EB- and the anger .entre
to de"elop a sot#are system #hich produces and maintains automatic
annotation o eukaryotic genomes!
*rotein Sequence
The t#o protein sequence databases @-%A;=T and A-; are
dierent rom the nucleotide databases in that they are both curated!
This means that groups o designated curators +scientists prepare the
entries rom literature and>or contacts #ith e'ternal e'perts!
SWISS-PROT, TrEMBL www.expasy.ch/sprot/
@-%A;=T is a protein sequence database #hich stri"es to pro"ide
a high level o( annotations +such as the description o the unction o a protein, its domains structure, post%translational modiications,
"ariants, etc!, a minimal le"el o redundancy and high le"el o
integration #ith other databases!
-t #as started in (862 by 1mos Bairoch in the Department o Medical
Biochemistry at the /ni"ersity o Gene"a! This database is generally
considered one o the best protein sequence databases in terms o the
quality o the annotation! ;elease 58!(3 +(( Jan 344( contained
83,3(( entries!
TrEMBL is a computer'annotated supplement o @-%A;=T
that contains all the translations o EMBL nucleotide sequence entries
http://www.expasy.ch/sprot/http://www.expasy.ch/sprot/
8/9/2019 adv bi unit 1
5/39
not yet integrated in @-%A;=T! The procedure that is used to
produce it #as de"eloped by ;ol 1p#eiler! ;elease (7!( +7 Jan
344( contained 5)6,(73 entries! The annotation o an entry in
TrEMBL has not +yet reached the standards required or inclusion
into @-%A;=T proper!
@-%A;=T and TrEMBL are de"eloped by the @-%A;=T
groups at #iss -nstitute o Bioinormatics +-B and at EB-! The
databases can be accesses and searched through the the ; system atE'A1y, or one can do#nload the entire database as one single lat
ile! 1n e'ample o #hat an entry looks like is gi"en or the human ra
oncogene protein, -D 0;1&C*/M1
The @-%A;=T database has some legal restrictions: the entries
themsel"es are copyrighted, but reely accessible and usable byacademic researchers! .ommercial companies must pay a license ee
rom -B to use @-%A;=T!
PIR pir.georgetown.e!
The Arotein -normation ;esource +A-; is a di"ision o the
8/9/2019 adv bi unit 1
6/39
A-; gre# out o Margaret Dayhos #ork in the middle o the (824s!
-t stri"es to be comprehensive, #ell%organi$ed, accurate, and
consistently annotated! *o#e"er, it is generally belie"ed that it does
not reach the le"el o completeness in the entry annotation as does
@-%A;=T! 1lthough @-%A;=T and A-; o"erlap e'tensi"ely,
there are still many sequences #hich can be ound in only one o
them!
=ne can search or entries or do sequence similarity searches at theA-; site! The database can also be do#nloaded as a set o iles! 1n
e'ample o #hat an entry looks like is gi"en or the human ra%(
oncogene protein, -D T*/&2!
A-; also produces the N+L'#, #hich is a database o sequences
e'tracted rom the three%dimensional structures in the AroteinDatabank +ADB +see also the ollo#ing page in this lecture! The
8/9/2019 adv bi unit 1
7/39
domain, it contains a multiple alignment o a set o deining
sequences +the seeds and the other sequences in @-%A;=T and
TrEMBL that can be matched to that alignment!
The database #as started in (882 and is maintained by a consortium
o scientists, among them Erik onnhammer +.G;, 0-, #eden,
ean Eddy +@ash/, t Louis /1, ;ichard Durbin, 1lan Bateman
and E#an Birney +anger .entre, /0! ;elease 7!7 +ep 3444
contains 3)6 amilies!
The alignments can be con"erted into hidden Marov
models +*MM, #hich can be used to search or domains in a query
protein sequence! The sot#are *MME; +by ean Eddy is the
computational oundation or Aam! The domain structure o protein
sequences in @-%A;=T and TrEMBL are a"ailable directly romthe Aam #eb sites, and it is also possible to search or domains in
other sequences using ser"ers at the #eb sites!
The technology behind Aam>*MME; #ill be discussed in a lecture
later in this course!
The Aam database can be searched, or used to identiy domains in a
sequence, or do#nloaded rom the #ebsites abo"e! 1n e'ample o a
multiple sequence alignment that deines a protein amily +domain is
gi"en or the ;a%like ;as%binding domain +Aam name ;BD,
accession code A&43(82!
http://hmmer.wustl.edu/http://www.avatar.se/molbioinfo2001/hmm-pfam.htmlhttp://www.avatar.se/molbioinfo2001/hmm-pfam.htmlhttp://www.avatar.se/molbioinfo2001/RBD.alihttp://www.avatar.se/molbioinfo2001/RBD.alihttp://hmmer.wustl.edu/http://www.avatar.se/molbioinfo2001/hmm-pfam.htmlhttp://www.avatar.se/molbioinfo2001/hmm-pfam.htmlhttp://www.avatar.se/molbioinfo2001/RBD.alihttp://www.avatar.se/molbioinfo2001/RBD.ali
8/9/2019 adv bi unit 1
8/39
The Aam database is licensed under the G
8/9/2019 adv bi unit 1
9/39
*rimary and Secondary databases.
Primary Databases:
Databases consisting of data derivedexperimentally such as nucleotide sequences
and three dimentional structures are known as primary databases.
primary databases(consisting of data derived experimentally)
• grown tremendously over the years
• contains information of the sequence or structure alone and associated
annotation information
econdary Dtabases:
!hose data that are derived from the analysis or treatement of primary data such assecondary structures" hydrophobicity plots" and domain are stored in secondary
databases
• contains derived information from a primary database" like information
about conserved sequence" signature sequence and active site residues of
the protein families arrived by multiple sequence alignment of a set of related
proteins
• secondary structure database contains entries of the PDB in an organi#ed
way (for instance" by classification of all PD$ entries according to structures
like alpha%helix or &%sheets) and also information on conserved secondary
structure motifs of a particular protein
composite databases
8/9/2019 adv bi unit 1
10/39
• 'oins a variety of different primary database sources" which obviates the
need to search multiple resources
*rimary databases '
"enBan,• The "enBan sequence database is an open access,
annotated collection o all publicly
a"ailable nucleotide sequences and
their protein translations!
• This database is produced and maintained by
the
8/9/2019 adv bi unit 1
11/39
• -n the more than 54 years since its establishment,
GenBank has become the most important and most
inluential database or research in almost all biological
ields, #hose data are accessed and cited by millions o
researchers around the #orld!
• GenBank is built by direct submissions rom indi"idual
laboratories, as #ell as rom bulk submissions rom
large%scale sequencing centers!
• =nly original sequences can be submitted to GenBank!
Direct submissions are made to GenBank using Bank-t,
#hich is a @eb%based orm, or the stand%alone
submission program, equin!
• /pon receipt o a sequence submission, the GenBank
sta e'amines the originality o the data and assigns
an accession number to the sequence and perorms
quality assurance checks!
• The submissions are then released to the public database,
#here the entries are retrie"able by Entre$ or
do#nloadable by &TA!
EMBL,
http://www.ncbi.nlm.nih.gov/BankIt/http://www.ncbi.nlm.nih.gov/Sequin/http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)http://en.wikipedia.org/wiki/Entrezhttp://en.wikipedia.org/wiki/File_Transfer_Protocolhttp://www.ncbi.nlm.nih.gov/BankIt/http://www.ncbi.nlm.nih.gov/Sequin/http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)http://en.wikipedia.org/wiki/Entrezhttp://en.wikipedia.org/wiki/File_Transfer_Protocol
8/9/2019 adv bi unit 1
12/39
• The EMBL >###!ebi!ac!uk>embl, maintained at the European
Bioinormatics -nstitute +EB- near .ambridge, /0, is a
comprehensi"e collection o nucleotide sequences and
annotation rom a"ailable public sources!
• The database is part o an international collaboration
#ith DDBJ +Japan and GenBank +/1!
• Data are e'changed daily bet#een the collaborating
institutes!
• @ebinis the preerred tool or indi"idual submissions o
nucleotide sequences, including Third Aarty 1nnotation
+TA1 and alignments!
• 1utomated procedures are pro"ided or submissions
rom large%scale sequencing pro?ects and data rom the
European Aatent =ice!
•
8/9/2019 adv bi unit 1
13/39
• =ther tools are a"ailable or sequence similarity
searching +e!g! &1T1 and BL1T!
##B$,
• The #N #ata Ban o( $apan +DDBJ is a biological
database that collects D
8/9/2019 adv bi unit 1
14/39
• DDBJ is primarily unded by the Japanese Ministry o
Education, .ulture, ports, cience and
Technology +MEFT!• The principal purpose o DDBJ operations is to impro"e
the quality o -
8/9/2019 adv bi unit 1
15/39
• sequence database consists o sequence entries! equence
entries are composed o dierent line types,
•
each #ith their o#n ormat! &or standardi$ation purposesthe ormat o @-%A;=T ollo#s as closely as
• possible that o the EMBL TrEMBL contains high%quality
computationally analy$ed records, #hich are enriched
#ith automatic annotation!
• -t #as introduced in response to increased datalo#
resulting rom genome pro?ects, as the time% and labour%
consuming manual annotation process o
/niArot0B>#iss%Arot could not be broadened to include
all a"ailable protein sequences!
8/9/2019 adv bi unit 1
16/39
• The translations o annotated coding sequences in
the EMBL%Bank>GenBank>DDBJ nucleotide sequence
database are automatically processed and entered in
/niArot0B>TrEMBL! /niArot0B>TrEMBL also contains
sequences rom ADB, and rom gene prediction,
including Ensembl, ;eeqand ..D!
• Due to the nature o the source /niArot0B>TrEMBL is
highly redundant and the quality o the annotation is "ery
"ariable! 1s #ell as the original annotations carried o"er
rom EMBL%Bank additional annotations are added
based on a series o automated annotation #orklo#s!
• 1s the entries in /niArot0B>TrEMBL and manually
re"ie#ed by the /niArot curators they graduate into
/niArot0B>#iss%Arot +the human curated section o
/niArot0B and may be merged into e'isting entries
#hich describe the same gene in the same species!
• The usual #iss%Arot annotation pipeline in"ol"es the
manual annotation o TrEMBL entries, their integration
into #iss%Arot, #ith their original accession number,
and subsequent deletion rom TrEMBL!
•
Secondary databases 5
http://en.wikipedia.org/wiki/INSDChttp://en.wikipedia.org/wiki/INSDChttp://en.wikipedia.org/wiki/Protein_Data_Bankhttp://en.wikipedia.org/wiki/Ensemblhttp://en.wikipedia.org/wiki/RefSeqhttp://en.wikipedia.org/wiki/Consensus_CDS_Projecthttp://en.wikipedia.org/wiki/INSDChttp://en.wikipedia.org/wiki/INSDChttp://en.wikipedia.org/wiki/Protein_Data_Bankhttp://en.wikipedia.org/wiki/Ensemblhttp://en.wikipedia.org/wiki/RefSeqhttp://en.wikipedia.org/wiki/Consensus_CDS_Project
8/9/2019 adv bi unit 1
17/39
*+3S)4E,
• *+3S)4E is a protein database!(H3H -t consists o entries
describing the protein amilies, domains and unctional
sites as #ell as amino acid patterns and proiles in them!
• -t is based on the obser"ation that, #hile there is a huge
number o dierent proteins, most o them can be
grouped, on the basis o similarities in their sequences,
into a limited number o amilies!
• Aroteins or protein domains belonging to a particular
amily generally share unctional attributes and are
deri"ed rom a common ancestor!
• A;=-TE currently contains patterns and proiles
speciic or more than a thousand protein amilies or
domains!
• Each o these signatures comes #ith documentation
pro"iding background inormation on the structure and
unction o these proteins!
•
The Aro;ule section o A;=-TE is constituted o manually created rules that can automatically generate
annotation in the /niArot0B>#iss%Arot ormat based on
A;=-TE motis!
http://en.wikipedia.org/wiki/Sequence_databasehttp://en.wikipedia.org/wiki/PROSITE#cite_note-DeCastro2006-1http://en.wikipedia.org/wiki/PROSITE#cite_note-Hulo2007-2http://en.wikipedia.org/wiki/Protein_familieshttp://en.wikipedia.org/wiki/Protein_domainshttp://en.wikipedia.org/wiki/Functional_sitehttp://en.wikipedia.org/wiki/Functional_sitehttp://en.wikipedia.org/wiki/Amino_acidhttp://www.uniprot.org/http://en.wikipedia.org/wiki/Sequence_databasehttp://en.wikipedia.org/wiki/PROSITE#cite_note-DeCastro2006-1http://en.wikipedia.org/wiki/PROSITE#cite_note-Hulo2007-2http://en.wikipedia.org/wiki/Protein_familieshttp://en.wikipedia.org/wiki/Protein_domainshttp://en.wikipedia.org/wiki/Functional_sitehttp://en.wikipedia.org/wiki/Functional_sitehttp://en.wikipedia.org/wiki/Amino_acidhttp://www.uniprot.org/
8/9/2019 adv bi unit 1
18/39
• A;=-TEs uses include identiying possible unctions o
ne#ly disco"ered proteins and analysis o kno#n
proteins or pre"iously undetermined acti"ity!• A;=-TE oers tools or protein sequence analysis and
moti detection +see sequence moti , A;=-TE patterns!
-t is part o the E'A1y proteomicsanalysis ser"ers!
*+)N4S,• *+)N4S database is a collection o so%called
IingerprintsI
• it pro"ides both a detailed annotation resource or protein
amilies, and a diagnostic tool or ne#ly determinedsequences!
• 1 ingerprint is a group o conser"ed motis taken rom
a multiple sequence alignment % together, the motis orm
a characteristic signature or the aligned protein amily!
• The motis themsel"es are not necessarily contiguous in
sequence, but may come together in 5D space to deine
molecular binding sites or interaction suraces!
http://en.wikipedia.org/wiki/Sequence_analysishttp://en.wikipedia.org/wiki/Sequence_motifhttp://en.wikipedia.org/wiki/Sequence_motif#PROSITE_pattern_notationhttp://en.wikipedia.org/wiki/ExPASyhttp://en.wikipedia.org/wiki/Proteomicshttp://en.wikipedia.org/wiki/Protein_familyhttp://en.wikipedia.org/wiki/Protein_familyhttp://en.wikipedia.org/wiki/Sequence_motifhttp://en.wikipedia.org/wiki/Multiple_sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_analysishttp://en.wikipedia.org/wiki/Sequence_motifhttp://en.wikipedia.org/wiki/Sequence_motif#PROSITE_pattern_notationhttp://en.wikipedia.org/wiki/ExPASyhttp://en.wikipedia.org/wiki/Proteomicshttp://en.wikipedia.org/wiki/Protein_familyhttp://en.wikipedia.org/wiki/Protein_familyhttp://en.wikipedia.org/wiki/Sequence_motifhttp://en.wikipedia.org/wiki/Multiple_sequence_alignment
8/9/2019 adv bi unit 1
19/39
• The particular diagnostic strength o ingerprints lies in
their ability to distinguish sequence dierences at the
clan, superamily, amily and subamily le"els!
• This allo#s ine%grained unctional diagnoses o
uncharacterised sequences, allo#ing, or e'ample,
discrimination bet#een amily members on the basis o
the ligands they bind or the proteins #ith #hich theyinteract, and highlighting potential oligomerisation or
allosteric sites!
• A;-
8/9/2019 adv bi unit 1
20/39
• ie# protein domain architectures
• E'amine species distribution
• &ollo# links to other databases
• ie# kno#n protein structures
•
8/9/2019 adv bi unit 1
21/39
The database can be searched by e%mail and @orld @ide
@eb +@@@ ser"ers +http:>>blocks!hcrc!org>help to
classiy protein and nucleotide sequences! The description o a protein amily by its conser"ed
regions ocuses on the amilys characteristic and
distincti"e sequence eatures, thus reducing noise!
Databases o conser"ed eatures o protein amilies can
be utili$ed to classiy sequences rom proteins, cD
8/9/2019 adv bi unit 1
22/39
Bio"+)#
• The Biological "eneral +epository (or )nteraction
#atasets +Bio"+)# is a curated biological
database o protein%protein and genetic interactions
created in 3445
• -t stri"es to pro"ide a comprehensi"e resource
o proteinKprotein and genetic interactions or all
ma?or model organism species #hile attempting to
remo"e redundancy to create a single mapping o
interactions!• The Biological General ;epository or -nteraction
Datasets +BioG;-D database #as de"eloped to house
and distribute collections o protein and genetic
interactions rom ma?or model organism species!
• /sers o The BioG;-D can search or their protein o
interest and retrie"e annotation, as #ell as physical
and genetic interaction data as reported, by the primary
literature and compiled by in house large%scale curation
eorts!
http://en.wikipedia.org/wiki/Biological_databasehttp://en.wikipedia.org/wiki/Biological_databasehttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Geneticshttp://en.wikipedia.org/wiki/Model_organismhttp://en.wikipedia.org/wiki/Specieshttp://en.wikipedia.org/wiki/Model_organismhttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Geneticshttp://en.wikipedia.org/wiki/Biological_databasehttp://en.wikipedia.org/wiki/Biological_databasehttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Geneticshttp://en.wikipedia.org/wiki/Model_organismhttp://en.wikipedia.org/wiki/Specieshttp://en.wikipedia.org/wiki/Model_organismhttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Genetics
8/9/2019 adv bi unit 1
23/39
• =riginally separated into organism speciic databases, the
ne#est "ersion no# pro"ides a uniied ront end allo#ing
or searches across se"eral organisms simultaneously!• The BioG;-D is unded by the BB;.,
8/9/2019 adv bi unit 1
24/39
• Each o the member databases o -nterAro contribute
to#ards a dierent niche, rom "ery high%le"el, structure%
based classiications +/AE;&1M-LN and .1T*%
Gene5D through to quite speciic sub%amily
classiications +A;-
8/9/2019 adv bi unit 1
25/39
• The data, typically obtained by F%ray
crystallography or
8/9/2019 adv bi unit 1
26/39
8/9/2019 adv bi unit 1
27/39
• 1 moti"ation or this classiication is to determine the
e"olutionary relationship bet#een proteins!
• Aroteins #ith the same shapes but ha"ing little sequence
or unctional similarity are placed in dierent
IsuperamiliesI, and are assumed to ha"e only a "ery
distant common ancestor!
• Aroteins ha"ing the same shape and some similarity o
sequence and>or unction are placed in IamiliesI, and
are assumed to ha"e a closer common ancestor!
The .=A database is reely accessible on the internet!
.=A #as created in (88!(H
The source o protein structures is the Arotein Data Bank !
The unit o classiication o structure in .=A is
the protein domain!
The shapes o domains are called IoldsI in .=A!
Domains belonging to the same old ha"e the same ma?or
secondary structures in the same arrangement #ith the
same topological connections!
http://en.wikipedia.org/wiki/Structural_Classification_of_Proteins_database#cite_note-NAR2007-1http://en.wikipedia.org/wiki/Protein_Data_Bankhttp://en.wikipedia.org/wiki/Protein_domainhttp://en.wikipedia.org/wiki/Structural_Classification_of_Proteins_database#cite_note-NAR2007-1http://en.wikipedia.org/wiki/Protein_Data_Bankhttp://en.wikipedia.org/wiki/Protein_domain
8/9/2019 adv bi unit 1
28/39
The le"els o .=A are as ollo#s!
• .lass: Types o olds, e!g!, beta sheets!
• &old: The dierent shapes o domains #ithin a class!
• uperamily: The domains in a old are grouped into
superamilies, #hich ha"e at least a distant common
ancestor!
• &amily: The domains in a superamily are grouped into
amilies, #hich ha"e a more recent common ancestor!
• Arotein domain: The domains in amilies are grouped
into protein domains, #hich are essentially the same
protein!
• pecies: The domains in Iprotein domainsI are grouped
according to species!
• Domain: part o a protein! &or simple proteins, it can be
the entire protein!
/47,
8/9/2019 adv bi unit 1
29/39
The /47 *rotein Structure /lassi(icationis a semi%
automatic, hierarchical classiication o protein domains
.1T* shares many broad eatures #ith its principalri"al, .=A, ho#e"er there are also many areas in #hich
the detailed classiication diers greatly!
=nly crystal structures sol"ed to resolution better than
!4 angstroms are considered, together #ith
8/9/2019 adv bi unit 1
30/39
7omologous
superamily
indicati"e o a demonstrable e"olutionary
relationship! Equi"alent to the
superamily le"el o .=A!
.lass is determined according to the secondarystructure composition and packing #ithin the
structure! Three ma?or classes are recognised9
mainly%alpha, mainly%beta and alpha%beta!
•
Euro/arb#B,
• Euro/arb#B is an E/%unded initiati"e or the creation
o sot#are and standards or the systematic collection
o carbohydrate structures and their e'perimental data!
• The E/;=.arbDB pro?ect is a design study or a
technical rame#ork, #hich pro"ides sophisticated,
reely accessible, open%source inormatics tools and
databases to support glycobiology and glycomic
research!
http://en.wikipedia.org/wiki/European_Unionhttp://en.wikipedia.org/wiki/Carbohydratehttp://en.wikipedia.org/wiki/European_Unionhttp://en.wikipedia.org/wiki/Carbohydrate
8/9/2019 adv bi unit 1
31/39
• E/;=.arbDB is a relational database containing glycan
structures, their biological conte't and, #hen a"ailable,
primary and interpreted analytical data rom high%
perormance liquid chromatography, mass spectrometry
and nuclear magnetic resonance e'periments!
• Database content can be accessed "ia a #eb%based user
interace!
• The database is complemented by a suite o
glycoinormatics tools, speciically designed to assist the
elucidation and submission o glycan structure and
e'perimental data #hen used in con?unction #ith
contemporary carbohydrate research #orklo#s
• The pro?ect includes a database o kno#n carbohydrate
structures and e'perimental data, speciically mass
spectrometry, *AL. and
8/9/2019 adv bi unit 1
32/39
• 1 speciic design ob?ecti"e o the architecture o the
database #as to allo# or the e'tension and incorporation
o ne# modules and tools to support urther types o
e'perimental data and #orklo#s!
*ub/hem /ompound,
*ub/hem is a database o chemicalmolecules and their
acti"ities against biological assays! The system ismaintained by the
8/9/2019 adv bi unit 1
33/39
ubstances,
Bio1ssay,
• Aub.hem .ompound + is a searchable database o
chemical structures #ith "alidated chemical depiction
inormation pro"ided to describe substances in Aub.hem
ubstance!
•
tructures stored #ithin Aub.hem .ompounds are pre%clustered and cross%reerenced by identity and similarity
groups!
• Aub.hem .ompound includes o"er 7M compounds!
•
• Molecular chemical properties, and
descriptors!
• imple Elemental earches +all compounds containing
Gallium allo# searching #ith speciic element
restrictions!
•
#rugBan,
8/9/2019 adv bi unit 1
34/39
8/9/2019 adv bi unit 1
35/39
8/9/2019 adv bi unit 1
36/39
• The database contains more than 54 million
unique molecules rom o"er 74 data sources including:
/!! &ood and Drug 1dministration +&D1,
8/9/2019 adv bi unit 1
37/39
#eight range, .1 numbers, suppliers, etc! The search
can be used to #iden or restrict already ound results!
• tructure searching on mobile de"ices can be done using
ree apps or i=+iAhone>iAod>iAad(2H and or
the 1ndroid +operating system!()H
and /ambridge Structural #atabase.
• The /ambridge Structural #atabase +.D, is a
repository or small molecule crystal structures!
• cientists use single%crystal '%ray crystallography to
determine the crystal structure o a compound!
• =nce the structure is sol"ed, inormation about the
structure is sa"ed in a ile +.-& ormat and deposited in
the .D!
• =ther scientists can search and retrie"e structures rom
the database!
• The inormation consists o the space group symmetry o
the crystalline phase, its cell parameters, the relati"e
atomic coordinates o all the atoms in the cell in 5D!
http://en.wikipedia.org/wiki/Molecular_weighthttp://en.wikipedia.org/wiki/Chemical_Abstracts_Servicehttp://en.wikipedia.org/wiki/IOShttp://en.wikipedia.org/wiki/ChemSpider#cite_note-16http://en.wikipedia.org/wiki/Android_operating_systemhttp://en.wikipedia.org/wiki/ChemSpider#cite_note-17http://en.wikipedia.org/wiki/Moleculehttp://en.wikipedia.org/wiki/Crystal_structureshttp://en.wikipedia.org/wiki/X-ray_crystallographyhttp://en.wikipedia.org/wiki/Crystallographic_Information_Filehttp://en.wikipedia.org/wiki/Space_grouphttp://en.wikipedia.org/wiki/Lattice_constanthttp://en.wikipedia.org/wiki/Atomshttp://en.wikipedia.org/wiki/Molecular_weighthttp://en.wikipedia.org/wiki/Chemical_Abstracts_Servicehttp://en.wikipedia.org/wiki/IOShttp://en.wikipedia.org/wiki/ChemSpider#cite_note-16http://en.wikipedia.org/wiki/Android_operating_systemhttp://en.wikipedia.org/wiki/ChemSpider#cite_note-17http://en.wikipedia.org/wiki/Moleculehttp://en.wikipedia.org/wiki/Crystal_structureshttp://en.wikipedia.org/wiki/X-ray_crystallographyhttp://en.wikipedia.org/wiki/Crystallographic_Information_Filehttp://en.wikipedia.org/wiki/Space_grouphttp://en.wikipedia.org/wiki/Lattice_constanthttp://en.wikipedia.org/wiki/Atoms
8/9/2019 adv bi unit 1
38/39
• cientists can use the .D to compare e'isting data #ith
that obtained rom crystals gro#n in their laboratories!
• The inormation can also be used to "isuali$e the
structure in a "ariety o sot#are such
as atoms, powdercell etc!
• -t is also possible to calculate #hat the
theoretical po#der diraction pattern o the phase #ould
look like! This option is particularly important or
analytical reasons because it acilitates the identiication
o phases present in a crystalline po#der mi'ture #ithout
the need or gro#ing crystals!
• Many o the small molecules are organic compounds o
the sort that could potentially act as medical drugs, and a
"ery important use o the .D is or structural
comparisons among related molecules that can suggest
ne# leads or drug design!
• The .D is compiled and maintained by the .ambridge
.rystallographic Data .entre!
• Each crystal structure undergoes e'tensi"e "alidation and
cross%checking by e'pert chemists and crystallographers
http://en.wikipedia.org/wiki/Crystalshttp://en.wikipedia.org/wiki/Powder_diffractionhttp://en.wikipedia.org/wiki/Organic_compoundshttp://en.wikipedia.org/wiki/Drug_designhttp://en.wikipedia.org/wiki/Cambridge_Crystallographic_Data_Centrehttp://en.wikipedia.org/wiki/Cambridge_Crystallographic_Data_Centrehttp://en.wikipedia.org/wiki/Crystalshttp://en.wikipedia.org/wiki/Powder_diffractionhttp://en.wikipedia.org/wiki/Organic_compoundshttp://en.wikipedia.org/wiki/Drug_designhttp://en.wikipedia.org/wiki/Cambridge_Crystallographic_Data_Centrehttp://en.wikipedia.org/wiki/Cambridge_Crystallographic_Data_Centre
8/9/2019 adv bi unit 1
39/39
to ensure that the .D is maintained to the highest
possible standards!
• 1lso, each database entry is enriched #ith bibliographic,
chemical and physical property inormation, adding
urther "alue to the ra# structural data!
• These editorial processes are "ital or enabling scientists
to interpret structures in a chemically meaningul #ay!
• The .D is continually updated #ith ne# structures
+Q4,444 ne# structures each year and #ith
impro"ements to e'isting entries!
• @ith regular #eb%updates and early online access to
ne#ly published structures you can keep ully inormed
o the latest research!