The National Center for Biotechnology Information (NCBI) a
primary resource for molecular biology information www.ncbi.nih.gov
Database Resources
Slide 2
Slide 3
. to develop new information technologies to aid in the
understanding of fundamental molecular and genetic processes that
control health and disease. What does this involve ? creating
automated systems for storing and analyzing knowledge about
molecular biology, biochemistry, and genetics; facilitating the use
of such databases and software by the research and medical
community; coordinating efforts to gather biotechnology information
both nationally and internationally; performing research into
advanced methods of computer-based information processing for
analyzing the structure and function of biologically important
molecules. NCBI Mission
Slide 4
What is a Database ? A model or representation of some aspect
of the real world An organized collection of data. May contain many
different types of data Coherent, consistent and designed for a
specific purpose A computational system for managing and querying
the data.
Slide 5
A collection of information organized in such a way that a
computer program can quickly select desired pieces of data. An
electronic filing system Traditional databases are organized by
fields, records, and files. A field is a single piece of
information; a record is one complete set of fields a file is a
collection of records. For example, a telephone book is analogous
to a file. It contains a list of records, each of which consists of
three fields: name, address, and telephone number. What is a
Database ?
Slide 6
To access information from a database, you need a database
management system (DBMS). This is a collection of programs that
enables you to enter, organize, and select data in a database. Most
molecular biology databases primarily use relational database
management systems (RDBMS). What is a Database ?
Slide 7
A relational database is like a large spreadsheet. Each field
is a column, each row is an entry. Relational databases use a set
of tables to organize data. Each entry must be unambiguously
identified Names are not reliable e.g. incorrectly assigned gene
function Unique IDs (UID)s are used, e.g. in GenBank these are
called accession numbers UIDNameSequenceQuality Value
BU039022PP_LEa0001A01fCATACAAAT 35 BU039057PP_LEa0001B17fTACGGCTAC
28 Relational Database
Slide 8
Achieving consistency Repeated information is stored in a
single place. Only one copy needs to be updated Sequence UID
Definition Locus Accession Taxonomy ID* Sequence Taxonomy Taxonomy
ID* Genus Species Ref Index* UID Medline ID Ref Index Medline ID
Authors Title Journal * May be referred to by a secondary ID * May
be referred to indirectly via an index Relational Database
Slide 9
Language used is SQL or structured query language Easy to
understand (essentially English?) Relatively consistent across
RBDMS Supplies a set of commands to define tables, insert data and
make queries Queries SELECT some fields FROM some table WHERE some
condition is met E.g. select accession, sequence FROM sequence
WHERE Accession = BU039022 BU039022 CATACAAATACTGCTACHTAAATC . More
complex queries require two or more tables be joined to produce a
result Relational Database
Slide 10
Most RDBMS do not allow users to directly query the database by
SQL. An ill formed query can overload or crash the system SQL still
too complex for biologists? Provide a search interface for the user
instead E.g. user enters a phrase and the database identifies what
part of the database should be searched. The queries that make it
through the web interface have to be translated to SQL Relational
Database
Slide 11
Relational database : Example GenBank Query
Slide 12
What Constitutes a Good Database ? Broad coverage of the chosen
topic Up to date information gathering Curated Support staff
Commitment to the future Good query interface Issues for Molecular
Biological Databases ? Annotation Archives Updates Redundancy
Slide 13
Issues for Molecular Biological Databases ? Annotation Adding
biological information to genome sequence. Textual descriptive
information Correctness Many genes are incorrectly annotated. May
assign a function to a novel gene from a similar sequence that may
itself be incorrectly annotated so the error is propagated
throughout the database. Routine error Quality Expert or non expert
curation? Who provided the curation? Is there any biological
verification? What vocabulary is used Has their been any peer
review ?
Slide 14
Issues for Molecular Biological Databases ? Archival Quality Is
the database archival or curated Can the same data be recovered
later Dont overwrite primary key (each accession numbers) The best
databases note any changes to the data. Updates How often is the
database updated? Major databases take direct submissions Only the
direct submitter can make changes, even if you can prove its wrong.
When is a sequence finished ? How is annotation updated as more
knowledge is available Redundancy This is a major issue, how do we
deal with it without losing potentially valuable information. Also
relates to archival quality
Slide 15
Slide 16
Genbank is the genetic sequence database of all publicly
available DNA and derived protein sequences, with annotations
describing the biological information in them. GenBank is hosted
within NCBI Researchers submit their sequences to GenBank NCBI
provides analysis and retrieval resources for the data in GenBank
(and many other NCBI hosted databases). NCBI and GenBank
Slide 17
NCBI Databases
(http://www.ncbi.nlm.nih.gov/guide/all/#Databases_) Nucleotide
Database EST (dbEST) GSS (dbGSS) Protein Database Structure
Database Genome 3D Domains Conserved Domains UniSTS Gene UniGene
HomoloGene Reference Sequence (refseq) SNP (dbSNP) dbVAR large
scale genomic variation dbGAP integration of genotype &
phenotype PopSet Database Taxonomy Database GEO Profiles GEO
Datasets Cancer Chromosomes Epigenomics PubMed Central Journals
MeSH Bookshelf OMIM Database
Slide 18
Slide 19
Slide 20
Retrieving Data from NCBI using Entrez Entrez is a text based
retrieval system that integrates all the information resources
available at the NCBI such as; 1.Scientific literature 2.DNA and
protein sequence databases 3.3D protein structure and protein
domain data 4.Population study datasets 5.Expression data
6.Assemblies of complete genomes 7.Taxonomic information
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
http ://www.ncbi.nlm.nih.gov/guide/all/#howto _
Slide 31
Create/login to the myNCBI portal
Slide 32
Understanding GenBank records Go to
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#ModificationsDateB
Click on the links on the left to get a description of what the
term means, Copy the description into a word document and after
completed, save the document on your drupal web site
Slide 33
Entrez Sequences Help
http://www.ncbi.nlm.nih.gov/books/NBK44864/