Biological Databases November 30, 2006 Wailap V. Ng Institute of Biotechnology in Medicine Institute of Bioinformatics National Yang Ming University [email protected]

Biological DatabasesBiological Databases

November 30, 2006

Wailap V. Ng

Institute of Biotechnology in MedicineInstitute of Bioinformatics

National Yang Ming [email protected]

• DNA (Deoxyribonucleic acid)

• RNA (Ribonucleic acid)

- mRNA (Messenger RNA)

- rRNA (Ribosomal RNA)

- tRNA (Transfer RNA)

• Proteins- Enzymes

- Structural proteins

- Regulatory proteins

- Transporters

Macromolecules Related to Bioinformatics

A C G T G A A C CT

A C G U G A A C CU

G A V L I S T C DM E N Q R K F H Y PW

Nucleic acid and protein sequences store the essential bioinformation

DNA (A, C, G, T)

RNA (A, C, G, U)

Protein (20 amino acids)

A C G T G A A C CT

A C G U G A A C CU

G A V L I S T C DM E N Q R K F H Y PW

DNA mRNAs Proteins

Replication

Transcription Translation

Central DogmaCentral Dogma

Basic structure of a bacterial gene

Transcription

mRNA

Translation

protein

P GeneDNA

5’ 3’

5’ 3’

N C

Stop codon (TAA)

Start codon (ATG)

A gene is a segment of DNA with an upstream start codon and a downstream stop codon that codes for the sequence of a polypeptide (protein)

Information in Biological Databases

• DNA and protein sequences• Protein structures• Expression data (microarray, SAGE, etc.)• Biological pathways• Subcellular location of proteins• Protein-protein interactions • Molecular medicine• Literature• etc.

International Union of Pure and Applied Chemistry (IUPAC) codes for nucleotides and amino acids

IUPAC nucleotide code Base

A Adenine

C Cytosine

G Guanine

T (or U) Thymine (or Uracil)

R A or G (purine)

Y C or T (pyrimidine)

S G or C

W A or T

K G or T

M A or C

B C or G or T

D A or G or T

H A or C or T

V A or C or G

N any base

. or - gap

IUPAC amino acid code

Three letter code

Amino acid

A Ala Alanine

C Cys Cysteine

D Asp Aspartic Acid

E Glu Glutamic Acid

F Phe Phenylalanine

G Gly Glycine

H His Histidine

I Ile Isoleucine

K Lys Lysine

L Leu Leucine

M Met Methionine

N Asn Asparagine

P Pro Proline

Q Gln Glutamine

R Arg Arginine

S Ser Serine

T Thr Threonine

V Val Valine

W Trp Tryptophan

Y Tyr Tyrosine

The Origins of Protein Sequence Databases

* Protein sequencing (Sanger and Tuppy, 1951)

• Atlas of Protein Sequence and Structure (Margaret Dayhoff and National Biomedical Research Foundation (NBRF) (1965-1978)

• Protein Information Resource (PIR) (NBRF, 1984 - present)

• PIR-International Protein Sequence Database (NBRF, MIPS, and JIJPID, 1988 – present)

The Origins of DNA Sequence Databases

* DNA double-helix structure (James Watson and Francis Crick, 1953)

* Recombinant DNA (Paul Berg et al., 1972)

* DNA sequencing (Maxim and Gilbert; Sanger - 1977)

• GenBank [Walter Goad et al., 1979 (prototype); 1982 -1992, LANL (Los Alamos National Lab.)]

• EMBL Data Library [1982 (1980) – present] – UK

• DDBJ [1986 (1984) – present] - Japan

http://www.infobiogen.fr/services/dbcat/ (Site closed)

Number of biological databases in 2005

Major Bioinformation Resources

• NCBI – National Institute of Health

• EMBL – European Bioinformatics Institute

• DDBJ – National Institute of Genetics (Japan)

• Expasy – Swiss Institute of Bioinformatics

• GenomeNet – Koyoto University

NCBI molecular databases

Nucleotide Sequence Databases Consist of the Following Sequences:

• DNA fragments

• cDNA [Expressed Sequence Tags (EST) and full length cDNA sequences - partial and complete mRNA]

• Genomes

Nucleic acid sequences provide the fundamental starting point for describing and understanding the structure, function, and development of genetically diverse organisms.

Common Sequence File Formats• Fasta

• GenBank (DNA) or GenPept (Protein)

Each sequence has at least one unique number to allow you to retrieve it from the public db – e.g. Accession Number, gi_number, g

ene_ID, protein_ID, locus name, etc.

>gi|43500|emb|Y00534.1|HHGVPA Halobacterium halobium gvpA gene for major gas vesicle protein AAGCTTTACACTCTCCGTACTTAGAAGTACGACTCATTACAGGAGACATAACGACTGGTGAAACCATACACATCCTTATGTGATGCCCGAGTATAGTTAGAGATGGGTTAATCCCAGATCACCAATGGCGCAACCAGATTCTTCAGGCTTGGCAGAAGTCCTTGATCGTGTACTAGACAAAGGTGTCGTTGTGGACGTGTGGGCTCGTGTGTCGCTTGTCGGCATCGAAATCCTGACCGTCGAGGCGCGGGTCGTCGCCGCCTCGGTGGACACCTTCCTCCACTACGCAGAAGAAATCGCCAAGATCGAACAAGCCGAACTTACCGCCGGCGCGAGGCGGCACCCGAGGCCTGACGCACAGGCCTCCCTTCGGCCGGCGTAAGGGAGGTGAATCGCTTGCAAACCATACTTTAACACCT TCTCGGGTAC

DNA sequence in FASTA format

DNA sequence in GenBank format

Nucleotide Sequence DatabasesNucleotide Sequence Databases

• GenBank – NCBI (National Center for Biotechnology Information)

http://www.ncbi.nlm.nih.gov/

• EMBL (European Molecular Biology Laboratory) – EBI

http://www.ebi.ac.uk/

• DDBJ (DNA Data Bank of Japan) – NIG (National Institute of Genetics)

http://www.ddbj.nig.ac.jp/

When did the collaboration start?

In February, 1986, GenBank and EMBL began a collaborative effort [joined by DDBJ in 1987] to devise a common feature table format and common standards for annotation practice.

INSDCINSDC

International Nucleotide Sequence Database CollaborationInternational Nucleotide Sequence Database Collaboration

August 2005

National Center for Biotechnology Information (NCBI)

- Established in 1988

- Part of the National Library of Medicine, NIH, USA

- Creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information

- Host to the GenBank nucleotide sequence database since 1992 (1982 -1992, LANL)

NCBI Nucleotide Databases

• GenBank - INSDC collected DNA sequences• RefSeq - a comprehensive, integrated, non-redundant set of seq

uences, for major research organisms

• dbEST - contains sequence data on "single-pass" cDNA sequences (Expressed Sequence Tags)

• UniGene - a non-redundant set of gene-oriented clusters of automatically partitioned from GenBank sequences

• dbSTS - sequence & mapping data on short genomic landmark sequences or Sequence Tagged Sites (PCR primer pairs)

• UniSTS - a comprehensive db of STSs derived from STS-based maps and other experiments

NCBI Nucleotide Databases (continued)

• dbSNP – Single nucleotide polymorphism database

• dbGSS - Genome survey sequence database

• PopSet - a set of DNA sequences collected to analyze the evolutionary relatedness of a population

• TPA - Third party annotation sequences

• Nucleotide - Entrez Nucleotides database of GenBank, RefSeq, and PDB sequences

• Trace Archive – Raw DNA sequence trace files

• HomoloGene – A system for automated detection of homologs among the annotated genes of several completely sequenced eukaryotic genomes

• http://www.ncbi.nlm.nih.gov/Database/

NCBI Nucleotide Databases (continued)• MGC (Mammalian Gene Collection; )

cDNA Sequence Related Databases

dbEST

Unigene

TIGR THC

Full-Length cDNA Sequences

What is dbEST?

dbEST (Nature Genetics 4:332-3;1993) is a division of GenBank that contains sequence data and other information on "single-pass" cDNA sequences, or Expressed Sequence Tags, from a number of organisms.

Transcription

DNA

mRNA

cDNA

Reverse Transcription

DNA sequencing

EST

cDNA sequencing is a powerful tool for quick identification of new genes


dbEST

Unigene

TIGR THC, Human Gene Index


AAAAAAAAAAA

AAAAAAAAAAAAAA

AAAAAAAAAAAA

AAAAAAAAAA

mRNA

Transcription

cDNA cloning

Gene

cDNA sequencing

ESTs

EST clustering

Unigene

Expression profile


dbEST

Unigene

THC (Tentative human consensus sequences) - The Institute for Genome Research (www.tigr.org)



dbEST

Unigene

THC (Tentative human consensus sequences) - The Institute for Genome Research (www.tigr.org)


AAAAAAAAAAA

AAAAAAAAAAAAAA

AAAAAAAAAAAA

AAAAAAAAAA

mRNA

Transcription

Gene

Sequence assemblyFull-length cDNA sequence

DNA sequencing

ESTs

Full-length cDNA clone

cDNA cloning

http://hinv.ddbj.nig.ac.jp/

Protein Sequence Databases

Origins of Protein Sequences

Derived from:

• DNA fragment sequences

• mRNA sequences

• ESTs

• Genomes

Database name

Full name and/or description

NCBI Protein database

All protein sequences: translated from GenBank and imported from other protein databases

PIR-PSDProtein Information Resource Protein Sequence Database, has been merged into the UniProt knowledgebase - Georgetown University

PIR-NREFPIR's Non-redundant Reference protein database - Georgetown University

PRFProtein research foundation database of peptides: sequences, literature and unnatural amino acids - Japan

Swiss-ProtNow UniProt/Swiss-Prot: expertly curated protein sequence database, section of the UniProt knowledgebase - Swiss Institute of Bioinformatics

TrEMBLNow UniProt/TrEMBL: computer-annotated translations of EMBL nucleotide sequence entries: section of the UniProt knowledgebase - SIB

UniProtUniversal protein knowledgebase: merged data from Swiss-Prot, TrEMBL and PIR protein sequence databases – GU, SIB, EMBL

UniRefUniProt non-redundant reference database: clustered sets of related sequences (including splice variants and isoforms) – GU, SIB, EMBL

Protein sequence in FASTA format

Protein sequence in GenPept format – example 1

Protein sequence in GenPept format – example 2

Protein sequence in UniProt/SwisProt format

Other Biological Databases

• Protein-Protein Interaction

• Gene Ontology

• Biological Pathways

• Protein structures

• Orthologs

• Gene expression

• Literature

Protein-Protein Interaction Databases

• Most proteins do not work alone in the cell

• Utilize the concept of ‘guilt by association’ to discover the functions of previously uncharacterized proteins

Figure 1: (A) An interaction map of the yeast proteome assembled from published interactions.The map contains 1,548 proteins and 2,358 interactions. Proteins are colored according to their functional role as defined by the Yeast Protein Database16; proteins involved in membrane fusion (blue), chromatin structure (gray), cell structure (green), lipid metabolism (yellow), and cytokinesis (red). For other maps with different functional groups highlighted, see <http://depts.washington.edu/sfields/>. On-line maps can also be zoomed and searched for protein names. (B) Section of part A showing the clustering of proteins involved in membrane fusion (blue), lipid metabolism (yellow), and cell structure (green).

Schwikowski et al. 2000. Nat. Biotech.

Protein interaction map of Drosophila melanogasterGiot et al. Science 302:1727-36, 2003

7,048 proteins & 20,405 interactions

Integrated physical-interaction network. Nodes represent genes and are labeled with their corresponding gene names. Connections between nodes display physical interactions as recorded in the public databases, where a yellow arrow directed from one node to another represents a protein --> DNA interaction, and a blue line between nodes represents a protein-protein interaction. Global changes in mRNA expression (in this case, in response to a deletion of GAL4 in the presence of galactose) are visually superimposed on the network. The grayscale intensity of each node indicates the change in mRNA expression of the corresponding gene, where medium gray represents no change, darker or lighter shades represent an increase or decrease in expression, respectively, and node diameter scales with the overall magnitude of change. GAL4 is colored in red to signify that its expression level has been perturbed by external means. Highly interconnected groups of genes tend to have common biological function and are annotated accordingly (rectangular labels).

Ideker et al. Science 292:929, 2001

• Database of Interacting Proteins (DIP)

(http://dip.doe-mbi.ucla.edu/; UCLA)

• Biomolecular Interaction Network Database (BIND)

(http://bind.ca/; Mount Sinai Hospital, Canada)

• Human Protein Reference Database (HPRD)

(http://www.hprd.org/; Johns Hopkins University and the Institute of Bioinformatics)

• MIPS Mammalian Protein-Protein Interaction Database

(http://mips.gsf.de/proj/ppi/; Munich Information Center for Protein Sequences)

More can be found in http://mips.gsf.de/proj/ppi/

Database of Interacting Proteins (DIP)

The DIPTM database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. The data stored within the DIP database were curated, both, manually by expert curators and also automatically using computational approaches that utilize the knowledge about the protein-protein interaction networks extracted from the most reliable, core subset of the DIP data.

BIND Database

MIPS (Mammalian Protein-Protein Interaction) Database is a collection of manually curated high-quality PPI data collected from the scientific literature by expert curators. We took great care to include only data from individually performed experiments since they usually provide the most reliable evidence for physical interactions.

Human Protein Reference Database

A centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome. All the information in HPRD has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data.

Biological Pathway Databases

• KEGG (GenomeNet)

• Biocarta ( NCBI)

• BioPax (Biological Pathway Exchange)

* Ingenuity Systems

* GeneGo

ARGININE AND PROLINE METABOLISM

Biocarta Pathways http://cgap.nci.nih.gov/Pathways/BioCarta_Pathways

http://www.biocarta.com/

http://www.biopax.org/

BioPAX Motivation

Before BioPAX With BioPAX

Common format will make data more accessible, promoting data sharing and distributed curation efforts

>150 DBs and tools

Database

Application

User

Ingenuity Systems - Analyze expression/other biological data in pathways/networks

Ingenuity Systems – example

Gene Ontology (GO)

http://www.geneontology.org/

• Gene Card

• Human genes, proteins and diseases db

• http://www.genecards.org/

• Omin

• Online Mendelian Inheritance in Man

• http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM

Molecular Pathology and Disease Information Databases

Human disease database

GeneCards

Omin

COG/KOG

{COGNITOR/KOGNITOR}

Clusters of Orthologous Groups of proteins (COGs)

SAGE Database

Serial Analysis of Gene Expression

GEO (Gene Expression Omnibus)

http://www.ncbi.nlm.nih.gov/geo/

GPLPlatform

descriptions

GSMRaw/processedspot intensities

from a singleslide/chip

GSEGrouping of

slide/chip data“a single experiment”

GDSGrouping ofexperiments

Curated byNCBI

Submitted byExperimentalistsSubmitted by

Manufacturer*

Entrez GEOEntrez

GEO Datasets

Submit and update data

Query the database:• gene identifiers• field information• sequence

Browse datasets

Download data

Redesigned

with

new features

From Unigene: Hs.194143

Sequence and literature Search/Retrieval

• Entrez

• SRS

• ftp

Major sequence databases accessible through the Internet

1. GenBank - National Center for Biotechnology Information (NCBI), USA http://www.ncbi.nih.gov/Entrez/

2. European Molecular Biology Laboratory (EMBL) - European Bioinformatics Institute http://www.ebi.ac.uk/embl/index.html

3. DNA DataBank of Japan (DDBJ) - Mishima, Japanhttp://www.ddbj.nig.ac.jp/

4. Protein International Resource (PIR) - National Biomedical Research Foundation (NBRF), USAhttp://www-nbrf.georgetown.edu/pirwww/

5. SwissProt - Swiss Institute for Experimental Cancer Researchhttp://www.expasy.org/cgi-bin/sprot-search-de

6. Sequence Retrieval System (SRS) - European Bioinformatics Institute http://srs6.ebi.ac.uk

Protein Structure Databases

• PDB (Protein Data Bank)

http://www.rcsb.org/pdb/

• Entrez Structure (NCBI)

ftp ftp.ncbi.nih.gov

ftp ftp.expasy.org

ftp ftp.ebi.ac.uk

ftp ftp.ddbj.nig.ac.jp

Retrieve complete sets of data

Retrieve Raw Sequencing Data from

NCBI Trace Archive Database

Literature Searches

Entrez Pubmed (NCBI)

Entrez Pubmed Central (NCBI)

SRS (EMBL-EBI)

Gopubmed (Ontology-based Literature search)

http://www.geocities.com/bioinformaticsweb/datalink.html

More bio-db can be found in Bioinformatics web

Documents

Biological Databases November 30, 2006 Wailap V. Ng Institute of Biotechnology in Medicine Institute of Bioinformatics National Yang Ming University [email protected]