View
215
Download
2
Tags:
Embed Size (px)
Citation preview
Biological DatabasesBiological Databases
November 30, 2006
Wailap V. Ng
Institute of Biotechnology in MedicineInstitute of Bioinformatics
National Yang Ming [email protected]
• DNA (Deoxyribonucleic acid)
• RNA (Ribonucleic acid)
- mRNA (Messenger RNA)
- rRNA (Ribosomal RNA)
- tRNA (Transfer RNA)
• Proteins- Enzymes
- Structural proteins
- Regulatory proteins
- Transporters
Macromolecules Related to Bioinformatics
A C G T G A A C CT
A C G U G A A C CU
G A V L I S T C DM E N Q R K F H Y PW
Nucleic acid and protein sequences store the essential bioinformation
DNA (A, C, G, T)
RNA (A, C, G, U)
Protein (20 amino acids)
A C G T G A A C CT
A C G U G A A C CU
G A V L I S T C DM E N Q R K F H Y PW
DNA mRNAs Proteins
Replication
Transcription Translation
Central DogmaCentral Dogma
Basic structure of a bacterial gene
Transcription
mRNA
Translation
protein
P GeneDNA
5’ 3’
5’ 3’
N C
Stop codon (TAA)
Start codon (ATG)
A gene is a segment of DNA with an upstream start codon and a downstream stop codon that codes for the sequence of a polypeptide (protein)
Information in Biological Databases
• DNA and protein sequences• Protein structures• Expression data (microarray, SAGE, etc.)• Biological pathways• Subcellular location of proteins• Protein-protein interactions • Molecular medicine• Literature• etc.
International Union of Pure and Applied Chemistry (IUPAC) codes for nucleotides and amino acids
IUPAC nucleotide code Base
A Adenine
C Cytosine
G Guanine
T (or U) Thymine (or Uracil)
R A or G (purine)
Y C or T (pyrimidine)
S G or C
W A or T
K G or T
M A or C
B C or G or T
D A or G or T
H A or C or T
V A or C or G
N any base
. or - gap
IUPAC amino acid code
Three letter code
Amino acid
A Ala Alanine
C Cys Cysteine
D Asp Aspartic Acid
E Glu Glutamic Acid
F Phe Phenylalanine
G Gly Glycine
H His Histidine
I Ile Isoleucine
K Lys Lysine
L Leu Leucine
M Met Methionine
N Asn Asparagine
P Pro Proline
Q Gln Glutamine
R Arg Arginine
S Ser Serine
T Thr Threonine
V Val Valine
W Trp Tryptophan
Y Tyr Tyrosine
The Origins of Protein Sequence Databases
* Protein sequencing (Sanger and Tuppy, 1951)
• Atlas of Protein Sequence and Structure (Margaret Dayhoff and National Biomedical Research Foundation (NBRF) (1965-1978)
• Protein Information Resource (PIR) (NBRF, 1984 - present)
• PIR-International Protein Sequence Database (NBRF, MIPS, and JIJPID, 1988 – present)
The Origins of DNA Sequence Databases
* DNA double-helix structure (James Watson and Francis Crick, 1953)
* Recombinant DNA (Paul Berg et al., 1972)
* DNA sequencing (Maxim and Gilbert; Sanger - 1977)
• GenBank [Walter Goad et al., 1979 (prototype); 1982 -1992, LANL (Los Alamos National Lab.)]
• EMBL Data Library [1982 (1980) – present] – UK
• DDBJ [1986 (1984) – present] - Japan
http://www.infobiogen.fr/services/dbcat/ (Site closed)
Number of biological databases in 2005
Major Bioinformation Resources
• NCBI – National Institute of Health
• EMBL – European Bioinformatics Institute
• DDBJ – National Institute of Genetics (Japan)
• Expasy – Swiss Institute of Bioinformatics
• GenomeNet – Koyoto University
NCBI molecular databases
Nucleotide Sequence Databases Consist of the Following Sequences:
• DNA fragments
• cDNA [Expressed Sequence Tags (EST) and full length cDNA sequences - partial and complete mRNA]
• Genomes
Nucleic acid sequences provide the fundamental starting point for describing and understanding the structure, function, and development of genetically diverse organisms.
Common Sequence File Formats• Fasta
• GenBank (DNA) or GenPept (Protein)
Each sequence has at least one unique number to allow you to retrieve it from the public db – e.g. Accession Number, gi_number, g
ene_ID, protein_ID, locus name, etc.
>gi|43500|emb|Y00534.1|HHGVPA Halobacterium halobium gvpA gene for major gas vesicle protein AAGCTTTACACTCTCCGTACTTAGAAGTACGACTCATTACAGGAGACATAACGACTGGTGAAACCATACACATCCTTATGTGATGCCCGAGTATAGTTAGAGATGGGTTAATCCCAGATCACCAATGGCGCAACCAGATTCTTCAGGCTTGGCAGAAGTCCTTGATCGTGTACTAGACAAAGGTGTCGTTGTGGACGTGTGGGCTCGTGTGTCGCTTGTCGGCATCGAAATCCTGACCGTCGAGGCGCGGGTCGTCGCCGCCTCGGTGGACACCTTCCTCCACTACGCAGAAGAAATCGCCAAGATCGAACAAGCCGAACTTACCGCCGGCGCGAGGCGGCACCCGAGGCCTGACGCACAGGCCTCCCTTCGGCCGGCGTAAGGGAGGTGAATCGCTTGCAAACCATACTTTAACACCT TCTCGGGTAC
DNA sequence in FASTA format
DNA sequence in GenBank format
Nucleotide Sequence DatabasesNucleotide Sequence Databases
• GenBank – NCBI (National Center for Biotechnology Information)
http://www.ncbi.nlm.nih.gov/
• EMBL (European Molecular Biology Laboratory) – EBI
http://www.ebi.ac.uk/
• DDBJ (DNA Data Bank of Japan) – NIG (National Institute of Genetics)
http://www.ddbj.nig.ac.jp/
When did the collaboration start?
In February, 1986, GenBank and EMBL began a collaborative effort [joined by DDBJ in 1987] to devise a common feature table format and common standards for annotation practice.
INSDCINSDC
International Nucleotide Sequence Database CollaborationInternational Nucleotide Sequence Database Collaboration
August 2005
National Center for Biotechnology Information (NCBI)
- Established in 1988
- Part of the National Library of Medicine, NIH, USA
- Creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information
- Host to the GenBank nucleotide sequence database since 1992 (1982 -1992, LANL)
NCBI Nucleotide Databases
• GenBank - INSDC collected DNA sequences• RefSeq - a comprehensive, integrated, non-redundant set of seq
uences, for major research organisms
• dbEST - contains sequence data on "single-pass" cDNA sequences (Expressed Sequence Tags)
• UniGene - a non-redundant set of gene-oriented clusters of automatically partitioned from GenBank sequences
• dbSTS - sequence & mapping data on short genomic landmark sequences or Sequence Tagged Sites (PCR primer pairs)
• UniSTS - a comprehensive db of STSs derived from STS-based maps and other experiments
NCBI Nucleotide Databases (continued)
• dbSNP – Single nucleotide polymorphism database
• dbGSS - Genome survey sequence database
• PopSet - a set of DNA sequences collected to analyze the evolutionary relatedness of a population
• TPA - Third party annotation sequences
• Nucleotide - Entrez Nucleotides database of GenBank, RefSeq, and PDB sequences
• Trace Archive – Raw DNA sequence trace files
• HomoloGene – A system for automated detection of homologs among the annotated genes of several completely sequenced eukaryotic genomes
• http://www.ncbi.nlm.nih.gov/Database/
NCBI Nucleotide Databases (continued)• MGC (Mammalian Gene Collection; )
cDNA Sequence Related Databases
dbEST
Unigene
TIGR THC
Full-Length cDNA Sequences
What is dbEST?
dbEST (Nature Genetics 4:332-3;1993) is a division of GenBank that contains sequence data and other information on "single-pass" cDNA sequences, or Expressed Sequence Tags, from a number of organisms.
Transcription
DNA
mRNA
cDNA
Reverse Transcription
DNA sequencing
EST
cDNA sequencing is a powerful tool for quick identification of new genes
cDNA Sequence Related Databases
dbEST
Unigene
TIGR THC, Human Gene Index
Full-Length cDNA Sequences
AAAAAAAAAAA
AAAAAAAAAAAAAA
AAAAAAAAAAAA
AAAAAAAAAA
mRNA
Transcription
cDNA cloning
Gene
cDNA sequencing
ESTs
EST clustering
Unigene
Expression profile
cDNA Sequence Related Databases
dbEST
Unigene
THC (Tentative human consensus sequences) - The Institute for Genome Research (www.tigr.org)
Full-Length cDNA Sequences
cDNA Sequence Related Databases
dbEST
Unigene
THC (Tentative human consensus sequences) - The Institute for Genome Research (www.tigr.org)
Full-Length cDNA Sequences
AAAAAAAAAAA
AAAAAAAAAAAAAA
AAAAAAAAAAAA
AAAAAAAAAA
mRNA
Transcription
Gene
Sequence assemblyFull-length cDNA sequence
DNA sequencing
ESTs
Full-length cDNA clone
cDNA cloning
http://hinv.ddbj.nig.ac.jp/
Protein Sequence Databases
Origins of Protein Sequences
Derived from:
• DNA fragment sequences
• mRNA sequences
• ESTs
• Genomes
Database name
Full name and/or description
NCBI Protein database
All protein sequences: translated from GenBank and imported from other protein databases
PIR-PSDProtein Information Resource Protein Sequence Database, has been merged into the UniProt knowledgebase - Georgetown University
PIR-NREFPIR's Non-redundant Reference protein database - Georgetown University
PRFProtein research foundation database of peptides: sequences, literature and unnatural amino acids - Japan
Swiss-ProtNow UniProt/Swiss-Prot: expertly curated protein sequence database, section of the UniProt knowledgebase - Swiss Institute of Bioinformatics
TrEMBLNow UniProt/TrEMBL: computer-annotated translations of EMBL nucleotide sequence entries: section of the UniProt knowledgebase - SIB
UniProtUniversal protein knowledgebase: merged data from Swiss-Prot, TrEMBL and PIR protein sequence databases – GU, SIB, EMBL
UniRefUniProt non-redundant reference database: clustered sets of related sequences (including splice variants and isoforms) – GU, SIB, EMBL
Protein sequence in FASTA format
Protein sequence in GenPept format – example 1
Protein sequence in GenPept format – example 2
Protein sequence in UniProt/SwisProt format
Other Biological Databases
• Protein-Protein Interaction
• Gene Ontology
• Biological Pathways
• Protein structures
• Orthologs
• Gene expression
• Literature
Protein-Protein Interaction Databases
• Most proteins do not work alone in the cell
• Utilize the concept of ‘guilt by association’ to discover the functions of previously uncharacterized proteins
Figure 1: (A) An interaction map of the yeast proteome assembled from published interactions.The map contains 1,548 proteins and 2,358 interactions. Proteins are colored according to their functional role as defined by the Yeast Protein Database16; proteins involved in membrane fusion (blue), chromatin structure (gray), cell structure (green), lipid metabolism (yellow), and cytokinesis (red). For other maps with different functional groups highlighted, see <http://depts.washington.edu/sfields/>. On-line maps can also be zoomed and searched for protein names. (B) Section of part A showing the clustering of proteins involved in membrane fusion (blue), lipid metabolism (yellow), and cell structure (green).
Schwikowski et al. 2000. Nat. Biotech.
Protein interaction map of Drosophila melanogasterGiot et al. Science 302:1727-36, 2003
7,048 proteins & 20,405 interactions
Integrated physical-interaction network. Nodes represent genes and are labeled with their corresponding gene names. Connections between nodes display physical interactions as recorded in the public databases, where a yellow arrow directed from one node to another represents a protein --> DNA interaction, and a blue line between nodes represents a protein-protein interaction. Global changes in mRNA expression (in this case, in response to a deletion of GAL4 in the presence of galactose) are visually superimposed on the network. The grayscale intensity of each node indicates the change in mRNA expression of the corresponding gene, where medium gray represents no change, darker or lighter shades represent an increase or decrease in expression, respectively, and node diameter scales with the overall magnitude of change. GAL4 is colored in red to signify that its expression level has been perturbed by external means. Highly interconnected groups of genes tend to have common biological function and are annotated accordingly (rectangular labels).
Ideker et al. Science 292:929, 2001
• Database of Interacting Proteins (DIP)
(http://dip.doe-mbi.ucla.edu/; UCLA)
• Biomolecular Interaction Network Database (BIND)
(http://bind.ca/; Mount Sinai Hospital, Canada)
• Human Protein Reference Database (HPRD)
(http://www.hprd.org/; Johns Hopkins University and the Institute of Bioinformatics)
• MIPS Mammalian Protein-Protein Interaction Database
(http://mips.gsf.de/proj/ppi/; Munich Information Center for Protein Sequences)
More can be found in http://mips.gsf.de/proj/ppi/
Database of Interacting Proteins (DIP)
The DIPTM database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. The data stored within the DIP database were curated, both, manually by expert curators and also automatically using computational approaches that utilize the knowledge about the protein-protein interaction networks extracted from the most reliable, core subset of the DIP data.
BIND Database
MIPS (Mammalian Protein-Protein Interaction) Database is a collection of manually curated high-quality PPI data collected from the scientific literature by expert curators. We took great care to include only data from individually performed experiments since they usually provide the most reliable evidence for physical interactions.
Human Protein Reference Database
A centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome. All the information in HPRD has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data.
Biological Pathway Databases
• KEGG (GenomeNet)
• Biocarta ( NCBI)
• BioPax (Biological Pathway Exchange)
* Ingenuity Systems
* GeneGo
ARGININE AND PROLINE METABOLISM
Biocarta Pathways http://cgap.nci.nih.gov/Pathways/BioCarta_Pathways
http://www.biocarta.com/
http://www.biopax.org/
BioPAX Motivation
Before BioPAX With BioPAX
Common format will make data more accessible, promoting data sharing and distributed curation efforts
>150 DBs and tools
Database
Application
User
Ingenuity Systems - Analyze expression/other biological data in pathways/networks
Ingenuity Systems – example
Gene Ontology (GO)
http://www.geneontology.org/
• Gene Card
• Human genes, proteins and diseases db
• http://www.genecards.org/
• Omin
• Online Mendelian Inheritance in Man
• http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
Molecular Pathology and Disease Information Databases
Human disease database
GeneCards
Omin
COG/KOG
{COGNITOR/KOGNITOR}
Clusters of Orthologous Groups of proteins (COGs)
SAGE Database
Serial Analysis of Gene Expression
GEO (Gene Expression Omnibus)
http://www.ncbi.nlm.nih.gov/geo/
GPLPlatform
descriptions
GSMRaw/processedspot intensities
from a singleslide/chip
GSEGrouping of
slide/chip data“a single experiment”
GDSGrouping ofexperiments
Curated byNCBI
Submitted byExperimentalistsSubmitted by
Manufacturer*
Entrez GEOEntrez
GEO Datasets
Submit and update data
Query the database:• gene identifiers• field information• sequence
Browse datasets
Download data
Redesigned
with
new features
From Unigene: Hs.194143
Sequence and literature Search/Retrieval
• Entrez
• SRS
• ftp
Major sequence databases accessible through the Internet
1. GenBank - National Center for Biotechnology Information (NCBI), USA http://www.ncbi.nih.gov/Entrez/
2. European Molecular Biology Laboratory (EMBL) - European Bioinformatics Institute http://www.ebi.ac.uk/embl/index.html
3. DNA DataBank of Japan (DDBJ) - Mishima, Japanhttp://www.ddbj.nig.ac.jp/
4. Protein International Resource (PIR) - National Biomedical Research Foundation (NBRF), USAhttp://www-nbrf.georgetown.edu/pirwww/
5. SwissProt - Swiss Institute for Experimental Cancer Researchhttp://www.expasy.org/cgi-bin/sprot-search-de
6. Sequence Retrieval System (SRS) - European Bioinformatics Institute http://srs6.ebi.ac.uk
Protein Structure Databases
• PDB (Protein Data Bank)
http://www.rcsb.org/pdb/
• Entrez Structure (NCBI)
ftp ftp.ncbi.nih.gov
ftp ftp.expasy.org
ftp ftp.ebi.ac.uk
ftp ftp.ddbj.nig.ac.jp
Retrieve complete sets of data
Retrieve Raw Sequencing Data from
NCBI Trace Archive Database
Literature Searches
Entrez Pubmed (NCBI)
Entrez Pubmed Central (NCBI)
SRS (EMBL-EBI)
Gopubmed (Ontology-based Literature search)
http://www.geocities.com/bioinformaticsweb/datalink.html
More bio-db can be found in Bioinformatics web