Sequence databases and retrieval systems Guy Perrière [ replaced by Manolo Gouy ]

Sequence databases and retrieval systems

Guy Perrière

[ replaced by Manolo Gouy ]

Pôle Bio-Informatique LyonnaisLaboratoire de Biométrie et Biologie Évolutive

UMR CNRS n° 5558Université Claude Bernard – Lyon 1

In the beginning

First paper compilation in 1965 (Atlas of Protein Sequences).

Development of real databanks at the begin-ning of the 80’s: Fast access. Make possible analyses that require a lot of

data:– Codon usage.

– Molecular phylogeny.

General databanks

Nucleotide sequences: EMBL/GenBank/DDBJ.

Protein sequences: Simple translations of coding regions:

– GenPept (from GenBank).

– TrEMBL (from EMBL). Systems containing additional data:

– SWISS-PROT.

– PIR.

EMBL

Created in 1980 at the European Molecular Biology Laboratory in Heidelberg.

Maintained since 1994 at the European Bioinformatics Institute (EBI) near Cambridge.

Web server:http://www.ebi.ac.uk/embl

GenBank

Set up in 1979 at the Los Alamos National Laboratory in New Mexico, US.

Maintained since 1992 at the National Cen-ter for Biotechnology Information (NCBI) in Bethesda.

Web server:http://www.ncbi.nlm.nih.gov/Genbank/index.html

DDBJ

Active since 1984 at the National Institute of Genetics (NIG) in Mishima, Japan.

Web server:http://www.ddbj.nig.ac.jp

EMBL / GenBank / DDBJ

The International Nucleotide Sequence Database Collaboration : EMBL / GenBank / DDBJ

New sequences are exchanged daily between the three centers :--> the three banks have an identical content.

Data mainly provided by direct submissions from the authors through Internet: Web forms. Email.

Data growth

GenBankEMBLPIRSWISS-PROT

5

6

7

8

9

10

1103

/83

06/8

4

09/8

5

12/8

6

03/8

8

06/8

9

09/9

0

12/9

1

03/9

3

06/9

4

09/9

5

12/9

6

03/9

8

06/9

9

09/0

0

12/0

1

03/0

3

log

(num

ber

of r

esid

ues)

GenBank/EMBL size (April 2003)

31109 nucleotides. 24106 sequences. 1.8 million genes (proteins and RNA). 313,000 bibliographic references. 100 gigabytes on disk. Growth of 63 % in 12 months.

Taxonomic sampling (April 2003)

There are 135,560 species for which at least one sequence is available.

Nine species (0.007 %) correspond to 62 % of the total.

77,900 species are represented by only one sequence!

Homo sapiensMus musculusZea maysRattus norvegicusBrassica oleraceaArabidopsis thalianaDanio rerioDrosophila melanogasterOryza sativa

27.3%20.1%3.0 %2.9 %2.3 %2.0 %2.0 %1.4 %0.9 %

The nine most represented species in GenBank/EMBL

Distribution format

The banks are distributed as a set of text files called divisions ( 292 for EMBL).

A division contains sequences related to: A taxon (e.g., bacteria, invertebrates,

mammals). A class of sequences (EST, HTG, GSS).

Within a division, each sequence is called an entry.

Entry structure

Information is introduced in structured fields.

The format differs in its form between EMBL and GenBank/DDBJ …

but not in substance.

ID, AC, SV and DT fields

Contain identifiers and the creation and the last modification dates for the entries.

ID BSAMYL standard; DNA; PRO; 2680 BP.XXAC V00101; J01547XXSV V00101.1XXDT 13-JUL-1983 (Rel. 03, Created)DT 12-NOV-1996 (Rel. 49, Last updated, Version 11)

DE, KW, OS and OC fields

Definition, Keywords, Taxonomy.

DE Bacillus subtilis amylase gene.XXKW amyE gene; amylase; amylase-alpha;KW regulatory region; signal peptide.XXOS Bacillus subtilisOC Bacteria; Firmicutes; Bacillus/Clostridium group;OS Bacillus/Staphylococcus group; Bacillus.

The NCBI maintains a unified taxonomy, largely based on sequence information.

RN, RX, RA and RT fields

contain bibliographic information.

RN [1]RP 1-2680RX MEDLINE; 83143299.RA Yang M., Galizzi, A., Henner, D.J.;RT "Nucleotide sequence of the amylase gene fromRT Bacillus subtilis";RL Nucleic Acids Res. 11:237-249(1983).…

FT field

contains the descriptions of functional regions. key location and qualifiersFT promoter 369..374FT /note="put. promoter sequence P2 [3] (amyR1)"FT RBS 414..419FT /note="rRNA-binding site rbs-1 [3]"FT CDS 498..2480FT /gene="amyE"FT /db_xref="SWISS-PROT:P00691"FT /product="alpha-amylase precursor"FT /EC_number="3.2.1.1”FT /protein_id="CAA23437.1"FT /translation="MFAKRFKTSLLPLFAGFLLLFHLVLAGPAAFT ASAETANKSNELTAPSIKSGTILHAWNWSFNTLKHNMKDIHDAG...

Intron/exon structure

FT CDS join(242..610,3397..3542,5100..5351)FT /codon_start=1FT /db_xref="SWISS-PROT:P01308"FT /note="precursor"FT /gene="INS"FT /product="insulin"...

Sequence

Subsequence

SQ field

Contains the sequence iselfSQ Sequence 2680 BP; 825 A; 520 C; 642 G; 693 T; 0 other; gctcatgccg agaatagaca ccaaagaaga actgtaaaaa cgggtgaagc agcagcgaat 60 agaatcaatt gcttgcgcct ttgcggtagt ggtgcttacg atgtacgaca gggggattcc 120 ccatacattc ttcgcttggc tgaaaatgat tcttcttttt atcgtctgcg gcggcgttct 180 gtttctgctt cggtatgtga ttgtgaagct ggcttacaga agagcggtaa aagaagaaat 240 (...) gatggtttct tttttgttca taaatcagac aaaacttttc tcttgcaaaa gtttgtgaag 2580 tgttgcacaa tataaatgtg aaatacttca caaacaaaaa gacatcaaag agaaacatac 2640 cctgcaagga tgctgatatt gtctgcattt gcgccggagc 2680//

Errors in databanks

There are a lot of errors in the nucleotide sequence databanks: In annotations:

– Inaccuracies, omissions, and even mistakes.

– Inconsistencies between entries. In the sequences themselves:

– Sequencing errors.

– Cloning vectors inserted.

Redundancy

Another major pro-blem is redundancy.

A lot of entries are partially or entirely duplicated:

20% of vertebrate se-quences in GenBank.

Duplicated entries are often different in their sequence.

{ {

{

Partial and completesequence duplications

Protein sequence databases

Translation of Coding DNA Sequences (CDS) from EMBL/GenBank/DDBJ.

Consultation of publications or patents. Very small number of direct protein sequence

submission by authors. In SwissProt and PIR: additional annotations.

SWISS-PROT

Created by Amos Bairoch in 1986 at the Department of Medical Biochemistry in Geneva.

Maintained by the Swiss Institute of Bioinformatics (SIB) and funded by GeneBio, and, very recently, by NIH.

Web server:http://www.expasy.ch/sprot/sprot-top.html

SWISS-PROT characteristics

Almost no redundancy. Cross-references with 60 other databanks. High-quality annotations:

Systematic control by a team of annotators. Help from a set of > 200 volunteer experts.

Embedded in Expasy, a www proteomics server (http://www.expasy.org) .

http://www.expasy.org/



Annotations

Protein function. Post-translational modifications. Structural or functional domains. Secondary and quaternary structures. Similarities with other proteins. Conflicts between positions for CDS. Disease-related mutations

Associated databanks

TrEMBL, built using only annotated CDS from the EMBL data library.

ENZYME, for the international enzyme nomenclature.

PROSITE, for biologically significant sites, patterns and profiles.

SWISS-2DPAGE, for two-dimensional polyacrylamide gel electrophoresis maps.

PIR

PIR (The Protein Information Resource) was created by Margaret Dayhoff in 1965.

Aims: To provide exhaustive and non-redundant

protein sequence data. To give a classification using taxonomic and

similarity data:entries grouped in super-families, families

and subfamilies.

Data maintenance

Three organisms collect and organize the data introduced in PIR: The National Biomedical Research Foundation

(NBRF) in the United States. The Martinsried Institute for Protein Sequence

(MIPS) in Germany. The Japan International Protein Sequence

Information Database (JIPID) in Japan.

Results

The exhaustivity is not better than what is obtained with SWISS-PROT+TrEMBL.

Still contains redundancy. Less comprehensive annotation. Low number of cross-references. PIR has recently joined forces with EBI and SIB

to establish the UniProt (United Protein Databases), the central resource of protein sequence and function.

Specialized databanks

A lot of specialized databanks have been developed, which are devoted to: Complete genomes. Families of homologous genes. Non-sequence data.

These systems are under the responsibility of curators: Data quality and homogeneity control.

Complete genomes

There is a large number of databanks devoted to specific organisms.

These banks are associated to sequencing or mapping projects.

For some model organisms there are often several concurrent systems.

Examples

Available databanks

NRSub (Non-Redundant B. subtilis)SubtiList

ColibriEcoGene (E. coli Gene Database)ECDC (E. coli Database Collection)

CMR (Comprehensive Microbial Resource)EMGLib (Enhanced Microbial Genomes Library)Micado (Microbial Advanced Database Organization)

MYGD (MIPS Yeast Genome Database)SGD (Saccharomyces Genome Database)YPD (Yeast Proteome Database)

FlyBase

PlasmoDB (P. falciparum Database)

WormBaseWormPD (Worm Protein Database)

TAIR (The Arabidopsis Information Resource)

Organism

Bacillus subtilis

Escherichia coli

Various prokaryotes

Saccharomyces cerevisiae

Drosophila melanogaster

Plasmodium falciparum

Caenorhabditis elegans

Arabidopsis thaliana

Gene family databanks

Built with automated procedures: Similarity search between sets of proteins

(BLASTP, FASTP, Smith-Waterman). Clustering into homologous families using

similarity criteria. Include various data:

Protein (and sometimes nucleotide) sequences. Multiple sequence alignments and trees. Taxonomy.

ProtFam

Developed at MIPS. Built with PIR sequences. Includes four levels of classification:

Superfamilies (based on function and similarity criteria).

Families (50% similarity). Subfamilies (80% similarity). Entries (≥95% similarity).

ProtFAm characteristics

Allows to visualize alignments and dendrograms for the families.

Integrates Pfam domains. Allows users to classify their own protein

sequences. Web server:

http://mips.gsf.de

ProtoMap

Initially developed at the Hebrew University of Jerusalem ; now hosted at Cornell University.

Built with SWISS-PROT & TrEMBL sequences.

Combines 3 sequence similarity measures (BLASTP, FASTA and Smith-Waterman).

ProtoMap characteristics

Alignments and trees are visualized with Java applets.

Users can submit sequences and classify them.

Web server:http://protomap.cornell.edu/index.html

Specialized systems

HOVERGEN (Homologous Vertebrate Genes Database) : Based on GenBank CDS.

HOBACGEN (Homologous Bacterial Genes Database) for prokaryotes and yeast: Based on SWISS-PROT/TrEMBL.

HOBACGEN-CG for completely sequenced genomes: Based on SWISS-PROT/TrEMBL.

Other specialized systems

COG (Clusters of Orthologous Groups), also for complete genomes: Based on GenBank CDS.

NuReBase (Nuclear Receptors Database) for mammalian nuclear receptors: Based on EMBL CDS.

RTKdb (Tyrosine Kinase Receptors): Based on EMBL CDS.

Q9KPJ1

GLT1_YEAST

Q9VVA4

Q22275100

GLTS_SYNY3

O67512

Q9PA10

AAG08421

P95456

GLTB_ECOLI

100

100

85

56

100

Q9RXX2

Q9PJA4GLTB_SYNY3

GLTB_BACSU

Q9KC4697

100

Q9KPJ4

P96218

Q9S2Y9100

57

22

30

100

75

100

Are COGs real orthologs?

Reciprocalbest BLAST hit

Glutamate synthase large subunit

Escherichia coliBacillus subtilisPseudomonas

aeruginosaVibrio choleraeSynechocystis sp.

Beyond protein families

ProtFam, Hovergen, Hobacgen, COGs gather protein sequences homologous on their whole length

Patterns, profiles, domains, …are covered in Terry Attwood’s lecture.

Non-sequence data

Available systems

GXD (Mouse Gene Expression Database)The Stanford Microarray Database

GDB (Genome Data Base)EMG (Encyclopedia of Mouse Genome)MGD (Mouse Genome Database)INE (Integrated Rice Genome Explorer)

SWISS-2DPAGEPDD (Protein Disease Database)Sub2D (B. subtilis 2D Protein Index)

PDB (Protein Data Bank)MMDB (Molecular Modelling Data Base)NRL_3D (Non-Redundant Library of 3D Structures)SCOP (Structural Classification of Proteins)

ALFRED (Allele Frequency Database)

DIP (Database of Interacting proteins)BIND (Biomolecular Interaction Network Database)

Data

Gene expression

Mapping

Protein quantification

3D structures

Polymorphism

Molecular interactions

Sequence Data retrieval

Made mainly through Internet access: With client software (e.g., Entrez, HobacFetch). By remote connections to servers providing on-

line access to the banks (INFOBIOGEN). Using World-Wide Web servers and browsers

Advantages and limitations

Users do not have to cope with the usual databases problems: Storing of large amounts of data. Daily updates. Software upgrades.

Simplicity of use. Net access is sometimes very slow at peak

hours: consider using other servers besides NCBI

The ACNUC retrieval system

Direct access to functional regions described in feature tables (CDS, tRNA, rRNA).

Selection of entries using various criteria: Sequence names and accession numbers. Bibliographic criteria. Keywords. Taxonomy. Organelle.

Developed at Lyon University

ACNUC : possible accesses

Graphical interface distributed along with the databases themselves.

http://pbil.univ-lyon1.fr/databases/acnuc.html Web access at Pôle Bio-Informatique

Lyonnais (PBIL):http://pbil.univ-lyon1.fr/search/query.html

ACNUC characteristics

Allows to query any bank in PIR, SWISS-PROT, EMBL, or GenBank formats.

Keywords and species browsing. Complex queries. Links with sequence analysis programs on

the Web server (alignment, codon usage).

click

click

The Query form

click

Building queries to the sequence data bases

click

Locally save the received sequence data.

Retrieving sequences

Browsing thespecies trees

HOVERGEN:Families of homologousvertebrate genes

Access to family members

Download treeor alignment

SRS

Public version developed at EMBL by Etzold and Argos (1993).

Presently available on the different Web servers belonging to EMBnet: EBI (England). INFOBIOGEN (France). DKFZ (Germany). …

Characteristics

Database index built with the use of ODD (Object Design and Definition).

More than 250 databanks have been indexed and are accessible through 35 SRS servers.

Allows queries to operate simultaneously on different banks.

Databanks interconnection

SWISS-PROT

ENZYME

PDB

HSSP

SWISSNEW

YPDREF

YPD

PDBFINDERALI

DSSP

FSSP

NRL_3D

PMD

PIR

ProtFamFlyGene

TFSITE

TFACTOR

EMBL

TrEMBL

ECDC

TrEMBLNEW

EMNEW

EPD

GenBank MOLPROBE

OMIM

MIMMAP

REBASE

PROSITE ProDom

PROSITEDOCBlocks

SWISSDOM

Entrez

Developed by Schuler et al. (1996) at NCBI. Allows to query several US-made databases:

GenBank, GenPept, NR, MMDB, MEDLINE. Access through client software (Unix, Mac or

Windows) or Web server:http://www.ncbi.nlm.nih.gov

Characteristics

Introduces the concept of neighbours between sequences, references and structures.

Sequence neighbours are established using similarity criteria.

No access to multiple alignments.

Phylogeny(Taxman)

Structures(MMDB)

Refs.(PubMed)

CompleteGenomes

Nucl. Seq.(GenBank)

Prot. Seq.(GenPept)

NAR 2003 database issue

http://nar.oupjournals.org/content/vol31/issue1/

Documents

Sequence databases and retrieval systems Guy Perrière [ replaced by Manolo Gouy ]