13
Biomolecular databases Bioinformatics Jacques van HeldenFORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/ NEW ADDRESS (since Nov 1st, 2011) [email protected] Université d’Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM Unit U1090) http://tagc.univ-mrs.fr/ B!GRe Bioinformatique des Génomes et Réseaux !"#$%&'&()#*' *,-*%#". /&0 ("%&1)#. *%, #')%)#. !"#$ Inserm U1090 Contents ! Examples of biological databases " Nucleic sequences: Genbank, EMBL, and DDBJ " Protein sequences: UniProt " The Gene Ontology (GO) project ! Issues and perspectives for biological databases Examples of biomolecular databases Biomolecular Databases [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/ Examples of biomolecular databases ! Sequence and structure databases " Protein sequences (UniProt) " DNA sequences (EMBL, Genbank, DDBJ) " 3D structures (PDB) " Structural motifs (CATH) " Sequence motifs (PROSITE, PRODOM) ! Genome sequences and annotations " Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, !) " Multiple genomes (Integr8, NCBI, KEGG, TIGR, !) ! Molecular functions " Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB) " Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA) " Transport (YTPdb) ! Biological processes " Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation) " Signal transduction pathways (CSNdb, Transpath) " Protein-protein interactions (DIP, BIND, MINT) " Gene networks (GeneNet, FlyNets) Databases of databases ! There are hundreds of databases related to molecular biology and biochemistry. New databases are created every year. ! Every year, the first issue of Nucleic Acids Research is dedicated to biological databases " http://nar.oupjournals.org/ " 2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1 ! The same journal maintains a database of databases: the Molecular Biology Database Collection " http://www.oxfordjournals.org/nar/database/c/ ! Some bioinformatics centres maintain multiple database, with cross-links between them. The SRS server at EBI holds an impressive collection of databases. " http://srs.ebi.ac.uk/ Nucleic sequence databases: GenBank, EMBL, and DDBJ Biomolecular Databases [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

[email protected]/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Jacques.van-Helden@univ-amu.frpedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

Biomolecular databases

Bioinformatics

Jacques van HeldenFORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique

Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/

NEW ADDRESS (since Nov 1st, 2011) [email protected]

Université d’Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics

(TAGC, INSERM Unit U1090) http://tagc.univ-mrs.fr/

B!GRe Bioinformatique des

Génomes et Réseaux

!"#$%&'&()#*'+*,-*%#".+/&0+("%&1)#.+*%,+#')%)#.!"#$Inserm U1090

Contents

!  Examples of biological databases "  Nucleic sequences: Genbank, EMBL, and DDBJ "  Protein sequences: UniProt "  The Gene Ontology (GO) project

!  Issues and perspectives for biological databases

Examples of biomolecular databases

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Examples of biomolecular databases

!  Sequence and structure databases "  Protein sequences (UniProt) "  DNA sequences (EMBL, Genbank, DDBJ) "  3D structures (PDB) "  Structural motifs (CATH) "  Sequence motifs (PROSITE, PRODOM)

!  Genome sequences and annotations "  Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, !) "  Multiple genomes (Integr8, NCBI, KEGG, TIGR, !)

!  Molecular functions "  Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB) "  Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA) "  Transport (YTPdb)

!  Biological processes "  Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation) "  Signal transduction pathways (CSNdb, Transpath) "  Protein-protein interactions (DIP, BIND, MINT) "  Gene networks (GeneNet, FlyNets)

Databases of databases

!  There are hundreds of databases related to molecular biology and biochemistry. New databases are created every year.

!  Every year, the first issue of Nucleic Acids Research is dedicated to biological databases

"  http://nar.oupjournals.org/ "  2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1

!  The same journal maintains a database of databases: the Molecular Biology Database Collection

"  http://www.oxfordjournals.org/nar/database/c/ !  Some bioinformatics centres maintain multiple database, with cross-links

between them. The SRS server at EBI holds an impressive collection of databases.

"  http://srs.ebi.ac.uk/

Nucleic sequence databases: GenBank, EMBL, and DDBJ

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 2: Jacques.van-Helden@univ-amu.frpedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

Okubo et al. (2006) NAR 34: D6-D9

Nucleic sequence databases

!  To publish an article dealing with a sequence, scientific journals impose to have previously deposited this sequence in a reference database.

!  There are 3 main repositories for nucleic acid sequences. !  Sequences deposited in any of these 3 databases are automatically

synchronized in the 2 other ones.

Adapted from Didier Gonze

The sequencing pace !  Nucleic sequences

"  Genbank (April 2011) http://www.ncbi.nlm.nih.gov/genbank/ •  126,551,501,141 bases in 135,440,924 sequence records in the

traditional GenBank divisions •  191,401,393,188 bases in 62,715,288 sequence records in the

Whole Genome Ssequencing !  Entire genomes

"  GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced genomes.

"  http://www.genomesonline.org/gold_statistics.htm

!  Protein sequences "  Essentially obtained by translation of putative genes in nucleic

sequences (almost no direct protein sequencing). "  UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences. "  http://www.ebi.ac.uk/swissprot/sptr_stats/index.html

Size of the nucleotide database EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012 http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html !Class entries nucleotides!------------------------------------------------------------------!CON:Constructed 7,236,371 359,112,791,043!EST:Expressed Sequence Tag 73,715,376 40,997,082,803!GSS:Genome Sequence Scan 34,528,104 21,985,922,905!HTC:High Throughput CDNA sequencing 491,770 594,229,662!HTG:High Throughput Genome sequencing 152,599 25,159,746,658!PAT:Patents 24,364,832 12,117,896,594!STD:Standard 13,920,617 37,665,112,606!STS:Sequence Tagged Site 1,322,570 636,037,867!TSA:Transcriptome Shotgun Assembly 8,085,693 5,663,938,279!WGS:Whole Genome Shotgun 88,288,431 305,661,696,545! ----------- ---------------!Total 252,106,363 450,481,663,919!!Division entries nucleotides!------------------------------------------------------------------!ENV:Environmental Samples 30,908,230 14,420,391,278!FUN:Fungi 6,522,586 11,614,472,226!HUM:Human 32,094,500 38,072,362,804!INV:Invertebrates 31,907,138 52,527,673,643!MAM:Other Mammals 40,012,731 145,678,620,711!MUS:Mus musculus 11,745,671 19,701,637,499!PHG:Bacteriophage 8,511 85,549,111!PLN:Plants 52,428,994 55,570,452,118!PRO:Prokaryotes 2,808,489 28,807,572,238!ROD:Rodents 6,554,012 33,326,106,733!SYN:Synthetic 4,045,013 782,174,055!TGN:Transgenic 285,307 849,743,891!UNC:Unclassified 8,617,225 4,957,442,673!VRL:Viruses 1,358,528 1,518,575,082!VRT:Other Vertebrates 22,809,428 42,568,889,857! ----------- ---------------!Total 252,106,363 450,481,663,919!

Genbank (NCBI - USA) http://www.ncbi.nlm.nih.gov/Genbank/

The EMBL Nucleotide Sequence Database (EBI - UK) http://www.ebi.ac.uk/embl/

DDBJ - DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/

Page 3: Jacques.van-Helden@univ-amu.frpedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

URL Sequences

Bases (without shotgun)

bases (including shotgun) Organisms

DDBJ http://www.ddbj.nig.ac.jp/ 2.0E+06 1.7E+09EMBL http://www.ebi.ac.uk/embl/ 1.0E+11 2.0E+05GenBank http://www.ncbi.nlm.nih.gov/ 4.6E+07 5.1E+10 1.0E+11 2.1E+05

Size of the nucleic sequence databases

!  Summary of database contents for the 3 main databases of nucleic sequences. !  Source: NAR database issue January 2006.

UniProt : protein sequences and functional annotations

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

UniProt - the Universal Protein Resource http://www.uniprot.org/ !  Database content (Sept 2012)

"  UniProtKB: •  24,532,088 entries •  Translation of EMBL coding sequences

(non-redundant with Swiss-Prot) "  UniProtKB/Swiss-Prot section (reviewed):

•  537,505 entries •  annotation by experts •  high information content •  many references to the literature •  good reliability of the information

"  The rest (90% of the entries) •  Automatic annotation by sequence

similarity. !  Features

"  The most comprehensive protein database in the world.

"  A huge team: >100 annotators + developers. "  Annotation by experts: annotators are

specialized for different types of proteins or organisms.

"  World-wide recognized as an essential resource.

!  References "  Bairoch et al. The SWISS-PROT protein

sequence data bank. Nucleic Acids Res (1991) vol. 19 Suppl pp. 2247-9

"  The UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res (2008). Database Issue.

Number of entries (polypeptides) in Swiss-Prot

http://www.expasy.org/sprot/relnotes/relstat.html

Taxonomic distribution of the sequences

Within Eukaryotes

UniProt example - Human Pax-6 protein Header : name and synonyms

UniProt example - Human Pax-6 protein Human-based annotation by specialists

UniProt example - Human Pax-6 protein Structured annotation : keywords and Gene Ontology terms

Page 4: Jacques.van-Helden@univ-amu.frpedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

UniProt example - Human Pax-6 protein Protein interactions; Alternative products

UniProt example - Human Pax-6 protein Detailed description of regions, variations, and secondary structure

UniProt example - Human Pax-6 protein Peptidic sequence

UniProt example - Human Pax-6 protein References to original publications

UniProt example - Human Pax-6 protein Cross-references to many databases (fragment shown)

3D Structure of macromolecules

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 5: Jacques.van-Helden@univ-amu.frpedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

PDB - The Protein Data Bank http://www.rcsb.org/pdb/

Genome browsers

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

EnsEMBL Genome Browser (Sanger Institute + EBI) http://www.ensembl.org/

UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/

Human gene Pax6 aligned with Vertebrate genomes

UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/

Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes

UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/

Drosophila 120kb chromosomal region covering the Achaete-Scute Complex

Page 6: Jacques.van-Helden@univ-amu.frpedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

ECR Browser http://ecrbrowser.dcode.org/

EnsEMBL - Example: Drosophila gene Pax6 http://www.ensembl.org/

Comparative genomics

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Integr8 - access to complete genomes and proteomes http://www.ebi.ac.uk/integr8/

Integr8 - genome summaries http://www.ebi.ac.uk/integr8/

Integr8 - clusters of orthologous genes (COGs) http://www.ebi.ac.uk/integr8/

Page 7: Jacques.van-Helden@univ-amu.frpedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

Integr8 - clusters of paralogous genes http://www.ebi.ac.uk/integr8/

Databases of protein domains

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Prosite - protein domains, families and functional sites http://www.expasy.ch/prosite/

Prosite - aligned sequences and logo http://www.expasy.ch/prosite/

!  Some of the sequences that were used to built the Prosite profile for the Zn(2)-C6 fungal-type DNA-binding domain (ZN2_CY6_FUNGAL_2, PS50048).

!  The Sequence Logo (below) indicates the level of conservation of each residue in each column of the alignment.

!  Note the 6 cysteines, characteristic of this domain.

Prosite - Example of profile matrix http://www.expasy.ch/prosite/

Prosite - Example of sequence logo http://www.expasy.ch/prosite/

Page 8: Jacques.van-Helden@univ-amu.frpedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

Prosite - Example of domain signature http://www.expasy.ch/prosite/

!  The domain signature is a string-based pattern representing the residues that are characteristic of a domain.

PFAM (Sanger Institute - UK) http://pfam.sanger.ac.uk/ Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)

CATH - Protein Structure Classification http://www.cathdb.info/

!  CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels:

"  Class (C), "  Architecture (A), "  Topology (T) "  Homologous superfamily (H).

!  The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis.

!  References "  Orengo et al. The CATH Database

provides insights into protein structure/function relationships. Nucleic Acids Res (1999) vol. 27 (1) pp. 275-9

"  Cuff et al. The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res (2008) pp.

CATH - Protein Structure Classification http://www.cathdb.info/

InterPro (EBI - UK) http://www.ebi.ac.uk/interpro/

InterPro (EBI - UK) Antennapedia-like Homeobox (entry IPR001827)

Page 9: Jacques.van-Helden@univ-amu.frpedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

The Gene Ontology (GO) database

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Ontology definition

!  Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être, indépendamment de ses déterminations particulières

!  Ontology: part of the metaphysics that focusses on the being as a beging, independently of its particular determinations Le Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993!

The "bio-ontologies"

!  Answer to the problem of inconsistencies in the annotations "  Controlled vocabulary "  Hierarchical classification between the terms of the controlled vocabulary

!  E.g.: The Gene Ontology "  molecular function ontology "  process ontology "  cellular component ontology

Gene ontology: processes

Gene ontology: molecular functions Gene ontology: cellular components

Page 10: Jacques.van-Helden@univ-amu.frpedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

Gene Ontology Database http://www.geneontology.org/

Gene Ontology Database (http://www.geneontology.org/)

Example: methionine biosynthetic process

Status of GO annotations (NAR DB issue 2006)

!  Term definitions "  Biological process terms 9,805 "  Molecular function terms 7,076 "  Cellular component terms 1,574 "  Sequence Ontology terms 963

!  Genomes with annotation 30 "  Excludes annotations from UniProt, which represent 261 annotated proteomes.

!  Annotated gene products "  Total 1,618,739 "  Electronic only 1,460,632 "  Manually curated 158,107

QuickGO (http://www.ebi.ac.uk/QuickGO/)

!  Web site http://www.ebi.ac.uk/QuickGO/

!  A user-friendly Web interface to the Gene Ontology.

!  Graphical display of the hierarchical relationships between terms.

!  Convenient browsing between classes.

Remarks on "bio-ontologies"

!  Improvement compared to free text "  controlled vocabulary (choice among synonyms) "  hierarchical relationships between the concepts

!  Nothing to do with the philosophical concept of ontology "  A "bio-ontologies" is usually nothing more than a taxonomical classification of

the terms of a controlled vocabulary !  Multiple possibilities of classification criteria

"  e.g. compartment subtypes (plasma membrane is a membrane) "  e.g. compartment locations (nucleus is inside cytoplasm is inside plasma

membrane) !  To be useful, should remain purpose-based

"  each biologist might wish to define his/her own classification based on his/her needs and scope of interest

"  impossible to define a unifying standard for all biologists !  No representation of molecular interactions

"  relationships between objects are only hierarchical, not horizontal or cyclic "  e.g. does not describe which genes are the target of a given transcription

factor

What is biological function ?

!  A general definition "  Fonction: action, rôle caractéristique d’un élément, d’un organe, dans un ensemble

(souvent opposé à structure). Source: Le Petit Robert - dictionnaire alphabetique et analogique de la langue francaise. 1982.

"  Function: characteristic action (role) of an element (organ) within an set (often opposed to structure)

!  Function and gene ontology "  Understanding the function requires to establish the link between molecular activity

and the context in which it takes place (process). "  Multifunctionality

•  Same activity can play different roles in different processes. !  Example: scute gene in Drosophila melanogaster: a transcription factor

(activity) involved in sex determination, determination of neural precursors and malpighian tubules (3 processes).

•  Multiple activities of a same protein in a given process !  Example: aspatokinase PutA in Escherichia coli, contains 2 enzymatic

domains (enzymatic activities) + a DNA-binding domain (DNA binding transcription factor) -> 3 molecular activities in the same process (proline utilization).

Page 11: Jacques.van-Helden@univ-amu.frpedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

Small compounds, reactions and metabolic pathways

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

LIGAND - Small compounds and metabolic reactions

KEGG - Kyoto Encycplopaedia of Genes and Genomes Ecocyc, BioCyc and Metacyc - Metabolic pathways

Protein interaction networks and transduction pathways

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Microarray databases

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 12: Jacques.van-Helden@univ-amu.frpedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

Human genome resources

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

HapMap http://www.hapmap.org/

!  The International HapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings.

!  Associations between genetic variations (SNPs, ...) and diseases + response to pharmaceuticals.

Issues for biomolecular databases

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Issues for biological databases

!  Dealing with biological complexity !  Data content

"  Coverage "  Information content

!  Data quality "  Data structure "  Consistency

!  Query capabilities !  Interfaces

"  User interfaces "  Programmatic interfaces

!  Annotation !  Funding

Towards biological complexity

!  The main databases currently available are focussed on one type of molecular entity : nucleic sequences, proteins, compounds, !

!  This type of organization is very convenient as far as the information to be represented is simple (e.g. DNA sequences, structures of small molecules and macromolecules).

!  It becomes more difficult if we want to represent "  the interactions between biological objects, "  the integration of various elements in a biological process (metabolic pathways, protein

interaction networks, regulatory networks, !) "  complex concepts such as ”biological function”

Data content

!  Scope of the database "  types of biological objects represented

!  Number of entries "  coverage of the current knowledge

!  Information content "  Level of detail in the description of the biological objects

!  References to the source of information

Page 13: Jacques.van-Helden@univ-amu.frpedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/pdf... · 2014-05-09 · Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases ! Genbank (April

Data quality

!  Data Consistency "  always use the same name to indicate the same object "  (this seems trivial, but its is unfortunately still not always the case) "  event better: define an ID for each objects, and allow to retrieve it by any of its

synonyms "  spelling mistakes

!  Data Structuration "  distinct fields for distinct attributes of the biological objects

!  Reliability "  Evidences ? Level of confidence ? "  Assignation of function by similarity

•  recursive process ! propagation of errors

Query capabilities

!  Browsing (click and read) !  Simple search

"  select records with some constraints !  More elaborate search

"  select specific fields of some records with constraints on some fields (~SQL SELECT)

!  Complex querying "  ability to return an answer that results from a "live" computation, and was not part

of any record of the dabatase

Interfaces

!  User interfaces "  user-friendly "  convenient browsing "  intuitive query forms "  visualization (graphical output)

!  Programmatic interfaces "  communication with external programs:

•  other databases (concept of distributed database) •  analysis tools

Annotation

!  Problem "  The flow of available data is increasing exponentially

!  Strategies "  internal curators "  selected external experts "  public submission "  computer-based extraction of information from biological texts

Funding

!  Public funding "  Problem: easier to obtain public funds for creating a new database than for

maintaining or expanding existing resources !  Private funding

"  Industrial companies are •  ready to invest in good data and good query capabilities •  interested by academic expertise

!  Solutions "  All users pay (per query for example)

•  Note: academic users are anyway funded by public funds "  Hybrid solution

•  access is free for academic users, not for companies •  companies can buy the whole database an install it in-house

(+ add their own private data) •  academia-industry interface is often ensured by a spinoff company