Kahraman Weiss 2004

Embed Size (px)

Citation preview

  • 8/14/2019 Kahraman Weiss 2004

    1/3

    BIOINFORMATICS APPLICATIONS NOTEVol. 21 no. 3 2005, pages 418420

    doi:10.1093/bioinformatics/bti010

    PhenomicDB: a multi-species

    genotype/phenotype database for

    comparative phenomics

    Abdullah Kahraman1, Andrey Avramov2, Lyubomir G. Nashev2,Dimitar Popov2, Rainer Ternes3 , Hans-Dieter Pohlenz3

    and Bertram Weiss3,

    1Department of Bioinformatics, University of Applied Science Giessen,

    35596 Giessen, Germany, 2Metalife AG, Im Metapark 1, 79297 Winden, Germany

    and3Research Laboratories, Schering AG, 13342 Berlin, Germany

    Received on June 25, 2004; revised on August 11, 2004; accepted on August 30, 2004

    Advance Access publication September 16, 2004

    ABSTRACT

    Summary: We have created PhenomicDB, a multi-species

    genotype/phenotype database by merging public genotype/

    phenotype data from a wide range of model organisms and

    Homo sapiens. Until now these data were available in dis-

    tinct organism-specific databases (e.g. WormBase, OMIM,

    FlyBase and MGI). We compiled this wealth of data into a

    single integrated resource by coarse-grained semantic map-

    ping of the phenotypic data fields, by including common gene

    indices (NCBI Gene), and by the use of associated ortho-

    logy relationships. With its use-case-oriented user interface,PhenomicDB allows scientists to compare and browse known

    phenotypes for a given gene or a set of genes from different

    organisms simultaneously.

    Availability: PhenomicDB has been implemented at Schering

    AG as described below. A PhenomicDB implementation differ-

    ing in some technical details has been set up for the public at

    Metalife AG http://www.phenomicDB.de

    Contact: [email protected]

    Supplementary information: database model, semantic

    mapping table.

    MOTIVATION AND CONCEPTMore and more phenotypic data are being generated for both

    model and non-model organisms. New technologies such as

    RNAi now make genome-wide knock-down studies feasible

    and have already been applied in a high-throughput manner,

    for instance, to Homo sapiens (Berns et al., 2004; Fraser,

    2004; Paddison et al., 2004). Valuable resources for pheno-

    typic data are already available, but only for a given organism,

    for example, OMIM (Hamosh et al., 2002), WormBase

    (Harris etal., 2004), FlyBase (The FlyBase Consortium 2003)

    and MGD (Blake et al., 2003). Scientists have realized that

    To whom correspondence should be addressed.

    there is an additional need to make phenotypic data from

    different organisms simultaneously searchable, visible and,

    most importantly, comparable (Lussier and Li, 2004).

    Currently, research scientists looking for genes involved in

    a given disease have to search different phenotype databases.

    They need to figure out manually the orthology relationships

    among all genes concerned in order to understand the different

    genotypic effects on the phenotype of a certain gene in differ-

    ent organisms. These species-specific databases are scattered

    over the Internet and tailored to different objectives, and they

    store phenotypic data in different formats. Tedious handworkis therefore necessary to compare the phenotype of a gene in

    different organisms. A simple meta-search engine for these

    databases alone does not resolve this kind of problem, and

    this is exactly the functionality we were aiming to develop.

    Currently, the different source databases all use differ-

    ent gene loci description systems (i.e. gene indices) and

    the orthology relationships are not always obvious, so that

    many important phenotypic relationships may be difficult

    to discover. As others (Claustres et al., 2002; Lussier and

    Li, 2004) have already stated, a common data model com-

    bining the data with a common gene index is required.

    Orthology data must be available and an use-case-orienteduser interface should facilitate access to phenotypic data.

    Most data are available, but to the best of our know-

    ledge, an integrative system, as described here, is not yet

    available.

    In order to remedy to this situation, we set out to gather

    phenotype and genotype data from the different public

    resources and to map the data semantically into a single data

    model. To allow for directcomparison of phenotypes of ortho-

    logous genes from yeast to humans, we also uploaded these

    mapped data together with a gene index-like database [NCBI

    Gene (Pruitt et al., 2001)] and the associated orthology data

    [HomoloGene database (Wheeler et al., 2004)].

    418 Bioinformatics vol. 21 issue 3 Oxford University Press 2004; all rights reserved.

    http://www.phenomicdb.de/http://www.phenomicdb.de/
  • 8/14/2019 Kahraman Weiss 2004

    2/3

    Multi-species genotype/phenotype database

    PhenomicDB is thought as a first step towards comparative

    phenomics and will improve our understanding of gene

    function by combining the knowledge about phenotypes from

    several organisms. PhenomicDB has to compromise between

    data depth as available in the source databases and data com-patibility. It is not intended to compete with the much more

    dedicated primary source databases but tries to compensate its

    partial loss of depth by linking back to the primary sources.

    The basic functional concept of PhenomicDB is an integrated

    meta-search engine for phenotypes.

    Users should be aware that comparing genotypes or even

    phenotypes between organisms as different as yeast and

    humans may involve serious scientific hurdles. Nevertheless,

    finding, for instance, that thephenotype of a given mouse gene

    is described as similar to psoriasis and at the same time that

    the human orthologue has been described as a gene linked

    to skin defects can lead to novel and interesting hypotheses.Similarly, a gene involved in cancer in mammalian organisms

    could show a proliferation phenotypein a lowerorganism such

    as yeast, and this knowledge may lead to further insights.

    IMPLEMENTATION

    We implemented scripts to download phenotype/genotype

    data from public databases for Mus musculus (MGD),

    H.sapiens (OMIM), Drosophila melanogaster (FlyBase),

    Caenorhabditis elegans and Caenorhabditis briggsae

    (WormBase), Arabidopsis thaliana (MAtDB) (Schoof et al.,

    2004) and Saccharomyces cerevisae (CYGD) (Mewes et al.,

    2004). In addition, NCBI Gene and HomoloGene were down-loaded. Whenever possible, the given source genotypes (here

    meant as equivalent of gene loci) were mapped to a NCBI

    Gene entry.

    We performed coarse-grained semantic field mapping to

    bring the very heterogeneous phenotype data from different

    organisms into a common data model. Fields in the different

    source databases with the same content-type (e.g. containing

    the phenotype description) were identified and, irrespective

    of their original name there, uploaded in the corresponding

    PhenomicDB data field (e.g. phenotypedescription). Details

    of all semantic mappings and how we connected the data

    are shown in the semantic mapping table and the databaseschemata both available as Supplementary Information.

    PhenomicDB was designed as a normalized relational

    Oracle v. 8.1.7.4 database. The database scheme comprises

    three parts: common data, genotype data and phenotype data.

    The common part is used for information shared between

    genotypesas well as phenotypes, e.g. name, symbol, organism

    or literature references. The genotype-specific part contains

    specific genotype information (e.g. gene description, chro-

    mosomal location, gene ontology, etc.). Each genotype entry

    can relate to a NCBI Gene identifier in order to bin identical

    genes that are represented by, for example, different transcript

    identifiers in the source databases. The use of NCBI Gene

    identifiers is a prerequisite for making use of the orthology

    relationships uploaded from the HomoloGene database. They

    are captured as pairs of orthologous NCBI Gene identifiers

    (as determined by HomoloGene). The phenotype-specific part

    stores the free-text phenotype descriptions, and the data thatdescribe the underlying experiments (e.g. mutagenesis, RNAi

    and k.o. mice). Associated phenotype keywords or catalogue

    terms arestored as well. The genotype andphenotype parts are

    treated separately in our database. The connection between

    the two parts is implemented by genotype and phenotype

    id-mapping. Owing to the nature of the Oracle RDBMS, all

    knowledge containing fields are searchable. For performance

    reasons, those search functions which are accessible through

    the user interface are supported by interMedia text indices

    (Oracle).

    User is presentedwith a search interfacethat allows free text

    search with the ability to restrict the search to certain fields(e.g. gene name, organism, etc.) or to show only genotypes,

    if they have a phenotype associated or vice versa. The res-

    ult is a list of genotypes with their associated phenotypes.

    From the result list, user can select genotypes/phenotypes

    of further interest and expand these with their orthologues

    and the phenotypes of their orthologues. This allows dir-

    ect and simultaneous comparison of all available phenotypes

    for a certain gene in all available organisms. Each result is

    hyperlinked to a dedicated genotype/phenotype report and

    directly to the source database. Genotype/phenotype reports

    contain the associated data about the genotype (usually a

    gene) and the associated phenotype(s) including phenotype

    experiments.

    DISCUSSION AND OUTLOOK

    Data-integrative approaches over a range of organisms

    decrease conceptual accuracy or eliminate data: a genotype

    in our database can mean a gene, a mutated gene or a chro-

    mosomal region. Phenotype description ranges from the mere

    mention of non-viability in yeast to the detailed character-

    ization of a knockout mouse including all the experimental

    details. However, only these general concepts allow for data

    integration. The detailed and extensive description of the data

    or dedicated mining tools, e.g. PhenoBlast (Gunsalus et al.,2004) should stay within the realm of the organism-specific

    source databases. Others (Lussier and Li, 2004) have started

    with the integration of phenotypic notation and terminology

    over several species or have proposed common semantics for

    genome-wide phenotype databases (Claustres et al., 2002).

    They have also discussed in more detail the associated dif-

    ficulties of integrating phenotypic terminology that differs

    significantly between each organism-specific research com-

    munity. We adopted a practical approach clearly intended to

    allow for high-level data integration and easy integration of

    new upcoming data. The data content will be updated every

    8 weeks. PhenomicDB now has to prove that the method

    419

  • 8/14/2019 Kahraman Weiss 2004

    3/3

    A.Kahraman et al.

    of integration applied here can add value to the scientific

    exploitation of phenome data.

    Most of the valuable phenotypic data reside in the public

    literature not captured in databases. Effective text mining is

    needed to gather these data as well. A prerequisite for textmining, however, is the availability of specified thesauri, cata-

    logues and validated terms. Those are not yet available for

    phenotypic data (Gunsalus et al., 2004). First steps are under-

    way (Lussier and Li, 2004) and PhenomicDB could be used

    as a resource to extract such phenotype-specific vocabulary.

    We have started compiling thesauri from PhenomicDB to use

    them for the extraction of phenotypic data from literature by

    text mining.

    ACKNOWLEDGEMENTS

    We thank Dr Bernard Haendler (Schering AG) for useful

    discussion of the manuscript, and Dr Stephan Brock(Metalife) and Prof. Michael Schoenemann (Metalife) for

    their continuous support.

    REFERENCES

    Berns,K., Hijmans,E.M., Mullenders,J., Brummelkamp,T.R.,

    Velds,A., Heimerikx,M., Kerkhoven,R.M., Madiredjo,M.,

    Nijkamp,W., Weigelt,B. et al. (2004) A large-scale RNAi screen

    in human cells identifies new components of the p53 pathway.

    Nature, 428, 431437.

    Blake,J.A., Richardson,J.E., Bult,C.J., Kadin,J.A. and Eppig,J.T.

    (2003) MGD: the Mouse Genome Database. Nucleic Acids Res.,

    31, 193195.

    Claustres,M., Horaitis,O., Vanevski,M. and Cotton,R.G. (2002)Time for a unified system of mutation description and reporting:

    a review of locus-specific mutation databases. Genome Res., 12,

    680688.

    The FlyBase Consortium (2003) The FlyBase database of the

    Drosophila genome projects and community literature. Nucleic

    Acids Res., 31, 172175.

    Fraser,A. (2004) RNA interference: human genes hit the big screen.

    Nature, 428, 375378.

    Gunsalus,K.C., Yueh,W.C., MacMenamin,P. and Piano,F. (2004)

    RNAiDB and PhenoBlast: web tools for genome-wide pheno-

    typic mapping projects. Nucleic Acids Res.,32

    (Database issue),D406D410.

    Hamosh,A., Scott,A.F., Amberger,J., Bocchini,C., Valle,D. and

    McKusick,V.A., (2002) Online Mendelian Inheritance in Man

    (OMIM), a knowledgebase of human genes and genetic disorders.

    Nucleic Acids Res., 30, 5255.

    Harris,T.W., Chen,N., Cunningham,F., Tello-Ruiz,M.,

    Antoshechkin,I., Bastiani,C., Bieri,T., Blasiar,D., Bradnam,K.,

    Chan,J. et al. (2004) WormBase: a multi-species resource

    for nematode biology and genomics. Nucleic Acids Res., 32

    (Database issue), D411D417.

    Lussier,Y.A. and Li,J. (2004) Terminological mapping for high

    throughput comparative biology of phenotypes. Pac. Symp.

    Biocomput., 202213.

    Mewes,H.W., Amid,C., Arnold,R., Frishman,D., Guldener,U.,

    Mannhaupt,G., Munsterkotter,M., Pagel,P., Strack,N., Stump-

    flen,V., Warfsmann,J. and Ruepp,A. (2004) MIPS: analysis and

    annotation of proteins from whole genomes. Nucleic Acids Res.,

    32 (Database issue), D41D44.

    Paddison,P.J., Silva,J.M., Conklin,D.S., Schlabach,M., Li,M.,

    Aruleba,S., Balija,V., OShaughnessy,A., Gnoj,L., Scobie,K.,

    (2004) A resource for large-scale RNA-interference-based screens

    in mammals. Nature, 428, 427431.

    Pruitt,K.D. and Maglott,D.R. (2001) RefSeq and LocusLink: NCBI

    gene-centered resources. Nucleic Acids Res, 29, 13740.

    Schoof,H., Ernst,R., Nazarov,V., Pfeifer,L., Mewes,H.W. and

    Mayer,K.F. (2004) MIPS Arabidopsis thaliana Database

    (MAtDB): an integrated biological knowledge resource for plantgenomics. Nucleic Acids Res., 32 (Database issue), D373D376.

    Wheeler,D.L., Church,D.M., Edgar,R., Federhen,S., Helmberg,W.,

    Madden,T.L., Pontius,J.U., Schuler,L.M., Schriml,G.D.,

    Sequeira,E. et al. (2004) Database resources of the National

    Center for Biotechnology Information: update. Nucleic Acids

    Res., 32 (Database issue), D35D40.

    420