Upload
abdullahkahraman
View
220
Download
0
Embed Size (px)
Citation preview
8/14/2019 Kahraman Weiss 2004
1/3
BIOINFORMATICS APPLICATIONS NOTEVol. 21 no. 3 2005, pages 418420
doi:10.1093/bioinformatics/bti010
PhenomicDB: a multi-species
genotype/phenotype database for
comparative phenomics
Abdullah Kahraman1, Andrey Avramov2, Lyubomir G. Nashev2,Dimitar Popov2, Rainer Ternes3 , Hans-Dieter Pohlenz3
and Bertram Weiss3,
1Department of Bioinformatics, University of Applied Science Giessen,
35596 Giessen, Germany, 2Metalife AG, Im Metapark 1, 79297 Winden, Germany
and3Research Laboratories, Schering AG, 13342 Berlin, Germany
Received on June 25, 2004; revised on August 11, 2004; accepted on August 30, 2004
Advance Access publication September 16, 2004
ABSTRACT
Summary: We have created PhenomicDB, a multi-species
genotype/phenotype database by merging public genotype/
phenotype data from a wide range of model organisms and
Homo sapiens. Until now these data were available in dis-
tinct organism-specific databases (e.g. WormBase, OMIM,
FlyBase and MGI). We compiled this wealth of data into a
single integrated resource by coarse-grained semantic map-
ping of the phenotypic data fields, by including common gene
indices (NCBI Gene), and by the use of associated ortho-
logy relationships. With its use-case-oriented user interface,PhenomicDB allows scientists to compare and browse known
phenotypes for a given gene or a set of genes from different
organisms simultaneously.
Availability: PhenomicDB has been implemented at Schering
AG as described below. A PhenomicDB implementation differ-
ing in some technical details has been set up for the public at
Metalife AG http://www.phenomicDB.de
Contact: [email protected]
Supplementary information: database model, semantic
mapping table.
MOTIVATION AND CONCEPTMore and more phenotypic data are being generated for both
model and non-model organisms. New technologies such as
RNAi now make genome-wide knock-down studies feasible
and have already been applied in a high-throughput manner,
for instance, to Homo sapiens (Berns et al., 2004; Fraser,
2004; Paddison et al., 2004). Valuable resources for pheno-
typic data are already available, but only for a given organism,
for example, OMIM (Hamosh et al., 2002), WormBase
(Harris etal., 2004), FlyBase (The FlyBase Consortium 2003)
and MGD (Blake et al., 2003). Scientists have realized that
To whom correspondence should be addressed.
there is an additional need to make phenotypic data from
different organisms simultaneously searchable, visible and,
most importantly, comparable (Lussier and Li, 2004).
Currently, research scientists looking for genes involved in
a given disease have to search different phenotype databases.
They need to figure out manually the orthology relationships
among all genes concerned in order to understand the different
genotypic effects on the phenotype of a certain gene in differ-
ent organisms. These species-specific databases are scattered
over the Internet and tailored to different objectives, and they
store phenotypic data in different formats. Tedious handworkis therefore necessary to compare the phenotype of a gene in
different organisms. A simple meta-search engine for these
databases alone does not resolve this kind of problem, and
this is exactly the functionality we were aiming to develop.
Currently, the different source databases all use differ-
ent gene loci description systems (i.e. gene indices) and
the orthology relationships are not always obvious, so that
many important phenotypic relationships may be difficult
to discover. As others (Claustres et al., 2002; Lussier and
Li, 2004) have already stated, a common data model com-
bining the data with a common gene index is required.
Orthology data must be available and an use-case-orienteduser interface should facilitate access to phenotypic data.
Most data are available, but to the best of our know-
ledge, an integrative system, as described here, is not yet
available.
In order to remedy to this situation, we set out to gather
phenotype and genotype data from the different public
resources and to map the data semantically into a single data
model. To allow for directcomparison of phenotypes of ortho-
logous genes from yeast to humans, we also uploaded these
mapped data together with a gene index-like database [NCBI
Gene (Pruitt et al., 2001)] and the associated orthology data
[HomoloGene database (Wheeler et al., 2004)].
418 Bioinformatics vol. 21 issue 3 Oxford University Press 2004; all rights reserved.
http://www.phenomicdb.de/http://www.phenomicdb.de/8/14/2019 Kahraman Weiss 2004
2/3
Multi-species genotype/phenotype database
PhenomicDB is thought as a first step towards comparative
phenomics and will improve our understanding of gene
function by combining the knowledge about phenotypes from
several organisms. PhenomicDB has to compromise between
data depth as available in the source databases and data com-patibility. It is not intended to compete with the much more
dedicated primary source databases but tries to compensate its
partial loss of depth by linking back to the primary sources.
The basic functional concept of PhenomicDB is an integrated
meta-search engine for phenotypes.
Users should be aware that comparing genotypes or even
phenotypes between organisms as different as yeast and
humans may involve serious scientific hurdles. Nevertheless,
finding, for instance, that thephenotype of a given mouse gene
is described as similar to psoriasis and at the same time that
the human orthologue has been described as a gene linked
to skin defects can lead to novel and interesting hypotheses.Similarly, a gene involved in cancer in mammalian organisms
could show a proliferation phenotypein a lowerorganism such
as yeast, and this knowledge may lead to further insights.
IMPLEMENTATION
We implemented scripts to download phenotype/genotype
data from public databases for Mus musculus (MGD),
H.sapiens (OMIM), Drosophila melanogaster (FlyBase),
Caenorhabditis elegans and Caenorhabditis briggsae
(WormBase), Arabidopsis thaliana (MAtDB) (Schoof et al.,
2004) and Saccharomyces cerevisae (CYGD) (Mewes et al.,
2004). In addition, NCBI Gene and HomoloGene were down-loaded. Whenever possible, the given source genotypes (here
meant as equivalent of gene loci) were mapped to a NCBI
Gene entry.
We performed coarse-grained semantic field mapping to
bring the very heterogeneous phenotype data from different
organisms into a common data model. Fields in the different
source databases with the same content-type (e.g. containing
the phenotype description) were identified and, irrespective
of their original name there, uploaded in the corresponding
PhenomicDB data field (e.g. phenotypedescription). Details
of all semantic mappings and how we connected the data
are shown in the semantic mapping table and the databaseschemata both available as Supplementary Information.
PhenomicDB was designed as a normalized relational
Oracle v. 8.1.7.4 database. The database scheme comprises
three parts: common data, genotype data and phenotype data.
The common part is used for information shared between
genotypesas well as phenotypes, e.g. name, symbol, organism
or literature references. The genotype-specific part contains
specific genotype information (e.g. gene description, chro-
mosomal location, gene ontology, etc.). Each genotype entry
can relate to a NCBI Gene identifier in order to bin identical
genes that are represented by, for example, different transcript
identifiers in the source databases. The use of NCBI Gene
identifiers is a prerequisite for making use of the orthology
relationships uploaded from the HomoloGene database. They
are captured as pairs of orthologous NCBI Gene identifiers
(as determined by HomoloGene). The phenotype-specific part
stores the free-text phenotype descriptions, and the data thatdescribe the underlying experiments (e.g. mutagenesis, RNAi
and k.o. mice). Associated phenotype keywords or catalogue
terms arestored as well. The genotype andphenotype parts are
treated separately in our database. The connection between
the two parts is implemented by genotype and phenotype
id-mapping. Owing to the nature of the Oracle RDBMS, all
knowledge containing fields are searchable. For performance
reasons, those search functions which are accessible through
the user interface are supported by interMedia text indices
(Oracle).
User is presentedwith a search interfacethat allows free text
search with the ability to restrict the search to certain fields(e.g. gene name, organism, etc.) or to show only genotypes,
if they have a phenotype associated or vice versa. The res-
ult is a list of genotypes with their associated phenotypes.
From the result list, user can select genotypes/phenotypes
of further interest and expand these with their orthologues
and the phenotypes of their orthologues. This allows dir-
ect and simultaneous comparison of all available phenotypes
for a certain gene in all available organisms. Each result is
hyperlinked to a dedicated genotype/phenotype report and
directly to the source database. Genotype/phenotype reports
contain the associated data about the genotype (usually a
gene) and the associated phenotype(s) including phenotype
experiments.
DISCUSSION AND OUTLOOK
Data-integrative approaches over a range of organisms
decrease conceptual accuracy or eliminate data: a genotype
in our database can mean a gene, a mutated gene or a chro-
mosomal region. Phenotype description ranges from the mere
mention of non-viability in yeast to the detailed character-
ization of a knockout mouse including all the experimental
details. However, only these general concepts allow for data
integration. The detailed and extensive description of the data
or dedicated mining tools, e.g. PhenoBlast (Gunsalus et al.,2004) should stay within the realm of the organism-specific
source databases. Others (Lussier and Li, 2004) have started
with the integration of phenotypic notation and terminology
over several species or have proposed common semantics for
genome-wide phenotype databases (Claustres et al., 2002).
They have also discussed in more detail the associated dif-
ficulties of integrating phenotypic terminology that differs
significantly between each organism-specific research com-
munity. We adopted a practical approach clearly intended to
allow for high-level data integration and easy integration of
new upcoming data. The data content will be updated every
8 weeks. PhenomicDB now has to prove that the method
419
8/14/2019 Kahraman Weiss 2004
3/3
A.Kahraman et al.
of integration applied here can add value to the scientific
exploitation of phenome data.
Most of the valuable phenotypic data reside in the public
literature not captured in databases. Effective text mining is
needed to gather these data as well. A prerequisite for textmining, however, is the availability of specified thesauri, cata-
logues and validated terms. Those are not yet available for
phenotypic data (Gunsalus et al., 2004). First steps are under-
way (Lussier and Li, 2004) and PhenomicDB could be used
as a resource to extract such phenotype-specific vocabulary.
We have started compiling thesauri from PhenomicDB to use
them for the extraction of phenotypic data from literature by
text mining.
ACKNOWLEDGEMENTS
We thank Dr Bernard Haendler (Schering AG) for useful
discussion of the manuscript, and Dr Stephan Brock(Metalife) and Prof. Michael Schoenemann (Metalife) for
their continuous support.
REFERENCES
Berns,K., Hijmans,E.M., Mullenders,J., Brummelkamp,T.R.,
Velds,A., Heimerikx,M., Kerkhoven,R.M., Madiredjo,M.,
Nijkamp,W., Weigelt,B. et al. (2004) A large-scale RNAi screen
in human cells identifies new components of the p53 pathway.
Nature, 428, 431437.
Blake,J.A., Richardson,J.E., Bult,C.J., Kadin,J.A. and Eppig,J.T.
(2003) MGD: the Mouse Genome Database. Nucleic Acids Res.,
31, 193195.
Claustres,M., Horaitis,O., Vanevski,M. and Cotton,R.G. (2002)Time for a unified system of mutation description and reporting:
a review of locus-specific mutation databases. Genome Res., 12,
680688.
The FlyBase Consortium (2003) The FlyBase database of the
Drosophila genome projects and community literature. Nucleic
Acids Res., 31, 172175.
Fraser,A. (2004) RNA interference: human genes hit the big screen.
Nature, 428, 375378.
Gunsalus,K.C., Yueh,W.C., MacMenamin,P. and Piano,F. (2004)
RNAiDB and PhenoBlast: web tools for genome-wide pheno-
typic mapping projects. Nucleic Acids Res.,32
(Database issue),D406D410.
Hamosh,A., Scott,A.F., Amberger,J., Bocchini,C., Valle,D. and
McKusick,V.A., (2002) Online Mendelian Inheritance in Man
(OMIM), a knowledgebase of human genes and genetic disorders.
Nucleic Acids Res., 30, 5255.
Harris,T.W., Chen,N., Cunningham,F., Tello-Ruiz,M.,
Antoshechkin,I., Bastiani,C., Bieri,T., Blasiar,D., Bradnam,K.,
Chan,J. et al. (2004) WormBase: a multi-species resource
for nematode biology and genomics. Nucleic Acids Res., 32
(Database issue), D411D417.
Lussier,Y.A. and Li,J. (2004) Terminological mapping for high
throughput comparative biology of phenotypes. Pac. Symp.
Biocomput., 202213.
Mewes,H.W., Amid,C., Arnold,R., Frishman,D., Guldener,U.,
Mannhaupt,G., Munsterkotter,M., Pagel,P., Strack,N., Stump-
flen,V., Warfsmann,J. and Ruepp,A. (2004) MIPS: analysis and
annotation of proteins from whole genomes. Nucleic Acids Res.,
32 (Database issue), D41D44.
Paddison,P.J., Silva,J.M., Conklin,D.S., Schlabach,M., Li,M.,
Aruleba,S., Balija,V., OShaughnessy,A., Gnoj,L., Scobie,K.,
(2004) A resource for large-scale RNA-interference-based screens
in mammals. Nature, 428, 427431.
Pruitt,K.D. and Maglott,D.R. (2001) RefSeq and LocusLink: NCBI
gene-centered resources. Nucleic Acids Res, 29, 13740.
Schoof,H., Ernst,R., Nazarov,V., Pfeifer,L., Mewes,H.W. and
Mayer,K.F. (2004) MIPS Arabidopsis thaliana Database
(MAtDB): an integrated biological knowledge resource for plantgenomics. Nucleic Acids Res., 32 (Database issue), D373D376.
Wheeler,D.L., Church,D.M., Edgar,R., Federhen,S., Helmberg,W.,
Madden,T.L., Pontius,J.U., Schuler,L.M., Schriml,G.D.,
Sequeira,E. et al. (2004) Database resources of the National
Center for Biotechnology Information: update. Nucleic Acids
Res., 32 (Database issue), D35D40.
420