Kahraman Weiss 2004

8/14/2019 Kahraman Weiss 2004

1/3

BIOINFORMATICS APPLICATIONS NOTEVol. 21 no. 3 2005, pages 418420

doi:10.1093/bioinformatics/bti010

PhenomicDB: a multi-species

genotype/phenotype database for

comparative phenomics

Abdullah Kahraman1, Andrey Avramov2, Lyubomir G. Nashev2,Dimitar Popov2, Rainer Ternes3 , Hans-Dieter Pohlenz3

and Bertram Weiss3,

1Department of Bioinformatics, University of Applied Science Giessen,

35596 Giessen, Germany, 2Metalife AG, Im Metapark 1, 79297 Winden, Germany

and3Research Laboratories, Schering AG, 13342 Berlin, Germany

Received on June 25, 2004; revised on August 11, 2004; accepted on August 30, 2004

Advance Access publication September 16, 2004

ABSTRACT

Summary: We have created PhenomicDB, a multi-species

genotype/phenotype database by merging public genotype/

phenotype data from a wide range of model organisms and

Homo sapiens. Until now these data were available in dis-

tinct organism-specific databases (e.g. WormBase, OMIM,

FlyBase and MGI). We compiled this wealth of data into a

single integrated resource by coarse-grained semantic map-

ping of the phenotypic data fields, by including common gene

indices (NCBI Gene), and by the use of associated ortho-

logy relationships. With its use-case-oriented user interface,PhenomicDB allows scientists to compare and browse known

phenotypes for a given gene or a set of genes from different

organisms simultaneously.

Availability: PhenomicDB has been implemented at Schering

AG as described below. A PhenomicDB implementation differ-

ing in some technical details has been set up for the public at

Metalife AG http://www.phenomicDB.de

Contact: [email protected]

Supplementary information: database model, semantic

mapping table.

MOTIVATION AND CONCEPTMore and more phenotypic data are being generated for both

model and non-model organisms. New technologies such as

RNAi now make genome-wide knock-down studies feasible

and have already been applied in a high-throughput manner,

for instance, to Homo sapiens (Berns et al., 2004; Fraser,

2004; Paddison et al., 2004). Valuable resources for pheno-

typic data are already available, but only for a given organism,

for example, OMIM (Hamosh et al., 2002), WormBase

(Harris etal., 2004), FlyBase (The FlyBase Consortium 2003)

and MGD (Blake et al., 2003). Scientists have realized that

To whom correspondence should be addressed.

there is an additional need to make phenotypic data from

different organisms simultaneously searchable, visible and,

most importantly, comparable (Lussier and Li, 2004).

Currently, research scientists looking for genes involved in

a given disease have to search different phenotype databases.

They need to figure out manually the orthology relationships

among all genes concerned in order to understand the different

genotypic effects on the phenotype of a certain gene in differ-

ent organisms. These species-specific databases are scattered

over the Internet and tailored to different objectives, and they

store phenotypic data in different formats. Tedious handworkis therefore necessary to compare the phenotype of a gene in

different organisms. A simple meta-search engine for these

databases alone does not resolve this kind of problem, and

this is exactly the functionality we were aiming to develop.

Currently, the different source databases all use differ-

ent gene loci description systems (i.e. gene indices) and

the orthology relationships are not always obvious, so that

many important phenotypic relationships may be difficult

to discover. As others (Claustres et al., 2002; Lussier and

Li, 2004) have already stated, a common data model com-

bining the data with a common gene index is required.

Orthology data must be available and an use-case-orienteduser interface should facilitate access to phenotypic data.

Most data are available, but to the best of our know-

ledge, an integrative system, as described here, is not yet

available.

In order to remedy to this situation, we set out to gather

phenotype and genotype data from the different public

resources and to map the data semantically into a single data

model. To allow for directcomparison of phenotypes of ortho-

logous genes from yeast to humans, we also uploaded these

mapped data together with a gene index-like database [NCBI

Gene (Pruitt et al., 2001)] and the associated orthology data

[HomoloGene database (Wheeler et al., 2004)].

418 Bioinformatics vol. 21 issue 3 Oxford University Press 2004; all rights reserved.
http://www.phenomicdb.de/http://www.phenomicdb.de/


2/3

Multi-species genotype/phenotype database

PhenomicDB is thought as a first step towards comparative

phenomics and will improve our understanding of gene

function by combining the knowledge about phenotypes from

several organisms. PhenomicDB has to compromise between

data depth as available in the source databases and data com-patibility. It is not intended to compete with the much more

dedicated primary source databases but tries to compensate its

partial loss of depth by linking back to the primary sources.

The basic functional concept of PhenomicDB is an integrated

meta-search engine for phenotypes.

Users should be aware that comparing genotypes or even

phenotypes between organisms as different as yeast and

humans may involve serious scientific hurdles. Nevertheless,

finding, for instance, that thephenotype of a given mouse gene

is described as similar to psoriasis and at the same time that

the human orthologue has been described as a gene linked

to skin defects can lead to novel and interesting hypotheses.Similarly, a gene involved in cancer in mammalian organisms

could show a proliferation phenotypein a lowerorganism such

as yeast, and this knowledge may lead to further insights.

IMPLEMENTATION

We implemented scripts to download phenotype/genotype

data from public databases for Mus musculus (MGD),

H.sapiens (OMIM), Drosophila melanogaster (FlyBase),

Caenorhabditis elegans and Caenorhabditis briggsae

(WormBase), Arabidopsis thaliana (MAtDB) (Schoof et al.,

2004) and Saccharomyces cerevisae (CYGD) (Mewes et al.,

2004). In addition, NCBI Gene and HomoloGene were down-loaded. Whenever possible, the given source genotypes (here

meant as equivalent of gene loci) were mapped to a NCBI

Gene entry.

We performed coarse-grained semantic field mapping to

bring the very heterogeneous phenotype data from different

organisms into a common data model. Fields in the different

source databases with the same content-type (e.g. containing

the phenotype description) were identified and, irrespective

of their original name there, uploaded in the corresponding

PhenomicDB data field (e.g. phenotypedescription). Details

of all semantic mappings and how we connected the data

are shown in the semantic mapping table and the databaseschemata both available as Supplementary Information.

PhenomicDB was designed as a normalized relational

Oracle v. 8.1.7.4 database. The database scheme comprises

three parts: common data, genotype data and phenotype data.

The common part is used for information shared between

genotypesas well as phenotypes, e.g. name, symbol, organism

or literature references. The genotype-specific part contains

specific genotype information (e.g. gene description, chro-

mosomal location, gene ontology, etc.). Each genotype entry

can relate to a NCBI Gene identifier in order to bin identical

genes that are represented by, for example, different transcript

identifiers in the source databases. The use of NCBI Gene

identifiers is a prerequisite for making use of the orthology

relationships uploaded from the HomoloGene database. They

are captured as pairs of orthologous NCBI Gene identifiers

(as determined by HomoloGene). The phenotype-specific part

stores the free-text phenotype descriptions, and the data thatdescribe the underlying experiments (e.g. mutagenesis, RNAi

and k.o. mice). Associated phenotype keywords or catalogue

terms arestored as well. The genotype andphenotype parts are

treated separately in our database. The connection between

the two parts is implemented by genotype and phenotype

id-mapping. Owing to the nature of the Oracle RDBMS, all

knowledge containing fields are searchable. For performance

reasons, those search functions which are accessible through

the user interface are supported by interMedia text indices

(Oracle).

User is presentedwith a search interfacethat allows free text

search with the ability to restrict the search to certain fields(e.g. gene name, organism, etc.) or to show only genotypes,

if they have a phenotype associated or vice versa. The res-

ult is a list of genotypes with their associated phenotypes.

From the result list, user can select genotypes/phenotypes

of further interest and expand these with their orthologues

and the phenotypes of their orthologues. This allows dir-

ect and simultaneous comparison of all available phenotypes

for a certain gene in all available organisms. Each result is

hyperlinked to a dedicated genotype/phenotype report and

directly to the source database. Genotype/phenotype reports

contain the associated data about the genotype (usually a

gene) and the associated phenotype(s) including phenotype

experiments.

DISCUSSION AND OUTLOOK

Data-integrative approaches over a range of organisms

decrease conceptual accuracy or eliminate data: a genotype

in our database can mean a gene, a mutated gene or a chro-

mosomal region. Phenotype description ranges from the mere

mention of non-viability in yeast to the detailed character-

ization of a knockout mouse including all the experimental

details. However, only these general concepts allow for data

integration. The detailed and extensive description of the data

or dedicated mining tools, e.g. PhenoBlast (Gunsalus et al.,2004) should stay within the realm of the organism-specific

source databases. Others (Lussier and Li, 2004) have started

with the integration of phenotypic notation and terminology

over several species or have proposed common semantics for

genome-wide phenotype databases (Claustres et al., 2002).

They have also discussed in more detail the associated dif-

ficulties of integrating phenotypic terminology that differs

significantly between each organism-specific research com-

munity. We adopted a practical approach clearly intended to

allow for high-level data integration and easy integration of

new upcoming data. The data content will be updated every

8 weeks. PhenomicDB now has to prove that the method

419


3/3

A.Kahraman et al.

of integration applied here can add value to the scientific

exploitation of phenome data.

Most of the valuable phenotypic data reside in the public

literature not captured in databases. Effective text mining is

needed to gather these data as well. A prerequisite for textmining, however, is the availability of specified thesauri, cata-

logues and validated terms. Those are not yet available for

phenotypic data (Gunsalus et al., 2004). First steps are under-

way (Lussier and Li, 2004) and PhenomicDB could be used

as a resource to extract such phenotype-specific vocabulary.

We have started compiling thesauri from PhenomicDB to use

them for the extraction of phenotypic data from literature by

text mining.

ACKNOWLEDGEMENTS

We thank Dr Bernard Haendler (Schering AG) for useful

discussion of the manuscript, and Dr Stephan Brock(Metalife) and Prof. Michael Schoenemann (Metalife) for

their continuous support.

REFERENCES

Berns,K., Hijmans,E.M., Mullenders,J., Brummelkamp,T.R.,

Velds,A., Heimerikx,M., Kerkhoven,R.M., Madiredjo,M.,

Nijkamp,W., Weigelt,B. et al. (2004) A large-scale RNAi screen

in human cells identifies new components of the p53 pathway.

Nature, 428, 431437.

Blake,J.A., Richardson,J.E., Bult,C.J., Kadin,J.A. and Eppig,J.T.

(2003) MGD: the Mouse Genome Database. Nucleic Acids Res.,

31, 193195.

Claustres,M., Horaitis,O., Vanevski,M. and Cotton,R.G. (2002)Time for a unified system of mutation description and reporting:

a review of locus-specific mutation databases. Genome Res., 12,

680688.

The FlyBase Consortium (2003) The FlyBase database of the

Drosophila genome projects and community literature. Nucleic

Acids Res., 31, 172175.

Fraser,A. (2004) RNA interference: human genes hit the big screen.

Nature, 428, 375378.

Gunsalus,K.C., Yueh,W.C., MacMenamin,P. and Piano,F. (2004)

RNAiDB and PhenoBlast: web tools for genome-wide pheno-

typic mapping projects. Nucleic Acids Res.,32

(Database issue),D406D410.

Hamosh,A., Scott,A.F., Amberger,J., Bocchini,C., Valle,D. and

McKusick,V.A., (2002) Online Mendelian Inheritance in Man

(OMIM), a knowledgebase of human genes and genetic disorders.

Nucleic Acids Res., 30, 5255.

Harris,T.W., Chen,N., Cunningham,F., Tello-Ruiz,M.,

Antoshechkin,I., Bastiani,C., Bieri,T., Blasiar,D., Bradnam,K.,

Chan,J. et al. (2004) WormBase: a multi-species resource

for nematode biology and genomics. Nucleic Acids Res., 32

(Database issue), D411D417.

Lussier,Y.A. and Li,J. (2004) Terminological mapping for high

throughput comparative biology of phenotypes. Pac. Symp.

Biocomput., 202213.

Mewes,H.W., Amid,C., Arnold,R., Frishman,D., Guldener,U.,

Mannhaupt,G., Munsterkotter,M., Pagel,P., Strack,N., Stump-

flen,V., Warfsmann,J. and Ruepp,A. (2004) MIPS: analysis and

annotation of proteins from whole genomes. Nucleic Acids Res.,

32 (Database issue), D41D44.

Paddison,P.J., Silva,J.M., Conklin,D.S., Schlabach,M., Li,M.,

Aruleba,S., Balija,V., OShaughnessy,A., Gnoj,L., Scobie,K.,

(2004) A resource for large-scale RNA-interference-based screens

in mammals. Nature, 428, 427431.

Pruitt,K.D. and Maglott,D.R. (2001) RefSeq and LocusLink: NCBI

gene-centered resources. Nucleic Acids Res, 29, 13740.

Schoof,H., Ernst,R., Nazarov,V., Pfeifer,L., Mewes,H.W. and

Mayer,K.F. (2004) MIPS Arabidopsis thaliana Database

(MAtDB): an integrated biological knowledge resource for plantgenomics. Nucleic Acids Res., 32 (Database issue), D373D376.

Wheeler,D.L., Church,D.M., Edgar,R., Federhen,S., Helmberg,W.,

Madden,T.L., Pontius,J.U., Schuler,L.M., Schriml,G.D.,

Sequeira,E. et al. (2004) Database resources of the National

Center for Biotechnology Information: update. Nucleic Acids

Res., 32 (Database issue), D35D40.

420

Documents

Kahraman Weiss 2004