3
TECHNICAL BRIEF BrainProfileDB – a platform for integration of functional genomics data Johannes Schuchhardt 1 , Alexander Glintschert 1 , Daniela Hartl 2 , Martin Irmler 3 , Johannes Beckers 3, 4 , Christian Stephan 5 , Katrin Marcus 5 , Joachim Klose 2 , Helmut E. Meyer 5 and Arif Malik 1 1 MicroDiscovery GmbH, Berlin, Germany 2 Institute for Human Genetics, Charité-Universitätsmedizin-Berlin, Berlin, Germany 3 Institute of Experimental Genetics, Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany 4 Technical University Munich, Center of Life and Food Sciences Weihenstephan, Chair of Experimental Genetics, Weihenstephan, Germany 5 Medizinisches Proteom Center, Ruhr Universität Bochum, Bochum, Germany BrainProfileDB is a database system for integrating large sets of high throughput functional genomics data of the Human Brain Proteome Project (HBPP). Within HBPP (http://www.smp- proteomics.de/) the molecular pathology of neurodegenerative diseases is investigated, using complementary methods from transcriptomics, proteomics, toponomics and interaction meas- urements. Aim of the database system is to provide a broad spectrum of scientific users joined in the consortium with a practical integrated view on their data. Employing appropriate mapping techniques and levels of data representation the user is relieved from technical details of gene identification or experimental measurement technique. Received: July 30, 2007 Revised: November 15, 2007 Accepted: November 25, 2007 Keywords: Data analysis / Functional genomics database / High-throughput data management / Integrating genomics and proteomics / Neurodegenerative diseases 1162 Proteomics 2008, 8, 1162–1164 There are ongoing great efforts in the bioinformatics com- munity to standardize and integrate data from different sources [1]. The Microarray Gene Expression Data (MGED) society (http://www.mged.org) did successfully establish MIAME the ‘minimum information about a microarray experiment’ [2]. The Proteomics Standards Initiative (PSI) of the HUPO defines community standards for data repre- sentation in proteomics in MIAPE (‘minimum information about a proteomics experiment’) [3]. BrainProfileDB (http:// BrainProfileDB.microdiscovery.de) is designed for compat- ibility with defined standards of these initiatives without neglecting the main requirement of the project, to provide an easy to handle and user friendly interface. The database is capable to represent different study designs, species or tis- sues in a uniform manner. Beyond that it is dynamically extendable by new biomaterial and assay attributes. This makes the database a powerful tool for the integration and analysis of data from different sources. Within Human Brain Proteome Project (HBPP) tran- script and protein expression profiling data are collected according to a unified project wide study design. BrainPro- fileDB system provides information on quantitative expres- sion of these biomolecules to all partners of the HBPP con- sortium. The broad spectrum of professional background represented within HBPP poses a specific challenge to the design of the system: Information should be made available to biologists, physicians and bioinformaticians alike. In design- ing BrainProfileDB we did address this challenge by confin- Correspondence: Dr. Johannes Schuchhardt, MicroDiscovery GmbH, Marienburger Str. 1, 10405 Berlin, Germany E-mail: [email protected] Fax: 149(0)30-4435090-10 Abbreviations: HBPP, Human Brain Proteome Project; MIAPE, minimum information about a proteomics experiment DOI 10.1002/pmic.200700743 © 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

BrainProfileDB – a platform for integration of functional genomics data

Embed Size (px)

Citation preview

Page 1: BrainProfileDB – a platform for integration of functional genomics data

TECHNICAL BRIEF

BrainProfileDB – a platform for integration

of functional genomics data

Johannes Schuchhardt1, Alexander Glintschert1, Daniela Hartl2, Martin Irmler3,Johannes Beckers3, 4, Christian Stephan5, Katrin Marcus5, Joachim Klose2,Helmut E. Meyer5 and Arif Malik1

1 MicroDiscovery GmbH, Berlin, Germany2 Institute for Human Genetics, Charité-Universitätsmedizin-Berlin, Berlin, Germany3 Institute of Experimental Genetics, Helmholtz Zentrum München,

German Research Center for Environmental Health, Neuherberg, Germany4 Technical University Munich, Center of Life and Food Sciences Weihenstephan, Chair

of Experimental Genetics, Weihenstephan, Germany5 Medizinisches Proteom Center, Ruhr Universität Bochum, Bochum, Germany

BrainProfileDB is a database system for integrating large sets of high throughput functionalgenomics data of the Human Brain Proteome Project (HBPP). Within HBPP (http://www.smp-proteomics.de/) the molecular pathology of neurodegenerative diseases is investigated, usingcomplementary methods from transcriptomics, proteomics, toponomics and interaction meas-urements. Aim of the database system is to provide a broad spectrum of scientific users joined inthe consortium with a practical integrated view on their data. Employing appropriate mappingtechniques and levels of data representation the user is relieved from technical details of geneidentification or experimental measurement technique.

Received: July 30, 2007Revised: November 15, 2007

Accepted: November 25, 2007

Keywords:

Data analysis / Functional genomics database / High-throughput data management /Integrating genomics and proteomics / Neurodegenerative diseases

1162 Proteomics 2008, 8, 1162–1164

There are ongoing great efforts in the bioinformatics com-munity to standardize and integrate data from differentsources [1]. The Microarray Gene Expression Data (MGED)society (http://www.mged.org) did successfully establishMIAME the ‘minimum information about a microarrayexperiment’ [2]. The Proteomics Standards Initiative (PSI) ofthe HUPO defines community standards for data repre-sentation in proteomics in MIAPE (‘minimum informationabout a proteomics experiment’) [3]. BrainProfileDB (http://BrainProfileDB.microdiscovery.de) is designed for compat-

ibility with defined standards of these initiatives withoutneglecting the main requirement of the project, to provide aneasy to handle and user friendly interface. The database iscapable to represent different study designs, species or tis-sues in a uniform manner. Beyond that it is dynamicallyextendable by new biomaterial and assay attributes. Thismakes the database a powerful tool for the integration andanalysis of data from different sources.

Within Human Brain Proteome Project (HBPP) tran-script and protein expression profiling data are collectedaccording to a unified project wide study design. BrainPro-fileDB system provides information on quantitative expres-sion of these biomolecules to all partners of the HBPP con-sortium. The broad spectrum of professional backgroundrepresented within HBPP poses a specific challenge to thedesign of the system: Information should be made available tobiologists, physicians and bioinformaticians alike. In design-ing BrainProfileDB we did address this challenge by confin-

Correspondence: Dr. Johannes Schuchhardt, MicroDiscoveryGmbH, Marienburger Str. 1, 10405 Berlin, GermanyE-mail: [email protected]: 149(0)30-4435090-10

Abbreviations: HBPP, Human Brain Proteome Project; MIAPE,

minimum information about a proteomics experiment

DOI 10.1002/pmic.200700743

© 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 2: BrainProfileDB – a platform for integration of functional genomics data

Proteomics 2008, 8, 1162–1164 Bioinformatics 1163

ing the system in several ways: we avoided the complexity ofimage data or probe level raw data. The set of data analysis andvisualization tools was restricted to those considered mostimportant and useful to the consortium. Some key propertiesof the system are: intuitive user guidance with graphical userinterfaces, simple queries due to automatic mapping of bio-molecule identifiers, distributed access due to a web interface,compatibility with world-wide standards employing conceptsfrom microarray and gene expression (MAGE), MIAPE andthe PRoteomics IDEntifications database (PRIDE) (http://www.ebi.ac.uk/pride/). On the technical side BrainProfileDBis extendible due to modern relational database technologyand Java Platform Enterprise Edition (JEE) (http://java.sun.com/javaee/) based client/server architecture.

Incorporating ongoing efforts on genomics and proteom-ics data integration we designed a data model representing keyfunctional and semantic aspects of the experiments carried outin the project and an elaborate model of study design. A spe-cific study design implies a set of appropriate operations on thedata e.g. comparisons of case and control group or searchingpatterns in a time series. The systems central data structure istermed ‘Study’ and contains meta-information like a shortdescription of experiments and links to literature. The studydescribes a set of related assays and biomaterials and serves asa container for data from all experiments these assays andbiomaterials have been subjected to. Assays and biomaterialsare equipped with extensible dimensional attributes liketreatment, disease state or time stamp, which are used asexperimental factors for grouping assays and defining specificstudy designs like case/control or time series experiments(http://www. mged.org/Workgroups/MAGE/mage-om.html).

The study concept enables a top down approach for dataaccess: By keyword search a study may be identified, struc-ture and content of the study may be visualized in a studyinspection view. Finally, individual measurements are usedto perform analyses of biological states and expression pro-files. Special attention was given to an interactive graphicalinterface, providing informative views on the data. Differentplot types are available for data analysis. Two of them aregiven as an example in Fig. 1: A volcano plot on protein level,which allows the selection of differential proteins based onratio and p-value. A correlation plot is used to gain insightinto effects of transcriptional regulation on gene and proteinexpression levels. For any study and comparison of interestthis kind of analysis can be performed within just a fewconfiguration and selection operations. Integration work islargely done by the system in an automated fashion byselecting mapping tables for matching genes and proteinsproperly and by calculating means of experimental replicatesas defined in the study design. For a more advanced dataanalysis on comparison of gene and protein datasets see thework of Irmler et al. in this issue. All analyses in this appli-cation can be combined freely using a clipboard mechanismto transfer groups of gene names from one view to the other.Thereby the user is enabled to analyse a specific set of dif-ferentially expressed genes in different studies as well as

Figure 1. Dedicated views combined with efficient visualizationand selection tools enable rapid access to general project infor-mation and specific measurements. Mean values are calculatedautomatically for repeat groups as defined in study design. Gra-phical selection is used to define subsets of data (marked in dark).(A) Volcano plot of a set of protein expression data derived from adevelopmental study on mouse fetal brain. Proteins over-expressed at day 11.5 versus 9.5 and p-values are highlighted.(B) Scatter plot correlating gene and protein expression. Forprotein isoforms mapped to a single gene, the user can choose ifeach spot is represented individually or whether the sum ofintensities is formed (used here). Analysis results and study con-tent can be exported for local data analysis.

from different viewpoints. Each graphical view has a corre-sponding tabular view providing the relevant numeric data.Using the export facility of the tabular view, data may betransferred to a local spreadsheet application without effort.A typical operation might be to select a set of data points inthe volcano plot or gene versus protein view and send it to theclipboard. Clipboard content can always be accessed usingthe ‘show clipboard content’ functionality. The clipboardcontent can now serve as a gene group for functional char-acterization based on the term schema from The GeneOntology [4]. BrainProfileDB offers access to the GOSSIPtool [5], enabling a fully fledged evaluation of gene groups forterm enrichment. Functional analysis will result in a p-valueranked list of gene ontology terms. Employing ‘see result asgraph’ functionality provided above the term list, a graph ofgene ontology terms will be generated and displayed. Statis-tically significant terms are shown in a box with darkcoloured background and a click on a term box provides ashort description of the term. Using different views of thestudy as starting points for the analysis a lot of informationon the functional properties can be gathered interactively.

Database curation is handled by MicroDiscovery, usingspecific curation software ensuring data integrity, con-sistency and quality. Data submission is performed usingspecified form sheets for study description defined in accordwith MIAME/MIAPE standards. The content of these formsheets is largely reflected in the study representation of thedata base as presented for inspection by the ‘find study’ userinterface. When importing data, quality control is performedon three levels: single assay, groups of assays (e.g. repeatmeasurements) and the entire study (Fig. 2). Furthermore,the importer tool allows iterative updating of identification

© 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com

Page 3: BrainProfileDB – a platform for integration of functional genomics data

1164 J. Schuchhardt et al. Proteomics 2008, 8, 1162–1164

Figure 2. Snapshot of an impor-ter tool applied to a set ofmicroarray data. A series ofquality control steps is per-formed according to a definedworkflow. The first steps arededicated to the definition ofstudy design; later steps areassessing data quality. Dataquality control is performed onthe levels of single assays,groups of assays and the entirestudies. Correlation matrices,pairs plots and density distribu-tions serve for the identificationof outlier measurements. Singledata points as well as entirechips or 2-D gel runs failingquality control can be flagged.

information for protein data. A flexible updating schema isrequired for protein identification since, owing to the labori-ous procedures of mass spectrometric identification, proteinnames often are provided successively at later stages of aninvestigation. A name mapping system was developed map-ping external identifiers to unique internal identifiers. Map-ping information is extracted from ENSEMBL database(http://www.ensembl.org/index.html) [6], GENBANK(http://www.ncbi.nlm.nih.gov/Genbank/) [7] and RefSeq(http://www.ncbi.nlm.nih.gov/RefSeq/) [8]. All mappingtables are updated on a regular basis.

While the system currently enables representation ofgene and protein expression data we envision several direc-tions of further development. An obvious improvement ofthe system will be the incorporation of additional data types.Currently concepts are developed for the integration of sev-eral types of molecular interaction data (see Schweinsberg etal. in this issue) and of toponomics data (see Bode et al., inthis issue). Structured data export directly supporting for-mats defined in the MGED and HUPO-PSI consortia isanother extension we are aiming at. For expert users accessto more low level data like single probe information would bedesirable, still this might lead to substantial increase in dataload and probably additional complexity in managing dataaccess.

This work was supported by the German ministry for educa-tion and research (BMBF) within the National Genome Re-search Network 2 (NGFN2), subprojects FKZ01GR0440,FKZ01GR0442, FKZ01GR0444 and FKZ01GR0448.

The authors have declared no conflict of interest.

References

[1] Hwang, D., Rust, A. G., Ramsey, S., Smith, J. J. et al., A dataintegration methodology for systems biology. Proc. Natl.Acad. Sci. USA 2005, 102, 17296–17301.

[2] Brazma, A., Hingamp, P., Quackenbush, J., Sherlock G. et al.,Minimum information about a microarray experiment(MIAME)-toward standards for microarray data. Nat. Genet.2001, 29, 365–371.

[3] Hermjakob, H., Montecchi-Palazzi, L., Bader, G., Wojcik, J. etal., The HUPO PSI’s molecular interaction format–a commu-nity standard for the representation of protein interactiondata. Nat. Biotechnol. 2004, 22, 177–183.

[4] Ashburner, M., Ball, C., Blake, J., Botstein, D. et al., Geneontology: Tool for the unification of biology. The GeneOntology Consortium. Nat. Genet. 2000, 25, 25–29.

[5] Blüthgen, N., Brand, K., Cajavec, B., Swat, M. et al., Biologicalprofiling of gene groups utilizing Gene Ontology. GenomeInform. 2005, 16, 106–115.

[6] Hubbard, T. J., Aken, B. L., Beal, K., Ballester, B. et al.,Ensembl 2007. Nucleic Acids Res. 2007, 35, D610–D617.

[7] Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J.,Wheeler, D. L., GenBank. Nucleic Acids Res. 2007, 35, D21–D25.

[8] Pruitt, K. D., Tatusova, T., Maglott, D. R., NCBI ReferenceSequence (RefSeq): A curated non-redundant sequencedatabase of genomes, transcripts and proteins. Nucleic AcidsRes. 2005, 33, D501–D504.

© 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.com