1
An Automated System for Deep Proteome Annotation Gary Van Domselaar , Savita Shrivastava, Paul Stothard and David S. Wishart Unannotated Protein Sequence Data Annotated Protein Sequence Data Unannotated Genome Sequence Data Deeply Annotated Model Organisms Human, Mouse, E. coli, D. melanogaster, S. cerevisiae, C. elegans, etc. Visualizati on and Mining Software 1 2 3 4 5 Internal Processing External Processing Abstract Most biological databases in existence today are focused on a narrow biological domain. As such, they are unable to address biological questions outside of that domain. Researchers wishing to address broad biological questions must manually compile data from several biological data sources. This poster describes our progress on the development of an automated system for deeply annotating the proteomes of model organisms (and others), and an intuitive data mining and data visulization system that provide detailed information for broad biological queries. The Deep Annotaition system is part of the PENCE Proteome Analyst project. Local Datbases : SwissProt Description 1. The system accepts proteomic or genomic data. If the user submits genomic data, gene predictions can be performed with Glimmer or Genscan. 2. The unnanotated sequences enter the Proteome Annotation System. 3. Sequences are compared against existing deeply annotated databases. Sequences with sufficient homology inheret appropriate annotations. Other annoations are computed locally. 4. Annotations unavailable locally are obtained by querying servers and databases across the Internet. 5. The annotated sequence data is added to the database of annotated organisms and made available for viewing and querying. 6. Annoations are viewable over the Web using CGView for circular chromosomes, and LGView for linear chromosomes. Broad queries can be made across organisms for an arbitrary subset of available annotations. 6 Progress The workflow engine, database comparison, data input / output and html rendering systems are in place. A number of annotation computing modules have been implemented (Pfam, PROSITE, Protein Name Finder, Orthologues, Paralogues, Molecular Weight, PI, Subcellular Location Prediction, and Function Prediction). Many more are being written. We are currently working on improving the data storage and querying systems. An initial release has been planned for mid-summer 2004. An early test version of the output (on H. influenzae) is available at: http://redpoll.pharmacy.ualberta.ca/~savita/ha_series/ References Proteome Analyst: http://www.cs.ualberta.ca/~bioinfo/PA/ Cybercell: http://redpoll.pharmacy.ualberta.ca/CCDB/ MagPie: http://magpie.ucalgary.ca/ GeneQuiz: http://jura.ebi.ac.uk:8765/ext-genequiz/ Pedant: http://pedant.gsf.de/ Ensembl: http:// ensembl.org / Introduction Prior to the advent of high throughput sequencing, most biologists would annotate or characterize genes and proteins manually – one at a time. However for genome scale annotation it is too consuming to predict the properties of each protein sequence or to organize the results of many prediction tools by hand. Furthermore due to the enormous volume of biological information, the sheer number of different data sources, and their growing heterogeneity, an 'information labyrinth' has been created, where one can easily lose one’s way on such a quest for information. Clearly a high degree of automation is required to cope with the analysis of the huge number of sequences generated by genome sequencing projects, and to ensure consistent and reproducible results. This automation could free the expert to verify and refine these analyses and to follow up new discoveries. A number of systems have been developed over the past few years that permit automated genome- wide or proteome-wide annotation, such as The ENSEMBL system, PEDANT, Magpie, GeneQuiz, and Proteome Analyst. The above-mentioned systems are web-based tools designed to identify genes, parse data, translate sequences, search against public databases, identify domains or motifs and perform predictive analyses. Many of these packages provide user-customizable searches and graphical, hyperlinked output. The level of interpretation or inference offered by these annotation systems varies widely, with some offering only raw data in a consolidated format and others inferring function or ontology through detailed analysis. A common problem for many existing automated annotation system is that the “depth” of annotation about any given gene or protein is quite limited or “shallow”, typically consisting of 10-15 piece of information. We are working on an automated system (The Proteome Analyst System) for deeply annotating the proteomes of model organisms, and developing an intuitive data mining and data visualization system. Deep annotation means that the proteome/genome is annotated to a level that includes such items as predicted protein location, 2D or 3D structure, detailed or specific functions, post-translational modifications, expression levels, interacting partners, domains, active sites, substrates, ligands, pathways, cofactors, copy numbers, etc.. An example of the kind of "deep" annotation can be seen on Cybercell database. This “deep annotation” project contains a software engineering component that integrates existing data and methods to perform a scientific analysis of the integrated data. The results of this kind of project are of interest from the scientific point of view and from the software engineering point of view. This “deep” annotation system may be used to support a wide range of biologists and could be a platform for further developments. Since the similarity of functions between related proteins varies substantially Department of Computing Science and Biological Sciences University of Alberta Edmonton AB T6E 2E9 [email protected] [email protected]

An Automated System for Deep Proteome Annotation Gary Van Domselaar †, Savita Shrivastava, Paul Stothard and David S. Wishart ‡ Unannotated Protein Sequence

Embed Size (px)

Citation preview

Page 1: An Automated System for Deep Proteome Annotation Gary Van Domselaar †, Savita Shrivastava, Paul Stothard and David S. Wishart ‡ Unannotated Protein Sequence

An Automated System for Deep Proteome AnnotationGary Van Domselaar†, Savita Shrivastava,

Paul Stothard and David S. Wishart‡

Unannotated Protein

Sequence Data

Annotated Protein

Sequence Data

Unannotated Genome

Sequence Data

Deeply Annotated Model OrganismsHuman, Mouse, E. coli, D. melanogaster, S. cerevisiae, C. elegans, etc.

Visualization and

Mining Software

1

2

3

4

5

Internal Processing

External Processing

AbstractMost biological databases in existence today are focused on a narrow biological domain. As such, they are unable to address biological questions outside of that domain. Researchers wishing to address broad biological questions must manually compile data from several biological data sources.

This poster describes our progress on the development of an automated system for deeply annotating the proteomes of model organisms (and others), and an intuitive data mining and data visulization system that provide detailed information for broad biological queries. The Deep Annotaition system is part of the PENCE Proteome Analyst project.

Local Datbases: SwissProt

Description1. The system accepts proteomic or genomic data. If the user submits genomic data, gene predictions can be performed with Glimmer or Genscan.

2. The unnanotated sequences enter the Proteome Annotation System.

3. Sequences are compared against existing deeply annotated databases. Sequences with sufficient homology inheret appropriate annotations. Other annoations are computed locally.

4. Annotations unavailable locally are obtained by querying servers and databases across the Internet.

5. The annotated sequence data is added to the database of annotated organisms and made available for viewing and querying.

6. Annoations are viewable over the Web using CGView for circular chromosomes, and LGView for linear chromosomes. Broad queries can be made across organisms for an arbitrary subset of available annotations.

6

ProgressThe workflow engine, database comparison, data input / output and html rendering systems are in place. A number of annotation computing modules have been implemented (Pfam, PROSITE, Protein Name Finder, Orthologues, Paralogues, Molecular Weight, PI, Subcellular Location Prediction, and Function Prediction). Many more are being written. We are currently working on improving the data storage and querying systems. An initial release has been planned for mid-summer 2004.

An early test version of the output (on H. influenzae) is available at:http://redpoll.pharmacy.ualberta.ca/~savita/ha_series/

ReferencesProteome Analyst: http://www.cs.ualberta.ca/~bioinfo/PA/Cybercell:http://redpoll.pharmacy.ualberta.ca/CCDB/MagPie:http://magpie.ucalgary.ca/GeneQuiz:http://jura.ebi.ac.uk:8765/ext-genequiz/Pedant:http://pedant.gsf.de/Ensembl:http://ensembl.org/

IntroductionPrior to the advent of high throughput sequencing, most biologists would annotate or characterize genes and proteins manually – one at a time. However for genome scale annotation it is too consuming to predict the properties of each protein sequence or to organize the results of many prediction tools by hand. Furthermore due to the enormous volume of biological information, the sheer number of different data sources, and their growing heterogeneity, an 'information labyrinth' has been created, where one can easily lose one’s way on such a quest for information. Clearly a high degree of automation is required to cope with the analysis of the huge number of sequences generated by genome sequencing projects, and to ensure consistent and reproducible results. This automation could free the expert to verify and refine these analyses and to follow up new discoveries. A number of systems have been developed over the past few years that permit automated genome-wide or proteome-wide annotation, such as The ENSEMBL system, PEDANT, Magpie, GeneQuiz, and Proteome Analyst.

The above-mentioned systems are web-based tools designed to identify genes, parse data, translate sequences, search against public databases, identify domains or motifs and perform predictive analyses. Many of these packages provide user-customizable searches and graphical, hyperlinked output. The level of interpretation or inference offered by these annotation systems varies widely, with some offering only raw data in a consolidated format and others inferring function or ontology through detailed analysis.

A common problem for many existing automated annotation system is that the “depth” of annotation about any given gene or protein is quite limited or “shallow”, typically consisting of 10-15 piece of information. We are working on an automated system (The Proteome Analyst System) for deeply annotating the proteomes of model organisms, and developing an intuitive data mining and data visualization system. Deep annotation means that the proteome/genome is annotated to a level that includes such items as predicted protein location, 2D or 3D structure, detailed or specific functions, post-translational modifications, expression levels, interacting partners, domains, active sites, substrates, ligands, pathways, cofactors, copy numbers, etc.. An example of the kind of "deep" annotation can be seen on Cybercell database. This “deep annotation” project contains a software engineering component that integrates existing data and methods to perform a scientific analysis of the integrated data. The results of this kind of project are of interest from the scientific point of view and from the software engineering point of view. This “deep” annotation system may be used to support a wide range of biologists and could be a platform for further developments. Since the similarity of functions between related proteins varies substantially depending on the species context and evolutionary distance, the relevant analysis and annotations also differ between the kingdoms (viruses, archaebacteria, protista, fungae, animalia, eubacteria, plantae). The major challenge of this project is to develop custom analysis pipelines for each kingdom.

Department of Computing Science and Biological SciencesUniversity of Alberta

Edmonton AB T6E 2E9†[email protected][email protected]