29
An Automated System for An Automated System for Deep Proteome Annotation Deep Proteome Annotation Gary Van Domselaar September 27, 2003

An Automated System for Deep Proteome Annotation...David Meeuwis Roman Eisner Brett Poulin Zhiyong Lu John Anvik Cam Macdonnel. An Automated System for Deep Proteome Annotation Gary

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • An Automated System for An Automated System for Deep Proteome AnnotationDeep Proteome Annotation

    Gary Van DomselaarSeptember 27, 2003

  • The ProblemThe Problem

    • Most existing biological databases have a narrow biological aspect.– PDB: biomolecular coordinate data– Ensembl: human gene predictions– GO: Genome Ontology (process, function, location)

    • Each has a custom interface• Each can answer questions in its own domain

    but cannot answer question that span multiple domain boundaries. ‘Which human gene products located in the

    endoplasmic reticulum have experimental coordinate data?’

  • The Solution: Integrated Biological The Solution: Integrated Biological Databases.Databases.

    3 main approaches:1. Link Integration. Researchers begin their

    query with one data source, then follow hypertext links to related information in other data sources. Example: DAS, NCBI link out.

    2. View Integration. A ‘super interface’ is created that makes the source databases appear as one. Example: Kleisli.

    3. Data Warehousing. All the data is brought under one roof. Example: Genecards, GeneMine, Cybercell database.

  • An Automated Proteome An Automated Proteome Annotation System for Proteome Annotation System for Proteome

    AnalystAnalyst

    • Proteome Analyst provides annotations in the form of a ‘PA Card’

  • An Automated Proteome An Automated Proteome Annotation System for Proteome Annotation System for Proteome

    AnalystAnalyst

  • An Automated Proteome An Automated Proteome Annotation System for Proteome Annotation System for Proteome

    AnalystAnalyst

    • Proteome Analyst provides annotations in the form of a ‘PA Card’

    • This system will provide a much fuller set of annotations

  • AnnotationsAnnotations• 2D_Gel_Image• Accession_No.• Alternate_Names• Availability• Centisome Position• Cofactors• Copy Number• Cys/Met_Content• EC_Number• Entry_ID• Following_Gene• Gene_Name• Gene_Ontology• Gene_Position• General_Function• General_Reaction• Gene_Sequence

    • Quaternary_Structure• Resolution• Riley_Cell_Function• Riley_Gene_Function• RNA_Copy_No.• Secondary_Structure• Sequence• Similarity• Specific_Activity• Specific_Function• Specific_Reaction• Structure_CLASS• Substrates• SWISS_PROT_(AC_&_ID)• Theoretical_pI• Transmembrane• Upstream_100_bases

    • Homologues• Important_Sites• Inhibitor• Interacting_Partners• Kcat_Value_[1/min]• Km_Value_[mM]• Location• Metabolic_Importance• Metals_Ions• Molecular_Weight• No._of_Amino_Acids• Other_Databases• Paralogues• Pfam_Domain/Function• Preceding_Gene• Products

    • PROSITE_Motif

  • Concept Concept Genomic Sequence Data

  • Concept Concept Genomic Sequence Data

    Genomic data analysis must be tailored to the major kingdoms:

    •viruses

    •prokaryotes

    •Eukaryotes - Genscan

    } Glimmer

    Genomic Sequence Data

  • Concept Concept Genomic Sequence Data

    Proteomic Sequence Data

    Gene Identification

    and Translation

  • Concept Concept Genomic Sequence Data

    Proteomic Sequence Data Processing

    Gene Identification

    and Translation

  • Concept Concept Genomic Sequence Data

    Proteomic Sequence Data Processing

    Gene Identification

    and Translation

    Internal Processing

  • Concept Concept Genomic Sequence Data

    Proteomic Sequence Data Processing

    Gene Identification

    and Translation

    Internal Processing

  • Concept Concept Genomic Sequence Data

    Proteomic Sequence Data Processing

    Gene Identification

    and Translation

    •Secondary Structure

    •Homology Modeling

    •Mol. Wt

    •pI

    •Etc.

    Internal Processing

  • Concept Concept Genomic Sequence Data

    Proteomic Sequence Data Processing

    Internal Processing

    Gene Identification

    and Translation

    Internal DBs

  • Concept Concept Genomic Sequence Data

    Proteomic Sequence Data Processing

    Internal Processing

    Gene Identification

    and Translation

    •CCDB: a deeply annotated database for E. coli.

    •CCDB++ other deeply annotated model organisms from each kingdom

    SWISS-PROT

    PDB

    Internal DBs

  • Cybercell (CCDB)Cybercell (CCDB)

    • A comprehensive collection of detailed enzymatic, biological, chemical, genetic, and molecular biological data about E. coli (strain K12, MG1655).

  • Concept Concept

    External DBs

    Genomic Sequence Data

    Proteomic Sequence Data Processing

    External Processing

    Internal Processing

    Internal DBsGene Identification

    and Translation

  • Data SourcesData Sources

    • GenBank • SwissProt • Prosite • pI/MW Tool • Geneiz • PIR PEC/Shigen • Echobase • Wisconsin • ExpressDB • GeneOntology • GenProtEC • EcoGene • PsiPred

    • EcoCyc • PDB • CATH • Swiss2D PAGE • SwissModel • BRENDA • TargetDB • Rosetta • PsortB • KEGG • Chemfinder • Babel

  • Concept Concept

    External DBs

    Genomic Sequence Data

    Proteomic Sequence Data Processing

    External Processing

    Internal Processing

    Internal DBs

    Annotated Proteomic Sequence Data

    Gene Identification

    and Translation

  • Concept Concept

    External DBs

    Genomic Sequence Data

    Proteomic Sequence Data Processing

    External Processing

    Internal Processing

    Internal DBs

    Annotated Proteomic Sequence Data

    Viewing and Mining Software

    Gene Identification

    and Translation

  • Concept Concept

    External DBs

    Genomic Sequence Data

    Proteomic Sequence Data Processing

    External Processing

    Internal Processing

    Internal DBs

    Annotated Proteomic Sequence Data

    Viewing and Mining Software

    Gene Identification

    and Translation

    Proteome

    Analyst

    Multiple

    Protein

    Extraction and

    Report

    System

  • Data Mining and VisualizationData Mining and Visualization

  • Data Mining and VisualizationData Mining and Visualization

  • Concept Concept

    External DBs

    Genomic Sequence Data

    Proteomic Sequence Data Processing

    External Processing

    Internal Processing

    Internal DBs

    Annotated Proteomic Sequence Data

    Viewing and Mining Software

    Gene Identification

    and Translation

    Discoveries

  • ProgressProgress

    • Curently working on H. Influenzae reference genome.

    • Written modules for generating protein sequence data from gene predictions (using glimmer).

    • Currently writing the analysis modules and automation scripts.

  • ProgressProgress

  • AcknowledgmentsAcknowledgments

    P.I.sDavid WishartDwayne SzaffronPaul LuRussel Greiner

    CyberCell DatabaseShan Sundararaj An Chi GuoBahram Habibi Nazhad

    Proteome AnalystAlona FysheDavid MeeuwisRoman Eisner Brett PoulinZhiyong LuJohn AnvikCam Macdonnel

  • An Automated System for An Automated System for Deep Proteome AnnotationDeep Proteome Annotation

    Gary Van DomselaarSeptember 27, 2003