Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1 SRI International Bioinformatics
Pathway Bioinformatics: Inference, Visualization, and Analysis
Peter D. Karp, Ph.D.Bioinformatics Research Group
BioCyc.orgEcoCyc.org, MetaCyc.org, HumanCyc.org
2 SRI International Bioinformatics
Overview
MotivationsPathway/genome databases
BioCyc collectionEcoCyc, MetaCyc
Pathway Tools softwareVisualization, Editing, AnalysisInference tools: Pathway hole filler, Transport inference parserReachability analysis of metabolic network
ApplicationsNew levels of genome annotationOmics data analysisPublish curated pathway/genome database on the web
3 SRI International Bioinformatics
Model Organism Databases /Organism Specific Databases
DBs that describe the genome and other information about an organism
Every sequenced organism with an active experimental community requires a MOD
Integrate genome data with information about the biochemical and genetic network of the organismIntegrate literature-based information with computational predictions
Curated by experts for that organismNo one group can curate all the world’s genomesDistribute workload across a community of experts to create a community resource
4 SRI International Bioinformatics
Rationale for MODs
Each “complete” genome is incomplete in several respects:40%-60% of genes have no assigned functionRoughly 7% of those assigned functions are incorrectMany assigned functions are non-specific
Need continuous updating of annotations with respect to new experimental data and computational predictions
Gene positions, sequence, gene functions, regulatory sites, pathways
MODs are platforms for global analyses of an organismInterpret omics data in a pathway contextIn silico prediction of essential genesCharacterize systems properties of metabolic and genetic networks
5 SRI International Bioinformatics
BioCyc Collection of Pathway/Genome Databases
Pathway/Genome Database (PGDB) –combines information about
Pathways, reactions, substratesEnzymes, transportersGenes, repliconsTranscription factors/sites, promoters, operons
Tier 1: Literature-Derived PGDBsMetaCycEcoCyc -- Escherichia coli K-12
Tier 2: Computationally-derived DBs, Some Curation -- 20 PGDBs
HumanCycMycobacterium tuberculosis
Tier 3: Computationally-derived DBs, No Curation -- 349 DBs
6 SRI International Bioinformatics
Pathway Tools Overview
Pathway/GenomeEditors
Pathway/GenomeDatabase
PathoLogicAnnotatedGenome
MetaCycReference
Pathway DB
Pathway/GenomeNavigator
7 SRI International Bioinformatics
Pathway Tools Software: PathoLogic
Computational creation of new Pathway/Genome Databases
Transforms genome into Pathway Tools schema and layers inferred information above the genome
Predicts operonsPredicts metabolic networkPredicts which genes code for missing enzymes in metabolic pathways Infers transport reactions from transporter names
Bioinformatics 18:S225 2002
8 SRI International Bioinformatics
Pathway Tools Software:Pathway/Genome Editors
Interactively update PGDBs with graphical editors
Support geographically distributed teams of curators with object database system
Gene editorProtein editorReaction editorCompound editorPathway editorOperon editorPublication editor
9 SRI International Bioinformatics
Pathway Tools Software:Pathway/Genome Navigator
Querying and visualization of:PathwaysReactionsMetabolitesProteinsGenesChromosomes
Two modes of operation:Web modeDesktop modeMost functionality shared, but each has unique functionality
10 SRI International Bioinformatics
Pathway Tools Ontology / Schema
Ontology classes: 1621Many datatypes from genomes to pathwaysClassification schemes for pathways, chemical compounds, enzymatic reactions (EC system)Cell Component Ontology Protein Feature ontology
Comprehensive set of 221 attributes and relationshipsEvidence codes, supporting citations
11 SRI International Bioinformatics
BioCyc Tier 3
Tier 3 DBs created by subjecting annotated genomes to PathoLogic computational processing pipeline:
Pathway predictionPathway Hole FillingOperon prediction (bacteria)Transport Inference Parser
349 PGDBsSource: CMR database
12 SRI International Bioinformatics
Pathway Tools Software: PGDBs Created Outside SRI
1,300+ licensees: 75+ groups applying software to 200+ organisms
Saccharomyces cerevisiae, SGD project, Stanford UniversityMouse, MGD, Jackson LaboratorydictyBase, Northwestern UniversityUnder development:
CGD (Candida albicans), Stanford UniversityDrosophila, P. Ebert in collaboration with FlyBaseC. elegans, P. Ebert in collaboration with WormBase
Planned:RGD (Rat), Medical College of Wisconsin
Arabidopsis thaliana, TAIR, Carnegie Institution of WashingtonPlantCyc, ~20 plant PGDBs, Carnegie Institution of WashingtonSix Solanaceae species, Cornell University GrameneDB, Cold Spring Harbor LaboratoryMedicago truncatula, Samuel Roberts Noble Foundation
13 SRI International Bioinformatics
EcoCyc Project – EcoCyc.orgE. coli Encyclopedia
Review-level Model-Organism Database for E. coliTracks evolving annotation of the E. coli genome and cellular networksThe two paradigms of EcoCyc
“Multi-dimensional annotation of the E. coli K-12 genome”Positions of genes; functions of gene products – 76% / 66% expGene Ontology terms; MultiFun termsGene product summaries and literature citationsEvidence codesMultimeric complexesMetabolic pathwaysRegulation of transcription initiation
Nuc. Acids Res. 35:7577 2007 ASM News 70:25 2004 Science 293:2040
Karp, Gunsalus, Collado-Vides, Paulsen
14 SRI International Bioinformatics
EcoCyc = E.coli Dataset + Pathway/Genome Navigator
Genes: 4,471
Proteins: 4,464Complexes: 808
RNAs: 285
Reactions:Metabolic: 923Transport: 259
Pathways: 224
Compounds: 1,227
URL: EcoCyc.org
Gene Regulation:Operons: 3,187Trans Factors: 179Promoters: 1,343
TF Binding Sites: 1,189
Citations: 16,153
15 SRI International Bioinformatics
Paradigm 1:EcoCyc as Textual Review Article
All gene products for which experimental literature exists are curated with a minireview summary
Found on protein and RNA pages, not gene pages!3257 gene products contain summaries
Summaries cover function, interactions, mutant phenotypes, crystal structures, regulation, and more
Additional summaries found in pages for operons, pathways
EcoCyc cites 15,880 publications
16 SRI International Bioinformatics
Paradigm 2: EcoCyc as Computational Symbolic Theory
Highly structured, high-fidelity knowledge representation provides computable informationEach molecular species defined as a DB object
Genes, proteins, small moleculesEach molecular interaction defined as a DB object
Metabolic reactionsTransport reactionsTranscriptional regulation of gene expression
220 database fields capture extensive properties and relationships
17 SRI International Bioinformatics
EcoCyc Procedures
DB updates performed by 5 staff curatorsInformation gathered from biomedical literature
Enter data into structured database fieldsAuthor extensive summariesUpdate evidence codes
Corrections submitted by E. coli researchers
Four releases per year
Quality assurance of data and softwareEvaluate database consistency constraintsPerform element balancing of reactionsRun other checking programs
18 SRI International Bioinformatics
EcoCyc Accelerates Science
ExperimentalistsE. coli experimentalistsExperimentalists working with other microbesAnalysis of expression data
Computational biologists Biological research using computational methodsGenome annotationStudy connectivity of E. coli metabolic networkStudy phylogentic extent of metabolic pathways and enzymes in all domains of life
BioinformaticistsTraining and validation of new bioinformatics algorithms – predict operons, promoters, protein functional linkages, protein-protein interactions,
Metabolic engineers“Design of organisms for the production of organic acids, amino acids, ethanol, hydrogen, and solvents “
Educators
19 SRI International Bioinformatics
MetaCyc: Metabolic Encyclopedia
Describe a representative sample of every experimentally determined metabolic pathwayDescribe properties of metabolic enzymes
Literature-based DB with extensive references and commentaryPathways, reactions, enzymes, substrates
Jointly developed by P. Karp, R. Caspi, C. Fulcher, SRI InternationalL. Mueller, A. Pujar, Cornell UnivS. Rhee, P. Zhang, Carnegie Institution
Nucleic Acids Research 2008
20 SRI International Bioinformatics
Applications of MetaCyc
Reference source on metabolic pathways
Metabolic engineeringFind enzymes with desired activities, regulatory propertiesDetermine cofactor requirements
Predict pathways from genomes
Systematic studies of metabolism
Computer-aided education
21 SRI International Bioinformatics
MetaCyc Data -- Version 12.0
16,335Citations
1,108Organisms
6,719Small Molecules
4,731Enzymes
6,739Reactions
1036Pathways
22 SRI International Bioinformatics
Taxonomic Distribution ofMetaCyc Pathways – version 12.0
86Archaea
154Fungi
119Mammals
452Green Plants
691Bacteria
23 SRI International Bioinformatics
Family of Pathway/GenomeDatabases
MetaCyc
EcoCycCauloCycAraCyc
MtbRvCycHumanCyc
24 SRI International Bioinformatics
PathoLogic Step 1: Translate Genome to PGDBPathway/Genome
DatabaseAnnotated Genomic
Sequence
Genes/ORFs
Gene Products
DNA Sequences
Reactions
Pathways
Compounds
Multi-organism PathwayDatabase (MetaCyc)
PathoLogic Software
Integrates genome and pathway data to identify
putative metabolic networks
Genomic Map
Genes
Gene Products
Reactions
Pathways
Compounds
25 SRI International Bioinformatics
PathoLogic Step 4: Pathway Hole Filler
Definition: Pathway Holes are reactions in metabolic pathways for which no enzyme is identified
holesholes
LL--aspartateaspartate iminoaspartateiminoaspartate
quinolinatequinolinate
nicotinate nicotinate nucleotidenucleotide
deamidodeamido--NADNAD
NADNAD
1.4.31.4.3..--
2.7.7.182.7.7.18
quinolinate synthetasequinolinate synthetasenadAnadA
n.n. pyrophosphorylasen.n. pyrophosphorylasenadCnadC
NAD+ synthetase, NH3 NAD+ synthetase, NH3 --dependentdependent
CC3619CC3619
6.3.5.16.3.5.1
26 SRI International Bioinformatics
organism 1 enzyme A
organism 2 enzyme A
organism 5 enzyme A
organism 4 enzyme A
organism 3 enzyme A
organism 8 enzyme A
organism 7 enzyme A
organism 6 enzyme A
Step 1: Query UniProt for all sequences having
EC# of pathway hole
Step 2: BLAST against target
genome Step 3 & 4: Consolidate
hits and evaluate evidence
7 queries have high-scoring hits to sequence Y
gene Ygene X
gene Z
27 SRI International Bioinformatics
Bayes Classifier
P(protein has function X|P(protein has function X|EE--value, avg. rank, aln. length, etc.)value, avg. rank, aln. length, etc.)
protein has protein has function Xfunction X
% of query % of query alignedaligned
avg. rank in avg. rank in BLAST outputBLAST output
bestbestEE--valuevalue
Number of Number of queriesqueries
pwy pwy directondirecton
adjacent adjacent rxnsrxns
28 SRI International Bioinformatics
Pathway Hole Filler
Why should hole filler find things beyond the original genome annotation?
Reverse BLAST searches more sensitiveReverse BLAST searches find second domainsIntegration of multiple evidence types
29 SRI International Bioinformatics
Caulobacter crescentus Pathway Holes
130 pathways containing 582 reactions92 pathways contain 236 pathway holes
Caulobacter holes filled:77 holes filled at P >0.9
Previous functions of candidate hole fillers:No predicted functionCorrectly assigned single functionIncorrectly assigned functionImprecise functional assignment
BMC Bioinformatics 5:76 2004
30 SRI International Bioinformatics
PathoLogic Step 5:Transport Inference Parser
Problem: Write a program to query a genome annotation to compute the substrates an organism can transport
Typical genome annotations for transporters:ATP transporter for riboseribose ABC transporterD-ribose ATP transporterABC transporter, membrane spanning protein [ribose]ABC transporter, membrane spanning protein [D-ribose]
31 SRI International Bioinformatics
Transport Inference Parser
Input: “ATP transporter of phosphonate”Output: Structured description of transport activity
Locates most transporters in genome annotation using keyword analysis
Parse product name using a series of rules to identify:Transported substrate, co-substrateInflux/effluxEnergy coupling mechanism
Creates transport reaction object:
phosphonate[periplasm] + H2O + ATP = phosphonate + Pi + ADP
32 SRI International Bioinformatics
Transport Inference Parser
Permits symbolic computation with transport activities:Compute transportable substrates of the cellCompute connectivity among compartments for substratesFacilitate reasoning about transport/metabolism connectionsDraw transport cartoon in protein pages, cellular overview
33 SRI International Bioinformatics
Tools
34 SRI International Bioinformatics
Pathway Tools Overviews and Omics Viewers
Designed to avoid the hairball effectGenerated automatically from PGDBMagnify, interrogateOmics viewers paint omics data onto
overview diagramsDifferent perspectives on same datasetUse animation for multiple time points or conditionsPaint any data that associates numbers with genes, proteins, reactions, or metabolites
Provide genome-scale visualizations of cellular networksHarness human visual system to interpret patterns in biological
contexts
35 SRI International Bioinformatics
36 SRI International Bioinformatics
Infer Anti-Microbial Drug TargetsInfer drug targets as genes coding for enzymes that encode chokepoint reactions
Two types of chokepoint reactions:
Chokepoint analysis of Plasmodium falciparum:216/303 reactions are chokepoints (73%)All 3 clinically proven anti-malarial drugs target chokepoints21/24 biologically validated drug targets are chokepoints11.2% of chokepoints are drug targets3.4% of non-chokepoints are drug targets=> Chokepoints are significantly enriched for drug targets
Genome Research 14:917 2004
37 SRI International Bioinformatics
Reachability Analysis of Metabolic Network
Given:A PGDB for an organismA set of initial metabolites
Infer:What set of products can be synthesized by the small-molecule metabolism of the organism
Can known growth medium yield known essential compounds?
Romero and Karp, Pacific Symposium on Biocomputing, 2001
38 SRI International Bioinformatics
Algorithm: Forward PropagationThrough Production System
Each reaction becomes a production ruleEach metabolite in nutrient set becomes an axiom
Nutrientpool
Metabolitepool
“Fire”reactions
Transport
Products
Reactants
PGDBreaction
set
39 SRI International Bioinformatics
A + B W
E + F Y
C + D X
W + Y Z
Nutrients: A, B, C, E, F
Produced Compounds: W, Y, Z
40 SRI International Bioinformatics
Initial Metabolite Nutrient Set (Total: 21 compounds)
Nutrients (8) (M61 Minimal growth medium)
H+, Fe2+, Mg2+, K+, NH3, SO4
2-, PO42-, Glucose
Nutrients (10) (Environment)
Water, Oxygen, Trace elements (Mn2+, Co2+, Mo2+, Ca2+, Zn2+, Cd2+, Ni2+, Cu2+)
Bootstrap Compounds (3) ATP, NADP, CoA
41 SRI International Bioinformatics
Essential CompoundsE. coli Total: 41 compounds
Proteins (20)Amino acids
Nucleic acids (DNA & RNA) (8)Nucleosides
Cell membrane (3)Phospholipids
Cell wall (10)Peptidoglycan precursorsOuter cell wall precursors (Lipid-A, oligosaccharides)
42 SRI International Bioinformatics
Results
Phase I: Forward propagation21 initial compounds yielded only half of the 41 essential compounds for E. coli
Phase II: Manually identifyBugs in EcoCyc (e.g., two objects for tryptophan)
A B B’ CIncomplete knowledge of E. coli metabolic network
A + B C + D“Bootstrap compounds”Missing initial protein substrates (e.g., ACP)
Protein synthesis not represented
Phase III: Forward propagation with 11 more initial metabolites
Yielded all 41 essential compounds
43 SRI International Bioinformatics
Summary
Pathway/Genome DatabasesMetaCyc non-redundant DB of literature-derived pathways370 organism-specific PGDBs available through SRI at BioCyc.orgComputational theories of biochemical machinery
Pathway Tools softwareExtract pathways from genomesOmics data analysisQuery, visualization, WWW publishing
44 SRI International Bioinformatics
BioCyc and Pathway Tools Availability
BioCyc.org Web site and database files freely available to all
Pathway Tools freely available to non-profitsMacintosh, PC/Windows, PC/Linux
45 SRI International Bioinformatics
Acknowledgements
SRISuzanne Paley, Ron Caspi, Ingrid Keseler, Carol Fulcher, Markus Krummenacker, Alex Shearer, Tomer Altman, Joe Dale, Fred Gilham, Pallavi Kaipa
EcoCyc CollaboratorsJulio Collado-Vides, Robert Gunsalus, Ian Paulsen
MetaCyc CollaboratorsSue Rhee, Peifen Zhang, Kate DreherLukas Mueller, Anuradha Pujar
Funding sources:NIH National Center for Research ResourcesNIH National Institute of General Medical SciencesNIH National Human Genome Research Institute
BioCyc.org
Learn more from BioCyc webinars: biocyc.org/webinar.shtml