Upload
penda
View
29
Download
2
Embed Size (px)
DESCRIPTION
Pathway Tools Meeting - December 1, 2005, Geneva (SIB). :. &. Putting together synteny and metabolic information to achieve relevant expert annotation of microbial genomes. Dr Claudine Médigue. What is MaGe ? Yet another bacterial annotation platform !…. - PowerPoint PPT Presentation
Citation preview
Pathway Tools Meeting - December 1, 2005, Geneva (SIB)Pathway Tools Meeting - December 1, 2005, Geneva (SIB)
Putting together synteny and Putting together synteny and metabolic information to achieve metabolic information to achieve
relevant expert annotation of relevant expert annotation of microbial genomesmicrobial genomes
Dr Claudine Médigue
&& ::
Its development started in Oct. 2002
Context : the Acinetobacter sp. ADP1 genome annotation (Summer 2004)
What is MaGe ? Yet another bacterial annotation platform !… What is MaGe ? Yet another bacterial annotation platform !…
An automatic annotation process :
Shares functionalities with other existing annotation systems :
A relational database (MySQL) used to store the sequences and the analysis results.
Syntaxic and functional annotationsFunctional annotation and classification inferences
A WEB interface allowing multiple users to simultaneously annotate a genome. Connectivity to other databases or systemsDeveloped by biologists involved in manual expert annotation
Graphical interface which focuses on gene context and synteny results with available bacterial proteomes.
Relational SGBD (MySQL)Relational SGBD (MySQL)
Purpose: storage of ‘clean’ and complete annotation data which are subsequently used in the genomic comparative analysis.
• Annotation tool resultsAnnotation tool results : : Intrinsic: genes, signals, repeats,…
• New bacterial genomesNew bacterial genomes (annotation projects)(annotation projects)
Extrinsic : BLAST, InterPro, COG, synteny …
Introduction to the Prokaryotic Genome DataBase (PkGDB)Introduction to the Prokaryotic Genome DataBase (PkGDB)
• Complete bacterial genomesComplete bacterial genomes (Refseq NCBI and Genome Review EBI)(Refseq NCBI and Genome Review EBI)
Integration in PkGDB
Management of frameshifts
Correction of obvious errors
Syntactic re-annotation
Add missing gene annotations
NAR (WS), 2003
NAR (WS), 2005
Simplified structure of PkGDBSimplified structure of PkGDB
Genomic ObjectsAutomatic and manual functional assignations
Published genomes Newly sequenced genomes
Gene prediction AMIGene
Re-annotation project Annotation project
Annotation history
Sequence updates and annotation transfer
Functional Classification
Annotator management
Functional predictions
Orthologs & Paralogs
Syntenies
Protein similarities
Domains and motifsEnzymatic functions
helixes and signal peptides
UniprotKEGG COGInterpro
Reference annotation for model
organisms
Specific regions
Ecogene Geneprotec Subtilist
GenomeReviews
NCBIRefSeq
Annotation management
MultiFun GeneOntology
BioCyc
Project customization
• Multiple correspondences• Local rearrangements (ins/del)
Boyer et al. Bioinformatics (Nov 2005)
How to read the synteny maps ?How to read the synteny maps ?
ACIAD0574hutH
Two ‘homologs’ to ACIAD0574on the P. aeruginosa genome
This P. syringae gene (PSPTO0599/hutH-1) is a putative ‘ortholog’ to ACIAD0574 and is
involved in a synteny group containing 17 genes (in green)
These two P. syringae genes (PSPTO5274/hutH-2 and 5276/ hutH-3)
are similar to ACIAD0574 (putative paralogs of PSPTO0599)
A larger view of the previous A larger view of the previous AcinetobacterAcinetobacter ADP1 region ADP1 region
0574
hutH
0582-0583
fabG-fabF
0562
hisS
4 of 138genomesin PkGDB
9 of 284 complete microbial proteomes (RefSeq section)
How are genes organized in a synteny group ?How are genes organized in a synteny group ?
Synteny with Ralstonia solanacearum Mega Plasmid
Synteny with Ralstonia solanacearum chromosome
Synteny maps are useful to annotate gene fusion/fissionSynteny maps are useful to annotate gene fusion/fission
Colored rectangles represent the part of the protein which aligns with the corresponding Acinetobacter protein.
Fusion of genes involved in DNA replicationdnaQ (DNA polIII, epsilon subunit + proofreading 3’-5’ exonuclease)
rnhA (degradation of Okazaki fragments)
(dnaQ) YPO1082YPO1081 (rnhA)
(dnaQ) STM0264STM0263 (rnhA)
(dnaQ) NMB1514(rnhA) NMB1618
(dnaQ) PA1816PA1815 (rnhA)
(dnaQ) PSPTO3711PSPTO3712(rnhA)
Genomic ObjectsAutomatic and manual functional assignations
Published genomes Newly sequenced genomes
Gene prediction AMIGene
Re-annotation project Annotation project
Annotation history
Sequence updates and annotation transfer
Functional Classification
Annotator management
Functional predictions
Protein similarities
Domains and motifsEnzymatic functions
helixes and signal peptides
UniprotKEGG COGInterpro
Reference annotation for model
organisms
Ecogene Geneprotec Subtilist
GenomeReviews
NCBIRefSeq
Annotation management
MultiFun GeneOntology
BioCyc
Functional Classification
Annotator management
Orthologs & Paralogs
Syntenies
Reference annotation for model
organisms
Specific regions
Ecogene Geneprotec Subtilist
MultiFun GeneOntology
Project customization
Simplified structure of PkGDBSimplified structure of PkGDB
PRIAMhttp://bioinfo.genopole-toulouse.prd.fr/priam/
Position-specific scoring matrices ('profiles') built with SwissProt proteins
www.genome.jp/kegg/
Dynamicrequests
Localinstallation
http://www.biocyc.org/
Setting up a new annotation project : an exampleSetting up a new annotation project : an example
Newly sequenced genomes
• Bradyrhizobium sp. ORS278 (Genoscope) -> 1 chr (7,5 Mb)• Bradyrhizobium sp. BTAi (DOE/JGI) -> 1 chr (8,5 Mb)
Genomes in public DataBanks
• Mesorhizobium loti (00) • Sinorhizobium meliloti (01)• Bradyrhizobium japonicum (02)• Rhodopeudomonas palustris (03)
Available related sequences
• Rhizobium leguminosarum(Sanger Center)• Rhodobacter sphaeroides(DOE/JGI)• Rhodospirillum rubrum (DOE/JGI)
Complete pipeline ofautomatic annotations
Re-annotation process(pseudogenes, missing genes)
Automatic syntaxic annotations(in some cases, functional annotations)
Searching for synteny groups with complete proteomes available in RefSeq section(NCBI, 284 to date) and in PkGDB (curated genomes, 138 to date)
PkGDB
AcinetoScope
RhizoScopeYersiniaScope ColiScope
CloacaScope
FrankiaScope
Pathway Tools
BradyBTCyc BradyORCyc
Metabolic pathwayreconstruction
BrajapCyc
Ocelotobjectmodel
RhizoCyc
BioWareHouse relational model
1414 4343
127127
873873
897897
830830
1616
7676
3030
724724
BradyrhizobiumBradyrhizobium sp. ORS278 sp. ORS278BradyrhizobiumBradyrhizobium sp. BTAi sp. BTAi
Bradyrhizobium japonicumBradyrhizobium japonicum USDA 110 USDA 110
ORS278BTAi genes coding the same
reactionPathway Reaction
BRAOR5732 BRABT1389,BRABT0754,BRABT0723,BRABT0755,BRABT0724
protocatechuate degradation I PROTOCATECHUATE-4,5-DIOXYGENASE-RXN
BRAOR5733 BRABT1389,BRABT0754,BRABT0723,BRABT0755,BRABT0724
protocatechuate degradation I PROTOCATECHUATE-4,5-DIOXYGENASE-RXN
BRAOR5771BRABT1389,BRABT0754,BRABT07
23,BRABT0755,BRABT0724protocatechuate degradation I PROTOCATECHUATE-4,5-DIOXYGENASE-RXN
BRAOR5772BRABT1389,BRABT0754,BRABT07
23,BRABT0755,BRABT0724protocatechuate degradation I PROTOCATECHUATE-4,5-DIOXYGENASE-RXN
BRAOR5776 BRABT0759 protocatechuate degradation I RXN-2463
Comparative Metabolic Capabilities : an exampleComparative Metabolic Capabilities : an example
Reaction content comparisons between the 3 Bradyrhizobium Reaction content comparisons between the 3 Bradyrhizobium organisms organisms (BioWareHouse SQL query on reactions having gene-> (BioWareHouse SQL query on reactions having gene->
protein->reaction correspondences )protein->reaction correspondences )
BRAOR5771-5772 - 5773
BradyrhizobiumBradyrhizobium ORS278 region containing CDS 5771&5772 ORS278 region containing CDS 5771&5772
!!!
!!!???
““Cloning and Characterization of the Genes Encoding Cloning and Characterization of the Genes Encoding Enzymes for the Protocatechuate Enzymes for the Protocatechuate MetaMeta-degradation -degradation Pathways of Pathways of Pseudomonas ochraceaePseudomonas ochraceae NGJ1” Maruyama NGJ1” Maruyama et et alal. (2004) . (2004) Biosci. Biotechnol. BiochemBiosci. Biotechnol. Biochem, , 6868, 1434-1441., 1434-1441.
15277747
AUTOmatic vs EXPert annotation of the regionAUTOmatic vs EXPert annotation of the region
BRAOR5770
BRAOR5771
BRAOR5772
BRAOR5773
BRAOR5774
BRAOR5775
BRAOR5776
AUTO =
PRODUCT EC-number Gene Evidence
4-carboxy-2-hydroxymuconate-6-semialdehyde dehydrogenase
EXP
1.1.1.18 ligCBLAST R. palusPRIAM (medium)
4-carboxy-2-hydroxymuconate-6-semialdehyde dehydrogenase 1.2.1.45 ligC BLAST P. testosteroniPublication + Enzyme
Protochatechuate 4,5-dioxygenase, alpha subunit 1.13.11.8 ligB BLAST R. palusPRIAM (high)
AUTO
EXP
AUTO = EXP Protochatechuate 4,5-dioxygenase, beta subunit 1.13.11.8 ligA BLAST R. palusPRIAM (high)
2-pyrone-4,6-dicarboxylic acid hydrolase none ligI BLAST R. palus
3.1.1.57 ligI BLAST R. palusPublication + Enzyme
AUTO
EXP 2-pyrone-4,6-dicarboxylic acid hydrolase
Putative dehydrogenase none BLAST R. palusAUTO none
1.1.1.-BLAST R. palusInterproScan
EXP nonePutative dehydrogenase with NAD binding protein
Putative acyl transferase none BLAST R. palusAUTO fidZ
4.1.3.17 BLAST P. ochraceaePublication + Enzyme
EXP ligK4-hydroxy-4-methyly-2-oxoglutarate aldolase
4-oxalomesaconate hydratase none ligJ BLAST R. palus
4.2.1.83 ligJ BLAST R. palusPublication + Enzyme
AUTO
EXP 4-oxalomesaconate hydratase
BradyrhizobiumBradyrhizobium ORS278 region after expert annotation ORS278 region after expert annotation
ligC1.2.1.45 4.1.3.1
7
BRAOR5770
4.2.1.83ligBA
1.13.11.8
BRAOR5771-72BRAOR5773
ligI3.1.1.57
BRAOR5775
ligKligJ
BRAOR5776BRAOR5777 BRAOR5778
Connectivity to KEGG databaseConnectivity to KEGG database
Enzymes encoded by genes in the MaGe region
Enzymes encoded by genes elsewhere in the Bradyrhizobium genomeAdditional enzymes in E. coli
4.2.1.83
?
Connectivity to KEGG databaseConnectivity to KEGG database
Enzymes encoded by genes in the MaGe region
Enzymes encoded by genes elsewhere in the Bradyrhizobium genomeAdditional enzymes in E. coli
57715775
5772 5773
BradyrhizobiumBradyrhizobium ORS278 region after expert annotation ORS278 region after expert annotation
5770 5776
BRAOR5770_ligC
4-carboxy-2-hydroxymuconate6-semialdehyde dehydrogenase
1.2.1.45
BRAOR5776_ligJ
4-oxalmesaconate hydratase
4.2.1.83
The reactions catalyzed by 1.2.1.45 and 4.2.1.83 exist in MetaCyc but they are not involved in a pathway.
Probable protochatechuatetransporter
Probable transcriptionalregulator of protochatechuate degradation
BRAOR5777 BRAOR5778ligR
Enzymatic activity predictions (PRIAM) : some resultsEnzymatic activity predictions (PRIAM) : some results
Comparison of PRIAM predictions [P] and Expert annotations [E]
Nb EC_[P] vs EC_[E]
Total genes 3325
1012 / 947
AcinetobacterADP1
Pseudoalteromonashaloplanktis
Frankiaalni
Pseudomonasentomophila
3514
927 / 993
6861
1729 / 1498
5182
1455 / 1232
EC_[P] = EC_[E] 632 (62.5%)
47 (4.6%)EC_[P](3 digit) = EC_[E]
697 (75.2%)
23 (2.5%)
912 (52.8%)
68 (3.9%)
820 (56.3%)
46 (3.2%)
EC_[P] <> EC_[E]
111 (11.7%)
EC_[P] & (NO EC_[E])
131 (12.9%)
202 (20.0%)
EC_[E] & (NO EC_[P]) 152 (15.3%)
102 (11.0%)
105 (11.3%)
111 (7.4%)
401 (23.2%)
348 (20.1%)
90 (7.3%)
285 (19.6%)
304 (20.9%)
Limitations of PRIAM sequence-based enzyme prediction
Availability of at least one UniProt/SwissProt sequence in the Enzyme entry ! Existence of closely related enzymes with different substrate specificity
Several wrong predictions in case of Medium/Low PRIAM confidence
Relaxed substrate specificity exhibited by some enzymes
PGDBs built at GenoscopePGDBs built at Genoscope
Automatic updates of PathoLogic predictions : every week
MaGe’s training courses include a quick overview of how to explore PathoLogic results to perform relevant expert annotation
• The number of enzymes and pathways is slightly greater in our PGDBs (source of annotations + process of Pathologic file format generation)
• Important discrepancies with Sinorhizobium meliloti (44 predicted pathways in the SRI/EBI PGDB vs 259 in the Genoscope PGDB)
18 PGDBs : other published bacterial genomes
25 PGDBs for newly sequenced and annotated bacterial genomes
Our PGDBs are currently available in the MaGe’s interface
NO curation to date (Tier 3* Databases)(except for Acinetobacter ADP1-> Metabolic Thesaurus project)
HomePage : http://www.genoscope.cns.fr/agc/mage/
«Expansion of the BioCyc collection of pathway/genome databases to 160 genomes» Karp et al.Nucleic Acid Research, 2005, 33: 6083-6089.
To date : about 60 Tier 3 PGDBs 16 PGDBs common to SRI/EBI PGDBs Tier3* (and 4 with Tier2*):
*Tier 3: Computationally-Derived Databases Subject to No Curation*Tier 2: Computationally-Derived Databases Subject to Moderate Curation
Some Questions / PerspectivesSome Questions / Perspectives
Better correspondences between BioCyc and MaGe
• Optional fields in the PathoLogic file format (PubMedID, Funcat, …)
How to tackle the pseudogene information ?
Pathway X doesn’t exist because
No enzyme has been found
Some enzymes correspond to pseudogenes
Remove false-positive pathway (Tier 3 -> Tier2)
Curation of PGDB ?
• Automatic reduction of false positive pathway predictions stored in the PGDBs
Integration and evaluation of Pathway Hole Filler
• Finding a way to get a list of false positive pathways at the end of the manual process of annotation.
Tier2 -> Tier1*, especially creation of new metabolic pathways :
• PGDBs freely available for «adoption» by biologists
!!! Not an easy task !!! (a strong knowledge of metabolism is required)
*Tier1: Intensively Curated Databases
Metabolic Thesaurus project at GenoscopeMetabolic Thesaurus project at Genoscope
Annotation
Knock-out collection
2240 ADP1 genesknocked out
Metabolism predictionVincent Schächter’s bioInformatic team
Flux ModelsModel
Network reconstruction
Biological evidence
Accurate phenotyping
Systematic phenotyping
Transcriptomeanalyses
Biochemical studies
Functional complementation
Véronique de Berardinis’s team
3325 Acinetobacter ADP1annotated genes
Metabolic Pathway Reconstruction / Experimental DataMetabolic Pathway Reconstruction / Experimental Data
Metabolic Thesaurus ColiScope
Acinetobacter ADP1 KO collection
Sequencing of 2 commensal and 4 pathogenic E. coli strains
Phenotypic analysis: growth essay on different nutrient sources+
Metabolome analysis: LC/MS and CE/MS
Data Integration and Comparative Analysis
Evolution of metabolic capabilities => adaptation of
microorganisms commensalism / virulence
emergence
Linked enzymatic activity to genes of unknown function
Participating teamsParticipating teams
David Vallenet
Stéphane Cruveiller
AGC team : Zoé Rouy
Aurélie Lajus
Genoscope informatic system team
Laurent Sainte-Marthe
Claude Scarpelli
Sylvain Bonneval
… and with the help of : François Lefèvre (V. Schächter team)
Mage’s users feedback helps in improving many functionalities of our system !
Claudine Médigue