23
Automated Automated Prokaryotic Prokaryotic Annotation Annotation at the JCVI at the JCVI Daniel Haft Daniel Haft 2008 2008

Automated Prokaryotic Annotation at JCVI

  • Upload
    pathema

  • View
    1.905

  • Download
    1

Embed Size (px)

DESCRIPTION

Conference: Annual BRC Meeting (BRC6), Oct 28-29, 2008 in Ft. Lauderedale, Florida. Presenter: Dan Haft

Citation preview

Page 1: Automated Prokaryotic Annotation at JCVI

AutomatedAutomatedProkaryotic Prokaryotic AnnotationAnnotation

at the JCVIat the JCVI

Danie l HaftDanie l Haft20082008

Page 2: Automated Prokaryotic Annotation at JCVI

A Dual-UsePipeline

Multiple types of stored evidence Persistent & Flexibly Interleaved Supports selective re-annotation Features annotation-driving databases

- CHAR- TIGRFAMs- Genome Properties- BrainGrab Rules

Evidence used by Machine and by Experts MANATEE interface for annotators Capture new rules with BrainGrab

Page 3: Automated Prokaryotic Annotation at JCVI

Computable objects:Output from one programbecomes input to another.

HMM results drive GenomeProperties

Genome Properties guide GOprocess assignments

GO process terms

Page 4: Automated Prokaryotic Annotation at JCVI

Identification of Genome FeaturesIdentification of Genome Features

IMMbuilt

Glimmer builds a statistical model from the training set

Genome Sequence

•• rRNA rRNA, tRNA, , tRNA, RfamRfam

•• IS elements ·Phage regions ·RepeatsIS elements ·Phage regions ·Repeats

ORFsORFs

: Other Genome Features: Other Genome Features

Page 5: Automated Prokaryotic Annotation at JCVI

Gene FindingGlimmer & friends, homology methods

Homology Searches (gathering evidence)

BLAST-Extend-ReprazeHidden Markov Modelsmisc.

Structural Curation ( ORF Management)

Auto_Gene_Curate (start sites, overlaps)InterEvidence

Functional AssignmentsAuto_AnnotateManualMapped

Data Availability

Page 6: Automated Prokaryotic Annotation at JCVI

Homology SearchesHomology Searches•• HMM searches: TIGRFAMs & PfamHMM searches: TIGRFAMs & Pfam•• BLAST searches: against internal NIAABLAST searches: against internal NIAA•• PROSITE motifsPROSITE motifs•• InterProInterPro•• TmHMMTmHMM•• SignalPSignalP•• LipoproteinLipoprotein•• PsortPsort•• Generate Paralogous FamiliesGenerate Paralogous Families•• Custom databases searchesCustom databases searches ( (TransportDBTransportDB, Rules), Rules)

Page 7: Automated Prokaryotic Annotation at JCVI

Gene Model CurationGene Model Curation

•• Overlaps resolved by evidence competitionOverlaps resolved by evidence competition

•• Start site Start site curationcuration

•• Missed genes / unsupported gene callsMissed genes / unsupported gene calls

Page 8: Automated Prokaryotic Annotation at JCVI

Evidence can Overhang the GeneEvidence can Overhang the GeneBlast-Extend-Blast-Extend-ReprazeRepraze (BER) (BER)

The extensions help in the detection of frameshifts (FS) and point mutations resulting in in-framestop codons (PM). This is indicated when similarity extends outside the coordinates of the proteincoding sequence. Blue line indicates predicted protein coding seqeunce, green line indicates up-and downstream extensions. Red line is the match protein.

ORFxxxxx300 bp 300 bpend5 end3

search protein

match protein

similarity extends in the same frame through a stop codon

normal full length match

*similarity extends upstream through a start, or downstream past a frameshift

!

Page 9: Automated Prokaryotic Annotation at JCVI

Pfam vs. TIGRFAMs Functional assignments to

proteins Granularity tuned for

single-hit equivalogs(mono-functional !)

Generates computableobjects --> pathwayreconstructions

TIGRFAMs: RULES

Names for homologydomains in proteins

Granularity tuned fortwilight-level sequencesimilarity detection

Explains things toannotator

Pfam : Explanations

Page 10: Automated Prokaryotic Annotation at JCVI

TIGRFAMs equivalogsvs. Pfam domains

}X

X

X

Y

Z}

TIGRxxxxx

PFxxxxx

Page 11: Automated Prokaryotic Annotation at JCVI

TIGRFAMs as annotation rules

EC number computable ! GO term computable ! protein name computable ? HMM hit computable !!

Page 12: Automated Prokaryotic Annotation at JCVI

Isology (homology) types:ranking our rules

EXCEPTION additional info, e.g. “vegetative”

EQUIVALOG the SAME (in enough ways) toreceive the same name across multiple genomes,reflecting one specific function.

SUBFAMILY can name a whole class

DOMAIN class name for a protein region (and apply these classifications also to Pfam)

Page 13: Automated Prokaryotic Annotation at JCVI

CHAR : Experimentally CharacterizedCHAR : Experimentally Characterized Protein Database Protein Database

• Highly curated database of experimentally characterizedproteins; connects protein accessions, known function, and thescientific literature.

•What does it include:–Controlled vocabulary describes the type of experimentationperformed in each publication–Key annotation fields (protein name, gene symbol, EnzymeCommission (EC) number, taxonomic data, Gene Ontology (GO)terms) are extracted–Synonymous protein accessions obtained from publicdatabases (Genbank and UniProt) are stored

Page 14: Automated Prokaryotic Annotation at JCVI

Annotation Proceeds from …

Inside --> out (e.g. AutoAnnotate): for every protein Collect evidence Best-guess annotation

Outside --> in (e.g. TIGRFAMs): for every model

Search tool + cutoff + standards = annotation rule Achieves partial coverage

Hybrid (BrainGrab) for every unfinished protein Look for means to annotate: blastp, synteny, hole-filling, etc. Capture annotator logic as a new rule Add to library of rules/models for all future genomes

Page 15: Automated Prokaryotic Annotation at JCVI

Subject Genome

Trusted Complete Automatic

Proper Realm ofAnnotator Attention

RULES

BrainGrab

NEW

genome genome

share

validate IMPORT

Page 16: Automated Prokaryotic Annotation at JCVI

BLASTP_MATCH [SP|P07363, 1600, 95, 92, 60, 1]

SP|P07363|CHEA_ECOLI Chemotaxis protein cheA EC 2.7.13.3

EcHS_A1984 is manually annotated confidentlybecause it is similar enough to :

(method: defines “similar enough”)

Must be the only protein in genome that scores >= 1600 by blastp,covering >= 95 % of the length of the characterized protein and>= 92 % of the target protein, with >= 60 % sequence identity.

A Teachable Moment

Page 17: Automated Prokaryotic Annotation at JCVI

a sample of expert opinion:“For This Particular Protein Family”

I (D.H.H.) assert that any > 75 %-identical, full-length match is the same protein.

Ditto any > 65 % match, as long as the region isclearly syntenic.

Ditto any single-copy > 50 % match, as long as it fillsthis hole in this otherwise mostly complete pathway.

Page 18: Automated Prokaryotic Annotation at JCVI

B “Bag of Genes”

G Genome Properties

E Evidence to drive other programs

Image from Gödel, Escher, Bach:an Eternal Golden Braidby Douglas Hofstadter, 1979

Page 19: Automated Prokaryotic Annotation at JCVI

Genome Properties:annotation at the level of systems

pathway (glyoxylate shunt) system (type III secretion) structure (outer membrane) genometrics (GC content) phenotype (motility, pathogenesis)

YES someevidence

notsupported NO

Page 20: Automated Prokaryotic Annotation at JCVI

Some Novel Genome Properties

12 subtypes of CRISPR/Cas system

PEP-CTERM / exo-sortase: Biofilm-associatedprotein sorting

Type VI secretion (53 loci in B. mallei 23344)

Post-translational selenium-modified enzymes

Heterocycle-containing bacterial toxinproduction: BA_2677 = “heterocyclo-anthracin”

Page 21: Automated Prokaryotic Annotation at JCVI

A family of variable putative toxinswith patterns of CGG insertions.

Page 22: Automated Prokaryotic Annotation at JCVI

Future Annotation PipelineEnhancements

•• Populate the Characterized Protein DatabasePopulate the Characterized Protein Database

•• Develop META-RULES from CHARDevelop META-RULES from CHAR

•• BrainGrab BrainGrab for novel contentfor novel content

•• Import additional Import additional computable evidencecomputable evidence

•• ImproveImprove exchanges of validationexchanges of validation setssets

•• Build a protein names ontologyBuild a protein names ontology

Page 23: Automated Prokaryotic Annotation at JCVI

AcknowlegementsRamana Madupu

Jeremy Selengut

Alex Richter

JCVI microbial annotation team