View
217
Download
0
Tags:
Embed Size (px)
Citation preview
To provide
structured controlled vocabularies
for the
representation of biological knowledge
in
biological databases.
Manifesto of Liberation
Bioinformatics
Be open source Use open standards Make data & code available
without constraint Involve your community
Gene Ontology - 1998
FlyBase Drosophila Cambridge, EBI, HarvardBerkeley & Bloomington.
SGD Saccharomyces Stanford.
MGI Mus Jackson Labs., Bar Harbor.
Gene Ontology - 2004 Fruitfly - FlyBase Budding yeast - Saccharomyces Genome Database
(SGD) Mouse - Mouse Genome Database (MGD & GXD) Rat - Rat Genome Database (RGD) Weed - The Arabidopsis Information Resource (TAIR) Worm - WormBase Dictyostelium discoidem - Dictybase InterPro/UniProt at EBI - InterPro Fission yeast - Pombase Human - UniProt, Ensembl, NCBI, Incyte, Celera,
Compugen Parasites - Plasmodium, Trypanosoma, Leishmania -
GeneDB - Sanger Microbes - Vibrio, Shewanella, B. anthracus, … - TIGR Grasses - rice & maize - Gramene database zebra fish - Zfin Coming: Xenopus, Chlamydomonas, Tetrahymena,
Gallus & more.
GOThree (Orthogonal)
Ontologies Biological Process
• Goal or objective within cell, tissue ..
Molecular Function
• Elemental activity or task
Cellular Component
• location or complex
•molecular function 7422 terms•biological process 8972 terms•cellular component 1472 terms
•all 17,866 terms
definitions 16,600 (93%)
Content of GO
What is the least complex data structure that is sufficient?
Key word list? Hierarchical tree? Directed acyclic graph? Other?
What data structure to use ?
ISA (hypernomy/hyponomy)• as in: an elephant is a mammal
PARTOF (meronomy/holonomy)• as in: a trunk is part of an
elephant
REGULATES• carbohydrate metabolism
regulates: regulation of carbohydrate metabolism
Classes of parent-child relationship
Cellular component
%membrane
%vacuolar membrane
%nuclear membrane
%intracellular
%cell
<cytoplasm
<vacuole
<vacuolar membrane
<vacuolar lumen
<nucleus
<nuclear membrane
Cellular component
vacuolarmembrane
membrane intracellular
vacuole
vacuolarlumen
cytoplasmnucleus
nuclearmembrane
cell
ISA (%) PARTOf (<)
Structure of the Ontologies
term: chloroplastgo_id: GO:0009507definition: A chlorophyll-containing plastid with thylakoids organized into grana and frets, or stroma thylakoids, and embedded in a stroma.definition_reference: ISBN:0471245208
term: ketone catabolismgoid: GO:0042182definition: The breakdown into simpler components of ketones, a class of organic compounds that contain the carbonyl group, CO, and in which the carbonyl group is bonded only to carbon atoms. The general formula for a ketone is RCOR, where R and R are alkyl or aryl groups.definition_reference: GO:curators
GO terms are defined & have unique id’s
•literature curation:•Inferred from Mutant Phenotype•Inferred from Direct Assay•Inferred from Genetic Interaction•Inferred from Physical Interaction•Inferred from Expression Pattern•Traceable Author Statement•Non-traceable Author Statement.
•“homologies”:• Inferred from Sequence Similarity
•computed annotation:• Inferred from Electronic Annotation
Annotation of GO terms to gene products
GO Gene Association Tables
Herpes viruses
Vibrio cholerae, B. anthracis, Coxiella burnetii, Pseudomonas syringae,
Shewanella oneidensis …
Dictyostelium discoidem
Saccharomyces cerevisiae, Schizosaccharomyces pombe
Trypanosoma brucei, Leishmania major, Plasmodium falciparum
Caenorhabditis elegansDrosophila melanogaster, Glossina
morsitans
Danio rerio
Mus “domesticus”, Rattus norvegicus,Homo sapiens bioinformaticus
Arabidopsis thaliana, Oryza sativa
FB FBgn0015567 &agr;-Adaptin GO:0005886 FB:FBrf0093110|PMID:9118220 IDA CFB FBgn0015567 &agr;-Adaptin GO:0007269 FB:FBrf0108281|PMID:10218159 NAS PFB FBgn0015567 &agr;-Adaptin GO:0016192 FB:FBrf0124164 NAS PFB FBgn0015567 &agr;-Adaptin GO:0030122 FB:FBrf0115359 NAS CFB FBgn0015567 &agr;-Adaptin GO:0030122 FB:FBrf0124164 NAS CFB FBgn0015567 &agr;-Adaptin GO:0006901 FB:FBrf0108281|PMID:10218159 TAS PFB FBgn0015567 &agr;-Adaptin GO:0008021 FB:FBrf0108281|PMID:10218159 TAS CFB FBgn0015567 &agr;-Adaptin GO:0016181 FB:FBrf0141528|PMID:11697879 TAS PFB FBgn0015567 &agr;-Adaptin GO:0016183 FB:FBrf0108281|PMID:10218159 TAS PFB FBgn0015567 &agr;-Adaptin GO:0030135 FB:FBrf0108281|PMID:10218159 TAS CFB FBgn0010215 &agr;-Cat GO:0003779 FB:FBrf0132100 ISS FFB FBgn0010215 &agr;-Cat GO:0007016 FB:FBrf0129868|PMID:10908592 ISS PFB FBgn0010215 &agr;-Cat GO:0008092 FB:FBrf0132100 ISS FFB FBgn0010215 &agr;-Cat GO:0016342 FB:FBrf0129868|PMID:10908592 ISS CFB FBgn0010215 &agr;-Cat GO:0016343 FB:FBrf0129868|PMID:10908592 ISS FFB FBgn0010215 &agr;-Cat GO:0005912 FB:FBrf0151280|PMID:12147138 NAS C
SGD S0004660 AAC1 GO:0005743 SGD_REF:12031|PMID:2167309 TAS C SGD S0004660 AAC1 GO:0006854 SGD_REF:12031|PMID:2167309 IDA P SGD S0004660 AAC1 GO:0005471 SGD_REF:12031|PMID:2167309 IDA FSGD S0000289 AAC3 GO:0005743 SGD_REF:13606|PMID:1915842 TAS CSGD S0000289 AAC3 GO:0006854 SGD_REF:13606|PMID:1915842 IMP P
SGD S0000289 AAC3 GO:0005471 SGD_REF:13606|PMID:19158 42 IMP F ADP/ATP translocator YBR085W|ANC3 gene taxid:4932 20010213 SGD
go/gene_associations/
Expression studies: Human ontogenic tumor gene expressionHuman breast cancer gene expressionHuman endothelial cell gene expressionHuman fibrosarcoma cell cDNAsHuman osteoblast progenitor cell gene expressionHuman fibrosarcoma cell gene expressionMouse cDNAs - FANTOM/FANTOM2 ProjectsMouse lung gene expressionMouse dendritic cell gene expressionMouse hepatic and hippocampal gene expressionMouse liver tumor gene expressionDrosophila gene expression during agingDrosophila embryo gene expressionAffymetrix Probe Sets
Protein annotation: Vertebrate nuclear proteinsHuman GPCR proteinsMouse proteomePANTHER protein families
EST collections: Cattle ESTs, Pig ESTs, Dog ESTsParacoccidioides brasiliensis ESTsPlasmodium falciparum ESTsHoney bee ETSsSchizophyllum commune ESTsMeloidogyne incognita ESTsPlasmodium vivax ESTsAmblyomma variegatum ETSs
Genomic annotation: Drosophila melanogaster genomeCaenorhabditis briggsae genomeAnopheles gambiae genomeSchizosaccharomyces pombe genomePlasmodium yoelli genomePlasmodium falciparum genomeDictyostelium genomeRice genomePlant alternatively spliced genesHuman pseudogenes
http://www.geneontology.org/GO.biblio.html
Combinatoric explosion
Process Body partRegulationNegative or
Positive
2 * 1 * (# of processes - 1)
Induction
2 * 2 * (# of processes - 2)
2 * 2 * (# of processes - 2) * (# of body parts)
The OBOL System Approach: annotation-time term
composition vs tools for maintenance of large directed acyclic graphs
Requires new generalization hierarchies Term decomposition using grammars Generating computable logical
definitions Using logical definitions – term
creation and error checking
A A Formal Grammar for OBO terms Formal Grammar for
OBO terms All GO terms are NOUN-PHRASES A NOUN-PHRASE is (recursively) made from
• a NOUN (includes inflected verbs; eg binding)• an ADJECTIVE followed by a NOUN-PHRASE• a NOUN-PHRASE preceeded by a NOUN-
PHRASE acting as ADJECTIVE; eg clathrin coat• a NOUN-PHRASE then PREPOSITION then
NOUN-PHRASE; eg regulation of transcription• an (optional) NOUN-PHRASE then a
RELATIONAL ADJECTIVE then a NOUN-PHRASE; eg clathrin-coated vesicle
Precedence rules are also required to prune parse forest
A Formal Grammar for OBO terms
Gene Ontology Software
•Browsers - Amigo•Database - mySQL•Editor - DAG-EDIT
geneontology.sourceforge.net
•Third party software (e.g. Spotfire; TreeMap; GoFish; FatiGO)
OBO-Edit - a powerful editor for directed acyclic graphs.
•data adaptors•multiple edits on same graph•define your own relationship types•plug in architecture - e.g. add an external in-line dictionary