GO Annotation Fall 2005 Introduction: Proteins & Function · known function in a db Assign to T same function as homologs Confirm with suitable wet experiments Discard this function

GO Annotation Fall 2005

Introduction: Proteins & Function

Guest Lecturer: Bindu Nanduri [email protected]

Post genomic world

The Genome Covers

2001 IHGSC and Celera corporation

From Sequence to Function

•The genomic sequence identifies the 'parts'the next trick is understanding gene function

•Post genomic era = functional genomics

•Critical concept: genes of similar sequence may have similar functions

•Inferring function for a new gene begins with searching for it’s nearest neighbor (or homolog) of known function

Evolution Allows us to Infer Function

• The most powerful method for inferring function of a gene or protein is by similarity searching a sequence database.

• Our ability to characterize biological properties of a protein using sequence data alone stems from properties conserved through evolutionary time.

• Homologous (evolutionarily related) proteins always share a common 3-dimensional folding structure.

• They often contain common active sites or binding domains.

• They frequently share common functions.

Orthologs• Homologs = genes that are evolutionarily related

• There are two kinds of homologs:

• Orthologs = genes in different species that have diverged from a common gene in an ancestral species.

• Paralogs = genes that have diverged due to gene duplication.

• Orthologs are more likely than paralogs to have conserved function.

• Orthologs cannot be identified using BLAST or FASTA sequence comparison alone.

• Reliable ortholog identification requires phylogenetic methods.

Rice-2b

Rice-2a

Maize-2

Wheat-2

Sorghum-2

Barley-1

Wheat-1

Maize-1

Sorghum-1

Arabidopsis

Example Gene Tree (with plant genes)

The outgroup, Arabidopsis is a dicot. The cereals are monocots. Dicots and monocots diverged ~230 million years ago. These monocots diverged from each other ~60 mya.

orthologs

paralogs

Why shouldn’t we depend on inferences based on paralogs?

• Paralogs emerge after a gene duplication.

• Possible fates of duplicated genes:

– Loss of function for one of the duplicates - lack of selective pressure allows gene to mutate beyond recognition

– Emergence of new functional paralogs - one duplicate aquires a new function, so selection favors its maintenance in the genome

– Sub-functionalization - both duplicates are required to maintain the function of the original

“Central Dogma” (Francis Crick):

1 gene gives 1 mRNA gives 1 protein (predicted hundreds of thousands of genes in humans)

Information flow

DNA----- >>Proteins---->> Function

Finding Genes in Genomic DNA• Translate (in all 6 reading frames) and look for similarity to known protein

sequences• Look for long Open Reading Frames (ORFs) between start and stop

codons(start=ATG, stop=TAA, TAG, TGA)

• Look for known gene markers• TAATAA box, intron splice sites, etc.

• Statistical methods (codon preference)

Intron/exon boundaries• Gene finding programs work well in bacteria• None of the gene prediction programs do an adequate job predicting

intron/exon boundaries• The only reasonable gene models are based on alignment of cDNAs to

genome sequence• Some human genes still do not have a correct coding sequence defined

(transcription start, intron splice sites)

Proteins As Modules• Proteins are derived from a limited number of

basic building blocks (Domains)

• Evolution has shuffled these modules giving rise to a diverse repertoire of protein sequences

• As a result, proteins can share a global or local relationship

(Higgins, 2000)

Protein Domains

Motifs describe the domain

Janus Kinase 2 Modular Sequence Architecture

SH2Motif

Protein Families• Protein Family - a group of proteins that share a common function

and/or structure, that are potentially derived from a common ancestor (set of homologous proteins)

• Characterizing a Family - Compare the sequence and structure patterns of the family members to reveal shared characteristics that potentially describe common biological properties

• Motif/Domain - sequence and/or structure patterns common to protein family members (a trait)

Protein Families

Family A

Family B

Family D Family E

Family C

Separate Families canBe Interrelated

Creating Protein Families• Use domains to identify family members

– Use a sequence to search a database and characterize a pattern/profile

– Use a specific pattern/profile to identify homologous sequences (family members)

Find Domains Find Family Members

Pattern/Profile Searches

BLAST and Alignments

Motifs are built from Multiple Alignments

Protein Motif Databases• Known protein motifs have been collected in databases• One such database is PROSITE

Each domain is defined by a simple pattern– Patterns can have alternate amino acids in each position and defined

spaces, but no gaps– Pattern searching is by exact matching, so any new variant will not

be found (can allow mismatches, but this weakens the algorithm)

Profiles• Profiles are tables of amino acid frequencies at each position in a motif• They are built from multiple alignments• PROSITE entries also contain profiles built from an alignment of proteins

that match the pattern• Profile searching is more sensitive than pattern searching - uses an

alignment algorithm, allows gaps

Hidden Markov Models

• Hidden Markov Models (HMMs) are a more sophisticated form of profile analysis.

• Rather than build a table of amino acid frequencies at each position, they model the transition from one amino acid to the next.

• Pfam is built with HMMs.

So far……..

Given a gene/protein sequence we have learned ways to determine the function

With this wealth of information, there arises a need for ANNOTATION

an·no·ta·tion (ăn'ō-tā'shən)

The act or process of furnishing critical commentary or explanatory notes

Genome sequence annotation can be

---structural demarcate functional nucleotide sequences in the genome (ORFs)

---functional assign functions to the identified functional element (gene products)

ONTOLOGY controlled structured vocabulary

Gene Ontology (GO)

•Molecular function: tasks performed by gene product

•Biological process: broad biological goals accomplished by ordered assemblies of molecular functions

•Cellular component: subcellular structures, locations and macromolecular complexes

Example, Bioworks output from a MudPIT experiment (Bovine)

Gene association file

Guilt-by-Association

Compare T with seqs of known function in a db

Assign to T same function as homologs

Confirm with suitable wet experiments

Discard this functionas a candidate

Basic Local Alignment Search ToolBLAST

Altschul et al., JMB, 215:403--410, 1990

find from db seqswith short perfectmatches to queryseq

find seqs withgood flanking alignment

Annotating the Chicken GenomeCHICKEN GENOME

known gene products

(UniProt)

‘predicted’gene products(NCBI/UniParc)

unique gene products

(Ensembl?)

no annotationannotation

IEA annotation

non-IEA annotation

requiresmanual annotation

further experimentation

35% 35%30%

(0.91%)

Gene Ontology Annotation Training: Introduction

Fiona McCarthy November 2005

GO Annotation Training

• Primarily focus on training to be a biocurator, ie. training to assign GO terms to gene products

• GO Users: introduction to GO

Overview1. Bio-Ontologies and the Gene Ontology2. The Gene Ontology Consortium and

AgBase3. The 3 Gene Ontologies4. GO Terms5. DAGs6. Annotating to GO7. Nomenclature8. GO Term Rules and Relationships

1. Bio-Ontologies and the Gene Ontology (GO)

What is an Ontology?

Gruber, 1993:• “ontologies provide controlled, consistent

vocabularies to describe concepts and relationships, thereby enabling knowledge sharing”

OBO: Open Biomedical Ontologies

Why is Gene Ontology Needed?• The post-genome ‘problem’!

- a huge body of information with an extremely large vocabulary to describe it

• Vocabulary used is poorly defined- one word can have different meanings- different names for the same concept

• Biological systems are complex and our knowledge of such systems is incomplete

• Results in large databases which are difficult to mine computationally

Lewis SE, Genome Biology 2004. PMID:15642104

What is the Gene Ontology?Emily Dimmer, GOA EBI, 2004:GO: “a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing”

• a collaborative effort to address the need for consistent descriptions of gene products in different databases• facilitates uniform queries across them • structured to be queried at different levels, eg. you can use GO to find all the maize gene products in the genome that are involved in signal transduction or you can zoom in on all the receptor tyrosine kinases• annotators can assign properties to gene products at different levels, depending on how much is known about a gene product• provides a standard, species-neutral way of representing biological function• GO covers ‘normal’ functions and processes

- no pathology, no experimental conditions

What GO is NOT

• GO is not a database of gene sequences or gene products

• GO is not a unified database; rather it is a shared vocabulary

• GO does not attempt to describe every aspect of biology

• GO is not a dictated standard

2. The GO Consortium and AgBase

The GO Consortium• development of ontologies that describe gene products

in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner

• There are three separate aspects to this effort: 1. Writing and maintaining the ontologies; 2. Making associations between the ontologies and the

genes and gene products in the collaborating databases; 3. Developing tools that facilitate the creation, maintenance

and use of ontologies.

© Emily Dimmer, GOA

© Emily Dimmer, GOA

AgBase Data Submission• new associations are made by biocurators• gene associations are prepared as a gene

association file• the gene association file is submitted to AgBase

for QC checking• AgBase collates and submits gene associations

to the GO Consortium via GOA• AgBase submissions are QC checked by GOA

and MGI• AgBase gene associations are available through

the AgBase homepage, GOA and UniProt

3. The Gene Ontologies

http://www.geneontology.org/

The Gene OntologiesThe three organizing principles of GO are:1. Molecular Function (mf or F)

- describes activities, such as catalytic or binding activities, at the molecular level

- represent activities rather than the entities (molecules or complexes) that perform the actions

- eg. catalytic activity, transporter activity 2. Biological Process (bp or P)

- series of events accomplished by one or more ordered assemblies of molecular functions

- a bp term does not represent a pathway- eg.signal transduction or pyrimidine metabolism

3. Cellular Component (cc or C)- a component of a cell which may be a subcellular structure or a gene

product group- eg. nucleus or proteasome

Gene products can be described by more that one term in each ontology.1. Molecular Function (mf or F)cytochrome c can be described by the mf term electron transporter activity2. Biological Process (bp or P)cytochrome c is used in several bp: oxidative phosphorylation and induction of cell death3. Cellular Component (cc or C)the gene product cytochrome c has 2 cc: mitochondrial matrix and mitochondrial inner membrane.

Ontology Guides• Cellular Component ontology guide:http://www.geneontology.org/GO.component.guid

elines.shtml• Molecular Function ontology guide:http://www.geneontology.org/GO.function.guidelin

es.shtml• Biological Process ontology guide:http://www.geneontology.org/GO.process.guidelin

es.shtml

• Many gene products associate into entities that function as complexes, or 'gene product groups', – eg. hemoglobin contains the gene products alpha-

globin and beta-globin, and the small molecule heme

– eg. the ribosome is a complex assemblies of numerous different gene products

• To avoid confusion between gene products and molecular function, mf terms are described as an activity– eg. alcohol dehydrogenase can be annotated to the

term alcohol dehydrogenase activity

4. GO Terms

What is a GO Term?• The purpose of GO is to define particular

attributes of gene products• A GO Term is simply the text string used to

describe an entry in GO– eg. signal transduction

• GO Terms do not include – mutant or disease specific terms, eg. oncogenesis– protein domains or structural features, eg. the death

domain• A GO Node is a GO Term and all its children

The Anatomy of a GO Term

GO Terms are composed of:• term name• unique 7-digit GO ID• term definition (93% of GO terms are

defined)• synonyms (optional)• database cross-references (optional)• relationships to other GO terms

The Anatomy of a GO Term

relationship to other GO terms

record of term update

term nameunique GO ID

ontology that the GO term fits intosynonyms for the term name (to aid searching)

term definition

General Conventions for GO Terms

• Use “U.S. English”• Avoid abbreviations• Use full element names, not symbols, eg. “zinc” not Zn• Spell Greek symbols out in full, eg. alpha• Use lower case, except where demanded by context,

eg,. DNA• Use singular form where possible• Do not use anatomical qualifiers, eg. mitochondrial DNA

polymerase• Avoid gene product names where possible (may be used

as a synonym)• Cross-reference other databases where possible

GO ID Numbers

• A GO ID is a 7-digit number, padded with zeros, if necessary– numbers are assigned as they are required

• a GO ID is NEVER deleted: GO IDs should be conserved at all times so that, even if a term is defunct or has a new GO ID, someone searching using the old GO ID can find it.

• A GO ID is associated with a definition rather than with the term name, eg. – if the wording but not the meaning of a term is changed, the GO ID

stays the same, – a new meaning requires a new GO ID, even if the text string doesn't

change.

Changing GO ID Numbers

• Assume that we have a term 'mouse', GO ID GO:0000123, in an ontology; it is defined as a small furry mammal.

• We decide to change the wording to 'Mus musculus', keeping the definition the same. In this case we merely update the text; the GO ID stays the same because the meaning stays the same. We may choose to keep 'mouse' as a synonym, but there would still only be one ID associated with the term.

• We decide that the term 'mouse' should instead mean a piece of computer equipment. In this case, the old term and ID are moved to the 'obsolete' category, and 'mouse' as newly defined gets a new GO ID, GO:0000456. The old GO ID and definitions are saved for posterity in case we ever need to know what happened to them.

mouseAccession: GO:0000123Aspect: biological_processSynonyms: MickeyDefinition: small, furry rodent

GO Term Definition

• Always define new terms– If you create a new term, or refine a term, you should add a definition for

it, and note the references used in composing the definition– Refer to Oxford Dictionary of Chemistry and Molecular Biology– Wherever a 'standard' definition exists for a group of related terms, it

should be used (standard definitions available from the ontology guides)• Write definitions carefully in full sentences• Document where your definition came from• Always define which ontology the term belongs to• Show where new terms fit into the ontology

– show all possible paths– all possible paths must be true

GO Term Synonyms

• Synonyms are important as they facilitate database searching

• Don’t add synonyms where the only difference is case• Synonyms may not always be strictly 'synonymous',

eg. – may be broader or narrower than the term definition– may be a related phrase– may be alternative wording, spelling or use a different

system of nomenclature

GO Term SynonymsThe synonym relationship types are:

• the term is an exact synonym:ornithine cycle is an exact synonym of urea cycle

• the terms are related:cytochrome bc1 complex is a related synonym of ubiquinol-cytochrome-c reductase activity

• the synonym is broader than the term name:cell division is a broad synonym of cytokinesis

• the synonym is narrower or more precise than the term name:pyrimidine-dimer repair by photolyase is a narrow synonym of photoreactive repair

• the synonym is related to the term, but is not exact, broader or narrower:virulence has a synonym type of other related to the term pathogenesis

Cross-referencing Other Databases General database cross references (general dbxrefs) should be used whenever a GO term has an identical meaning to an object in another database.

5. Fitting GO Terms together: DAGs

Directed Acyclic Graphs

• The ontologies are organized as directed acyclic graphs (DAGs)

• DAGs differ from hierarchies in that a “child”, or more specialized, term can have many “parents”, or less specialized, terms

• Terms are linked by two types of relationships:- is_a & part_of

AmiGO Browser http://www.godatabase.org/cgi-bin/amigo/go.cgi?search_constraint=terms&action=replace_tree

MGI GO Browser http://www.informatics.jax.org/searches/GO_form.shtml

Browsing the GO DAGsThere are three GO Browsers online:• AmiGO browserhttp://www.godatabase.org/cgi-bin/amigo/go.cgi

• QuickGO browserhttp://www.ebi.ac.uk/ego/

• MGI browserhttp://www.informatics.jax.org/searches/GO_form.

shtml

6. Annotating to GO: attributing function

Annotating to GO

• Using GO terms to represent the activities and localizations of a gene product

• Annotations contributed by members of the GO Consortium– model organism databases– cross-species databases, eg. UniProt

• Annotations made freely available from GO website under ‘downloads’

Annotating to GO

Information required:1. database object (eg. a protein or gene

identifier)2. reference ID (eg. PubMed ID)3. GO Term ID (eg. GO:0004674)4. evidence code (eg. IDA)

Two Types of Annotation:1. electronic annotation2. manual annotation

All annotations must:• Every annotation must be attributed to a source,

which may be a literature reference, another database or a computational analysis.

• The annotation must indicate what kind of evidence is found in the cited source to support the association between the gene product and the GO term.

Making Gene Associations

• GO terms should be associated with database objects representing gene products rather than genes – a single gene may encode very different products with

very different attributes – if the database object is a gene, it is associated with all

GO terms applicable to any of its products • GO annotations should be attributed to a source • Each annotation should indicate the evidence on

which it is based

Evidence CodesIMP inferred from mutant phenotype IGI inferred from genetic interaction

[with <database:gene_symbol[allele_symbol]>] IPI inferred from physical interaction

[with <database:protein_name>] ISS inferred from sequence similarity

[with <database:sequence_id>] IDA inferred from direct assay IEP inferred from expression pattern IEA inferred from electronic annotation

[to <database:id>] TAS traceable author statement NAS non-traceable author statement ND no biological data available RCA inferred from reviewed computational analysis IC inferred by curator

• Because a single gene may encode very different products with very different attributes, GO recommends associating GO terms with database objects representing gene products rather than genes. • At present, however, many participating databases are unable to associate GO terms to gene products, and therefore use genes instead. • If the database object is a gene, it is associated with all GO terms applicable to any of its products.

The Gene Association File• collaborating databases export to GO a tab

delimited file, the "gene association file" of links between database objects and GO terms

• the database object may represent a gene or a gene product (transcript or protein)

• gene association files are available from both collaborating databases and the GO Consortium

• the gene association file is the mechanism by which gene products functions are shared/released

The Gene Association FileTo make gene associations we have to fill in up to 15 fields; 11 are compulsory

7. Nomenclature

Why Do We Care?

DB_Object_Symbol

To make gene associations we have to fill in 11 compulsory fields

Why Do We Care?• The 3rd gene association field is

“DB_Object_Symbol”• The entry in the DB_Object_Symbol field should be

a symbol that means something to a biologist, wherever possible (gene symbol, for example).

• It is not an ID or an accession number (the second column, DB_Object_ID, provides the unique identifier), although IDs can be used in DB_Object_Symbol if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated).

Nomenclature Guidelines• chicken & cow will follow human guidelines

– controlled by HUGO Gene Nomenclature Committee (HGNC)

– http://www.gene.ucl.ac.uk/nomenclature/• Arabidopsis:

– http://www.arabidopsis.org/jsp/processor/genesymbol/symbol_main.jsp

• Rice nomeclature committee: Gramene, MaizeGDB

• catfish: “Gene Nomenclature for Protein-Coding Loci in Fish” Transactions of the American Fisheries Society: Vol. 119, No. 1, pp. 2–15.

Guidelines for HGNC

• Hester M. Wain, Elspeth A. Bruford, Ruth C. Lovering, Michael J. Lush, Mathew W. Wright and Sue PoveyGenomics 79(4):464-470 (2002)

• Available online (last updated 2004):http://www.gene.ucl.ac.uk/nomenclature/gui

delines.html

Guidelines Summary1. Each approved gene symbol must be unique. 2. Symbols are short-form representations (or abbreviations) of the descriptive gene name. 3. Symbols should only contain Latin letters and Arabic numerals. 4. Symbols should not contain punctuation. 5. Symbols should not end in "G" for gene. 6. Symbols do not contain any reference to species, for example "H/h" for human.

Designating Gene symbols• The initial character of the symbol should always be a letter.

Subsequent characters may be other letters, or if necessary, Arabic numerals.

• No superscripts or subscripts or Roman numerals may be used. • All Greek letters should be changed to letters in the Latin alphabet

and placed at the end of the gene symbol, eg. GLA "galactosidase, alpha“

• No punctuation may be used, with the exception of the HLA, immunoglobulin and T cell receptor gene symbols (which may be hyphenated).

• Gene symbols will not usually be assigned to alternative transcripts.• Tissue specificity or molecular weight should be avoided.• Some letters or combination of letters are used as prefixes or

suffixes in a symbol to give a specific meaning and their use for other meanings should be avoided (Section 10).

• Oncogenes are given symbols corresponding to the homologous retroviral oncogene, but without the "v-" or "c-" prefices.

Gene Symbols can be assigned to: • Genes - a DNA segment that contributes to phenotype/function

- includes pseudogenes, genes coded by the opposite (antisense) strand that overlap a known gene • Locus - a point in the genome, identified by a marker, which can be mapped by some means. It does not necessarily correspond to a gene• Chromosome Region - a genomic region which has been associated with a particular syndrome or phenotype, particularly when there is a possibility that several genes within it may be involved in the phenotype. • Transcribed but untranslated functional DNA segments e.g. XIST "X (inactive)-specific transcript". • EST clusters which suggest a putative gene e.g. C1orf1 "chromosome 1 open reading frame 1". • Predicted genes (in silico) which show a high degree of sequence homology to well characterized genes will be assigned the same symbol with an "L" for like• Gene symbols will not usually be assigned to alternative transcripts or to genes predicted solely from in silico data (with no other supporting evidence e.g. significant homology to a characterized gene).

Homologies With Other Species

• ORTHOLOGS: to determine orthologs, use NCBI Entrez Gene/RefSeq; PHiGs– recognizable orthologs should be named the same as

human• HOMOLOGS: (across species) can be named “-

like”, “-homolog” or “-related”– gene symbols for homologs can be designated with

an “L”• PARALOGS: (within the same species) can be

named “-related sequence:

Genes Identified from Sequence Information

•Antisense - A gene of unknown function, encoded at the same genomic locus (with overlapping exons) as another gene should have its own symbol.•Opposite Strand - Genes of unknown function on the opposite strand should be assigned the suffix OS for "opposite strand".•Untranslated Functional RNAs - These may be assigned symbols, however the approved name should contain "untranslated RNA" •Related (-like) sequences - The designation of the suffix "L" is used where no other functional information is available and there is some sequence similarity with a known gene •Genes of unknown function - Genes predicted with EST evidence, but showing no structural or functional homology, are regarded as putative. These are designated by the chromosome of origin, the letters "orf" for open reading frame and a number in a series e.g. C2orf1, "chromosome 2 open reading frame 1".•Pseudogenes - pseudogenes will usually be assigned the next number in the relevant symbol series, suffixed by a "P" for pseudogene (or "PS" in the specific cases) if requested, however, the designation "pseudogene" will remain in the gene name.

Enzymes and Proteins • The rules described in the sections on gene names and

symbols apply, but in addition the names of genes coding for enzymes should be based on those recommended by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology

• These can be found at http://ca.expasy.org/enzyme/• Names of genes encoding plasma proteins,

hemoglobins, and specialized proteins are based on standard names and those recommended by their respective committees e.g. HBA1 "hemoglobin, alpha 1".

http://ca.expasy.org/enzyme/

8. GO term rules and relationships

For the most up to date information, refer to the GO Annotation Guide:

http://www.geneontology.org/GO.annotation.shtml

The True Path Rule

"the pathway from a child term all the way up to its top-level parent(s) must

always be true"

GO Term Relationships• A GO Term can have multiple parents

– Need to keep in mind relationships between ‘child’and ‘parent’ terms

– A child term can have one of two different relationships to its parent(s): is_a or part_of.

– The same term can have different relationships to different parents:

is_a

is_a

is_a

part_of

The is_a relationship • an is_a relationship means that the term is a

subclass of its parent • eg. mitotic cell cycle is_a cell cycle• The is_a relationship is transitive, which

means that if 'GO term A' is a subclass of 'GO term B', and 'GO term B' is an subclass of 'GO term C', 'GO term A' is also a subclass of 'GO term C:

is_a

is_a

is_a

The part_of relationship There are four basic levels of restriction for a part_of relationship:

The True Path Rule

the part_of relationship is restricted to those types where a child term must always be part_of its parent

The part_of relationship

2. 'necessarily is_part', means that wherever the child exists, it is as part of the parent, eg. replication fork is part_of chromosome, so whenever replication fork occurs, it is as part_of chromosome, but chromosome does not necessarily have part replication fork.

The nucleus always has_partchromosome, but chromosome isn't necessarily part_of nucleus

The part_of relationship

4. The final type is a combination of both two and three, 'has_part' and 'is_part‘,eg. nuclear membrane is part_of nucleus. So nucleus always has_part nuclear membrane, and nuclear membrane is always part_of nucleus.

• The part_of relationship used in GO is usually type two, 'necessarily is_part'.

• Like ‘is_a’, ‘part_of’ is transitive, so that if 'GO term A' is part_of 'GO term B', and 'GO term B' is part_of 'GO term C', 'GO term A' is part_of'GO term C':

part_of

part_of

part_of

For example: laminin-1 is part_of basal lamina. basal lamina is part_of basement membrane. laminin-1 is part_of basement membrane.

is_a & part_of Designations

Designations shown by GO Browsers

is_a %part_of < Plain Text Designations

Species-specific terms • GO nodes should aggressively avoid using species-

specific definitions. • Nevertheless, many functions, processes and

components are not common to all life forms. • The current convention is to include any term that can

apply to more than one taxonomic class of organism. • The are cases where a word or phrase has different

meanings when applied to different organisms, eg. embryonic development

• Such terms are distinguished from one another by their definitions and by the sensu designation (sensu means 'in the sense of'), as in the term embryonic development (sensu Insecta)

To find taxonomic groups used with sensu:http://www.geneontology.org/GO.usage.shtml#sensu

Fiona’s Homework

1. Maize NomenclatureWith thanks to Bela:http://www.maizegdb.org/maize_nomenclature.php

“Unknown” & Unannotated“Unknown vs Unannotated: "Unknown" means that someone has tried annotating the gene, but didn't find any information. Absence of annotation implies that no one has looked.

AgBase: ‘unknown’ will be used when a curator has looked at all the current literature for the product AND tried to assign function using sequence similarity (ISS).

Evidence Codes & Gene Association Files



Fiona McCarthy July 7 2005

1. Evidence Codes



Annotating to GO

Information required:1. database object (eg. a protein or gene

identifier)2. reference ID (eg. PubMed ID)3. GO Term ID (eg. GO:0004674)4. evidence code (eg. IDA)

Evidence CodesIMP inferred from mutant phenotype IGI inferred from genetic interaction

[with <database:gene_symbol[allele_symbol]>] IPI inferred from physical interaction

[with <database:protein_name>] ISS inferred from sequence similarity

[with <database:sequence_id>] IDA inferred from direct assay IEP inferred from expression pattern IEA inferred from electronic annotation

[to <database:id>] TAS traceable author statement NAS non-traceable author statement ND no biological data available RCA inferred from reviewed computational analysis IC inferred by curator

IMP: inferred from mutant phenotype • Any gene mutation/knockout • Overexpression/ectopic expression of wild-type or mutant genes • Anti-sense experiments • RNAi experiments • Specific protein inhibitors • Comment: anything that is concluded from looking at mutations or abnormal

levels of the product(s) only of the gene of interest is IMP (compare IGIs). Use this code for experiments that use antibodies or other specific inhibitors of RNA or protein activity, even though no gene may be mutated (the rationale is that IMP is used where an abnormal situation prevails in a cell or organism).

• In addition to such 'abnormal' situations, IMP may also be used for experiments (e.g. QTL analyses) in which an inference is based on the observed effects of naturally occurring variations in a gene.

• IMP also covers phenotypic similarity: a phenotype that is informative because it is similar to that of another independent phenotype (which may have been described earlier or documented more fully) is IMP (not IGI).

IGI: inferred from genetic interaction • "Traditional" genetic interactions such as suppressors, synthetic lethals, etc. • Functional complementation, eg. when a gene from one organism complements a deletion or other mutation in another species. (See notes for WITH field.)• Rescue experiments • Inference about one gene drawn from the phenotype of a mutation in a different gene • Includes any combination of alterations in the sequence (mutation) or expression of more than one gene/gene product. (Covers any of the IMP experiments that are done in a non-wild-type background)• When redundant copies of a gene must all be mutated to see an informative phenotype, that's IGI. • Use IMP for "phenotypic similarity," as described below. • We have also decided to use this category for situations where a mutation in one gene (gene A) provides information about the function, process, or component of another gene (gene B; i.e. annotate gene B using IGI).

IPI: inferred from physical interaction • 2-hybrid interactions • Co-purification • Co-immunoprecipitation• Ion/protein binding experiments • Covers physical interactions between the gene product of

interest and another molecule (or ion, or complex).

For functions such as protein binding or nucleic acid binding, abinding assay is simultaneously IPI and IDA; IDA is preferred because the assay directly detects the binding. For both IPI andIGI, it would be good practice to qualify them with the gene/protein/ion. We thought that antibody binding experiments were not suitable as evidence for function or process.

ISS: inferred from sequence or structural similarity

This code can be used for any analysis based on sequence alignment, structure comparison, or evaluation of sequence features such as composition, including:• Sequence similarity (homologue of/most closely related to) • Recognized domains • Structural similarity • Protein features, predicted or observed (e.g. hydrophobicity, sequence composition) • Southern blotting • Use this code for BLAST (or other sequence similarity detectionmethod) results that have been reviewed for accuracy by a curator. If the result has not been reviewed, use IEA. • ISS can also be used for sequence similarities reported in published papers, if the curator thinks the result is reliable enough.

Usage of the ISS code within GOAhttp://www.ebi.ac.uk/GOA/goaHelp.html#6

There are three ways in which a curator can use the ISS evidence code:1. If a curator reads a paper that provides functional information and states an orthologybetween two proteins, a curator can transfer the annotation to the corresponding orthologs by changing the evidence code to 'ISS' in the target protein's entry. The original literature identifier is still shown. Any information that was previously in the 'with' column of the original entry is changed in the target entry to contain the original entry's accession number. This allows the source of the annotation to be traced.

2. If a curator is confident that a protein shows high similarity to another (e.g. from using BLAST or UniRef90), and it seemed reasonable to infer that the two proteins have a common function, then annotation can be transferred from one protein to another using the 'ISS' code. The evidence code in the target annotation will change to 'ISS'. The target entry's accession number will replace the journal identifier and any information that was previously contained in the 'with' column of the original entry is changed in the target entry to contain the original entry's accession number. This allows the source of the annotation to be traced.

3. If sequence similarity and evidence for human annotation was reported in two different papers, the annotation can be transferred using the 'ISS' evidence code. In the target entry, the evidence code changes to 'ISS' and the curator will add a new journal identifier. Any information that was previously contained in the 'with' column of the original entry is changed in the target entry to the original entry's accession number. This allows the source of the annotation to be traced.

IDA: Inferred from Direct Assay• Enzyme assays • In vitro reconstitution (e.g. transcription) • Immunofluorescence (for cellular component) • Cell fractionation (for cellular component) • Physical interaction/binding assay (sometimes appropriate for cellular

component or molecular function) • Important: this code is used to indicate a direct assay for the function,

process, or component indicated by the GO term. Curators therefore need to be careful, because an experiment considered as direct assay for a term from one ontology may be a different kind of evidence for the other ontologies. In particular, we thought of more kinds of direct assays for cellular component than for function or process. For example, a fractionation experiment might provide "direct assay" evidence that a gene product is in the nucleus, but "protein interaction" evidence for its function or process. Binding assays can provide direct assay evidence for ... binding molecular function terms.

IEP: inferred from expression pattern • Transcript levels (e.g. Northerns, microarray data) • If a GO term is inferred from the results of a microarray

experiment, this code will usually be used. There are cases, however, where RCA may be appropriate, such as studies that combine results of microarray and other types of experiments. Protein levels (e.g. Western blots)

• Covers cases where the annotation is inferred from the timing or location of expression of a gene. Expression data will be most useful for process annotation rather than function. For example, several of the heat shock proteins are thought to be involved in the process of stress response because they are upregulated during stress conditions. Use this category with caution!

• Note: The "database identifier" column in the gene_associationfile should be filled in whenever possible, to help avoid circular annotations between GO and other databases.

Notes on IEP •Notes: Addition of the IEP category generated a lot of discussion via email. One theme that emerged is that curators and users will have to be careful when interpreting expression results, especially if there's no other kind of evidence linking a gene product with a process. For instance, we certainly don't want to look at a cluster of genes, and, based on previous knowledge of one of them being involved in protein folding, annotate the rest of the genes in that cluster to the same process. This is certainly a dangerous thing to do. But having the IEP code allows curators to include expression data when they deem it appropriate, and allows researchers to make their own decisions/judgments about the reliability of the annotation. (Midori Harris, 2000-03-08; updated 2000-03-09) •Another important theme, indeed one of the reasons we opted to add the category, is that systematic analysis will prove to be very informative. It was especially well stated by Richard Baldarelli of MGI, so I've included his message here: It seems that expression data will be very useful for process and cellular component mapping, but caution should be used for function mapping (as Allan and Kara point out [in email messages]). While conventional expression assays will provide useful evidence in several cases, the real benefit will come from expression profiling. The rationale behind expression profiling from chip data is that genes that are coordinately regulated over a range of environments are likely to be involved in the same biological processes, and thus may have interrelated functions. As expression technology evolves to consider other aspects of gene expression (e.g. transcription and post-transcription chips, Mass-spec on 2D protein data), profiling will become an even more valuable tool for process implication. With the genome sequences here or on the way, the most significant information we may have for many genes will be expression profiling data (at least for a while anyway). Accuracy levels for process implication aside, this type of evidence is necessarily indirect. Having an evidence type "expression" takes this into account and remains fairly non-specific.

IEA: inferred from electronic annotation • Annotations based on "hits" in sequence similarity searches, if they have not

been reviewed by curators (curator-reviewed hits would get ISS) • Annotations transferred from database records, if not reviewed by curators

(curator-reviewed items may use NAS, or the reviewing process may lead toprint references for the annotation)

• Comment: Used for annotations that depend directly on computation or automated transfer of annotations from a database. The key feature that distinguishes this evidence code from others is what a curator has done IEA is used when no curator has checked the annotation to verify its accuracy. The actual method used (BLAST search, SwissProt keyword mapping, etc.) doesn't matter. An entry may be made in the 'with' column if relevant (e.g. for sequence comparisons).

TAS: traceable author statement• Anything in a review article where the original experiments are traceable

through that article (material from introductions to non-review papers will sometimes meet this standard; discussion sections should usually be regarded with greater skepticism)

• Anything found in a text book or dictionary; usually text book material has become common knowledge (e.g. "everybody" knows that enolase is a glycolytic enzyme).

• TAS and NAS are both used for cases where the publication that acurator uses to support an annotation doesn't show the evidence (experimental results, sequence comparison, etc.). With a few exceptions, TAS should be used only if references to the original work are available. TAS is meant for the more reliable cases, such as reviews (presumably written by experts) or material sufficiently well established to appear in a text book, but there isn't really a sharp cutoff between TAS and NAS. Curator discretion is advised!

• Comment: Formerly ASS ("author said so").

NAS: non-traceable author statement

• Database entries that don't cite a paper (e.g. UniProt Knowledgebase records,)

• Statements in papers (abstract, introduction, or discussion) that a curator cannot trace to another publication

• Comment: Formerly NA (not available).

ND: no biological data available • Used for annotations to "unknown" molecular function, biological process, or cellular

component. • Curators at the contributing database found no information supporting an annotation

to any term from the ontology in question (molecular function, biological process, or cellular component) as of the date indicated.

• This reference documents the fact that a database curator has reviewed the literature describing a gene product but found no biologically useful information*; it should only be cited for annotations to "unknown," i.e. molecular function unknown ; GO:0005554, biological process unknown ; GO:0000004 or cellular component unknown ; GO:0008372.

• This code is used only for annotations to "unknown," and it is the only evidence code recommended for annotations to unknown (except in cases where a cited source explicitly says that something is unknown). It should be accompanied by a reference that explains that curators looked but found no information. The GO Reference collection includes a generic reference that can be used with ND; to use it insert "GO_REF:0000015" in the reference column of a gene_association file.

• Note that ND can be used with any one (or two) of the 'unknown' terms, even if there is data available to support annotation to a term from one or both of the other ontologies (e.g. ND can be used with cellular component unknown ; GO:0008372if the function and process are known but component is not).

* under review: GOA to decide whether or not this code can be used if there is ISS data.

RCA: inferred from reviewed computational analysis

• Predictions based on large-scale experiments (e.g. genome-wide two-hybrid, genome-wide synthetic interactions)

• Predictions based on integration of large-scale datasets of several types • Text-based computation (e.g. text mining) • This code is used for annotations based on a non-sequence-based

computational method, where the results have been reviewed by an author or a curator.

• IEA should be used for any computational annotations that are not checked for accuracy by a curator (or by the authors of a paper describing such analysis), and sequence comparisons that have been reviewed use ISS. For microarray results alone, IEP is preferred, but RCA is used when microarray results are combined with results of other types of large-scale experiments.

• Cellular component using DDF-MudPIT?

IC: Inferred by Curator • To be used for those cases where an annotation is not supported by any evidence, but can be reasonably inferred by a curator from other GO annotations, for which evidence is available. • An example would be when there is evidence (be it direct assay,sequence similarity or even from electronic annotation) that a particular gene product has the function transcription factor activity. There is no evidence whatsoever that this gene product has the cellular location nucleus, but this would be a perfectly reasonable inference for a curator to make (if the curator is annotating a eukaryotic gene product, of course). • Note that the With/From field should always be filled in with a GO id when using this evidence code. • Example:gene_product: jubireference: Ashburner et al. 2006 J. irreprod. data 107:11989-11990 molecular_function: general RNA polymerase II transcription factor ; GO:0016251 | inferred from sequence similarity cellular_location: nucleus ; GO:0005634 | inferred by curator from GO:0016251

Gene Association Files

Two Types of Annotation:

1. electronic annotation – IEA- see July 19 lecture

1. manual annotation – sequence data (ISS)- papers (all other codes)

Making Gene Associations

• GO terms should be associated with database objects representing gene products rather than genes – a single gene may encode very different products with

very different attributes – if the database object is a gene, it is associated with all

GO terms applicable to any of its products • GO annotations should be attributed to a source • Each annotation should indicate the evidence on

which it is based

The Gene Association File

• the gene association file is the mechanism by which gene products functions are shared/released

• collaborating databases export to GO a tab delimited file, the "gene association file" of links between database objects and GO terms

• the database object may represent a gene or a gene product (transcript or protein)

The Gene Association FileTo make gene associations we have to fill in up to 15 fields; 11 are compulsory

Attributing a Source

• Every annotation must be attributed to a source, which may be a literature reference, another database or a computational analysis.

• The annotation must indicate what kind of evidence is found in the cited source to support the association between the gene product and the GO term.

General Recommendations (1)• A gene product can be annotated to zero or more

nodes of each ontology. • Annotation of a gene product to one ontology is

independent of its annotation to other ontologies. • Annotate gene products in each species database to

the most detailed level in the ontology that correctly describes the biology of the gene product.

• Keep to the True Path Rule: annotating to a term implies annotation to all parents via any path, so check the parentage of a term before annotating.

General Recommendations (2) • “Unknown vs Unannotated: "Unknown" means that someone

has tried annotating the gene, but didn't find any information. Absence of annotation implies that no one has looked.

• Annotate to terms from all three ontologies, using "unknown" if necessary, citing a reference within their database that explains that they found no relevant biological information in the literature (or any other sources they may have considered).

• Uncertain knowledge of where a gene product operates should be denoted by annotating it to two nodes, one of which can be a parent of the other, eg. a gene product known to be in the nucleolus, but also experimentally observed in the nucleus generally, can be annotated to both nucleolus and nucleus in the cell component ontology. The two annotations may have the same or different supporting evidence.

• Annotate to multiple nodes that conflict with each other if there are conflicting claims in the literature.

General Recommendations (3)• If the database object is a gene, it is associated with all GO

terms applicable to any of its products. • “Normal" depends on the point of view taken by the annotator,

eg. many viruses use host proteins to carry out viral processes. The host protein is then doing something abnormal from the perspective of the host, but completely normal from the perspective of the virus.

• In this case, use two taxon IDs in the "Taxon" column of the gene association file, the first being the organism that encodesthe gene product, and the second that of the organism that uses the gene product, and whose perspective is considered "normal" for that annotation.


1. DB

• refers to the database contributing the gene_association file• this field is mandatory• currently, we use ‘UniProt’• ‘UniParc’ and ‘Genbank’ are also acceptable

2. DB_Object_ID

• refers to a unique identifier in the database for the gene product being annotated • may or may not correspond exactly to what is described in a paper, eg. a paper describing a protein may support annotations to the gene encoding the protein (gene ID in DB_Object_ID field) or annotations to a protein object (protein ID in DB_Object_ID field)• this field is mandatory• we use the UniProt accession

3. DB_Object_Symbol

• refers to a unique (and valid) symbol to which the DB_Object_ID is matched• this should be a gene symbol, wherever possible• it is not an ID or an accession number (the DB_Object_IDprovides the unique identifier), although IDs can be used if there is no more biologically meaningful symbol available (e.g.,when an unnamed gene is annotated)• can use ORF name for otherwise unnamed gene or protein this field is mandatory

4. Qualifier

• flags that modify the interpretation of an annotation• one (or more) of NOT, contributes_to, colocalizes_with (pipes delimited)• this field is not mandatory

Allowable values:1. ‘NOT’

Use NOT when a gene product is not associated with the GO term to document conflicting claims in the literature.Not is used when there is some reason to expect an association, but experimental evidence proves the expection wrong.2. ‘Contributes to’ (used with GO Function Ontology)distinguishes between individual subunits functions and whole complex functions3. ‘Colocalizes with’ (used with GO Component Ontology)Transiently or peripherally associated with an organelle or complex where the resolution of an assay is not accurate.

5. GO ID

• refers to the GO ID number for the term attributed to the DB_Object_ID• this field is mandatory

Note that the GO term name is NOT used in the gene association file.

6. DB:Reference

• refers to the unique identifier that gives authority for the attribution of the GOid to the DB_Object_ID• for manual curation, this will be a PubMed ID number• may also be a database record, eg. electronic annotations & annotations to unknown function (ND) will require that AgBase have a protocol code• only one reference can be cited on a single line • when database references also refer to PMID, entries are pipes delimited (separated by a “|”)• this field is mandatory

7. Evidence

• either IMP, IGI, IPI, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC or RCA• after filling in the evidence code, check to see if you need tofill in ‘With (or) From’ field• this field is mandatory • evidence codes are changing & updated: check definitions regularly

8. With (or) From

• This field is used to hold an additional identifier for annotations using certain evidence codes (IC, IEA, IGI, IPI, ISS). • IC: if you are inferring an association based on another, populate the From field with the associated GOID number. Mandatory for the IC.

•Example: if a particular gene product has the function transcription factor activity. There is no evidence whatsoever that this gene product has the cellular location nucleus, but this would be a perfectly reasonable inference for a curator to make (if the curator is annotating a eukaryotic gene product, of course). Both associations would have the same reference. •molecular_function: general RNA polymerase II transcription factor ; GO:0016251 | inferred from sequence similarity •cellular_location: nucleus ; GO:0005634 | inferred by curator from GO:0016251

• IEA and ISS: use “with” to populate the field with a gene identifier (DB:gene_id) unless ISS is used to denote predicted sequence features (such as hydrophobicity, alpha-helices, etc.

• IGI and IPI: use ‘with’ to include an identifier for the "other" gene involved in the interaction. The entry in the "with" column does not have to refer to the same species that is being annotated.

IGI, IPI and ‘WITH’For IGI & IPI codes, we recommend making an entry in the "with"

column (i.e. include an identifier for the "other" gene involved in the interaction). If more than one independent genetic interaction supports the association, use separate lines for each. In cases where the gene of interest interacts simultaneously with more than one other gene, put both/all of the interacting genes on the same line (separate identifiers by pipes in the "with" column).

To help clarify: GOterm IGI FB:gene1|FB:gene2 means that the GO term is supported by evidence from its interaction

with both of these genes; i.e. neither of these statements are true: GOterm IGI FB:gene1

GOterm IGI FB:gene2

9. Aspect

• either P (biological process), F (molecular function) or C (cellular component) • this field is mandatory

10. DB_Object_Name

• name of gene or gene product• white space allowed• this field is not mandatory

11. DB_Object_Synonym

• any alternative names (pipes delimited)• aids searching so be thorough• generally, names that differ only by case are not recorded, butnames that differ by punctuation are, eg, CAP22|CAP-22• white space allowed• this field is not mandatory

12. DB_Object_Type

• refers to the database entry, ie. does it match a gene, transcript, protein, protein_structure or complex • we will enter ‘protein’• MUST match the database entry identified by DB_Object_ID• this column does not reflect anything about the GO term or the evidence on which the annotation is based. eg. if your database entry represents a gene, then 'gene' goes in the DB_Object_ID column, even if the annotation is relevant to the localization of a protein product of the gene• several alternative transcripts from one gene may be annotated separately, each with a unique transcript DB_Object_ID, but list the same gene symbol in the DB_Object_Symbol column. • this field is mandatory

13. taxon

• refers to the taxon identifier(s)• usually use the ID of the species encoding the gene product• can also have 2 IDs: the first ID is that of the species encoding the gene product; the second ID is that of the species using the gene product (pipes delimited)

Chicken: 9031Corn: 4577Cow: 9913 (Bos taurus)Channel catfish: 7998 (Ictalurus punctatus)

14. Date

• date on which the association was made• YYYYMMDD format

15. Assigned_by

• refers to the database that made the association (AgBase; ‘AB’)• used for tracking the source of the annotation


HOMEWORK!

1. Curate these papers:

PMID: 15356338 7490283 1512296 12097608 11119244

2. Annotate this gene product:UniProt: Q9IAK1Try to get comprehensive GO annotation

Gene Ontology Annotation Training: Electronic Annotation

and GO Tools

Fiona McCarthy November 2005

Overview• Electronic Annotation Strategies

– Interpro2GO– spkw2GO– ec2GO

• Blast Strategies– SeqHound– UniRef

• GO Slims• GO Tools

- GO slims- GO tools available from the GO Homepage - Other sources for GO tools - GO Tools available from AgBase

IEA: Electronic Annotation

November 14, 2005

Mappings of External Classification Systems to GO

http://www.geneontology.org/GO.indices.shtml

IEA Mappings at AgBase• IEA mappings provide “higher order” or more

general GO terms• allow large scale assignment of function• ALL IEA mappings need to be updated monthly• currently can take the IEA mappings from

UniProt/GOA• Process needs to be updateable (relies on

protocol references)• Need to be able to look for obsolete terms and

to change mappings as GO terms & their mappings change.

ISS: Blast Strategies

November 14, 2005

SeqHound

SeqHound

• uses gi numbers• processes multiple entries• can process whole taxon entries• uses ‘ontoglyphs’ – may lose granularity• shows sequence neighbors (but not their

alignments)

SeqHound

UniRef Database

UniRef Database

GO Tools

1. GO Slims

What is a GOSlim?• GO slims are cut-down versions of the GO ontologies

containing a subset of the terms in the whole GO. They give a broad overview of the ontology content without the detail of thespecific fine grained terms.

• GO slims are particularly useful for giving a summary of the results of GO annotation.

• GO slims are created by users according to their needs, and may be specific to species or to particular areas of the ontologies.

• GO provides a generic GO slim which should be suitable for most purposes. Alternatively, users can create their own GO slims; TAIR (plant), SGD (yeast) and GOA (generic) have submitted GO slims which have been integrated into the GO flat file.

http://www.geneontology.org/GO.slims.shtml

map2slim.pl

•The map2slim.pl script, distributed as part of the go-perl package, maps a set of annotations up to their parent GO slim terms. •Further documentation, installation help and instructions on running the script are available from: http://www.godatabase.org/dev/pod/scripts/map2slim.htmlhttp://search.cpan.org/~cmungall/go-perl-0.01/

Figure 3. GO Slim Viewer and its Output

A. The GO Slim Viewer Page.

B. GO Slim Viewer Output and a Chart plotted in Excel using this output.

Membrane Proteins: Biological Process

B-cells Stroma

ion/proton transportcell migration

cell adhesioncell growthapoptosisimmune response

cell cycle/cell proliferation cell-cell signalingfunction unknowndevelopmentendocytosisproteolysis and peptidolysis

protein modificationsignal transduction

Nuclear Proteins: Biological Process

B-cells Stroma

chromosome organization and biogenesisDNA metabolismchromatin assembly or disassembly

cell differentiationcell growthchromosome segregationchromatin modification

cell cycle/cell proliferation function unknownnuclear organization and biogenesisprotein catabolismprotein modificationRNA processingnuclear transport

regulation of transcription, DNA-dependentsignal transduction

2. GO tools available from the GO Homepage

3. Other sources for GO tools

Using GO Tools

• new GO tools may not be listed on the GO Consortium webpage

• check PubMED literature for new GO Tools

• always check when the latest DAG version and GO annotations were loaded into a GO tool

4. GO Tools available from AgBase

Online Demonstration

Nan Wang Computer Science & Engineering, MSU

Documents

GO Annotation Fall 2005 Introduction: Proteins & Function · known function in a db Assign to T same function as homologs Confirm with suitable wet experiments Discard this function