Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are...

Preview:

Citation preview

Protein and RNA FamiliesProtein and RNA Families

Function Prediction

Tell me what you do

and I will tell you who you are …

From multiple alignments we can derive:

• A motif• A profile (PSSM)• A Hidden Markov Model

MOTIF

Rxx(F,Y,W)(R,K)SAQ

Profile Scoring

Profile Hidden Markov Model (profile HMM)

• An MSA can be described by a HMM• HMM is a probabilistic model of the MSA

consisting of a number of interconnected states• The different states are match, delete or

insert.• Each position is modeled independently• The concatenation of the probabilistic models

of the positions is the protein model.

Profile HMM

D16 D17 D18 D19

M16 M17 M18 M19

I16 I19I18I17

100%

100% 100%

100%

D 0.8S 0.2

P 0.4R 0.6

T 1.0 R 0.4S 0.6

X XX X

50%

50%D R T RD R T SS - - SS P T RD R T RD P T SD - - SD - - SD - - SD - - R

16 17 18 19

Protein Domains

• Domains can be considered as building blocks of proteins.

• Some domains can be found in many proteins with different functions, while others are only found in proteins with a certain function.

• The presence of a particular domain can be indicative of the function of the protein.

C2H2 Zinc-Finger

DNA Binding domainZinc-Finger

PROSITE

• ProSite is a database of protein domains that can be searched by either regular expression patterns or sequence profiles.

Zinc_Finger_C2H2 Cx{2,4}Cx3(L,I,V,M,F,Y,W,C)x8Hx{3,5}H

Pfam

• The Pfam database is based on two distinct classes of alignments– Seed alignments which are deemed to be

accurate and used to produce Pfam A– Alignments derived by automatic clustering of

SwissProt, which are less reliable and give rise to Pfam B

• Database that contains a large collection of multiple sequence alignments andProfile hidden Markov Models (HMMs).

• High-quality seed alignments are used to build HMMs to which sequences are aligned

0 1000 2000 3000 4000 5000 6000 7000 80000

10

20

30

40

50

60

70

80

90

100

Pfam Coverage

Number Of Families

Pe

rce

nta

ge

Co

vera

ge

Of U

niP

rot

● First 2000 families covered ~ 65% of UniProt● Currently, 7503 families cover 74% of UniProt

Pfam coverage

InterPro

Was built from protein classification databases, such as:

• PROSITE• ProDom• SMART• Pfam• PRINTSA total of 10403 entries

Uses UniProt = SWISSPROT and TrEMBL

Applications of InterPro

Diagnostic protein family signature database for:

• Classification of proteins through text and sequence search tools

• Large-scale classification

• Enhancing genome annotation -fly, human, rice mouse

• Proteome Analysis

GO (gene ontology)http://www.geneontology.org/

• The GO project is aimed to develop three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes (P), cellular components (C) and molecular functions (F) in a species-independent manner. There are three separate aspects to this effort: first, to write and maintain the ontologies themselves; second, to make associations between the ontologies and the genes and gene products in the collaborating databases, and third, to develop tools that facilitate the creation, maintainence and use of ontologies

Ontology is a description of the concepts and relationships that can exist for an agent or a community of agents

InterPro to GO

InterPro: IPR000003 Retinoic acid receptor > GO: DNA binding GO:0003677

InterPro: IPR000003 AraC type helix-turn-helix > GO: transcription factor GO:0003700

Database and Tools for protein families and domains

• InterPro - Integrated Resources of Proteins Domains and Functional Sites

• Prosite – A dadabase of protein families and domain • BLOCKS - BLOCKS db • Pfam - Protein families db (HMM derived)• PRINTS - Protein Motif fingerprint db • ProDom - Protein domain db (Automatically generated) • PROTOMAP - An automatic hierarchical classification of Swiss-Prot

proteins • SBASE - SBASE domain db • SMART - Simple Modular Architecture Research Tool • TIGRFAMs - TIGR protein families db

Clusters of Orthologous Groups of proteins

(COGs) Classification of conserved genes according to their

homologous relationships. (Koonin et al., NAR)

Homologs - Proteins with a common evolutionary origin

Paralogs - Proteins encoded within a given species that arose from one or more gene duplication events.

Orthologs - Proteins from different species that evolved by vertical descent (speciation).

Clusters of Orthologous Groups of proteins

(COGs)

Each COG consists of individual orthologous proteins or orthologous sets of paralogs from at least three lineages.

Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG.

COGS - Clusters of orthologous groups

* All-against-all sequence comparison of the proteins encoded in completed genomes (paralogs/orthologs)

* For a given protein “a” in genome A, if there are several similarproteins in genome B, the most similar one is selected

* If when using the protein “b” as a query, protein “a” in genome A is selected as the best hit “a” and “b” can be included in a COG

* Proteins in a COG are more similar to other proteins in the COG than to any other protein in the compared genomes

* A COG is defined when it includes at least three homologousproteins from three distant genomes

Distribution of functional categories in the COGs database

Function unknown

General function,prediction only

Information in COGS

* Annotation of proteins by members of known structure/function

* Phylogenetic patterns - presence or absence of proteins in a given organism --> Enables following metabolic pathways

* Multiple alignments

Discovering common motifs in unaligned sequences

MEME-can be used for protein sequences as for DNA sequences

RNA families

• Rfam : General non-coding RNA database

(most of the data is taken from specific databases)

http://www.sanger.ac.uk/Software/Rfam/

Includes many families of non coding RNAs and functionalMotifs, as well as their alignement and their secondary structures

Rfam (currently version 6.1)

• 379 different RNA families or functional

Motifs from mRNA UTRs etc.

GENE

INTRON

Cis ELEMENTS

An example of an RNA family miR-1 MicroRNAs