27
Protein and RNA Families Protein and RNA Families Function Prediction

Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Embed Size (px)

Citation preview

Page 1: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Protein and RNA FamiliesProtein and RNA Families

Function Prediction

Page 2: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Tell me what you do

and I will tell you who you are …

Page 3: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

From multiple alignments we can derive:

• A motif• A profile (PSSM)• A Hidden Markov Model

Page 4: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

MOTIF

Rxx(F,Y,W)(R,K)SAQ

Page 5: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Profile Scoring

Page 6: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Profile Hidden Markov Model (profile HMM)

• An MSA can be described by a HMM• HMM is a probabilistic model of the MSA

consisting of a number of interconnected states• The different states are match, delete or

insert.• Each position is modeled independently• The concatenation of the probabilistic models

of the positions is the protein model.

Page 7: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Profile HMM

D16 D17 D18 D19

M16 M17 M18 M19

I16 I19I18I17

100%

100% 100%

100%

D 0.8S 0.2

P 0.4R 0.6

T 1.0 R 0.4S 0.6

X XX X

50%

50%D R T RD R T SS - - SS P T RD R T RD P T SD - - SD - - SD - - SD - - R

16 17 18 19

Page 8: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Protein Domains

• Domains can be considered as building blocks of proteins.

• Some domains can be found in many proteins with different functions, while others are only found in proteins with a certain function.

• The presence of a particular domain can be indicative of the function of the protein.

Page 9: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

C2H2 Zinc-Finger

Page 10: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

DNA Binding domainZinc-Finger

Page 11: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

PROSITE

• ProSite is a database of protein domains that can be searched by either regular expression patterns or sequence profiles.

Zinc_Finger_C2H2 Cx{2,4}Cx3(L,I,V,M,F,Y,W,C)x8Hx{3,5}H

Page 12: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Pfam

• The Pfam database is based on two distinct classes of alignments– Seed alignments which are deemed to be

accurate and used to produce Pfam A– Alignments derived by automatic clustering of

SwissProt, which are less reliable and give rise to Pfam B

• Database that contains a large collection of multiple sequence alignments andProfile hidden Markov Models (HMMs).

• High-quality seed alignments are used to build HMMs to which sequences are aligned

Page 13: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

0 1000 2000 3000 4000 5000 6000 7000 80000

10

20

30

40

50

60

70

80

90

100

Pfam Coverage

Number Of Families

Pe

rce

nta

ge

Co

vera

ge

Of U

niP

rot

● First 2000 families covered ~ 65% of UniProt● Currently, 7503 families cover 74% of UniProt

Pfam coverage

Page 14: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

InterPro

Was built from protein classification databases, such as:

• PROSITE• ProDom• SMART• Pfam• PRINTSA total of 10403 entries

Uses UniProt = SWISSPROT and TrEMBL

Page 15: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Applications of InterPro

Diagnostic protein family signature database for:

• Classification of proteins through text and sequence search tools

• Large-scale classification

• Enhancing genome annotation -fly, human, rice mouse

• Proteome Analysis

Page 16: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

GO (gene ontology)http://www.geneontology.org/

• The GO project is aimed to develop three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes (P), cellular components (C) and molecular functions (F) in a species-independent manner. There are three separate aspects to this effort: first, to write and maintain the ontologies themselves; second, to make associations between the ontologies and the genes and gene products in the collaborating databases, and third, to develop tools that facilitate the creation, maintainence and use of ontologies

Ontology is a description of the concepts and relationships that can exist for an agent or a community of agents

Page 17: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

InterPro to GO

InterPro: IPR000003 Retinoic acid receptor > GO: DNA binding GO:0003677

InterPro: IPR000003 AraC type helix-turn-helix > GO: transcription factor GO:0003700

Page 18: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Database and Tools for protein families and domains

• InterPro - Integrated Resources of Proteins Domains and Functional Sites

• Prosite – A dadabase of protein families and domain • BLOCKS - BLOCKS db • Pfam - Protein families db (HMM derived)• PRINTS - Protein Motif fingerprint db • ProDom - Protein domain db (Automatically generated) • PROTOMAP - An automatic hierarchical classification of Swiss-Prot

proteins • SBASE - SBASE domain db • SMART - Simple Modular Architecture Research Tool • TIGRFAMs - TIGR protein families db

Page 19: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Clusters of Orthologous Groups of proteins

(COGs) Classification of conserved genes according to their

homologous relationships. (Koonin et al., NAR)

Homologs - Proteins with a common evolutionary origin

Paralogs - Proteins encoded within a given species that arose from one or more gene duplication events.

Orthologs - Proteins from different species that evolved by vertical descent (speciation).

Page 20: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Clusters of Orthologous Groups of proteins

(COGs)

Each COG consists of individual orthologous proteins or orthologous sets of paralogs from at least three lineages.

Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG.

Page 21: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

COGS - Clusters of orthologous groups

* All-against-all sequence comparison of the proteins encoded in completed genomes (paralogs/orthologs)

* For a given protein “a” in genome A, if there are several similarproteins in genome B, the most similar one is selected

* If when using the protein “b” as a query, protein “a” in genome A is selected as the best hit “a” and “b” can be included in a COG

* Proteins in a COG are more similar to other proteins in the COG than to any other protein in the compared genomes

* A COG is defined when it includes at least three homologousproteins from three distant genomes

Page 22: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Distribution of functional categories in the COGs database

Function unknown

General function,prediction only

Page 23: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Information in COGS

* Annotation of proteins by members of known structure/function

* Phylogenetic patterns - presence or absence of proteins in a given organism --> Enables following metabolic pathways

* Multiple alignments

Page 24: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Discovering common motifs in unaligned sequences

MEME-can be used for protein sequences as for DNA sequences

Page 25: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

RNA families

• Rfam : General non-coding RNA database

(most of the data is taken from specific databases)

http://www.sanger.ac.uk/Software/Rfam/

Includes many families of non coding RNAs and functionalMotifs, as well as their alignement and their secondary structures

Page 26: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

Rfam (currently version 6.1)

• 379 different RNA families or functional

Motifs from mRNA UTRs etc.

GENE

INTRON

Cis ELEMENTS

Page 27: Protein and RNA Families Function Prediction. Tell me what you do and I will tell you who you are …

An example of an RNA family miR-1 MicroRNAs