7
C overage persequence 0 10 20 30 40 50 60 70 80 90 0 1000 2000 3000 4000 5000 6000 Fam ilies,ordered by size Percentage ofsequences Percentage of Domain Percentage of Domain Sequences in Genomes Sequences in Genomes all excluding singletons excluding singletons And filtering Genome Coverage by Domain Superfamilies Genome Coverage by Domain Superfamilies ~50% of domain sequences in the genomes are contained in ~50% of domain sequences in the genomes are contained in ~1000 CATH/SCOP domain superfamilies ~1000 CATH/SCOP domain superfamilies Further ~20% of sequences belong to ~1400 Pfam Further ~20% of sequences belong to ~1400 Pfam superfamilies with no structural relative superfamilies with no structural relative PSI2 currently targetting these ~1400 superfamilies PSI2 currently targetting these ~1400 superfamilies

Percentage of Domain Sequences in Genomes

  • Upload
    patch

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Genome Coverage by Domain Superfamilies. excluding singletons And filtering. excluding singletons. all. Percentage of Domain Sequences in Genomes. ~50% of domain sequences in the genomes are contained in ~1000 CATH/SCOP domain superfamilies - PowerPoint PPT Presentation

Citation preview

Page 1: Percentage of Domain Sequences in Genomes

Coverage per sequence

0

10

20

30

40

50

60

70

80

90

0 1000 2000 3000 4000 5000 6000Families, ordered by size

Pe

rcen

tag

e o

f s

equ

en

ces

Perc

enta

ge o

f D

om

ain

Perc

enta

ge o

f D

om

ain

Sequen

ces

in G

enom

es

Sequen

ces

in G

enom

es

all

excluding singletons

excluding singletonsAnd filtering

Genome Coverage by Domain Superfamilies Genome Coverage by Domain Superfamilies

~50% of domain sequences in the genomes are contained in ~50% of domain sequences in the genomes are contained in ~1000 CATH/SCOP domain superfamilies ~1000 CATH/SCOP domain superfamilies

Further ~20% of sequences belong to ~1400 Pfam superfamilies Further ~20% of sequences belong to ~1400 Pfam superfamilies with no structural relativewith no structural relative

PSI2 currently targetting these ~1400 superfamiliesPSI2 currently targetting these ~1400 superfamilies

Page 2: Percentage of Domain Sequences in Genomes

Pfamsuperfamily

close sequence

family (30%)‘unique family’

PSI2 targetting ~1400 LARGE superfamilies with no close PSI2 targetting ~1400 LARGE superfamilies with no close structural relativestructural relative

Page 3: Percentage of Domain Sequences in Genomes

All sequence families

Near/distant PDB relative

no PDB relative

Targetting ~1400 Pfam superfamilies but these Targetting ~1400 Pfam superfamilies but these contain tens of thousand of subfamiliescontain tens of thousand of subfamilies

0

10

20

30

40

50

60

70

80

90

100

0 10000 20000 30000 40000 50000 60000Subfamilies ordered by size

Per

cen

tag

e o

f se

qu

ence

s

Unique families ordered by size

Page 4: Percentage of Domain Sequences in Genomes

Pfamsuperfamily

close sequence

family (30%)‘unique family’

target ~1400 LARGE Pfam superfamilies common with no target ~1400 LARGE Pfam superfamilies common with no structural relativestructural relative

target clusters of families predicted to have different target clusters of families predicted to have different functionsfunctions

Gene3D annotations: COG, GO, EC, DIP, BIND, Y2H, Gene3D annotations: COG, GO, EC, DIP, BIND, Y2H, Microarray data, phylogenetic profiles Microarray data, phylogenetic profiles

functionalgroup

Page 5: Percentage of Domain Sequences in Genomes

Model quality v sequence identity for 78,545 structural genomics homology models, built by Modeller 8v1, assessed

using ProSa II

Methods like ProSa and GA341 can identify reasonable models at low sequence identities.

Comparison of models built by different methods may help in identifying reliable

regions

Page 6: Percentage of Domain Sequences in Genomes

In combination with analysis of other features e.g. domain context, homology models may help in suggesting functional

subgroups within a superfamily

Dissimilarity in electrostatic potential as an indicator of dissimilarity in function of PH domains.Blomberg et al. (1999) Classification of Protein Sequences by Homology Modeling and Quantitative Analysis of Electrostatic Similarity. Proteins 37:379-387

Electrostatic potential tends to be conserved to relatively low sequence identity between target and template.Chakravarty et al. (2005) Accuracy of structure-derived properties in simple comparative models of protein structures. Nucleic Acids Res. 33:244-259

Human pleckstrin Human exo84 signalling complex

Page 7: Percentage of Domain Sequences in Genomes

GeMMA http://www.biochem.ucl.ac.uk/~dlee/GeMMA• Currently ~ 80,000 models built by Modeller

• Update requires ~ 1 month every 6 months

• Modelling alignments from SAM-T99 HMMs

• Residue conservation calculated by Scorecons

• Electrostatic potential calculated by APBS

• Model quality assessed by ProSa 2003 and GA341