Preview:
DESCRIPTION
Citation preview
- Representative Proteomesand GenomesA standardized, stable and
unbiased set of proteomes
andgenomeshttp://pir.georgetown.edu/rps/Raja Mazumder
(mazumder@gwu.edu)
- Nomenclature Representative Proteomes Primarily computational
Reference Proteomes/QFO Proteomes/Blessed Proteomes Primarily
manual Extension of Reference proteomes
- Representative Proteomes (RP) and Representative genomes (RG)
http://pir.georgetown.edu/rps/
- Procedure to generate the RPs Compute pair-wise co-membership
value (X) in UniRef50 for all proteomes For each proteome, compute
the mean co-membership between this proteome and the other
proteomes Create ranked proteome list based on the mean
co-membership RPG generation starts For a given CMT, take the first
proteome in the ranked list and the ones with X CMT to form an RPG,
and remove them from ranked list ranked No list empty? Yes RPG
generation ends Select a Representative Proteome for each RPG
(manually inspected by curator)
- Proteome A Proteome B UniRef50 UniRef50 Sequence clusters
(UniRef100, 90, 50) From any organism Part of UniProt production
cycle PMID: 17379688
- RPs at Different CMTs 1000 100 900 90 3.02 800 80 700 2.69 70 #
RPs Million proteins 600 2.36 60# RPs % Reduction - 500 50 % 1
Proteomes 400 40 2.02 % Reduction - 300 30 Sequences 2 200 20 100
10 0 0 75 55 35 15 1 Based on 1144 complete genomes CMT (%) 2 Based
on 4.3 million sequences (complete genomes only) UniProtKB total:
13.46 million sequences
- RP at higher level is used to cluster the lower levels RPGs are
constructed based on co-membership, not taxonomy
- Manual mapping of UniProt andNCBI genomes The taxonomy ID of
each proteome present in UniProt is mapped to the NCBI
RefSeq/GenBank genome project IDs When more than one genome is
available for the same taxonomy ID, the genomes are ranked
according to the availability of a RefSeq genome, number of related
publications, number of citations for each publication, and date of
sequencing. The highest ranking genome is mapped to the UniProtKB
proteome
- RefSeq genomes and proteomes Mapping allows us to retrieve
genomes and proteomes from RefSeq.
ftp://ftp.pir.georgetown.edu/databases/rps/rgftp://ftp.pir.georgetown.edu/databases/rps/rp_in_refseq_sequences/
- RP55 Over Time 1400 120 # complete proteomes # RPGs % species
in multiple RPGs 1200 100 % stable RPGs 1000# Complete proteomes 80
800 60 % 600 40 400 20 200 0 0 2004_1 2005_4 2006_7 2007_10 2008_13
2009_15 2010_09 UniProtKB release
- Coverage Statistics RP55 95% of all InterPro families contain
at least one protein from the RP set InterPro covers ~75% of all
proteins in UniProtKB, and this number holds true as well for RP55
93% of the experimentally-characterized proteins are retained in
the RP set
- Coverage and use
- Downloads
- Make your own set
- Visualizing all-against-all proteomecorrelation matrix vs. the
taxonomytree Developed a method to graphically visualize NCBIs
taxonomy tree and overlay the proteome correlation tree (PCT) to
illustrate genomic similarity between organisms that may otherwise
be considered to have distant ancestry. Computed all-against-all
correlation values between all complete proteomes Comparison
network can be browsed in Cytoscape network software to easily
identify nodes in the taxonomy tree that are not supported by PCT
data Development tools: CytoscapeWeb, CytoscapeRPC, Perl
- Family Enterobacteriaceae Distance based on taxonomy tree
Shigella Escherichia Enterobacter Klebsiella Genus Distance based
on taxonomy tree Species ENT38 Distance based on correlation table
ECOLI ESCF3 ECO24 ENTAKECOK1 KLEP7 SHIFL SHIF8
- Example: Examine correlation scores ofAGRT5 Agrobacterium
tumefaciens (AGRT5)
http://pir.georgetown.edu/cgi-bin/rps_tree.pl?point_id=r15p176299&on=1&on100=1&file_id=122063&p=#-5
- Can easily identify genomic neighbors The top 2 levels are
family and genus nodes arranged according to taxonomic position The
bottom nodes are complete proteomes with a heuristic force-directed
layout applied according to all-against-all correlation Although
AGRT5 and AGRVS share the same genus, they are relatively distant
from each other (~28%), compared to AGRT5 and AGRSH (~70%).
- Sequence search Cleaner BLAST/phmmer results
- Conclusions High quality RPs generated computationally and
inspected by curators A standardized, stable and unbiased set of
proteomes and genomes Completely integrated and into the
UniProt/UniRef production pipeline and has monthly releases
Automatically selects QFO/UniProt RF (if available in RPG) as RP
(provide feedback to QFO and others if discrepancy) Extended to
RefSeq ( ftp://ftp.pir.georgetown.edu/databases/rps/rg;
ftp://ftp.pir.georgetown.edu/databases/rps/rp_in_refseq_s
equences/) RGs can help placement of unknown metagenomic sequences
into the correct clusters
- Acknowledgements Chuming Chen (PIR) Darren Natale (PIR)
Hongzhan Huang (PIR) Jian Zhang (PIR) Peter McGarvey (PIR) Cathy Wu
(PIR) Mona Motwani (GWU) Jamal Theodore (GWU) Robert Finn (HHMI
Janelia Farm/Pfam) Eleanor Stanley (EBI) Kim Pruitt (NCBI) Yuri
Wolf (NCBI) UniProt Consortium