firstname.lastname@example.org http://www.bork.embl-heidelberg.de/ Peer Bork EMBL & MDC Heidelberg & Berlin Proteome analysis in silico
omics research on an entirety of biomolecular objects Proteomics research on the entirety of proteins (so far in an organism) coined beginning of the 90th Original intention exemplified by the genome: Common Praxis: omics - used to describe large-scale approaches (whereby large is sometimes 1) omes: use and misuse Proteomics - used for research on many proteins (whereby many might mean 3) ome entirety of biomolecular objects (ALL genes etc)
Protein profiling and interaction proteomics Originally two main directions: Protein profiling: establishment of protein inventories under controlled conditions (organelles, tissues, organisms). Interaction proteomics: identification of temporally and spatially defined functional modules formed by proteins Bioinformatics analysis is essential in both areas
Part I Part II Protein detection and annotation by homology and orthology (function in1D) Protein interactions and protein networks (function in 2D) Proteome analysis in silico Temporal and spatial considerations (function in 3D+4D)
AlternativeSplicing Genomeannotation Bork et al. JMolBiol 1998 Domain analysis Protein networks Literature mining coupled to genomic data
70% prediction accuracy is great!
Concepts in function prediction Homology-based (intrinsic molecular features) Gene context (functional associations) - Sequence and domain DBs (Blast, Pfam,Smart) - Gene neighbourhood, fusion, co-occurrence - Shared regulatory elements Other (residue level, functional class ) - Correlated mutations - Interaction threading - Function transfer by orthology - Feature analysis
www. bork.embl-heidelberg.de I. Homology-based protein annotation Metazoan proteome analysis: human vs chicken Evolution of protein function Metazoan genome annotation: the dark side Homology detection and domain annotation
Status of homology based function prediction Many homologues, an increasing number of predictable folds, but tough times for automatic function prediction
Molecular Functions have to be defined on a domain basis i.e. separately for each structurally independent unit within a sequence Henikoff et al. 1997 Science 278, 609
History of signaling domain discovery Systematic discovery by 1) searching in between regions 2) starting with repeats Doerks et al. 2002 Genome Res. Ponting et al. 2001 Genome Res.
Domain discovery in disease genes
SMART Blast-like input - - Access to different databases - - Domain annotation & architecture www.smart.embl-heidelberg.de Collaboration with Chris Ponting - - Alerting
Digested output -signal sequence, Coiled coil and TM -Pfam integrated SMART -comparison of domain context www.smart.embl-heidelberg.de
Calpain7 MIT Spastin SKD1 protein VPS4p ATPase (Vacuolar protein sorting factor 4A and 4B) Tobacco mosaic virus helicase domain-binding protein MIT Sorting nexin 15 MIT RSK-like protein MIT Similar to ribosomal protein S6 kinase MIT CG8866 MIT Ciccarelli, F. D., et al. Genomics 81(03)437Patel, H. et al. Nat Genet 31(02)347, Spartin Mutation MIT Plant-related A putative transport-associated microtubule-binding domain Unifying disorders associated to hereditary spastic paraplegia?
www. bork.embl-heidelberg.de I. Homology-based genome annotation Metazoan proteome analysis: human vs chicken Evolution of protein function Metazoan genome annotation: the dark side Homology detection and domain annotation Metazoan genome annotation: the dark side
21 Number of human genes in time Aug00Apr01Oct00Dec00Feb01Feb00 0 100 120 20 40 80 60 HGS, Incyte and co Textbooks, public opinion Celera HGP 38 32 52 39 27 24 22 No human genes in thousands HGS others Basis for Feb 01 publications 10T 8T 6T 4T 2T NEMAX50 index Jan05 10T 8T 6T 4T 2T TecDAX index
regions containing independent elements Merging of fragments of the same element 1. Similarity search in intergenic regions Masking of known repeats and already predicted genes 1.5-2 million fragments fragments with significant sequence similarity BLASTX vs nr prot. db E-value < 0.001 Exclusion of transposon and virus derived sequence Closest known protein (first blast hit) GENEWISE Torrents, Suyama, Bork Genome Res. 13(2003)2550 Annotation of pseudogenes changes gene numbers Ka/Ks functionality check Ca 20.000 detectable pseudogenes in each: human, mouse, rat
Still >3000 pseudogenes among the predicted human genes mid 2004 (build 34) e1e2 Processed Pseudogene Genewise prediction using sptrembl|Q9HBM5 e3e4 e5 e6 Processed Pseudogene Genewise prediction using SwissProt|RS2_RAT 80 kb Predicted Gene Mm chr1:7608644-7681026 Stop codon or frameshift 2. Consistency check of gene predictions Annotation of pseudogenes changes gene numbers Arrays, chips et al. 20%off?
genes Protein diversity 20-40k genes >100k transcripts >1000k proteins? What do we count?
Rate of detectable alternative splicing depends on EST coverage and library range 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 AS per mRNA (x) Brett et al. Nature Genet. 30(2002)29
www. bork.embl-heidelberg.de Boue et al. Bioessays 03
Homology-based predictions of exons and alternative transcripts ( www.smart.embl-heidelberg.de) SMART domain DB links to genomes
Top 10 domains* in human: 30% diff.! humanflyworm Immunoglobulin C2H2zinc finger *Only no of genes given, no of domains higher; note that only around 90% is sequenced Protein kinase Rhod.-like GPCR P-loop NTPase Rev.transcriptase RRM (RNA-binding) WD40 (G-protein) Ankyrin repeat 765 (381) 14064 706 (607) 357151 575 (501) 319437 569 (616) 97358 433 198183 350 1050 300 (224) 15796 277 (136) 162102 276 (145) 105107 1330018200 Nature 409 (01)860; Science 291(01)1304 Total no genes Species Homeobox 267 (160) 148109 26500(26500)
Metazoan genome annotation an ongoing process and far from complete n >2000 pseudogenes in mammalian gene sets: Only now they are about to be included in prediction pipelines n Ca 150 retro-related genes in mammalian gene sets (>1000 in 2004), but true human genes sometimes suppressed n Annotation of gene clusters need considerable improvements n Alternative splicing still a major unknown n Considerable human factor in annotation
www. bork.embl-heidelberg.de I. Homology-based genome annotation Metazoan proteome analysis: human vs chicken Evolution of protein function Metazoan genome annotation: the dark side Homology detection and domain annotation Metazoan genome annotation: the dark side Metazoan proteome analysis: human vs chicken
Human: Nature Feb 2001 Mouse: Nature Dec 2002 Mosquito: Science Oct 2002 Rat: Nature Apr 2004 75 40 mouse rat chicken chimp 310MY fugu 450MY 600-1200MY? ? C.eleg. D.mena. 250MY mosquito 5 human chicken: Nature Dec 2004
Chicken genome analysis Zdobnov et al Science 02 15% 45% Hillier et al Nature 04
Chicken genome analysis: orthology and cellular processes 75.4% identity (median) between chicken and human 1:1 orthologs Immune response evolves fastest
www. bork.embl-heidelberg.de Chicken genome analysis: Innovation and Expansion of domain families
Orthology analysis reveals more subtle functional changes
Evolution by duplication: Burst of an olfactory receptor family thought to recognize MHC diversity chicken human 221 copies in chicken given a ca 300 ORs in chicken and 450 in human
Chicken genome analysis: Evolution of function by domain accretion Scavenger receptor cysteine-rich domain acquired by a fibrinogen-domain containing protein (identified an