Bork@embl.de http://www.bork.embl-heidelberg.de/ Peer Bork EMBL & MDC Heidelberg & Berlin Proteome analysis in silico

  • View
    214

  • Download
    0

Embed Size (px)

Transcript

  • Slide 1
  • bork@embl.de http://www.bork.embl-heidelberg.de/ Peer Bork EMBL & MDC Heidelberg & Berlin Proteome analysis in silico
  • Slide 2
  • omics research on an entirety of biomolecular objects Proteomics research on the entirety of proteins (so far in an organism) coined beginning of the 90th Original intention exemplified by the genome: Common Praxis: omics - used to describe large-scale approaches (whereby large is sometimes 1) omes: use and misuse Proteomics - used for research on many proteins (whereby many might mean 3) ome entirety of biomolecular objects (ALL genes etc)
  • Slide 3
  • Protein profiling and interaction proteomics Originally two main directions: Protein profiling: establishment of protein inventories under controlled conditions (organelles, tissues, organisms). Interaction proteomics: identification of temporally and spatially defined functional modules formed by proteins Bioinformatics analysis is essential in both areas
  • Slide 4
  • Part I Part II Protein detection and annotation by homology and orthology (function in1D) Protein interactions and protein networks (function in 2D) Proteome analysis in silico Temporal and spatial considerations (function in 3D+4D)
  • Slide 5
  • AlternativeSplicing Genomeannotation Bork et al. JMolBiol 1998 Domain analysis Protein networks Literature mining coupled to genomic data
  • Slide 6
  • 70% prediction accuracy is great!
  • Slide 7
  • Concepts in function prediction Homology-based (intrinsic molecular features) Gene context (functional associations) - Sequence and domain DBs (Blast, Pfam,Smart) - Gene neighbourhood, fusion, co-occurrence - Shared regulatory elements Other (residue level, functional class ) - Correlated mutations - Interaction threading - Function transfer by orthology - Feature analysis
  • Slide 8
  • www. bork.embl-heidelberg.de I. Homology-based protein annotation Metazoan proteome analysis: human vs chicken Evolution of protein function Metazoan genome annotation: the dark side Homology detection and domain annotation
  • Slide 9
  • Status of homology based function prediction Many homologues, an increasing number of predictable folds, but tough times for automatic function prediction
  • Slide 10
  • Molecular Functions have to be defined on a domain basis i.e. separately for each structurally independent unit within a sequence Henikoff et al. 1997 Science 278, 609
  • Slide 11
  • Slide 12
  • History of signaling domain discovery Systematic discovery by 1) searching in between regions 2) starting with repeats Doerks et al. 2002 Genome Res. Ponting et al. 2001 Genome Res.
  • Slide 13
  • Domain discovery in disease genes
  • Slide 14
  • SMART Blast-like input - - Access to different databases - - Domain annotation & architecture www.smart.embl-heidelberg.de Collaboration with Chris Ponting - - Alerting
  • Slide 15
  • Digested output -signal sequence, Coiled coil and TM -Pfam integrated SMART -comparison of domain context www.smart.embl-heidelberg.de
  • Slide 16
  • Calpain7 MIT Spastin SKD1 protein VPS4p ATPase (Vacuolar protein sorting factor 4A and 4B) Tobacco mosaic virus helicase domain-binding protein MIT Sorting nexin 15 MIT RSK-like protein MIT Similar to ribosomal protein S6 kinase MIT CG8866 MIT Ciccarelli, F. D., et al. Genomics 81(03)437Patel, H. et al. Nat Genet 31(02)347, Spartin Mutation MIT Plant-related A putative transport-associated microtubule-binding domain Unifying disorders associated to hereditary spastic paraplegia?
  • Slide 17
  • www. bork.embl-heidelberg.de I. Homology-based genome annotation Metazoan proteome analysis: human vs chicken Evolution of protein function Metazoan genome annotation: the dark side Homology detection and domain annotation Metazoan genome annotation: the dark side
  • Slide 18
  • 21 Number of human genes in time Aug00Apr01Oct00Dec00Feb01Feb00 0 100 120 20 40 80 60 HGS, Incyte and co Textbooks, public opinion Celera HGP 38 32 52 39 27 24 22 No human genes in thousands HGS others Basis for Feb 01 publications 10T 8T 6T 4T 2T NEMAX50 index Jan05 10T 8T 6T 4T 2T TecDAX index
  • Slide 19
  • Improvement of gene cluster predictions 8 genes / 11 pseudogenic fragments Mouse chr4:94-94,6 Mb p450 (CYP2J) region: 8 genes / 11 pseudogenic fragments cyp2j6cyp2j9cyp2j5 Known genes cyp2j13 ESTs Twinscan (1 gene) GeneID (3 genes) fgenesh++ (13 genes) ENSEMBL (9 genes) (comparison performed in 2004)
  • Slide 20
  • BLAST2GENE finds independent gene copies BLAST of cyp2j13 protein vs. Mouse chr4:94-94,6 Mb ~ 150 Alignments 355 2662 22025 21614259635425245704 733 10328106461857619633 49573955 12288 1978 816 126262869024 1298325664 294 20546 25638844 19731 600 507430897684 22780 644 19940164511458713029 2403 23116 47173443 20352 775248 241231808678186354821988328021113613 383 95477380 15275 29601772352216563839 362 15495141 14703 9639 13461 986 328914525270 11826 986 328914525270 11826 12983 25664 294 20546 355 26625482 22025 BLAST2GENE 548276742499996095022772 Hundrets often considerable differences to current gene prediction pipelines!
  • Slide 21
  • regions containing independent elements Merging of fragments of the same element 1. Similarity search in intergenic regions Masking of known repeats and already predicted genes 1.5-2 million fragments fragments with significant sequence similarity BLASTX vs nr prot. db E-value < 0.001 Exclusion of transposon and virus derived sequence Closest known protein (first blast hit) GENEWISE Torrents, Suyama, Bork Genome Res. 13(2003)2550 Annotation of pseudogenes changes gene numbers Ka/Ks functionality check Ca 20.000 detectable pseudogenes in each: human, mouse, rat
  • Slide 22
  • Still >3000 pseudogenes among the predicted human genes mid 2004 (build 34) e1e2 Processed Pseudogene Genewise prediction using sptrembl|Q9HBM5 e3e4 e5 e6 Processed Pseudogene Genewise prediction using SwissProt|RS2_RAT 80 kb Predicted Gene Mm chr1:7608644-7681026 Stop codon or frameshift 2. Consistency check of gene predictions Annotation of pseudogenes changes gene numbers Arrays, chips et al. 20%off?
  • Slide 23
  • genes Protein diversity 20-40k genes >100k transcripts >1000k proteins? What do we count?
  • Slide 24
  • Rate of detectable alternative splicing depends on EST coverage and library range 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 AS per mRNA (x) Brett et al. Nature Genet. 30(2002)29
  • Slide 25
  • www. bork.embl-heidelberg.de Boue et al. Bioessays 03
  • Slide 26
  • Homology-based predictions of exons and alternative transcripts ( www.smart.embl-heidelberg.de) SMART domain DB links to genomes
  • Slide 27
  • Top 10 domains* in human: 30% diff.! humanflyworm Immunoglobulin C2H2zinc finger *Only no of genes given, no of domains higher; note that only around 90% is sequenced Protein kinase Rhod.-like GPCR P-loop NTPase Rev.transcriptase RRM (RNA-binding) WD40 (G-protein) Ankyrin repeat 765 (381) 14064 706 (607) 357151 575 (501) 319437 569 (616) 97358 433 198183 350 1050 300 (224) 15796 277 (136) 162102 276 (145) 105107 1330018200 Nature 409 (01)860; Science 291(01)1304 Total no genes Species Homeobox 267 (160) 148109 26500(26500)
  • Slide 28
  • Metazoan genome annotation an ongoing process and far from complete n >2000 pseudogenes in mammalian gene sets: Only now they are about to be included in prediction pipelines n Ca 150 retro-related genes in mammalian gene sets (>1000 in 2004), but true human genes sometimes suppressed n Annotation of gene clusters need considerable improvements n Alternative splicing still a major unknown n Considerable human factor in annotation
  • Slide 29
  • www. bork.embl-heidelberg.de I. Homology-based genome annotation Metazoan proteome analysis: human vs chicken Evolution of protein function Metazoan genome annotation: the dark side Homology detection and domain annotation Metazoan genome annotation: the dark side Metazoan proteome analysis: human vs chicken
  • Slide 30
  • Human: Nature Feb 2001 Mouse: Nature Dec 2002 Mosquito: Science Oct 2002 Rat: Nature Apr 2004 75 40 mouse rat chicken chimp 310MY fugu 450MY 600-1200MY? ? C.eleg. D.mena. 250MY mosquito 5 human chicken: Nature Dec 2004
  • Slide 31
  • Chicken genome analysis Zdobnov et al Science 02 15% 45% Hillier et al Nature 04
  • Slide 32
  • Chicken genome analysis: orthology and cellular processes 75.4% identity (median) between chicken and human 1:1 orthologs Immune response evolves fastest
  • Slide 33
  • www. bork.embl-heidelberg.de Chicken genome analysis: Innovation and Expansion of domain families
  • Slide 34
  • Orthology analysis reveals more subtle functional changes
  • Slide 35
  • Evolution by duplication: Burst of an olfactory receptor family thought to recognize MHC diversity chicken human 221 copies in chicken given a ca 300 ORs in chicken and 450 in human
  • Slide 36
  • Chicken genome analysis: Evolution of function by domain accretion Scavenger receptor cysteine-rich domain acquired by a fibrinogen-domain containing protein (identified an