bork@embl.de bork.embl-heidelberg.de

  • View
    19

  • Download
    0

Embed Size (px)

DESCRIPTION

Proteome analysis in silico. Peer Bork EMBL & MDC Heidelberg & Berlin. bork@embl.de http://www.bork.embl-heidelberg.de/. omes: use and misuse. Original intention exemplified by the genome:. ome entirety of biomolecular objects (ALL genes etc). - PowerPoint PPT Presentation

Transcript

  • bork@embl.dehttp://www.bork.embl-heidelberg.de/Peer BorkEMBL & MDCHeidelberg & BerlinProteome analysis in silico

  • omics research on an entirety of biomolecular objectsProteomics research on the entirety of proteins (so far in an organism) coined beginning of the 90thOriginal intention exemplified by the genome:Common Praxis:omics - used to describe large-scale approaches(whereby large is sometimes 1)omes: use and misuseProteomics - used for research on many proteins(whereby many might mean 3)ome entirety of biomolecular objects (ALL genes etc)

  • Protein profiling and interaction proteomicsOriginally two main directions:Protein profiling: establishment of protein inventories under controlled conditions (organelles, tissues, organisms). Interaction proteomics: identification of temporally and spatially defined functional modules formed by proteinsBioinformatics analysis is essential in both areas

  • Part IPart IIProtein detection and annotation by homology and orthology (function in1D)Protein interactions and protein networks (function in 2D)Proteome analysis in silicoTemporal and spatial considerations (function in 3D+4D)

  • AlternativeSplicingGenomeannotationBork et al. JMolBiol 1998Domain analysisProtein networksLiterature miningcoupled togenomic data

  • 70% prediction accuracy is great!

    Prediction of

    |acc*cov | %acc | % cov of reference set|reference

    Human promoters:

    .35 50% 70% of annotated test setPrestidge, 1995; Bucher , pers. Comm.

    Human regulatory RNA elements.34 85% 40% of new DNA

    Dandekar & Sharma, 1998

    Human genes (only presence):.49 70% 70% of chromosome. 22 Dunham et al., 1999 and refs therein

    Human SNPs by EST comparison: .21 70% 30% of all proteins with SNPSunyaev et al., 2000; Buetow et al., 1999

    Human alternative splicing:.45 90% 50% of all splice sites

    Hanke et al., 1999

    Transmembranes (only presence):.85 85% 99% of annotated test setTusnady & Simon, 1998 and refs therein

    Signal peptides (only presence):.90 90% 100% of annotated test setNielsen et al., 1999

    GPI ancors (incl cleavage site):.72 72% 100% of annotated test setEisenhaber et al., 1999

    Coiled coil (only presence):.81 90% 90% of annotated coiled coilLupas, 1996

    Secondary structure (3 states):.77 77% 100% of 3D test set Jones, 1999 and refs therein

    Buried or exposed residues:.74 74% 100% of 3D test set

    Rost, 1996

    Residue hydration:

    .72 72% 100% of 3D test set

    Ehrlich et al., 1998

    Protein folds (in Mycoplasma):.49 98% 50% of Mycoplasma ORFsTeichmann et al,1999 and refs therein

    Homology (several methods):.49 98% 50% of 3D test set

    Muller et al, 1999 and refs therein

    Functional features by homology: .63 90% 70% unicellular genomes Bork and Koonin, 98; Brenner, 99

    Function association by context:.25 50% 10% high confidence in yeastMarcotte et al.,1999b

    Cellular localization (2 states): .77 77% 100% of annotated test set Andrade et al., 1998

  • Concepts in function predictionHomology-based (intrinsic molecular features)Gene context (functional associations)- Sequence and domain DBs (Blast, Pfam,Smart)- Gene neighbourhood, fusion, co-occurrence- Shared regulatory elementsOther (residue level, functional class )- Correlated mutations- Interaction threading- Function transfer by orthology- Feature analysis

  • I. Homology-based protein annotationMetazoan proteome analysis: human vs chickenEvolution of protein functionMetazoan genome annotation: the dark sideHomology detection and domain annotationHomology detection and domain annotation

  • Status of homology based function predictionMany homologues, an increasing number of predictable folds, but tough times for automatic function prediction

  • Molecular Functions have to be defined on a domain basisi.e. separately foreach structurallyindependent unitwithin a sequenceHenikoff et al. 1997 Science 278, 609

  • History of signaling domain discovery Systematicdiscovery by 1) searching in betweenregions2) starting with repeatsDoerks et al. 2002Genome Res.Ponting et al. 2001Genome Res.

  • Domain discovery in disease genes

    gene/protein

    disease

    domains

    reference

    dystrophin

    Muscular dystrophy

    WW

    Bork & Sudol: TIBS 19(94)531

    X11

    Friedreich's ataxia (c)

    PI/PTB+PDZ

    Bork & Margolis: Cell 80(95)693

    PKD1

    Polycystic kidney

    many (PKD1)

    Int. PKD1 consortium: Cell 81(95)298

    HD

    Huntington's

    HEAT repeats

    Andrade & Bork: Nat.Genet.11(95)115

    BRCA2

    Breast cancer

    BRC repeats

    Bork et al.: Nat. Genet. 13 (96) 22

    BRCA1

    Breast cancer

    BRCT

    Koonin et al.: Nat. Genet. 13 (96) 266

    dsh

    DiGeorge syndrome

    DEP

    Ponting & Bork: TIBS 21(96) 245

    X25 (FRDA)

    Friedreich's ataxia

    CyaY

    Gibson et al. : TINS 19 (96) 465

    beige/CH

    Chediak-Higashi

    BEACH

    Nagle et al. : Nat. Genet. 14 (96) 307

    RB

    Retinoblastoma

    BRCT

    Bork et al. :FASEB J. 11 (97) 68

    9 incl. HML1

    Colon cancer

    HSP90

    Mushegian et al. : PNAS 94 (97) 5831

    TSG101

    Breast cancer

    UBC

    Ponting, Cai & Bork: JMM 75 (97) 467

    WRN/BLM

    Werner + Bloom syn.

    HRDC

    Morozov et al. : TIBS 22 (97) 417

    2 inc pyrin

    Mediterrian fever

    SPRY

    Schultz et al. : PNAS 95 (98) 5857

    p73

    various tumors?

    SAM

    Bork & Koonin: Nat. Genet. 18 (98) 313

    mahagony

    Obesity

    PSI

    Nagle et al.: Nature 398 (99) 148

    Parkin

    AP-J Parkinsonism

    IBR

    Morett & Bork: TIBS 24 (99) 229

  • SMARTBlast-like inputAccess to different databasesDomain annotation & architecturewww.smart.embl-heidelberg.deCollaboration withChris PontingAlerting

  • Digested output-signal sequence, Coiled coil and TM-Pfam integratedSMART-comparison of domain context www.smart.embl-heidelberg.de

  • Spastin SKD1 protein VPS4p ATPase (Vacuolar protein sorting factor 4A and 4B) Tobacco mosaic virus helicase domain-binding protein Sorting nexin 15Ciccarelli, F. D., et al. Genomics 81(03)437Patel, H. et al. Nat Genet 31(02)347,SpartinMutation MITPlant-relatedA putative transport-associated microtubule-binding domainUnifying disorders associated to hereditary spastic paraplegia?

  • I. Homology-based genome annotationMetazoan proteome analysis: human vs chickenEvolution of protein functionMetazoan genome annotation: the dark sideHomology detection and domain annotationHomology detection and domain annotationMetazoan genome annotation: the dark side

  • Number of human genes in time21Aug00Apr01Oct00Dec00Feb01Feb00 010012020408060HGS, Incyte and coTextbooks, public opinionCeleraHGP38325239272422No human genes in thousandsHGSothersBasis for Feb 01 publicationsJan05

  • Improvement of gene cluster predictionsMouse chr4:94-94,6 Mb p450 (CYP2J) region: 8 genes / 11 pseudogenic fragments(comparison performed in 2004)

  • BLAST2GENE finds independent gene copies BLAST of cyp2j13 protein vs. Mouse chr4:94-94,6 Mb ~ 150 Alignments355266222025216142596354252457047331032810646185761963349573955122881978816126262869024129832566429420546256388441973160050743089768422780644199401645114587130292403231164717344320352775248241231808678186354821988328021113613383954773801527529601772352216563839362154951411470396391346198632891452527011826986328914525270118261298325664294205463552662548222025BLAST2GENE548276742499996095022772Hundrets often considerable differences to current gene prediction pipelines!

  • 1. Similarity search in intergenic regionsTorrents, Suyama, Bork Genome Res. 13(2003)2550Annotation of pseudogenes changes gene numbers

  • Still >3000 pseudogenes among the predicted human genes mid 2004 (build 34)2. Consistency check of gene predictionsAnnotation of pseudogenes changes gene numbers Arrays, chips et al. 20%off?

  • genesProtein diversity20-40k genes>100k transcripts>1000k proteins?What do we count?

  • Rate of detectable alternative splicing depends on EST coverage and library range2.02.12.22.32.42.52.62.72.8AS per mRNA (x)Brett et al. Nature Genet. 30(2002)29

    Diagramm2

    7.59.60.990.991.031.03

    10.916.81.041.041.161.16

    16.9260.960.961.31.3

    24.436.21.341.341.461.46

    25.842.61.361.361.481.48

    33441.451.451.371.37

    3200000451.491.49

    mouse

    human

    ESTs

    %AS

    Tabelle1

    mousehuman

    100,0007.59.61.030.99

    250,00010.916.81.161.04

    500,00016.9261.30.96

    1,000,00024.436.21.461.34

    1,500,00025.842.61.481.36

    1,950,87633441.371.45

    3200000451.49

    11.040.961.341.361.45

    Tabelle2

    Tabelle3

  • Boue et al. Bioessays 03

  • Homology-based predictions of exons and alternative transcripts (www.smart.embl-heidelberg.de) SMART domain DBlinks to genomes

  • Top 10 domains* in human: 30% diff.!humanflywormImmunoglobulinC2H2zinc finger*Only no of genes given, no of domains higher; note that only around 90% is sequencedProtein kinaseRhod.-like GPCRP-loop NTPaseRev.transcriptaseRRM (RNA-binding)WD40 (G-protein)Ankyrin repeat765 (381)14064706 (607)357151575 (501)319437569 (616)973584331981833501050300 (224)15796277 (136)162102276 (145)1051071330018200Nature 409 (01)860; Science 291(01)1304Total no genesSpeciesHomeobox267 (160)14810926500(26500)

  • Metazoan genome annotation an ongoing process and far from complete>2000