Upload
bethany-banks
View
218
Download
1
Embed Size (px)
Citation preview
Comparative Comparative genome analysisgenome analysis
[email protected]://www.bork.embl-heidelberg.de/
Peer BorkPeer BorkEMBL & MDCEMBL & MDC
Heidelberg & BerlinHeidelberg & Berlin
Hard data and soft Hard data and soft interpretations?interpretations?
www.bork.embl-heidelberg.de
Sequenced eukaryotic genomesBork and Copley N
ature 409(01)818
www.bork.embl-heidelberg.de
• Assembly accuracy
• Sequence coverage
• Sequence accuracy
• Polymorphism
Sources of uncertaintiesSources of uncertainties
• Annotation accuracy
(human genome draft)
70% prediction accuracy is great!70% prediction accuracy is great!Prediction of |acc*cov | %acc | % cov of reference set| reference
Human promoters: .35 50% 70% of annotated test set Prestidge, 1995; Bucher , pers. Comm.
Human regulatory RNA elements .34 85% 40% of new DNA Dandekar & Sharma, 1998
Human genes (only presence): .49 70% 70% of chromosome. 22 Dunham et al., 1999 and refs therein
Human SNPs by EST comparison: .21 70% 30% of all proteins with SNP Sunyaev et al., 2000; Buetow et al., 1999
Human alternative splicing: .45 90% 50% of all splice sites Hanke et al., 1999
Transmembranes (only presence): .85 85% 99% of annotated test set Tusnady & Simon, 1998 and refs therein
Signal peptides (only presence): .90 90% 100% of annotated test set Nielsen et al., 1999
GPI ancors (incl cleavage site): .72 72% 100% of annotated test set Eisenhaber et al., 1999
Coiled coil (only presence): .81 90% 90% of annotated coiled coil Lupas, 1996
Secondary structure (3 states): .77 77% 100% of 3D test set Jones, 1999 and refs therein
Buried or exposed residues: .74 74% 100% of 3D test set Rost, 1996
Residue hydration: .72 72% 100% of 3D test set Ehrlich et al., 1998
Protein folds (in Mycoplasma): .49 98% 50% of Mycoplasma ORFs Teichmann et al,1999 and refs therein
Homology (several methods): .49 98% 50% of 3D test set Muller et al, 1999 and refs therein
Functional features by homology: .63 90% 70% unicellular genomes Bork and Koonin, 98; Brenner, 99
Function association by context: .25 50% 10% ‘high confidence’ in yeast Marcotte et al.,1999b
Cellular localization (2 states): .77 77% 100% of annotated test set Andrade et al., 1998
www.bork.embl-heidelberg.de
Comparative genome analysisComparative genome analysis
Prediction of genes and pseudogenesPrediction of genes and pseudogenes
Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genes
Prediction of genes and pseudogenesPrediction of genes and pseudogenes
Homology-based function predictionHomology-based function prediction
Context-based function predictionContext-based function prediction2. Gene neighbourhood 2. Gene neighbourhood
www.bork.embl-heidelberg.de
Number of human genes in timeNumber of human genes in time
Aug00 Apr01Oct00 Dec00 Feb01Feb00 0
100
120
20
40
80
60
HGS, Incyte and coTextbooks, public opinion
Celera
HGP38 32
5239
27 24 22
No h
uman
gen
es in
thou
sand
s
HGS
othersBasis for Feb 01 publications
10T
8T
6T
4T
2T
NEMAX50 index
BLASTX vs nr95 prot. db. (cutoff E < e-8)
HUMAN GENOME
Homology search of all human intergenic regions Homology search of all human intergenic regions Masking for repetitive elements and ENSEMBL sequences
Filtering of query and database for Low Complexity Regions
BLASTX vs nr95 prot. db. (cutoff E < e-8)
Merging and extension of fragments
Construction of gene structure
Removal of all virus derived sequences
3.3·109 nucleotides
1.4·106 DNA fragments
4.4·104 DNA fragments
3.6·104 DNA fragments
2.3·104 DNA fragments
BLASTX vs ENSEMBL database
12526 elements (pseudogenes or genes)with sequence similarity to known proteins
Hunting for pseudogenes: Hunting for pseudogenes:
0
5
10
15
20
25
30
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 >1.2
Pseudogenes reference set (856 seq.)
SWISSPROT (1935 seq.)
RefSeq (1103 seq.)
log (dS/dN)
% o
f se
qu
ence
sSynonymous/non-synonymous (dS/dN) substitution rates Synonymous/non-synonymous (dS/dN) substitution rates
of functional and pseudogenic human sequences of functional and pseudogenic human sequences
log (dS/dN)
% o
f se
qu
ence
sSynonymous/non-synonymous (dS/dN) substitution rates of Synonymous/non-synonymous (dS/dN) substitution rates of
unannotated regions with homology to known genes unannotated regions with homology to known genes
00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 >1.2
2
4
6
8
10
12
14
16
Analyzed = 37121858 (50%) 1161 (31%) 693 (19%)
PSEUDOGENES GENESUNCERTAIN
Total
= 125268205 4321
693 novel genes detected; 693 novel genes detected; >4300 expected in ourset>4300 expected in ourset
0
e-18
0
e-16
0
e-10
0
e-8
0
e-6
0
e-4
0
e-2
0
e-8
e-14
0
e-12
0
Nu
mb
er o
f se
ons
E value
< e-
180
50
100
150
200
250
300
pseudogenesuncertainfunctional
E value distribution of pseudogenic, E value distribution of pseudogenic, uncertain and functional exons uncertain and functional exons
3712 sequences
(BLASTX vs nr95 database)(BLASTX vs nr95 database)
www.bork.embl-heidelberg.de
Comparative genome analysisComparative genome analysis
Prediction of genes and pseudogenesPrediction of genes and pseudogenes
Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genes
Prediction of genes and pseudogenesPrediction of genes and pseudogenes
Homology-based function predictionHomology-based function prediction
Context-based function predictionContext-based function prediction2. Gene neighbourhood 2. Gene neighbourhood
Homology-based function predictionHomology-based function prediction
Mycoplasma pneumoniaeMycoplasma pneumoniae predictions predictions
0
20
40
60
80
100
1995Function
1995Structure
1999Function
1999Structure
fold twilightFoldsContextTwilightHomology
Molecular Functions have to be defined on a domain basisi.e. separately foreach structurallyindependent unitwithin a sequence
Henikoff et al. 1997 Science 278, 609
SMARTSMARTBlast-like inputBlast-like input
- ID or AC sufficient
- Access to different databases
- Domain annotation
www.smart.embl-heidelberg.de
Digested outputDigested output
-signal sequence
-transmembrane regions
SMARTSMART
-comparison of domain context
www.smart.embl-heidelberg.de
www.bork.embl-heidelberg.de
Domain organization of TAPDomain organization of TAP
LRR
LRR
LRR
LRR
NTF2-like UBA
100aa
RNA-binding p15-binding
np-bind.
Directed mutagenesisDirected mutagenesis
619aaTAPTAP Random mutagenesisRandom mutagenesis
Collaboration with Elisa Izaurralde
NTF2-like
p15
Directed mutagenesis confirmsDirected mutagenesis confirmspredicted predicted TAPTAP//p15p15 interaction interaction
Red - loss of binding
Blue - no effect on binding Gray - alanine scan
Top 10 domains* in humanTop 10 domains* in humanman fly worm yeast cress
ImmunoglobulinC2H2zinc finger
*Only no of genes given, no of domains higher; note that only around 90% is sequenced
Protein kinaseRhod.-like GPCRP-loop NTPaseRev.transcriptaseRRM (RNA-binding)WD40 (G-protein)Ankyrin repeat
765(381) 140 64 0 1706(607) 357 151 48 115575(501) 319 437 121 1049569(616) 97 358 0 16433 198 183 97 331350 10 50 6 80300(224) 157 96 54 255277(136) 162 102 91 210276(145) 105 107 19 120
13300 18200 6100 25700
Nature 409 (01)860; Science 291(01)1304
Total no genesSpecies
Homeobox 267(160) 148 109 9 118
26500(26500)
www.bork.embl-heidelberg.de
Comparative genome analysisComparative genome analysis
Prediction of genes and pseudogenesPrediction of genes and pseudogenes
Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genes
Homology-based function predictionHomology-based function prediction
Context-based function predictionContext-based function prediction2. Gene neighbourhood 2. Gene neighbourhood
Homology-based function predictionHomology-based function prediction
Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genesContext-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genes
www.bork.embl-heidelberg.de
Function prediction via Function prediction via genomic context informationgenomic context information
Gene context:Gene context:
- Pathway data (can overrule homology!)- Gene expression data (co-expression etc.)- Protein interaction /localisation - Scientific literature
- Gene fusion as distinct neighborhood subset - Conserved gene neighborhood in genomes - Conserved co-occurrence of genes in species (‘phylogentic profile’, ‘COG pattern’)- Surrounding and shared regulatory elements
Knowledge-based context:Knowledge-based context:
www.bork.embl-heidelberg.de
Context methods in Mycoplasma: Context methods in Mycoplasma: Fusion, neighborhood, co-occurrenceFusion, neighborhood, co-occurrence
MG total:MG total:480 genes480 genes
Presence in conserved Presence in conserved operons: 213operons: 213
Conserved Conserved neighborhoodneighborhood
27
54
FusionFusion
Co-occurrenceCo-occurrencein genomesin genomes
178
www.bork.embl-heidelberg.de
Orthology vs paralogy
Genome A
Genome B
gene A1 gene A2
gene B1 gene B2
orthology
paralogy
genegene 2
gene 1gene A1gene B1gene A2gene B2
history
… within homology
Exploiting the absence of genesExploiting the absence of genes
www.bork.embl-heidelberg.deHuynen et al., 1998, FEBS Lett 426, 1-5
www.bork.embl-heidelberg.de
Predicting functional interactions between proteins Predicting functional interactions between proteins by the co-occurrence of their genes in genomesby the co-occurrence of their genes in genomes
Distribution of four M.genitalium genes among 25 genomes
MG299 (pta) 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1MG357(ackA)0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1MG019(dnaJ) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1MG305(dnaK)0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1
Using the mutual information between genes as a scoring heuristic for their co-occurrence.
M(pta, ackA)=0.69 (phospotransacetylase, acetate kinase)M(dnaJ, dnaK)=0.55 (heat shock proteins)M(dnaJ, ackA)=0.19
hscB Jac1hscB Jac1hscAhscA
ssq1ssq1
Nfu1Nfu1
iscA Isa1-2iscA Isa1-2fdx Yah1fdx Yah1
Arh1Arh1
ORF1ORF1ORF2ORF2ORF3ORF3
iscS Nfs1 iscS Nfs1 iscU Isu1-2iscU Isu1-2
A.aeolicus S
ynechocystis
B.subtilis
M.genitalium
M.tuberculosis
D.radiodurans
R.prow
azekii
C.crescentus
M.loti
N.m
eningitidis
X.fastidiosa
P.aeruginosa
Buchnera
V.cholerae
H.influenzae
P.multocida
E.coli
A.pernix
M.jannaschii
A.thaliana S
.cerevisiaes
C.jejuni
C.albicans
S.pom
be
H.sapiens
C.elegan
H. pylori D
.melan.
The phylogenetic The phylogenetic distribution of cyaY distribution of cyaY (frataxin) is identical (frataxin) is identical to that of hscB/Jac1, to that of hscB/Jac1, indicating a indicating a functional role of functional role of cyaY in iron-sulfur cyaY in iron-sulfur cluster assembly on cluster assembly on proteins, specifically proteins, specifically in conjunction with in conjunction with Jac1.Jac1.
Phylogenetic distribution of iron-sulfur cluster assembly proteinsPhylogenetic distribution of iron-sulfur cluster assembly proteins
cyaY Yfh1cyaY Yfh1 (frataxin)(frataxin)
Huynen et al.Hum.Mol.Genet2001
www.bork.embl-heidelberg.de
Comparative genome analysisComparative genome analysis
Prediction of genes and pseudogenesPrediction of genes and pseudogenes
Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genes
Homology-based function predictionHomology-based function prediction
Context-based function predictionContext-based function prediction2. Gene neighbourhood 2. Gene neighbourhood
Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genesContext-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genesContext-based function predictionContext-based function prediction2. Gene neighbourhood 2. Gene neighbourhood
GenomeGenomealignmentalignment
Conservation of gene neighboorhoodConservation of gene neighboorhood
Pairwise comparison of 20 prokaryotic genomes
(time)
(log)
xxxxxxxxxxxxxxxxxxxxxxxxxooooooooooooooooooooo
EC-HIMG-MP I I
Nucleotide salvage/degradation Nucleotide salvage/degradation pathway in gram-positive bacteriapathway in gram-positive bacteria
STRING server for context retrievalSTRING server for context retrieval
Tryptophan Tryptophan biosynthesisbiosynthesis
ww
w.bork.em
bl-heidelberg.de/STRIN
Gw
ww
.bork.embl-heidelberg.de/STR
INGw
ww
.bor
k.em
bl-h
eide
lber
g.de
/STR
ING
ww
w.b
ork.
embl
-hei
delb
erg.
de/S
TRIN
G
www.bork.embl-heidelberg.de
Gene neighborhood reflects connections between Gene neighborhood reflects connections between Tryptophan and Shikimate biosynthesisTryptophan and Shikimate biosynthesis
www.bork.embl-heidelberg.de
hemK
tyrA
aroB
aroEaroC
asdtruA
hyp
hyp
2c-rr
trpF
trpC
trpAtrpB
trpDtrpG
trpE
Modularity in “genomic association space” Modularity in “genomic association space”
Tryptophan synthesis pathway
Shikimate pathway
Networks based on conserved gene neighborhood reveal ‘natural’ subsystems
(pseudo)genes(pseudo)genesYan Yuan
Mikita SuyamaDavid Torrents
Rich Copley
Ivica Letunic
SMARTSMART
www.bork.embl-heidelberg.de
*Martijn (NL)
*Frank (D) Yan (C) Peer (D)
Tobias (D)
*Luis (E)
*Jörg (D)Berend (NL)Warren (US)
Miguel (E)
Shamil (RU)
*Birgit (D) Mikita (J)
Richard (UK)
*Vassily (RU)
*Gert (D)
David (E), Ivica (Hr), Caroline (E), Steffen(D), Francesca(I),Jan (D), Parantu(In), Christian(D)
*left EMBL