Comparative genome analysis [email protected] Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

Comparative Comparative genome analysisgenome analysis

[email protected]://www.bork.embl-heidelberg.de/

Peer BorkPeer BorkEMBL & MDCEMBL & MDC

Heidelberg & BerlinHeidelberg & Berlin

Hard data and soft Hard data and soft interpretations?interpretations?

www.bork.embl-heidelberg.de

Sequenced eukaryotic genomesBork and Copley N

ature 409(01)818


• Assembly accuracy

• Sequence coverage

• Sequence accuracy

• Polymorphism

Sources of uncertaintiesSources of uncertainties

• Annotation accuracy

(human genome draft)

70% prediction accuracy is great!70% prediction accuracy is great!Prediction of |acc*cov | %acc | % cov of reference set| reference

Human promoters: .35 50% 70% of annotated test set Prestidge, 1995; Bucher , pers. Comm.

Human regulatory RNA elements .34 85% 40% of new DNA Dandekar & Sharma, 1998

Human genes (only presence): .49 70% 70% of chromosome. 22 Dunham et al., 1999 and refs therein

Human SNPs by EST comparison: .21 70% 30% of all proteins with SNP Sunyaev et al., 2000; Buetow et al., 1999

Human alternative splicing: .45 90% 50% of all splice sites Hanke et al., 1999

Transmembranes (only presence): .85 85% 99% of annotated test set Tusnady & Simon, 1998 and refs therein

Signal peptides (only presence): .90 90% 100% of annotated test set Nielsen et al., 1999

GPI ancors (incl cleavage site): .72 72% 100% of annotated test set Eisenhaber et al., 1999

Coiled coil (only presence): .81 90% 90% of annotated coiled coil Lupas, 1996

Secondary structure (3 states): .77 77% 100% of 3D test set Jones, 1999 and refs therein

Buried or exposed residues: .74 74% 100% of 3D test set Rost, 1996

Residue hydration: .72 72% 100% of 3D test set Ehrlich et al., 1998

Protein folds (in Mycoplasma): .49 98% 50% of Mycoplasma ORFs Teichmann et al,1999 and refs therein

Homology (several methods): .49 98% 50% of 3D test set Muller et al, 1999 and refs therein

Functional features by homology: .63 90% 70% unicellular genomes Bork and Koonin, 98; Brenner, 99

Function association by context: .25 50% 10% ‘high confidence’ in yeast Marcotte et al.,1999b

Cellular localization (2 states): .77 77% 100% of annotated test set Andrade et al., 1998


Comparative genome analysisComparative genome analysis

Prediction of genes and pseudogenesPrediction of genes and pseudogenes

Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genes


Homology-based function predictionHomology-based function prediction

Context-based function predictionContext-based function prediction2. Gene neighbourhood 2. Gene neighbourhood


Number of human genes in timeNumber of human genes in time

Aug00 Apr01Oct00 Dec00 Feb01Feb00 0

100

120

20

40

80

60

HGS, Incyte and coTextbooks, public opinion

Celera

HGP38 32

5239

27 24 22

No h

uman

gen

es in

thou

sand

s

HGS

othersBasis for Feb 01 publications

10T

8T

6T

4T

2T

NEMAX50 index

BLASTX vs nr95 prot. db. (cutoff E < e-8)

HUMAN GENOME

Homology search of all human intergenic regions Homology search of all human intergenic regions Masking for repetitive elements and ENSEMBL sequences

Filtering of query and database for Low Complexity Regions

BLASTX vs nr95 prot. db. (cutoff E < e-8)

Merging and extension of fragments

Construction of gene structure

Removal of all virus derived sequences

3.3·109 nucleotides

1.4·106 DNA fragments




BLASTX vs ENSEMBL database

12526 elements (pseudogenes or genes)with sequence similarity to known proteins

Hunting for pseudogenes: Hunting for pseudogenes:

0

5

10

15

20

25

30

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 >1.2

Pseudogenes reference set (856 seq.)

SWISSPROT (1935 seq.)

RefSeq (1103 seq.)

log (dS/dN)

% o

f se

qu

ence

sSynonymous/non-synonymous (dS/dN) substitution rates Synonymous/non-synonymous (dS/dN) substitution rates

of functional and pseudogenic human sequences of functional and pseudogenic human sequences

log (dS/dN)

% o

f se

qu

ence

sSynonymous/non-synonymous (dS/dN) substitution rates of Synonymous/non-synonymous (dS/dN) substitution rates of

unannotated regions with homology to known genes unannotated regions with homology to known genes

00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 >1.2

2

4

6

8

10

12

14

16

Analyzed = 37121858 (50%) 1161 (31%) 693 (19%)

PSEUDOGENES GENESUNCERTAIN

Total

= 125268205 4321

693 novel genes detected; 693 novel genes detected; >4300 expected in ourset>4300 expected in ourset

0

e-18

0

e-16

0

e-10

0

e-8

0

e-6

0

e-4

0

e-2

0

e-8

e-14

0

e-12

0

Nu

mb

er o

f se

ons

E value

< e-

180

50

100

150

200

250

300

pseudogenesuncertainfunctional

E value distribution of pseudogenic, E value distribution of pseudogenic, uncertain and functional exons uncertain and functional exons

3712 sequences

(BLASTX vs nr95 database)(BLASTX vs nr95 database)









Mycoplasma pneumoniaeMycoplasma pneumoniae predictions predictions

0

20

40

60

80

100

1995Function

1995Structure

1999Function

1999Structure

fold twilightFoldsContextTwilightHomology

Molecular Functions have to be defined on a domain basisi.e. separately foreach structurallyindependent unitwithin a sequence

Henikoff et al. 1997 Science 278, 609

SMARTSMARTBlast-like inputBlast-like input

- ID or AC sufficient

- Access to different databases

- Domain annotation

www.smart.embl-heidelberg.de

Digested outputDigested output

-signal sequence

-transmembrane regions

SMARTSMART

-comparison of domain context

www.smart.embl-heidelberg.de


Domain organization of TAPDomain organization of TAP

LRR

LRR

LRR

LRR

NTF2-like UBA

100aa

RNA-binding p15-binding

np-bind.

Directed mutagenesisDirected mutagenesis

619aaTAPTAP Random mutagenesisRandom mutagenesis

Collaboration with Elisa Izaurralde

NTF2-like

p15

Directed mutagenesis confirmsDirected mutagenesis confirmspredicted predicted TAPTAP//p15p15 interaction interaction

Red - loss of binding

Blue - no effect on binding Gray - alanine scan

Top 10 domains* in humanTop 10 domains* in humanman fly worm yeast cress

ImmunoglobulinC2H2zinc finger

*Only no of genes given, no of domains higher; note that only around 90% is sequenced

Protein kinaseRhod.-like GPCRP-loop NTPaseRev.transcriptaseRRM (RNA-binding)WD40 (G-protein)Ankyrin repeat

765(381) 140 64 0 1706(607) 357 151 48 115575(501) 319 437 121 1049569(616) 97 358 0 16433 198 183 97 331350 10 50 6 80300(224) 157 96 54 255277(136) 162 102 91 210276(145) 105 107 19 120

13300 18200 6100 25700

Nature 409 (01)860; Science 291(01)1304

Total no genesSpecies

Homeobox 267(160) 148 109 9 118

26500(26500)








Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genesContext-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genes


Function prediction via Function prediction via genomic context informationgenomic context information

Gene context:Gene context:

- Pathway data (can overrule homology!)- Gene expression data (co-expression etc.)- Protein interaction /localisation - Scientific literature

- Gene fusion as distinct neighborhood subset - Conserved gene neighborhood in genomes - Conserved co-occurrence of genes in species (‘phylogentic profile’, ‘COG pattern’)- Surrounding and shared regulatory elements

Knowledge-based context:Knowledge-based context:


Context methods in Mycoplasma: Context methods in Mycoplasma: Fusion, neighborhood, co-occurrenceFusion, neighborhood, co-occurrence

MG total:MG total:480 genes480 genes

Presence in conserved Presence in conserved operons: 213operons: 213

Conserved Conserved neighborhoodneighborhood

27

54

FusionFusion

Co-occurrenceCo-occurrencein genomesin genomes

178


Orthology vs paralogy

Genome A

Genome B

gene A1 gene A2

gene B1 gene B2

orthology

paralogy

genegene 2

gene 1gene A1gene B1gene A2gene B2

history

… within homology

Exploiting the absence of genesExploiting the absence of genes

www.bork.embl-heidelberg.deHuynen et al., 1998, FEBS Lett 426, 1-5


Predicting functional interactions between proteins Predicting functional interactions between proteins by the co-occurrence of their genes in genomesby the co-occurrence of their genes in genomes

Distribution of four M.genitalium genes among 25 genomes

MG299 (pta) 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1MG357(ackA)0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1MG019(dnaJ) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1MG305(dnaK)0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1

Using the mutual information between genes as a scoring heuristic for their co-occurrence.

M(pta, ackA)=0.69 (phospotransacetylase, acetate kinase)M(dnaJ, dnaK)=0.55 (heat shock proteins)M(dnaJ, ackA)=0.19

hscB Jac1hscB Jac1hscAhscA

ssq1ssq1

Nfu1Nfu1

iscA Isa1-2iscA Isa1-2fdx Yah1fdx Yah1

Arh1Arh1

ORF1ORF1ORF2ORF2ORF3ORF3

iscS Nfs1 iscS Nfs1 iscU Isu1-2iscU Isu1-2

A.aeolicus S

ynechocystis

B.subtilis

M.genitalium

M.tuberculosis

D.radiodurans

R.prow

azekii

C.crescentus

M.loti

N.m

eningitidis

X.fastidiosa

P.aeruginosa

Buchnera

V.cholerae

H.influenzae

P.multocida

E.coli

A.pernix

M.jannaschii

A.thaliana S

.cerevisiaes

C.jejuni

C.albicans

S.pom

be

H.sapiens

C.elegan

H. pylori D

.melan.

The phylogenetic The phylogenetic distribution of cyaY distribution of cyaY (frataxin) is identical (frataxin) is identical to that of hscB/Jac1, to that of hscB/Jac1, indicating a indicating a functional role of functional role of cyaY in iron-sulfur cyaY in iron-sulfur cluster assembly on cluster assembly on proteins, specifically proteins, specifically in conjunction with in conjunction with Jac1.Jac1.

Phylogenetic distribution of iron-sulfur cluster assembly proteinsPhylogenetic distribution of iron-sulfur cluster assembly proteins

cyaY Yfh1cyaY Yfh1 (frataxin)(frataxin)

Huynen et al.Hum.Mol.Genet2001







Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genesContext-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genesContext-based function predictionContext-based function prediction2. Gene neighbourhood 2. Gene neighbourhood

GenomeGenomealignmentalignment

Conservation of gene neighboorhoodConservation of gene neighboorhood

Pairwise comparison of 20 prokaryotic genomes

(time)

(log)

xxxxxxxxxxxxxxxxxxxxxxxxxooooooooooooooooooooo

EC-HIMG-MP I I

Nucleotide salvage/degradation Nucleotide salvage/degradation pathway in gram-positive bacteriapathway in gram-positive bacteria

STRING server for context retrievalSTRING server for context retrieval

Tryptophan Tryptophan biosynthesisbiosynthesis

ww

w.bork.em

bl-heidelberg.de/STRIN

Gw

ww

.bork.embl-heidelberg.de/STR

INGw

ww

.bor

k.em

bl-h

eide

lber

g.de

/STR

ING

ww

w.b

ork.

embl

-hei

delb

erg.

de/S

TRIN

G


Gene neighborhood reflects connections between Gene neighborhood reflects connections between Tryptophan and Shikimate biosynthesisTryptophan and Shikimate biosynthesis


hemK

tyrA

aroB

aroEaroC

asdtruA

hyp

hyp

2c-rr

trpF

trpC

trpAtrpB

trpDtrpG

trpE

Modularity in “genomic association space” Modularity in “genomic association space”

Tryptophan synthesis pathway

Shikimate pathway

Networks based on conserved gene neighborhood reveal ‘natural’ subsystems

(pseudo)genes(pseudo)genesYan Yuan

Mikita SuyamaDavid Torrents

Rich Copley

Ivica Letunic

SMARTSMART


*Martijn (NL)

*Frank (D) Yan (C) Peer (D)

Tobias (D)

*Luis (E)

*Jörg (D)Berend (NL)Warren (US)

Miguel (E)

Shamil (RU)

*Birgit (D) Mikita (J)

Richard (UK)

*Vassily (RU)

*Gert (D)

David (E), Ivica (Hr), Caroline (E), Steffen(D), Francesca(I),Jan (D), Parantu(In), Christian(D)

*left EMBL

Documents

Comparative genome analysis [email protected] Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?