36
Comparative Comparative genome genome analysis analysis [email protected] http://www.bork.embl-heidelberg.d Peer Bork Peer Bork EMBL & MDC EMBL & MDC Heidelberg & Berli Heidelberg & Berli Hard data Hard data and soft and soft interpretati interpretati ons? ons?

Comparative genome analysis [email protected] Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

Embed Size (px)

Citation preview

Page 1: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

Comparative Comparative genome analysisgenome analysis

[email protected]://www.bork.embl-heidelberg.de/

Peer BorkPeer BorkEMBL & MDCEMBL & MDC

Heidelberg & BerlinHeidelberg & Berlin

Hard data and soft Hard data and soft interpretations?interpretations?

Page 2: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

Sequenced eukaryotic genomesBork and Copley N

ature 409(01)818

Page 3: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

• Assembly accuracy

• Sequence coverage

• Sequence accuracy

• Polymorphism

Sources of uncertaintiesSources of uncertainties

• Annotation accuracy

(human genome draft)

Page 4: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

70% prediction accuracy is great!70% prediction accuracy is great!Prediction of |acc*cov | %acc | % cov of reference set| reference

Human promoters: .35 50% 70% of annotated test set Prestidge, 1995; Bucher , pers. Comm.

Human regulatory RNA elements .34 85% 40% of new DNA Dandekar & Sharma, 1998

Human genes (only presence): .49 70% 70% of chromosome. 22 Dunham et al., 1999 and refs therein

Human SNPs by EST comparison: .21 70% 30% of all proteins with SNP Sunyaev et al., 2000; Buetow et al., 1999

Human alternative splicing: .45 90% 50% of all splice sites Hanke et al., 1999

Transmembranes (only presence): .85 85% 99% of annotated test set Tusnady & Simon, 1998 and refs therein

Signal peptides (only presence): .90 90% 100% of annotated test set Nielsen et al., 1999

GPI ancors (incl cleavage site): .72 72% 100% of annotated test set Eisenhaber et al., 1999

Coiled coil (only presence): .81 90% 90% of annotated coiled coil Lupas, 1996

Secondary structure (3 states): .77 77% 100% of 3D test set Jones, 1999 and refs therein

Buried or exposed residues: .74 74% 100% of 3D test set Rost, 1996

Residue hydration: .72 72% 100% of 3D test set Ehrlich et al., 1998

Protein folds (in Mycoplasma): .49 98% 50% of Mycoplasma ORFs Teichmann et al,1999 and refs therein

Homology (several methods): .49 98% 50% of 3D test set Muller et al, 1999 and refs therein

Functional features by homology: .63 90% 70% unicellular genomes Bork and Koonin, 98; Brenner, 99

Function association by context: .25 50% 10% ‘high confidence’ in yeast Marcotte et al.,1999b

Cellular localization (2 states): .77 77% 100% of annotated test set Andrade et al., 1998

Page 5: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

Comparative genome analysisComparative genome analysis

Prediction of genes and pseudogenesPrediction of genes and pseudogenes

Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genes

Prediction of genes and pseudogenesPrediction of genes and pseudogenes

Homology-based function predictionHomology-based function prediction

Context-based function predictionContext-based function prediction2. Gene neighbourhood 2. Gene neighbourhood

Page 6: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

Number of human genes in timeNumber of human genes in time

Aug00 Apr01Oct00 Dec00 Feb01Feb00 0

100

120

20

40

80

60

HGS, Incyte and coTextbooks, public opinion

Celera

HGP38 32

5239

27 24 22

No h

uman

gen

es in

thou

sand

s

HGS

othersBasis for Feb 01 publications

10T

8T

6T

4T

2T

NEMAX50 index

Page 7: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

BLASTX vs nr95 prot. db. (cutoff E < e-8)

HUMAN GENOME

Homology search of all human intergenic regions Homology search of all human intergenic regions Masking for repetitive elements and ENSEMBL sequences

Filtering of query and database for Low Complexity Regions

BLASTX vs nr95 prot. db. (cutoff E < e-8)

Merging and extension of fragments

Construction of gene structure

Removal of all virus derived sequences

3.3·109 nucleotides

1.4·106 DNA fragments

4.4·104 DNA fragments

3.6·104 DNA fragments

2.3·104 DNA fragments

BLASTX vs ENSEMBL database

12526 elements (pseudogenes or genes)with sequence similarity to known proteins

Hunting for pseudogenes: Hunting for pseudogenes:

Page 8: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

0

5

10

15

20

25

30

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 >1.2

Pseudogenes reference set (856 seq.)

SWISSPROT (1935 seq.)

RefSeq (1103 seq.)

log (dS/dN)

% o

f se

qu

ence

sSynonymous/non-synonymous (dS/dN) substitution rates Synonymous/non-synonymous (dS/dN) substitution rates

of functional and pseudogenic human sequences of functional and pseudogenic human sequences

Page 9: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

log (dS/dN)

% o

f se

qu

ence

sSynonymous/non-synonymous (dS/dN) substitution rates of Synonymous/non-synonymous (dS/dN) substitution rates of

unannotated regions with homology to known genes unannotated regions with homology to known genes

00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 >1.2

2

4

6

8

10

12

14

16

Analyzed = 37121858 (50%) 1161 (31%) 693 (19%)

PSEUDOGENES GENESUNCERTAIN

Total

= 125268205 4321

693 novel genes detected; 693 novel genes detected; >4300 expected in ourset>4300 expected in ourset

Page 10: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

0

e-18

0

e-16

0

e-10

0

e-8

0

e-6

0

e-4

0

e-2

0

e-8

e-14

0

e-12

0

Nu

mb

er o

f se

ons

E value

< e-

180

50

100

150

200

250

300

pseudogenesuncertainfunctional

E value distribution of pseudogenic, E value distribution of pseudogenic, uncertain and functional exons uncertain and functional exons

3712 sequences

(BLASTX vs nr95 database)(BLASTX vs nr95 database)

Page 11: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

Comparative genome analysisComparative genome analysis

Prediction of genes and pseudogenesPrediction of genes and pseudogenes

Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genes

Prediction of genes and pseudogenesPrediction of genes and pseudogenes

Homology-based function predictionHomology-based function prediction

Context-based function predictionContext-based function prediction2. Gene neighbourhood 2. Gene neighbourhood

Homology-based function predictionHomology-based function prediction

Page 12: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

Mycoplasma pneumoniaeMycoplasma pneumoniae predictions predictions

0

20

40

60

80

100

1995Function

1995Structure

1999Function

1999Structure

fold twilightFoldsContextTwilightHomology

Page 13: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

Molecular Functions have to be defined on a domain basisi.e. separately foreach structurallyindependent unitwithin a sequence

Henikoff et al. 1997 Science 278, 609

Page 14: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?
Page 15: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

SMARTSMARTBlast-like inputBlast-like input

- ID or AC sufficient

- Access to different databases

- Domain annotation

www.smart.embl-heidelberg.de

Page 16: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

Digested outputDigested output

-signal sequence

-transmembrane regions

SMARTSMART

-comparison of domain context

www.smart.embl-heidelberg.de

Page 17: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

Domain organization of TAPDomain organization of TAP

LRR

LRR

LRR

LRR

NTF2-like UBA

100aa

RNA-binding p15-binding

np-bind.

Directed mutagenesisDirected mutagenesis

619aaTAPTAP Random mutagenesisRandom mutagenesis

Collaboration with Elisa Izaurralde

NTF2-like

p15

Page 18: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

Directed mutagenesis confirmsDirected mutagenesis confirmspredicted predicted TAPTAP//p15p15 interaction interaction

Red - loss of binding

Blue - no effect on binding Gray - alanine scan

Page 19: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

Top 10 domains* in humanTop 10 domains* in humanman fly worm yeast cress

ImmunoglobulinC2H2zinc finger

*Only no of genes given, no of domains higher; note that only around 90% is sequenced

Protein kinaseRhod.-like GPCRP-loop NTPaseRev.transcriptaseRRM (RNA-binding)WD40 (G-protein)Ankyrin repeat

765(381) 140 64 0 1706(607) 357 151 48 115575(501) 319 437 121 1049569(616) 97 358 0 16433 198 183 97 331350 10 50 6 80300(224) 157 96 54 255277(136) 162 102 91 210276(145) 105 107 19 120

13300 18200 6100 25700

Nature 409 (01)860; Science 291(01)1304

Total no genesSpecies

Homeobox 267(160) 148 109 9 118

26500(26500)

Page 20: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

Comparative genome analysisComparative genome analysis

Prediction of genes and pseudogenesPrediction of genes and pseudogenes

Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genes

Homology-based function predictionHomology-based function prediction

Context-based function predictionContext-based function prediction2. Gene neighbourhood 2. Gene neighbourhood

Homology-based function predictionHomology-based function prediction

Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genesContext-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genes

Page 21: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

Function prediction via Function prediction via genomic context informationgenomic context information

Gene context:Gene context:

- Pathway data (can overrule homology!)- Gene expression data (co-expression etc.)- Protein interaction /localisation - Scientific literature

- Gene fusion as distinct neighborhood subset - Conserved gene neighborhood in genomes - Conserved co-occurrence of genes in species (‘phylogentic profile’, ‘COG pattern’)- Surrounding and shared regulatory elements

Knowledge-based context:Knowledge-based context:

Page 22: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

Context methods in Mycoplasma: Context methods in Mycoplasma: Fusion, neighborhood, co-occurrenceFusion, neighborhood, co-occurrence

MG total:MG total:480 genes480 genes

Presence in conserved Presence in conserved operons: 213operons: 213

Conserved Conserved neighborhoodneighborhood

27

54

FusionFusion

Co-occurrenceCo-occurrencein genomesin genomes

178

Page 23: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

Orthology vs paralogy

Genome A

Genome B

gene A1 gene A2

gene B1 gene B2

orthology

paralogy

genegene 2

gene 1gene A1gene B1gene A2gene B2

history

… within homology

Page 24: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

Exploiting the absence of genesExploiting the absence of genes

www.bork.embl-heidelberg.deHuynen et al., 1998, FEBS Lett 426, 1-5

Page 25: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

Predicting functional interactions between proteins Predicting functional interactions between proteins by the co-occurrence of their genes in genomesby the co-occurrence of their genes in genomes

Distribution of four M.genitalium genes among 25 genomes

MG299 (pta) 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1MG357(ackA)0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1MG019(dnaJ) 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1MG305(dnaK)0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1

Using the mutual information between genes as a scoring heuristic for their co-occurrence.

M(pta, ackA)=0.69 (phospotransacetylase, acetate kinase)M(dnaJ, dnaK)=0.55 (heat shock proteins)M(dnaJ, ackA)=0.19

Page 26: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

hscB Jac1hscB Jac1hscAhscA

ssq1ssq1

Nfu1Nfu1

iscA Isa1-2iscA Isa1-2fdx Yah1fdx Yah1

Arh1Arh1

ORF1ORF1ORF2ORF2ORF3ORF3

iscS Nfs1 iscS Nfs1 iscU Isu1-2iscU Isu1-2

A.aeolicus S

ynechocystis

B.subtilis

M.genitalium

M.tuberculosis

D.radiodurans

R.prow

azekii

C.crescentus

M.loti

N.m

eningitidis

X.fastidiosa

P.aeruginosa

Buchnera

V.cholerae

H.influenzae

P.multocida

E.coli

A.pernix

M.jannaschii

A.thaliana S

.cerevisiaes

C.jejuni

C.albicans

S.pom

be

H.sapiens

C.elegan

H. pylori D

.melan.

The phylogenetic The phylogenetic distribution of cyaY distribution of cyaY (frataxin) is identical (frataxin) is identical to that of hscB/Jac1, to that of hscB/Jac1, indicating a indicating a functional role of functional role of cyaY in iron-sulfur cyaY in iron-sulfur cluster assembly on cluster assembly on proteins, specifically proteins, specifically in conjunction with in conjunction with Jac1.Jac1.

Phylogenetic distribution of iron-sulfur cluster assembly proteinsPhylogenetic distribution of iron-sulfur cluster assembly proteins

cyaY Yfh1cyaY Yfh1 (frataxin)(frataxin)

Huynen et al.Hum.Mol.Genet2001

Page 27: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

Comparative genome analysisComparative genome analysis

Prediction of genes and pseudogenesPrediction of genes and pseudogenes

Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genes

Homology-based function predictionHomology-based function prediction

Context-based function predictionContext-based function prediction2. Gene neighbourhood 2. Gene neighbourhood

Context-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genesContext-based function predictionContext-based function prediction1. Co-occurrence of genes1. Co-occurrence of genesContext-based function predictionContext-based function prediction2. Gene neighbourhood 2. Gene neighbourhood

Page 28: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

GenomeGenomealignmentalignment

Page 29: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

Conservation of gene neighboorhoodConservation of gene neighboorhood

Pairwise comparison of 20 prokaryotic genomes

(time)

(log)

xxxxxxxxxxxxxxxxxxxxxxxxxooooooooooooooooooooo

EC-HIMG-MP I I

Page 30: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

Nucleotide salvage/degradation Nucleotide salvage/degradation pathway in gram-positive bacteriapathway in gram-positive bacteria

Page 31: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

STRING server for context retrievalSTRING server for context retrieval

Tryptophan Tryptophan biosynthesisbiosynthesis

ww

w.bork.em

bl-heidelberg.de/STRIN

Gw

ww

.bork.embl-heidelberg.de/STR

INGw

ww

.bor

k.em

bl-h

eide

lber

g.de

/STR

ING

ww

w.b

ork.

embl

-hei

delb

erg.

de/S

TRIN

G

Page 32: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

Gene neighborhood reflects connections between Gene neighborhood reflects connections between Tryptophan and Shikimate biosynthesisTryptophan and Shikimate biosynthesis

Page 33: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

hemK

tyrA

aroB

aroEaroC

asdtruA

hyp

hyp

2c-rr

trpF

trpC

trpAtrpB

trpDtrpG

trpE

Modularity in “genomic association space” Modularity in “genomic association space”

Tryptophan synthesis pathway

Shikimate pathway

Networks based on conserved gene neighborhood reveal ‘natural’ subsystems

Page 34: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

(pseudo)genes(pseudo)genesYan Yuan

Mikita SuyamaDavid Torrents

Page 35: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

Rich Copley

Ivica Letunic

SMARTSMART

Page 36: Comparative genome analysis bork@embl-heidelberg.de  Peer Bork EMBL & MDC Heidelberg & Berlin Hard data and soft interpretations?

www.bork.embl-heidelberg.de

*Martijn (NL)

*Frank (D) Yan (C) Peer (D)

Tobias (D)

*Luis (E)

*Jörg (D)Berend (NL)Warren (US)

Miguel (E)

Shamil (RU)

*Birgit (D) Mikita (J)

Richard (UK)

*Vassily (RU)

*Gert (D)

David (E), Ivica (Hr), Caroline (E), Steffen(D), Francesca(I),Jan (D), Parantu(In), Christian(D)

*left EMBL