View
215
Download
0
Category
Tags:
Preview:
Citation preview
Genome Paleontology:Discoveries from complete genomes
Steven L. SalzbergThe Institute for Genomic Research (TIGR)and Johns Hopkins University
© 2003 Steven L. Salzberg2
What is genome paleontology?
Compare genomes to uncover:• history of species• genome
transformations• recent mutations
such as SNPs• evolution
3© 2003 Steven L. Salzberg
Outline (time permitting) An algorithm for rapid large-scale alignment
A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, and S.L. Salzberg. Alignment of whole genomes. Nucleic Acids Res 27:11 (1999), 2369-76. MUMmer 2: Delcher et al., NAR, 2002.
Alignments and analyses of bacterial genomesJ.A. Eisen, J.F. Heidelberg, O. White, and S.L. Salzberg. Evidence for
symmetric chromosomal inversions around the replication origin in bacteria. Genome Biology 1:6 (2000), 1-9.
Large-scale genome duplications: plant and human• The Arabidopsis Genome Initiative. Analysis of the genome sequence
of the flowering plant Arabidopsis thaliana. Nature 408 (2000), 796-815.
• J.C. Venter et al. The sequence of the human genome. Science 291 (2001), 1304-1351.
● Lateral gene transfer between humans and bacteria• S.L. Salzberg, O. White, J. Peterson, and J.A. Eisen. Microbial genes
in the human genome: lateral transfer or gene loss? Science 292 (2001), 1903–1906.
4
Genomes completed and published by TIGR and our collaborators, 1995-present
Organism ReferenceArabidopsis thaliana Lin et al., Nature 402: 761-8 (2000)Archaeoglobus fulgidus Klenk et al., Nature 390:364-370 (1997)Bacillus anthracis Ames Read et al., Nature 423: 81-86 (2003)Bacillus anthracis Florida Read et al., Science 296, 2028-33 (2002)Borrelia burgdorferi Fraser et al., Nature 390: 580-586 (1997) Brucella suis Paulsen et al., PNAS 99 (2002)Caulobacter crescentus Nierman et al., PNAS 98 (2001)Chlamydia pneumoniae Read et al., Nucl. Acids Res. 28, (2000)Chlamydia muridarum Read et al., Nucl. Acids Res. 28, (2000)Chlamydophila caviae Read et al., Nucl. Acids Res. 31, (2003) Chlorobium tepidum Eisen et al., PNAS 99: 9509-9514 (2002)Coxiella burnetii RSA 493 Seshadri et al., PNAS 100: 5455-60 (2003)Deinococcus radiodurans White et al., Science 286 (1999)Enterococcus faecalis Paulsen et al., Science 299: 2071-2074 (2003)Haemophilus influenzae Fleischmann et al., Science 269, (1995)Helicobacter pylori Tomb et al., Nature 388:539-547 (1997)Methanococcus jannaschii Bult et al., Science 273:1058-1073 (1996)Mycobacterium tuberculosis Fleischmann et al., J. Bact.184, (2002)Mycoplasma genitalium Fraser et al., Science 270:397-403 (1995)Neisseria meningitidis Tettelin et al., Science 287 (2000)Oryza sativa (rice) chr 10 Wing et al., Science 300: 1566-1569 (2003)Plasmodium falciparum Gardner et al., Nature 419:531-534 (2002)Plasmodium yoelii Carlton et al., Nature 419:512-519(2002)Porphyromonas gingivalis Nelson et al., J. Bact., in revision.Pseudomonas putida Nelson et al., Envir. Microbiol. (2002)Shewanella oneidensis Heidelberg et al., Nat. Biotech. 20 (2002) Streptococcus agalactiae Tettelin et al., PNAS. 99 (2002) Streptococcus pneumoniae Tettelin et al., Science 293 (2001)Sulfolobus islandicus virus Arnold et al., Virology 15:252-66 (2000)Thermotoga maritima Nelson et al., Nature 399: 323-329 (1999)Treponema pallidum Fraser et al., Science 281: 375-388 (1998)Vibrio cholerae Heidelberg et al., Nature 406, (2000)
5
Genomes in progress or recently completedFibrobacter succinogenesPrevotella intermediaPseudomonas fluorescensSilicibacter pomeroyi DSS-3Streptococcus agalactiae A909Streptococcus gordoniiStreptococcus mitisStreptococcus pneumoniae 670Acidobacterium capsulatum Bacillus anthracis A01055Bacillus anthracis A0402Bacillus anthracis Ames 0581Burkholderia thailandensisCampylobacter coli RM2228Campylobacter upsaliensis RM3195Clostridium perfringens SM101Epulopiscium fisheloniiHyphomonas neptuniumListeria monocytogenes F6854Listeria monocytogenes H7858Mycoplasma arthritidis Mycoplasma capricolumMyxococcus xanthusPrevotella ruminicolaPyrococcus furiosusVerrucomicrobium spinosum Actinomyces naeslundii
Bacillus anthracis A0071 Bacillus anthracis Kruger BErwinia chrysanthemiGemmata obscuriglobus Mycobacterium tuberculosisRuminococcus albusStreptococcus sobrinusAspergillus fumigatus Brugia malayi Coccidioides immitisCryptococcus neoformansEntamoeba histolyticaOryza sativa Chromosome 3 & 10Plasmodium vivaxSchistosoma mansoniSolanum spp.Tetrahymena thermophilaToxoplasma gondii Theileria parvaTrichomonas vaginalis Trypanosoma brucei Trypanosoma cruzi
Acidithiobacillus ferrooxidansBacillus anthracis Kruger BBurkholderia mallei Clostridium perfringens ATCC13124Dehalococcoides ethenogenesDesulfovibrio vulgaris Ehrlichia chaffeensisEhrlichia sennetsuGeobacter sulfurreducens Listeria monocytogenes Methylococcus capsulatusMycobacterium avium 104Mycobacterium smegmatisPseudomonas syringae Staphylococcus aureus Staphylococcus epidermidis Treponema denticolaWolbachia sp.Anaplasma phagocytophilaBacillus cereus 10987Bacteroides forsythesBrucella ovisBaumannia cicadellinicolaCampylobacter jejuniCarboxydothermus hydrogenoformansColwellia sp. 34HDichelobacter nodosus
© 2003 Steven L. Salzberg6
• Efficiently compute alignments between entire genomes and chromosomes, for example:
• Two strains of B. anthracis,each 5.1 Mb (<30 CPU seconds)
• Two chromosomes of A. thaliana, each 20-30 Mb (< 5 minutes)
• Two chromosomes of human, 100+ Mb each (< 30 minutes)
Genome-Scale Sequence Alignment
© 2003 Steven L. Salzberg7
MUMs: Maximal Unique MatchesAlgorithm finds ALL matchesString them together and align gaps
Suffix treesVery fast alignment of long DNA sequencesLinear time and space requirementsSoftware at:
http://www.tigr.org/software/mummer/
MUMmer alignments
TIGRTIGRTIGRTIGR
8© 2003 Steven L. Salzberg
A trieA tree with edges labelled by stringsEach leaf represents a sequence—the
labels on the path to it from the rootThe suffix tree for sequences A and B :
Contains |A | + |B | leaf nodes.Can be constructed in O (|A | + |B |) time!
Holds all suffixes of a set of sequences
Suffix Trees
9© 2003 Steven L. Salzberg
Sequences in genomes A and B that:Occur exactly once in A and in BAre not contained in any larger matching sequence
Maximal Unique Matches (MUMs)
A:
B:
Occurs only here Mismatch at both ends
10© 2003 Steven L. Salzberg
MUMmer 2 streaming algorithm
i+1
87 i
91 10
5 3 6Suffix Tree for String atgtgtgtc$
1 2 3
atgtgtgtc$
c$ gt t
$
c$ c$gt gt
c$ gtc$ c$ gt
4 2
c$ gtc$
Streaming String
4 5 6 7 8 9 10
...atgtcc...
MUMmer results: M. tuberculosis CDC1551 vs. H37Rv
A C G TA 66 164 9C 48 81 169G 164 89 44T 11 159 61
a MUM
16© 2003 Steven L. Salzberg
Duplication and Gene Loss?
A
B
CD
E
F
A
B
CD
E
F
A
B
CD
E
F
A
B
C
D
EF
A’
B’
C’
D’
E’F’
A
B
C
D
EF
A’
B’
C’
D’
E’F’
A
C
D
F
A’
B’
E’
E. coliE. coli
B
C
D
F
A’
B’
D’
E’
V. cholerae
A
B
C
D
EF
A’
B’
C’
D’
E’F’
© 2003 Steven L. Salzberg19
Symmetric Inversions Model
B1
A1
B2
A2
B3
A3
B3
B2
2423
2221
2019
1817161514
1312
11109
67258
2627
2829
301 2 3
45
3132
B1
3132
6789
1011
1213
1415161718
1920
2122
2324252627
2829
301 2 3
45
3132
B3 2423
2221
2019
1817161514
1312
11109
67258
2627
2829
33231 30
45
2 1
A1
3132
6789
1011
1213
1415161718
1920
2122
2324252627
2829
301 2 3
45
3132
A2
3132
6789
1011
1213
1918171615
1420
2122
2324252627
2829
301 2 3
45
3132
A3
2
6789
1011
1213
1918171615
1420
2122
2324252627
54
3 31 3029
28
1 32
B2
* *
* *
* *
* *
CommonAncester ofA and B
6789
1011
1213
1415161718
1920
2122
2324252627
28
29
301 2 3
45
3231
A2
A1 A2
A3
B2
B1
Inversionaroundterminus
Inversionaroundterminus
Inversionaroundorigin
Inversionaroundorigin
© 2003 Steven L. Salzberg22
Arabidopsis genome paleontologyCompare all chromosomes to each other....
Diorama by B.E. Dahlgren, © The Field Museum, Chicago
© 2003 Steven L. Salzberg23
The hunt for genome-scale duplications
S. cerevisiae?16% duplicated (Seoighe & Wolfe, 1999)
Maize? 10 chromosomes vs. 5 in some related
grasses; segmental allotetraploid? (Gaut & Dobley, 1997)
Drosophila melanogaster - no duplications Vertebrates: much speculation but little
evidence (Skrabanek & Wolfe, 1998) Arabidopsis thaliana: yes!
24© 2003 Steven L. Salzberg
chr.2
chr.4
First discovery: large-scale duplication between chromosomes 2 and 4 (Lin et al., 1999)
26© 2003 Steven L. Salzberg
•Over 60% of the genome is covered by duplicated regions
•Centromeres cover much of the rest
•Strikingly, only about 1/3 of the genes in each block remain as duplicates
27© 2003 Steven L. Salzberg
No triplications!
19-24 large-scale duplications>60% of the genome duplicatedIf duplications occurred over time,
triplications highly likelyDuplications likely happened as one
event (on evolutionary time scale)Conclusion: whole genome duplication
28© 2003 Steven L. Salzberg
I III IV V I III IV V
Warning: Salzberg’s speculation follows
Start with 4 ancestral chromosomes
54© 2003 Steven L. Salzberg
Warning: data quality control Until December 2000, Arabidopsis data in
GenBank was all BAC-based Errors included:
BACs on the wrong chromosome BACs entered twice with different IDs, different
annotation (sequenced twice), slightly different sequence
For duplications analysis, these errors would prove disastrous
Many of these errors are still in GenBank Old BACs are not automatically deleted
55© 2003 Steven L. Salzberg
Human Genome analysis used Celera’s assembly and annotation 26,588 genes, ordered along each of 24
chromosomes MUMmer 2.0 used to align whole
chromosomes Nothing found in DNA-level alignments Proteome alignments used instead
Recently re-computed using latest human genome annotation (Ensembl)
56© 2003 Steven L. Salzberg
Human whole-genome aligment Create 24 “mini-proteomes” by concatenating
all proteins on each chromosome Use MUMmer to align each mini-proteome to
the complete proteome (9,675,713 amino acids)
Search for conserved clusters of proteins Confirmed analysis by looking at Blast hits of
all vs. all
57© 2003 Steven L. Salzberg
Not looking fortandem duplicationsdomain hits (very common, often give highly
significant Blast hits)
What we’re looking for
58© 2003 Steven L. Salzberg
Summary results 1077 duplicated blocks
10,310 “gene pairs” “pair” = 2 genes that match between two blocks
296 blocks with 3-4 gene pairs 781 blocks with 5 or more gene pairs
3522 distinct genes, many duplicated more than once Large block: 33 genes on chr 2 and chr 14
spans 63Mbp on chr 14, over 70% of chr 14’s length spread over 97 genes on chr 2 and 332 genes on 14 includes two of four known Hox clusters, an ancient duplication
Large block: 64 genes on chr 18 and chr 20 previously undiscovered
Shuffled data: 370 gene pairs (3.6% false positive rate)
© 2003 Steven L. Salzberg61
Human-mouse genome mapping
Close evolutionary distance permits DNA-level alignments
Protein similarity even greater than DNA MUMmer quickly aligns each mouse mini-
proteome to its human counterparts Blast finds most (not all) of the same
matches (and is far slower) 77% (566/731) of Mouse16 genes are
found in syntenic regions of human 2.5% (18/731) of Mouse16 genes are
unique to mouse, not found in human
© 2003 Steven L. Salzberg63
Have bacteria transferred their genes directly into the human genome?
“Startling” discovery, Feb. 2001: 223 bacterial genes were laterally transferred into a vertebrate ancestor of humans (from the Nature human genome paper)
© 2003 Steven L. Salzberg67
Horizontal gene transfer in Arabidopsis thaliana chr 2(Lin et al., Nature, 1999)
135 genes most closely related to cyanobacterial genes and thus likely were transferred from chloroplast to the nucleus
Very recent transfer of > 250 kb section of mitochondrial genome
Many additional older mitochondrial → nuclear gene transfers
68© 2003 Steven L. Salzberg
Examples of Horizontal TransfersAntibiotic resistance genes on plasmidsPathogenicity islandsToxin resistance genes on plasmidsAgrobacterium Ti plasmidViruses and viroidsOrganelle to nucleus transfers
69© 2003 Steven L. Salzberg
Mechanisms of Horizontal TransferPlasmid exchange (prokaryotes)Mating/conjugation (prokaryotes)Viruses and viroidsOrganelle to nucleus exchange
(eukaryotes)Scavenging from environmentPassive absorptionFusion of cells
70© 2003 Steven L. Salzberg
Nature human genome paper (2001): Evidence for transfer?
Evidence: Genes match bacteria, but do not match non-vertebrate eukaryotes
Or, genes really are in non-vertebrates, but have stronger match to bacteriaMeasured by BLAST E-value
113 of the 223 genes found in a broad spectrum of prokaryotic species
71© 2003 Steven L. Salzberg
Alternative explanations Gene loss from a small sample of non-vertebrate
eukaryotes Only 4 non-vertebrates used for analysis: fruit fly, nematode,
yeast, and mustard weed (Arabidopsis) Large and diverse set of prokaryotes (over 30 organisms,
including extremophiles) used as well Rapid divergence in non-vertebrate eukaryotes
(evolutionary rate variation) Still-incomplete genomes (e.g., D. melanogaster) Erroneous annotation/gene finding Contamination
72© 2003 Steven L. Salzberg
Re-analysis: number of “transfers” decreases with # of genomes analyzed
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 2 3 4 5 Other
Number of protein sets removed
Nu
mb
er
of
gen
es in
late
ral tr
an
sfe
r can
did
ate
set
Fruit fly
C. elegans
Arabidopsis
Yeast
Parasites
74© 2003 Steven L. Salzberg
Trees Don’t Support Transfer
Paramecium bursaria Chlorella virus 1Homo sapiens HAS1Mus musculus HAS1
Xenopus laevisXenopus laevis Danio rerio
Homo sapiens Mus musculus
Danio rerio Xenopus laevis Gallus gallus Bos taurus Homo sapiens Mus musculus Rattus norvegicus
Bradyrhizobium sp SNU001Rhizobium leguminosarum
Rhizobium spRhizobium loti
Rhizobium tropici
Rhizobium sp. NodC
Mesorhizobium sp 7653RSinorhizobium melilotiRhizobium melilotiRhizobium leguminosarumRhizobium galegaeAzorhizobium caulinodans
Stigmatella aurantiacaStreptomyces coelicolor
Streptococcus uberisStreptococcus equisimilisStreptococcus pyogenes HASA
Streptococcus pneumoniae
0.2
Bacteria
Vertebrates
Virus
III
II
I
75© 2003 Steven L. Salzberg
Birney et al., Nature special issue on human genome“The unfinished human genomic DNA may contain contamination, particularly from bacteria but also from other sources. .... If the predicted gene matches a bacterial gene more closely than any vertebrate gene then it will almost always be a contaminant.”
© 2003 Steven L. Salzberg76
Were genes really transferred? NO Our re-analysis finds just 41 genes (Ensembl) or 46
(Celera) with best hits to bacteria – not 223 All of these could be explained by alternative mechanisms More genomes will likely eliminate these remaining
candidates At least 3 have already been found in Drosophila, 10 more in
other species Great care is needed in order to make assertions of
transfer from bacteria to humans Implications would be significant; e.g., GMOs
Even more care is needed when working with unfinished data
Nature erratum to human genome paper, August 2001: “We agree.”
© 2003 Steven L. Salzberg77
AcknowledgementsMUMmer: Arthur Delcher, Jeremy Peterson, Rob Fleischmann, Owen White, Simon Kasif, Jonathan Allen, Sam Angiuoli, Adam Phillippy
X alignments: Jonathan Eisen, Owen White, John HeidelbergArabidopsis duplications:
TIGR: Maria Ermolaeva, Owen White, Jonathan Eisen, Xiaoying Lin, Samir KaulAGI collaborators: Klaus Meyer and all his MIPS colleagues, Mike Bevan
Human duplications: Mark Yandell, Mark Adams, Mani Subramanian, Craig Venter (all formerly Celera), Ron Wides (Bar-Ilan University), Art Delcher
Lateral transfer: Jonathan Eisen, Owen White, Jeremy Peterson
Funding support:National Institutes of Health (NHGRI, NLM)National Science Foundation (CISE, BIO)
Recommended