Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Proc. Natl. Acad. Sci. USAVol. 93, pp. 5854-5859, June 1996Evolution
Similarities and dissimilarities of phage genomesB. EDWIN BLAISDELLt, ALLAN M. CAMPBELLt, AND SAMUEL KARLINt§Departments of tMathematics and tBiological Sciences, Stanford University, Stanford, CA 94305-2125
Contributed by Allan M. Campbell, February 22, 1996
ABSTRACT Genomic similarities and contrasts are inves-tigated in a collection of 23 bacteriophages, including phageswith temperate, lytic, and parasitic life histories, with variedsequence organizations and with different hosts and with dif-ferent morphologies. Comparisons use relative abundances ofdi-, tri-, and tetranucleotides from entire genomes. We highlightseveral specific findings. (i) As previously shown for cellulargenomes, each viral genome has a distinctive signature of shortoligonucleotide abundances that pervade the entire genome anddistinguish it from other genomes. (ii) The enteric temperatedouble-stranded (ds) phages, like enterobacteria, exhibit signif-icantly high relative abundances ofGpC = GC and significantlylow values of TA, but no such extremes exist in ds lytic phages.(ii) The tetranucleotide CTAG is of statistically low relativeabundance in most phages. (iv) The DAM methylase site GATCis of statistically low relative abundance in most phages, but notin P1. This difference may relate to controls on replication (e.g,actions of the host SeqA gene product) and to MutH cleavagepotential oftheEscherichia coiDAMmismatch repair system. (v)The enteric temperate dsDNA phages form a coherent group:they are relatively close to each other and to their bacterial hostsin average differences of dinucleotide relative abundance values.By contrast, the lytic dsDNA phages do not form a coherentgroup. This difference may come about because the temperatephages acquire more sequence characteristics ofthe host becausethey use the host replication and repair machinery, whereas theanalyzed ytic phages are replicated by their own machinery. (vi)The nonenteric temperate phages with mycoplasmal and myco-bacterial hosts are relatively close to their respective hosts andrelatively distant from any ofthe enteric hosts and from the otherphages. (vii) The single-stranded RNA phages have dinucleotiderelative abundance values closest to those for random sequences,presumably attributable to the mutation rates of RNA phagesbeing much greater than those of DNA phages.
In previous publications (1-4), data and analyses were presentedsupporting the hypothesis that the 16-dinucleotide relative abun-dance values of genomes provide a robust unique genomicsignature for prokaryotic and eukaryotic sequences. Genomicsequences are compared with respect to: (i) di-, tri-, and tet-ranucleotide relative abundance extremes; (ii) distances betweendinucleotide relative abundance values (profiles) for each pair ofgenomes; and (iii) consistent similarity of individual sequences tovarious consensus sequences. In many cases, distances calculatedfrom dinucleotide relative abundances correlate well with tradi-tional phylogenies (1, 2). Dinucleotide relative abundance valuesare essentially equivalent to the "general designs" derived fromnearest-neighbor frequency analysis evaluated extensively duringthe 1960s and 1970s in samples of genomic DNA from manyorganisms (5, 6). Our recent studies have demonstrated that thedinucleotide relative abundance profiles of different DNA se-quence samples from the same organism are generally muchmore similar to each other than they are to profiles from otherorganisms and that closely related organisms generally have moresimilar profiles than do distantly related organisms (1-4, 7).
The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement" inaccordance with 18 U.S.C. §1734 solely to indicate this fact.
These highly stable DNA-doublet patterns suggest that there maybe genome-wide factors, such as functions of the replication andrepair machinery that impose limits on the compositional andstructural patterns of a genomic sequence, and that the set ofdinucleotide relative abundance values constitutes a genomicsignature that reflects the influence of these factors. In cellularorganisms the existence of genomic signatures and their corre-lation with phylogeny may reflect either the evolutionary devel-opment of the causative factors or a very slow rate of change inresponse to any alterations in them.We have studied the similarities and differences between the
DNA sequences of 23 phages for which complete genomes areavailable or for which many segments of sufficient aggregatelength have been sequenced. These phages possess diversecharacteristics including: double-stranded (ds) or single-stranded (ss); DNA or RNA; lytic, temperate, or parasitic lifecycles; icosahedral or filamentous capsids; a considerablerange of G+C content; a large variety of morphologicalclassifications; and some variety in bacterial hosts althoughmost are Escherichia coli (Table 1).The signatures of bacteriophage genomes have a special
interest because there is no accepted molecular taxonomy ofphages. Models for phage evolution frequently postulate achimeric origin for phage genomes (8). At least some lateraltransfer between distantly related phages or from host toinfecting phage is known (9). Despite these facts, the resultsin this paper show that the dinucleotide relative abundanceprofile provides a genomic signature that pervades eachphage genome. The following questions are also of interest.To what extent are morphology, life cycle, virion organization,and genomic replication properties reflected in the genomicsequence similarities and contrasts? To what extent are thevarious bacteriophage genomes similar to or different from theirbacterial host genomes or to other possible hosts? Which phagegenomes have the most (or least) bias in dinucleotide relativeabundances? Finally, to what extent do temperate, lytic, andfilamentous phages define coherent groups?
DATAThe data include every phage having a complete genomicsequence and several nonredundant sets of sequences ofsufficient aggregate length that were available in GenBank asof July 1994. Very similar replicates were removed. For example,among complete genomes, PX1 (= bX174) and S13 are -98%identical, and only PX1 is retained. Other similar selections areshown in the legend ofTable 1. The data also include dinucleotiderelative abundance profiles for collections of appropriate bacte-rial sequences as listed in the legend of Table 1.
METHODSThe statistical methods of this paper were elaborated inprevious publications (1, 3) and are applied in this paper tocompare diverse bacteriophage and bacterial host sequences.
Oligonucleotide Relative Abundances. A standard (singlestrand) measure of dinucleotide bias is the odds ratio pxy =
fxY/fxfr, where fx denotes the frequency of the nucleotide X and
Abbreviations: ds, double-stranded; ss, single-stranded.§To whom reprint requests should be addressed.
5854
Dow
nloa
ded
by g
uest
on
Nov
embe
r 1,
202
0
Proc. Natl. Acad. Sci. USA 93 (1996) 5855
Table 1. Phage sequences studied
G + C,Genome name % x 10 Length, bp Host
tmp* enteric dsDNALAMt 498 48,502t EcoMU 494 11,997 ManyP1 428 20,825 ManyP4 495 11,624t EcoP22 474 30,002 Sty
tmp nonenteric dsDNAL5 632 52,297t MsmL2 320 11,965t Ala
lyt* dsDNAT2 378 8,883 EcoT4 357 168,903t EcoT3 498 26,457 EcoT7 484 39,936t EcoPZA 397 19,366t BsuPRD 481 14,925t Many
prod* ssDNAfl 409 6407t Eco122 427 6744t EcoIKE 406 6883t EcoPF3 454 5833t Pae
Lytic ssDNAPXlt 448 5386t EcoG4 457 5577t EcoCP1 365 4877t Cps
RNA46 (ds) 559 13,286 PsyGA (ss) 479 3466t EcoMS2 (ss) 521 3569t Eco
Eco, E. coli; Bsu, Bacillus subtilis; Pae, Pseudomonas aeruginosa; Sty,Salmonella typhimurium; Cps, Chlamydia psittaci; Psy, Pseudomonassyringae; Ala, Acholeplasma laidlawii; and, Msm, Mycobacterium smeg-matis. Of the very similar PX1 and S13, only PX1 is kept; of fl, M13and fd, only fl is kept; and of PZA and (p29, only PZA is kept.*tmp, temperate; lyt, lytic; and prod, parasitic or filamentous (fil).tLAM, A; and PX1, GenBank name for 4)X174.tSignifies a complete genome sequence.
fxY is the frequency of the dinucleotide XY. However, becauseDNA properties are influenced by oligonucleotide compositionsof both strands, the formula for pxy is modified by combining thegiven sequence and its inverted complement sequence. In this way,the frequency fA is symmetrized to f f= fT = (fA + fr)/2 andfc = fG = (fc + fG)/2. Similarly, fGT = fic = (fGT + fAc)/2,etc. A symmetrized dinucleotide relative abundance measureis PGT = PAC = fGT/fGf; = 2(fGT + fAC)/(fG + fc)(fT + fA),etc. The deviation of pGT from 1 can be construed as a measureof the dinucleotide bias of GT/AC. Corresponding trinucle-otide and tetranucleotide measures are y and T, respectively(1). Table 2 shows the dinucleotide relative abundance p*vectors (also referred to as the {pxy} profiles) of all theorganisms of Table 1, of three consensus sequences, and ofeight host bacteria.
Consensus p* Values. For natural groups of phage sequences(e.g., enteric temperate dsDNA {LAM, Mu, P1, P4, P22},filamentous {F1, IKE, 122, PF3}, and lytic dsDNA {T2, T4, T3,T7, PZA}), we calculate a consensus dinucleotide relative abun-dance p* vector by averaging the fy and fi values of the relevantsequences. For example, let tmp denote the enteric temperateconsensus phage. We calculate
(tm)= (LAM) +(MU) +(Pfl.(tmp) = -5-Wj(LAM) + f(MU) + rfij(P1)+ fj(P4) + fj(P22)] [1]
Table 2. Symmetrized dinucleotide relative abundances (p*)CC AA AC AG CA GA
CG GC TA AT GG TT GT CT TG TC
LAMMUP1P4P22
L5L2
T2T4T3T7PZAPRD
flI22IKEPF3
tmp enteric dsDNA103 120 71 109 94 115 88108 124 69 102 96 125 9198 124 79 100 100 117 86104 122 74 106 101 124 8598 126 76 106 92 112 83
tmp nonenteric dsDNA108 91 5159 107 92
104 114 8994 117 8290 100 8488 95 8892 88 86125 154 91
88 119 85101 133 8999 127 8380 117 68
PX1 99 129 76G4 99 108 82CP1 81 110 86
66 113 101 67GA 104 102 89MS2 110 96 95
tmpfillyt
EcoStyBsuPaeCpsMsmAlaPsy
103 123 7492 124 8294 103 87
116 127 76123 129 82104 126 65109 117 5881 120 79124 100 4576 108 89105 121 64
96 86 9099 108 103lyt dsDNA98 109 11399 102 11286 91 10185 91 9793 98 105112 116 146
fil ssDNA93 112 12197 97 11596 100 11692 106 123lyt ssDNA93 90 11895 93 10993 97 108RNA
94 83 10499 103 11194 98 106Consensus
105 97 11995 104 11994 98 108
Host110 90 121114 91 122101 96 123111 87 11483 99 119116 81 9588 109 112108 88 115
87 11681 11591 11085 11096 113
98909693100
105 104 104 13298 98 118 95
928510410710871
85888988
889580
10391100
868898
898576878311110088
90 10295 106110 110112 10895 11266 96
98 9897 10593 10894 118
98 111100 107118 94
99 10698 95100 92
88 11295 10799 106
83 11182 10392 10899 109115 9883 108101 10993 115
9410410610710264
95909393
97104116
117105107
9593101
939310710810512489
101All values multiplied by 100. See legends to Tables 1 and 5. tmp,
consensus of five temperate dsDNA phages; fil, consensus of fourfilamentous ssDNA phages; and lyt, consensus of five lytic dsDNAphages excluding PRD, which is an extreme outlier (see Table 3).and similarly for f;(tmp) and define p,(tmp) = f,(tmp)/f,(tmp)fj(tmp). A random sequence would have each of itsdinucleotide relative abundances about 1.00, so we designatethe 16-component vector with all unit values as equ.
Distances Between p* Vectors. We calculate the pairwiseabsolute distances between the phage g and h p* vectors,(Table 3).
1 16
8*(g,h) = 1-lpjP(g) - pj(h)l1=l[2]
RESULTS AND DISCUSSIONA Genomic Signature of Dinucleotide Relative Abundance
Values. Our methods of genomic comparisons based on shortoligonucleotide (mainly dinucleotide) relative abundance val-ues can be applied to sets of genomic sequences for whichcommon functional gene sequences are not available-e.g., toa broad range of eukaryotic viruses-to diverse sets of protist
Evolution: Blaisdell et al.
Dow
nloa
ded
by g
uest
on
Nov
embe
r 1,
202
0
Proc. Natl. Acad. Sci. USA 93 (1996)
Table 3. Absolute distances (8*) of dinucleotide relativeabundance (p*) vectors of 23 phagesMU P1 P4 P22 L5 L2 T2 T4 T3 T7 PZA PRD
46 43* 63
*
434036*k
43794162*
185205190210175
121140107132116193*
748857698918488*
67993768461659962*
13116012015310811210613091*
155 109 228184 135 200145 95 214178 126 189132 94 239108 132 377122 88 283143 101 199110 73 24024 58 329* 65 342
* 300
fl I22 IKE PF3 PX1 G4 CP1 46 GA MS2
90103566883
2041134863144160118185
6977456960185112545612613810120359*
5564204755188104474411713892
2125630*
7172626285
213867975141163115239698563*
5175366536167119846010012497
24370474071
7810768
10160125907542587760
27391706410457
*f
1521841291591231691481349410811112828611412513114613097*
129 97149 122134 79154 103119 8773 153140 100133 54109 4977 9992 11189 84
321 233152 62132 71132 74154 111108 8872 52130 91* 102
*
1321511201461221241159790848770
27210410311215311762988048
LAMMUP1P4P22L5L2T2T4T3T7PZA
LAMMUP1P4P22L5L2T2T4T3T7PZAPRDflI22IKEPF3PX1G4CP146GA
All values multiplied by 1000.
or bacterial sequences (1, 3) and to bacteriophages as we havedone here. The traditional means of assessing similaritiesbetween organisms is to examine aligned sequences of homol-ogous genes by various tree-building algorithms. In the case ofthe 23 phages at hand, it is neither possible to find highlysimilar genes common to all of them, or even to most pairs, norto make reasonable alignments of them, even pairwise. Con-servation of the order of genes coding for proteins of similarfunction has been used as evidence for the similarity of blocksof contiguous genes in phages. However, different sets of suchblocks are conserved in different sets of phages so that nooverall similarity of whole genomes is realized (10).
Previous work has shown that dinucleotide relative abundancesprovide a unique signature for each genome, which is observedin all segments -20 kb of that genome (1, 2, 4). In fact, the 8*distances (see Methods) between dinucleotide relative abundancevalues of long samples from an organism are almost always lessthan the corresponding distances between similarly long samplesfrom different organisms. This relation has been found to begenerally true for a wide variety of organisms (vertebrate, inver-tebrate, fungus, protist, and prokaryote) and is here confirmedfor a variety of phages (see Table 6). We highlight below some ofthe main results emanating from our dinucleotide relative abun-dance comparisons and some interpretations of them.Extreme Relative Abundances of Dinucleotides (Table 4).
The dinucleotide TA is significantly low in all the enterictemperate dsDNA phages and in the nonenteric temperate
phage L5 but normal in the lytic dsDNA phages. The threeparasitic ssDNA phages of E. coli generally carryTA in the lownormal range, while TA is significantly low in PF3, a phage ofP. aeruginosa, and in the lytic ssDNA phage PX1. The tem-perate and parasitic phages but not lytic dsDNA coliphagesgenerally exhibit significantly high relative abundances of thedinucleotide GC (not CG, which is, among the phages, low onlyin L2 and among the hosts, significantly low only in itsmycoplasmal host). Notably, sequences of y-proteobacteriainfecting mammalian hosts (e.g., E. coli, S. typhimurium,Haemophilus influenzae, and Klebsiella pneumoniae) but not ofsoil-dwelling y-proteobacteria (e.g., P. aeruginosa and Azoto-bacter vinelandii) are significantly high in GC relative abun-dances (4).
Enteric Temperate Versus Lytic dsDNA Phages. The set ofenteric temperate dsDNA phages and the set of lytic dsDNAphages both show generally significantly low relative abun-dances for CTAG and GATC. On the other hand, CGCG islow for all the lytic but none of the enteric temperate;CCCC/GGGG is low or low-normal for all the lytic buthigh-normal for all the enteric temperate. For 6* distances, thefive enteric temperate phages are all close to the host E. coli,whereas lytic phages are relatively distant. Nonenteric tem-perate phages are much closer to their respective hosts than tothe enteric hosts and are closer than the dsDNA lytic phagesare to their hosts (Table 5).Low Relative Abundance of the DAM Site GATC in Coliph-
ages. GATC is low in all enteric dsDNA phages except P1 andtends to be lower in lytic coliphages than in temperate coliph-ages (Table 4). McClelland (11) has noticed that GATC is lowin some phages but not in prokaryotic chromosomes (e.g.,normal in E. coli). This has been attributed to selection againstit in phages that would otherwise occasionally be cut atunmethylated GATC sites by the MutH activity of the host(12). This cutting is prevented in P1 by its own EcoP1 proteinthat methylates GATC (13). The life history of P1 as asingle-copy plasmid requires elaborate machinery to stop P1replication at one additional copy (sequestration) and toensure that one copy goes to each daughter cell at host celldivision (partition). Important to this control is the state ofmethylation of four GATC sites embedded in the heptadAGATCC(C/A) repeats near oriR, where the host sequestra-tion protein SeqA can bind at hemimethylated GATC (14). Itis reasonable to guess that P1 developed its own GATCmethylase to assure methylation of this vital cluster of GATC.
Relative abundances of GATC in dsDNA phages are, exceptin PZA, consistently lower in lytic dsDNA phages than indsDNA temperate phages. GATC is much lower in phages T3and T7 than in phages T2 and T4. Also, except in PZA,GAAC/GTTC and GACC/GGTC, both differing fromGATC by a single substitution, are higher in lytic phages thanin temperate phages and are higher in T3 and T7 than in T2and T4. While integrated into the host genome, the GATCs oftemperate dsDNA phages are methylated by the host DAMmethylase, whereas lytically replicating dsDNA phage DNAmay not be. In this context, protection against MutH cutting isattained by severely reducing the number of GATC occur-rences in lytic phages and to a lesser extent in the temperatephages. The T2 and T4 phages contain their own DAMmethylases that are different from those of the host E. coli andall cytosine bases are converted to hydroxymethylcytosine(frequently glycosylated). These modifications may also inhibithost MutH cutting of GATC. On the other hand, the T-oddphages have no means of specific protection of GATC andpresumably compensate by maintaining very low levels ofoccurrence. The host for PZA is B. subtilis, which does notpossess a DAM methylase and may not possess an analog ofthe MutH endonuclease that recognizes GATC. Thus, PZAmay not need to reduce its GATC level to that found in theother lytic dsDNA coliphages.
5856 Evolution: Blaisdell et aL.
111111111
Dow
nloa
ded
by g
uest
on
Nov
embe
r 1,
202
0
Proc. Natl. Acad. Sci. USA 93 (1996) 5857
Table 4. Extreme symmetrized relative abundances
Phage TA AA* GC CCC* CTA* CCA* CGC* CCCC* CGCG CCGG CTAG GATC TATA TCGA CCGC* CTAC* GAAC* GACC* GCCC*
71 (120)69 125 123(79) 12473 123 (122)76 126
51
6864
125128
(120)(78) (121)
(79)
69 135125
(80) (120)78 124
146 154
(120)
68 123
76
133127 71
125(80) 125
129 69 76 132 (120)
37 127 135
70
7167
125 63
Enteric tmp dsDNA52 72
6339
4733 54
Nonenteric tmp dsDNA55 76
47lyt dsDNA
74 (81) 2671 (81) 4870 52 1275 57 511 54 78
60 8Parasitic ssDNA (filamentous)
26 2958 52
124 68(121) 70 65
00
66(80)
63lyt ssDNA65 3766 2937RNA
4350
58 78
(121)(122)
12353
6477
59
757373
00 6717 71
77
00
7476
142
123129
166
126
123123 125
127
(122)(120)132128
Low Relative Abundance of CTAG. CTAG is broadly low inGram-negative proteobacterial sequences (for example, see
ref. 15). Interpretations center on structural defects (kinking)or special functional roles associated with this tetranucleotide(15). CTAG is generally low in all classes of our phages.However, CTAG is not low in three dsDNA phages: P4, MU,and PZA (r* = 0.93, 0.97, and 1.10, respectively). We speculatethat CTAG has an important but as yet undetermined regu-latory function in these phages. The CTAG sites tend to occurin small clusters in each of these phages, perhaps as bindingsites for regulatory proteins.
Restriction Avoidance. The low values for palindromictetranucleotides in Table 4 may reflect to some extent restric-tion avoidance by the various phages (16, 17). Supporting thishypothesis is the observation that all the extreme high relativeabundance tetranucleotides can be obtained from one of theextreme low palindromic tetranucleotides via a single basesubstitution. For example, high relative abundances forGAAC/GTTC and GACC/GGTC are paired with low rela-tive abundances of GATC in the same phages. Similarly, thehigh values of CTAC/GTAG are paired with low values ofCTAG. The extreme occurrences of CGCG and of GACC/GGTC are almost exclusively associated with lytic phages.CGCG is also low and CCCG/CGGG is high in L2, which,although eventually lysogenic, is initially noncytocidally pro-ductive (18). TATA is low and TAAA/TTTA is high in L5,that is phenotypically temperate but has its own DNA poly-merase, a characteristic generally of lytic phages (19). Thetetranucleotide restriction sites in B. subtilis, CCGG, CGCG,and GGCC (T* = 0.08), are much lower in its phage PZA thanin any other lytic dsDNA phage and are accompanied by manyhigh tetranucleotides differing from them by one substitution
(data not shown). These facts are consistent with the principleof restriction avoidance.
Dinucleotide Relative Abundance Distances BetweenPhages. Pairwise distances (see Methods) between phages are
shown in Table 3 and phage distances to some consensus
phages and bacteria are shown in Table 5. The enteric tem-
perate dsDNA phages are among the closest to E. coli and are
generally closest to each other (in the range of distances ofhuman to mammals, see (2)). The smallest 8* distances to E.coli are attained by MU (8* = 0.043), P4 (8* = 0.041), andLAM (8* = 0.045), and each of these is closer to E. coli thanto any other bacterium in our sample. Also, moderately close toE. coli are P1 and P22. Similarly, temperate L2 is closer to itshost than to any other bacterium and to any other phage, andtemperate L5 is closer to its host than to any other bacteriumand to any other phage except )6. Also, CP1 is closer to its hostthan to any other bacterium and to any other phage (data notshown). CP1 is assumed to be lytic because of its similarity toPX1 and G4 (20), but our result suggests that it is phenotyp-ically temperate. None of the lytic phages is as close to its hostas is any of the enteric temperate phages to its host. Temperatecoexistence apparently produces 8* distances within the setand to the host that are smaller than those of the lytic phages.
In the distance data the five lytic dsDNA phages form threedistinguishable subsets, T-even (T2 and T4), T-odd (T3 andT7), and PZA. They are also distinguished in morphology andG+C composition (Table 1). In extreme relative abundances,PZA is often different from the other four, possibly because itinfects a different host with its different restriction systems.Coherence ofTemperate Phages. The five enteric temperate
dsDNA phages all have in common some distinguishing ex-
treme relative abundance properties not possessed by any ofour lytic phage genomes-i.e., low TA and high GC (also true
LAMMUP1P4P22
L2L5
T2T4T3T7PZAPRD
fl122IKEPF3
PX1G4CP1
46PGAMS2
123
134 167 147
132
67
125136123
124127126(121)
73
(122)
134
All values multiplied by 100.*Symmetrized value for oligonucleotide and its inverted complement. Significance criteria were low <78 and high >122. Values are shown for diif there was more than one occurrence, tri more than 2, and tetra more than 3. Some values (in parentheses) nearly satisfying the criteria are shownwhen they occur in accepted blocks.
Evolution: Blaisdell et al.
Dow
nloa
ded
by g
uest
on
Nov
embe
r 1,
202
0
Proc. Natl. Acad. Sci. USA 93 (1996)
Table 5. Absolute distances (8*) of dinucleotide relativeabundance vectors of 23 phages relative to four bacterial genomesand three consensus sequences
LAMMUP1P4P22
L5L2
T2T4T3T7PZAPRD
flI22IKEPF3
Eco
4543574163
199152
8894147172138195
98755992
PX1 52G4 102CP1 179
66 137GA 123MS2 150
Bsu
6772565962
178161
10265140157129216
91836889
6988129
124101130
Sty7065705681
211170
8797164183156168
897571114
76113170
154122149
Msm
164184193189182
Ala12212994115111
96 186212 57
195180153152154334
235212198228
192155220
89185168
67859210685
255
89828079
9976129
13387103
tmp*2742252344
194121
6660136160108210
68564165
4584145
13896
131
lyt*931197410776
13474
7049617540268
876867102
732796
774951
equ*11814496127104
13183
8180758566
269
999193137
10761121
955546
All values multiplied by 1000.*tmp, consensus of {LAM, MU, P1, P4, P22}; lyt, consensus of {T2,T4, T3, T7, PZA}; and equ, a vector of 16 ones.
for E. coli and S. typhimurium). In 8* distances, they constitutethe five single sequences closest to their consensus sequence.The major host E. coli is generally close to each of them andto their consensus. No other group of three or more phagesequences is as cohesive in its behavior. However, the enterictemperate group is diverse in a number of other properties. Forexample, the set contains exemplars of three capsid forms:short-tailed (P22), long-tailed (LAM), and contractile-tailed(MU, P1, and P4). Also, one occurs as an independent plasmid(P1), one can act as a transposon (MU), one infects S.typhimurium (P22), and two infect many hosts (P1 and MU).However, these differences are submerged beneath a similarityof relative abundances, probably because all are replicated andrepaired by a common host machinery. The average 8* dis-tance of E. coli to S. typhimurium, based on many 50-kbsamples from each, is -0.040 (7). None of the temperatephages is this close to either host. The Salmonella phage P22is farthest, but not very far, from both hosts, presumably forreasons other than its host preference.
Variation among the filamentous phages. The four filamen-tous parasitic ssDNA phages do not constitute a coherentgroup. They show substantial differences in relative abundanceextremes and in 6* distances. These differences can be asso-ciated with morphology, life history, and host differences. Inno instance do all four phages possess an extreme relativeabundance of the same oligonucleotide. In extreme relativeabundance PF3 often stands alone: low in TA, low in CGCGand high in CTAC/GTAG, and not high in GC and not low inTATA. PF3 infects P. aeruginosa and, in this host, the TA levelis very low (p* = 0.60) and the GC level is normal. In /*
distance, PF3 is farthest from the filamentous consensus. 122and IKE, both of which attach to I pili, are very close (6* =
0.030), and the next closest pairs are twice as distant (>0.063).Phage fl, which attaches to F pili, is alone very low in GATC,very low in TCGA (in fact, absent), and high in GACC/GGTCand in CCGA (p* = 1.29). These four phages also requirevariable numbers of host proteins for the three stages of RFform replication and for transcription stage 2.
Similarities and Differences of PX1 (= tpX174) and G4. Thetwo lytic ssDNA phages PX1 and G4 both have small icosahedralcapsids with no tails, the same host range, approximately the samegenome size and G+C content, and the same gene organization.The aligned amino acid sequences are, on average, 66% identical(21) but are substantially different in codon usage. They have sixextreme palindromic tetranucleotides in common, more than anyother pair of phages, including GATC, which is absent from PX1and occurs only twice in G4. However, the two also showsubstantial differences: PX1 has five extreme di- and trinucleotiderelative abundances, while G4 has none. Their mutual distance ismoderate (6* = 0.057). Several temperate and filamentousphages and their consensuses, but no lytic phages, are closer toPX1 than is G4. No temperate or filamentous phages is closer toG4 than PX1 is.There are precedents for substantial amino acid identities
associated with only a moderate level of genome similarities.For example, human herpes simplex virus I and varicella zostervirus show significant amino acid identities and highly con-cordant gene organization in the unique long region. However,herpes simplex virus I entails a genome G+C content of 68%,compared with varicella zoster virus with 46%. Among allvertebrate alpha herpes viruses, herpes simplex virus I andvaricella zoster virus are relatively distant with genomic S*distance = 0.081 (3). Human and mouse are often highlyconcordant in protein sequence comparisons but have genome6* distance = 0.057, the same level as PX1 with G4. E. coli andShigella flexneri are highly similar on the protein level but withmoderate 6* distance (0.062; ref. 7).
Lytic ssRNA Phages GA and MS2 Are Closest to Random.Distances between computer-generated random sequences oflength 10 kb fall in the range 0.000-0.018 (2). No phage samplein our collection is very close to random. The closest individualphage sequences to random in distances are the ssRNA MS2(0.046) and GA (0.055). This observation is consistent with thefact that these two phages show much fewer extreme dinucle-otide relative abundances than do any other phages. Theseresults are in accord with the indication that mutation rates ofssRNA are 10,000 times those of dsDNA (22) and thatsequences tend to approach randomness with increasing num-bers of substitutions. Among the most nonrandom are all theenteric temperate phages (Table 5). This is consistent with thefinding that the enteric temperate dsDNA phages are amongthose having the most extreme oligonucleotides (Table 4).
Phages That Are Far from Most Other Phages. The set ofphages {CP1, #6, L2, L5, MS2, T3, T7, PRD} are generallyfarthest from any other phage not in this set. It is interestingthat CP1, 46, L2, and L5 are phages of distinct hosts differentfrom E. coli. MS2 is the phage closest to random and,therefore, different from the other phages, which generallyhave their own characteristic profiles. T3 and T7 infect E. coli,and PRD infects many bacteria including E. coli, but they arelytic and use their own, presumably characteristic, DNA poly-merases. PRD is generally farthest from any of the otherphages, which is consistent with its having four extremeoligonucleotides possessed by no other phage. PRD is uniqueamong these 23 phages in having lipid inside the capsid.
Distances Between 20-kb Samples of Four Phage and ThreeBacterial Sequences. Karlin et al. (1, 2) found that distancesbetween p* vectors based on 50-kb sample sequences fromprokaryotic or eukaryotic species were generally considerablylarger than distances between 50-kb samples taken from single
-
5858 Evolution: Blaisdell et aL.
Dow
nloa
ded
by g
uest
on
Nov
embe
r 1,
202
0
Proc. Natl. Acad. Sci. USA 93 (1996) 5859
Table 6. Average, minimum, and maximum of absolute distances(6*) of dinucleotide relative abundance (p*) vectors of=20-kb samplesOrgan-ism Eco Bsu Sty Msm LAM P22 L5 T4 T7
Eco 39 87 51 161 61 75 196 98 172(n = 20) 9 48 10 132 25 48 128 37 104
90 135 114 194 90 94 226 133 211Bsu 39 96 183 76 66 178 73 160(n = 20) 13 53 160 39 50 151 35 127
87 153 202 106 86 198 115 189Sty 36 174 82 89 214 105 187(n = 20) 13 134 30 48 167 62 140
75 214 120 115 252 133 227Msm 39 162 182 96 182 155(n = 2) 39 146 156 68 162 151
39 177 207 120 204 161LAM 31 46 179 68 150(n = 3) 24 32 168 46 131
36 66 191 104 169P22 42 175 52 132(n = 2) 42 152 32 106
42 197 69 158L5 33 165 109(n = 3) 26 147 91
39 188 121T4 29 113(n = 8) 13 95
42 141T7 37(n = 2) 37
37
All values multiplied by 1000.
species. Because of our phage sequence size limitations, wehave taken samples of length -20 kb from the larger phagesequence sets: genomes T7 (2 samples), LAM (3 samples), L5(3 samples), T4 (8 samples), and the sequence collection P22(2 samples). We also randomly selected twenty 20-kb samplesfrom each of the collections of B. subtilis, E. coli, and S.typhimurium, and two from M. smegmatis.The ranges and averages within and between organism sample
distances are shown in Table 6. Examination reveals that theaverage distance between organism samples always exceeds thecorresponding within organism sample average distances.
Conclusion. The dinucleotide relative abundance measureof distance between DNA sequences appears to providemeaningful measures of similarities. We suggest that the shortoligonucleotide (di-, tri-, and tetranucleotide) relative abun-dance values relate to DNA structures (4). Several factors thatcan impact on DNA structures have been identified-e.g.,dinucleotide stacking energies, curvature, superhelicity, meth-ylation and other short oligonucleotide modifications, context-dependent mutation biases, and DNA replication and repairmechanisms (4, 7).The distances between relative abundance vectors of sam-
ples from the same genome are generally smaller than dis-tances between samples from different genomes (Table 6). Wereiterate the hypothesis (4) that the replication and repairmachinery that processes the whole genome contributes mostto maintaining the constancy and uniqueness of the relativeabundance vectors of the genome of an organism and thatevolutionary differences in this machinery can create evolu-tionarily meaningful genomic differences between differentorganisms. Among phages, the relative abundance valuesshould therefore be strongly influenced by the extent to whichhost machinery is used and by the nature of the host. Acomprehensive correlation of distance values with degree of
host dependence requires more information than we haveabout the natural life styles and host preferences of the phagesavailable for analysis. However, our results support a picturewhere those phages dependent on host replication machinery(temperate phages and, to a lesser extent, parasitic ssDNAphages) converge toward the DNA signatures of the host,whereas autologously replicating phages (T4 and T7) eachdiverge in their own direction.Can similarity of temperate phages to their hosts arise
through extensive incorporation of host genes? Whereas suchincorporation certainly can and does occur, we consider thisexplanation unlikely. The genome signature depends on thetotality of the DNA and cannot be significantly affected by thepresence of a few recently acquired DNA segments. Forexample, in the case of A and E. coli (8* = 0.046), a search forlong approximately matching common blocks (unpublishedresults) found three of total length 878 nt, or 1.8% of the lengthof A. It is plausible that the signature has been imposed uponthe viral sequence during its long residence in E. coli, where thereplication and repair systems and context dependent muta-tional biases of the host act on it. We would expect that anyforeign sequence replicated in E. coli (or any other host) for along period of time should likewise approach the host signa-ture. The phylogenetic significance of the 8* distances may below within the temperate phages, moderate to appreciable forthe ssDNA phages, and substantial for the autologously rep-licating phages as it is for cellular organisms.
This study was supported in part by National Institutes of HealthGrants 2R01GM10452-31, 5R01HG00335-07, and 9R01 GM51117-27, and National Science Foundation Grant DMS 9403553.
1. Karlin, S. & Cardon, L. R. (1994) Annu. Rev. Microbiol. 48,619-654.
2. Karlin, S. & Ladunga, I. (1994) Proc. Natl. Acad. Sci. USA 91,12832-12836.
3. Karlin, S., Mocarski, E. S. & Schachtel, G. A. (1994) J. Virol. 68,1886-1902.
4. Karlin, S. & Burge, C. (1995) Trends Genet. 11, 283-290.5. Josse, J., Kaiser, A. D. & Korberg, A. (1961)J. Biol. Chem. 263,
864-875.6. Russel, G. J., Walker, P. M. B., Elton, R. A. & Subak-Sharpe,
J. H. (1976) J. Mol. Biol. 108, 1-28.7. Karlin, S. (1996) Genomic Signature and Bacterial Phylogeny in
Bacterial Genomes: Physical Structures and Analysis, eds. deBruijn, F. J., Lupski, J. R. & Weinstock, G. (Chapman and Hall,New York), in press.
8. Campbell, A. M. (1994) Annu. Rev. Microbiol. 48, 193-222.9. Haggard-Ljungquist, E., Holling, C. & Callender, R. (1992) J.
Bacteriol. 174, 1462-1477.10. Casjens, S., Hatfull, G. & Hendrix, R. (1992) Semin. Virol. 3,
383-397.11. McClelland, M. (1985) J. Mol. Evol. 27, 317-322.12. Deschavanne, P. & Radman, M. (1991)J. Mol. Evol. 33, 125-132.13. Yarmolinsky, M. B. & Sternberg, N. (1988) in The Bacterio-
phages, ed. Calendar, R. (Plenum, New York), Vol. 1.14. Brendler, T. G., Abeles, A. L. & Austin, S. J. (1995)J. Cell Biol.,
Suppl. 19A, A2-115 (abstr.).15. Burge, C., Campbell, A. M. & Karlin, S. (1992) Proc. Natl. Acad.
Sci. USA 89, 1358-1362.16. Kessler, C. & V. Manta. (1990) Gene 92, 1-248.17. Sharp, P. M. (1986) Mol. Biol. Evol. 3, 75-83.18. Maniloff, J., Campo, G. J. & Dascher, C. C. (1994) Gene 141,
1-8.19. Hatfull, G. F. & Sarkis, G. J. (1993) Mol. Microbiol. 7, 395-405.20. Storey, C. G., Lusher, M. & Richmond, S. J. (1989)J. Gen. Virol.
70, 3381-3390.21. Godson, G. N., Fiddes, J. C., Barrell, B. G. & Sanger, F. (1978)
in The Single-Stranded DNA Phages, eds. Denhardt, D.T.,Dressier, D. & Ray, D. S. (Cold Spring Harbor Lab. Press,Plainview, NY).
22. Holland, J. (1993) in Emerging Viruses, ed. S. S. Morse (Oxford,New York).
Evolution: Blaisdell et al.
Dow
nloa
ded
by g
uest
on
Nov
embe
r 1,
202
0