21
Comparative sequence analysis of the icm/dot genes in Legionella Irina Morozova, a,1 Xiaoyan Qu, a,1 Shundi Shi, a,1 Gifty Asamani, a Joseph E. Greenberg, a Howard A. Shuman, b and James J. Russo a, * a Columbia Genome Center, Columbia University College of Physicians and Surgeons, 1150 St. Nicholas Avenue, New York, NY 10032, USA b Department of Microbiology, Columbia University College of Physicians and Surgeons, 701 W. 168th Street, New York, NY 10032, USA Received 18 August 2003, revised 21 November 2003 Abstract The icm/dot genes in Legionella pneumophila are essential for the ability of the bacteria to survive within macro- phages in lung infections such as LegionnairesÕ disease, or amoebae in nature. The 22 genes of the complex, thought to encode a transport apparatus for transfer of effector molecules into the host cell cytoplasm, are located in two chro- mosomal loci. We demonstrate that these genes are present in all the L. pneumophila strains examined herein, but display a wide range of sequence variation among the different strains, none of which are clearly associated with vir- ulence potential. The strains fall within seven phylogenetic groups, but discrepancies among the gene trees indicate a complicated evolutionary history for the icm/dot loci, with perhaps two independent gene acquisition events and subsequent genomic rearrangements. Significant findings include a probable t-SNARE domain in IcmG that may in- dicate a direct role for this putative inner membrane protein in altering the hostÕs membrane fusion machinery, a potential functional domain in the central hydrophobic portion of IcmK that may allow it to participate in forming the pore of the secretion complex, and strict conservation of the amino acid physicochemical characteristics in the IcmP region corresponding to the trbA domain that could play a role in molecular transfer. Ó 2004 Elsevier Inc. All rights reserved. Keywords: Legionella pneumophila; icm/dot genes; Evolution; Phylogenetic analysis; Virulence 1. Introduction Legionella pneumophila, the causative agent of LegionnairesÕ disease, an occasionally fatal pneu- monia, as well as much more common mild ‘‘flu’’- like lung infections, is found in fresh water throughout the world. The Philadelphia 1 isolate of L. pneumophila, named after the site of the originally described outbreak in 1976 (Fraser et al., 1977), is a member of the most prevalent serogroup 1 (Fields et al., 2002). Isolates associ- ated with at least 15 other serogroups have been * Corresponding author. Fax: +212-851-5215. E-mail address: [email protected] (J.J. Russo). 1 These three authors contributed equally to the work. 0147-619X/$ - see front matter Ó 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.plasmid.2003.12.004 Plasmid 51 (2004) 127–147 www.elsevier.com/locate/yplas

Comparative sequence analysis of the icm/dot genes in Legionella

Embed Size (px)

Citation preview

  • Comparative sequence analysis of the icm/dot

    mosomal loci. We demonstrate that these genes are present in all the L. pneumophila strains examined herein, but

    1. Introduction

    Legionella pneumophila, the causative agent of

    Legionnaires disease, an occasionally fatal pneu-

    monia, as well as much more common mild u-

    like lung infections, is found in fresh waterthroughout the world. The Philadelphia 1 isolate

    of L. pneumophila, named after the site of the

    originally described outbreak in 1976 (Fraser

    et al., 1977), is a member of the most prevalent

    serogroup 1 (Fields et al., 2002). Isolates associ-

    4) 12* Corresponding author. Fax: +212-851-5215.display a wide range of sequence variation among the dierent strains, none of which are clearly associated with vir-

    ulence potential. The strains fall within seven phylogenetic groups, but discrepancies among the gene trees indicate a

    complicated evolutionary history for the icm/dot loci, with perhaps two independent gene acquisition events and

    subsequent genomic rearrangements. Signicant ndings include a probable t-SNARE domain in IcmG that may in-

    dicate a direct role for this putative inner membrane protein in altering the hosts membrane fusion machinery, apotential functional domain in the central hydrophobic portion of IcmK that may allow it to participate in forming the

    pore of the secretion complex, and strict conservation of the amino acid physicochemical characteristics in the IcmP

    region corresponding to the trbA domain that could play a role in molecular transfer.

    2004 Elsevier Inc. All rights reserved.

    Keywords: Legionella pneumophila; icm/dot genes; Evolution; Phylogenetic analysis; Virulencegenes in Legionella

    Irina Morozova,a,1 Xiaoyan Qu,a,1 Shundi Shi,a,1 Gifty Asamani,a

    Joseph E. Greenberg,a Howard A. Shuman,b and James J. Russoa,*

    a Columbia Genome Center, Columbia University College of Physicians and Surgeons, 1150 St. Nicholas Avenue,

    New York, NY 10032, USAb Department of Microbiology, Columbia University College of Physicians and Surgeons, 701 W. 168th Street,

    New York, NY 10032, USA

    Received 18 August 2003, revised 21 November 2003

    Abstract

    The icm/dot genes in Legionella pneumophila are essential for the ability of the bacteria to survive within macro-

    phages in lung infections such as Legionnaires disease, or amoebae in nature. The 22 genes of the complex, thought toencode a transport apparatus for transfer of eector molecules into the host cell cytoplasm, are located in two chro-Plasmid 51 (200E-mail address: [email protected] (J.J. Russo).1 These three authors contributed equally to the work.

    0147-619X/$ - see front matter 2004 Elsevier Inc. All rights reservdoi:10.1016/j.plasmid.2003.12.0047147

    www.elsevier.com/locate/yplasated with at least 15 other serogroups have been

    ed.

  • described in the ensuing years (Helbig et al., 2002;Yu et al., 2002). In addition, L. pneumophila is one

    of about 42 known species within the genus Le-

    gionella (Fields et al., 2002; Yu et al., 2002), many

    of which can be associated with clinical symptoms.

    128 I. Morozova et al. / PlasmiAs part of its life cycle, Legionella bacteria are

    taken up and survive within phagocytic cells (e.g.,

    amoebae in the environment, macrophages in the

    human lung). The bacteria replicate within intra-cellular vacuoles, and eventually kill the original

    host cell, whereupon they may infect nearby

    phagocytes (Swanson and Hammer, 2000). Among

    other virulence genes, two regions with some fea-

    tures of pathogenicity islands, the so-called icm/dot

    gene clusters, appear to be essential for their

    ability to survive within and kill macrophages and

    amoebae (Andrews et al., 1998; Berger and Isberg,1993; Sadosky et al., 1993; Segal and Shuman,

    1997; Vogel et al., 1998).

    The icm/dot2 cluster I includes seven genes

    (dotAD, icmV,W,X) and the larger cluster II

    contains the remaining 17 members (icmT,S,R,

    Q,P,O,N,M,L,K,E,G,C,D,J,B,F). Their encoded

    proteins are thought to translocate eector mole-

    cules into the host cell that somehow prevent thelatter from killing the proliferating bacteria,

    probably by preventing phagosomelysosome fu-

    sion in macrophages (Christie, 2001; Nagai et al.,

    2002). The icm/dot loci are highly similar to the

    transfer region of plasmid R64 and other IncI1

    plasmids, and it has previously been suggested that

    icm/dot virulence genes share a common ancestor

    with plasmid conjugation genes (Komano et al.,2000; Segal and Shuman, 1999; Segal et al., 1998;

    Sexton and Vogel, 2002). It is unclear if the icm/dot

    genes derive from a single plasmid after which they

    separated into the two gene clusters, or there were

    multiple gene transfer events. Of the more than

    100 bacteria for which complete genome sequence

    is available, only Coxiella burnettii has homologs

    of the full icm/dot genes; in Coxiella, all the icm/dotgenes are contained in a single locus (Seshadri

    2 Many of these genes were discovered at about the same

    time in two laboratories and referred to as either icm (intra-

    cellular multiplication) or dot (defective in organellar track-

    ing). Where a particular gene has two names, we use the icmdesignation in this paper.et al., 2003). Besides the icm/dot genes, Legionella,like many other bacteria, contain most members of

    a Type IV secretion system, the lvh/lvr genes that

    have virulence properties in some organisms,

    though apparently not in L. pneumophila (Segal

    et al., 1999).

    Since most of the icm/dot proteins are clearly

    implicated in the ability of Legionella to grow

    and survive within macrophages, it would be ofinterest to know if any of them are missing or

    considerably dierent in strains of lower patho-

    genicity. In the present work, we demonstrated

    the presence of several of the icm/dot genes in at

    least seven L. pneumophila species and one or

    more lvh/lvr genes in most of the Legionella species

    tested.

    The evolution of the icm/dot genes may be dis-tinct from the majority of the other Legionella

    genes, particularly the housekeeping genes, either

    because they are part of the bacterias virulencegene set, or because of their presumed plasmid

    origin. Virulence genes are often subject to diver-

    sifying selection and evolve faster than the rest of

    the genome to avoid the hosts response to theinfection. But Legionella, with their largely intra-cellular lifestyle, are only briey exposed to the

    mammalian immune system, and are not known to

    establish infections in serial hosts; therefore they

    are unlikely to undergo adaptive evolution. In-

    deed, the mip gene, which encodes a possible vir-

    ulence factor, was found to have relatively few

    polymorphisms in L. pneumophila strains even

    though it encodes an outer membrane protein(Bumbaugh et al., 2002). In addition, although the

    quite variable dotA gene product was found to be

    less conservative in its outer domains, the ratio of

    synonymous and nonsynonymous nucleotide sub-

    stitutions in this region did not indicate adaptive

    evolution according to these same investigators

    (Bumbaugh et al., 2002). In this study we ad-

    dressed the question of whether the remaininggenes of the icm/dot loci show the same elevated

    level of variability as dotA relative to other por-

    tions of the genome, particularly housekeeping

    genes.

    A possible plasmid origin of the icm/dot genes

    would account for dierences in evolutionary his-

    d 51 (2004) 127147tory between these genes as a group and the rest of

  • Legionellas chromosomal genes. According to thishypothesis, the icm/dot loci would constitute rela-

    tively pliable regions susceptible to repeated re-

    gional gene rearrangments. Indications of multiple

    rearrangements in the icm/dot region have indeed

    been described (Ko et al., 2002a,b).

    We sequenced 18 icm/dot genes and 4 other

    genes in 18 dierent strains of L. pneumophila.

    Comparative sequence analysis reveals a widespectrum of variability among the icm/dot genes,

    some as conservative as mip and houskeeping

    genes, others more variable than dotA. Protein

    functional motif search along with the distribution

    of variable/conservative regions along the gene

    sequence gave additional information on the lo-

    cation of functionally important domains in some

    icm/dot gene products. We did not observe clear-cut associations of any particular gene variations

    with either known serogroups or virulence phe-

    notypes. Phylogenetic analysis indicated that dif-

    ferent L. pneumophila strains displayed distinct

    acquisition histories for some subsets of the icm/

    dot genes, consistent with an evolutionary scenario

    for the L. pneumophila species encompassing generearrangements as well as repeated horizontal gene

    transfer events.

    2. Materials and methods

    2.1. Bacterial strains

    The Legionella species and L. pneumophila

    strains used in this study are enumerated in

    Table 1.

    2.2. Hybridization

    Specic primers were designed to amplify all or

    a portion of each gene in the Philadelphia 1 strainof L. pneumophila. In general, PCR was carried

    out using 150200 ng DNA, 1.5mM MgCl2, 1reaction buer, 0.2 lM each dNTP, 10 pmol eachprimer, and 2U Taq polymerase (Invitrogen) in

    50 ll reaction volumes with the following PCRprole (5min at 95 C; 35 cycles of 95 C, 30 s;

    Leg 27 L. steigerwaltii SC-18-C9

    Leg 28 L. parisiensis PF-209C-C2

    tel, w

    6 in 1

    I. Morozova et al. / Plasmid 51 (2004) 127147 129Table 1

    Legionella strains

    L. pneumophila

    Leg # Serogroup Isolate

    Leg 1 1 Bellingham 1

    Leg 2 11

    Leg 3 1 Philadelphia 1

    Leg 4 2 Togus 1

    Leg 5 3 Bloomington 2

    Leg 6 4 Los Angeles 1

    Leg 7 5 Dallas

    Leg 8 6 Chicago 2

    Leg 9 7 Chicago 8

    Leg 10 8 Concord 3

    Leg 11 9 IN-23-G1-C2

    Leg 30 13 82A31053

    Leg 31 1 Knoxville 1

    Leg 32 10 Leiden 1

    Leg 33 11 797-PA-H

    Leg 34 12 570-CO-H

    Leg 35 1 Amsterdam B-1

    Leg 36 1 Amsterdam B-2

    Sources of strains: Leg 35 and 36, provided by Dr. R. van Ke

    Netherlands ower show. Leg 35 was identied in 28 and Leg 3Dr. B. Fields at the CDC.Leg 29 L. rubrilucens WA-270A-C2

    ere derived from the Legionnaires disease outbreak at the 1999out of 29 patients. The rest of the strains were obtained fromOther Legionella species

    Leg # Legionella species Isolate

    Leg 12 L. dumoi NY-23

    Leg 13 L. longbeachae 1 LB-4

    Leg 14 L. longbeachae 2 Tucker 1

    Leg 15 L. gormanii LS-13

    Leg 16 L. micdadei TATLOCK

    Leg 17 L. wadsworthii 81-716

    Leg 18 L. oakridgensis Oak Ridge-10

    Leg 19 L. feeleii 1 WO-44C-C3

    Leg 20 L. feeleii 2 691-WI-H

    Leg 21 L. sainthelensis Mt.St.Helens-4

    Leg 22 L. jordanis BL-54D

    Leg 23 L. spiritensis Mt.St.Helens-9

    Leg 25 L. jamestowniensis JA-26-G1-E2

    Leg 26 L. cherrii ORW

  • lasmi52 C, 30 s; 72 C, 30 s; 7min at 72 C). The PCRproduct was radiolabeled using random primer

    labeling kits RTS RadPrime (Life Technologies) or

    Redi-Prime 2 (Amersham) with [a-32P]dATP ac-cording to the manufacturers instructions. EcoRI-digested Southern blots of 15 species of Legionella

    other than pneumophila (17 strains in all), and 18

    strains of L. pneumophila were probed with each

    gene-specic amplimer under standard conditions(overnight hybridization at 65 C in 0.5MNaHPO4, pH 7.2, 7% sodium dodecyl sulfate

    (SDS), 1% bovine serum albumin, 1mM ethy-

    lenediamine tetraacetic acid, 125 lg/ml shearedsingle-stranded salmon sperm DNA), and then

    washed to a relatively low stringency (75mM

    NaCl/7.5mM Na3citrate/0.1% SDS, 65 C). In afew cases, hybridization temperature was reduced(45 C) and washing eliminated (just a single300mM NaCl/30mM Na3citrate/0.1%SDS room

    temperature rinse).

    2.3. PCR and sequencing

    The same primer pairs were used to attempt to

    amplify genes from various Legionella strains andspecies. When amplication failed, at least one

    additional attempt was made with alternative pri-

    mer pairs; moreover, in some cases it was neces-

    sary to adjust the annealing temperatures for

    individual strains or genes. After PCR, oligonu-

    cleotides were dephosphorylated and primers de-

    graded with shrimp alkaline phosphatase and

    exonuclease I, respectively (incubation at 37 C,90min; enzymatic denaturation at 72 C, 15min).The same primers were then used for bidirectional

    sequencing. With genes longer than about 500

    bases, additional internal oligonucleotides were

    designed for priming sequencing reactions. Most

    sequencing reactions were done with ABI big dye

    terminator kits or Amersham 377 energy transfer

    kits, according to the manufacturers instructions,and following isopropanol precipitation, the se-

    quencing products were separated on ABI 377 gel

    systems (PerkinElmer). Individual sequence reads

    were assembled into contigs using the Phrap

    assembler (Green P., http://bozeman.mbt.wash-

    ington.edu/) or SeqMan (Lasergene System,

    130 I. Morozova et al. / PDNASTAR, Madison, WI). The quality of eachbase in each sequence was checked both auto-matically and manually. In a few cases where there

    was uncertain base calling even after repeated se-

    quencing attempts, these positions were eliminated

    from the analyses.

    The sequences obtained in this study have been

    submitted to GenBank (National Center for

    Biotechnology Information, Bethesda, MD). Ac-

    cession numbers and the sequences themselves areavailable at http://genome3.cpmc.columbia.edu/~

    legion/comp_proj.htm.

    2.4. Additional gene sequences

    Homologs of the icm/dot genes from C. burnetii

    [sequence data were provided by The Institute for

    Genome Research (TIGR) under an academic li-cense agreement] and Legionella longbeachae

    (GenBank Accession No. gi18693262) were used

    as outgroups in L. pneumophila gene analysis.

    2.5. Sequence alignment and analysis

    The ClustalW (Thompson et al., 1994) program

    was selected for aligning nucleotide or translatedamino acid sequences. BioEdit version 5.0.6 (TA

    Hall, http://www.mbio.ncsu.edu/BioEdit/bioedit.

    html) and GeneDoc programs (H.B. Nicholas Jr,

    http://www.psc.edu/biomed/genedoc/) were used to

    manipulate the alignments and to build the protein

    hydrophilicity proles. The number of nonsynony-

    mous (leading to amino acid substitutions) (Kn) and

    synonymous (silent) (Ks) nucleotide substitutionswere calculated using the MEGA 2.1 package

    (Kumar et al., 2001, http://www.megasoftware .net/).

    Both Kn and Ks were calculated per corresponding

    site to avoid the inuence of overall gene composi-

    tion. Four physico-chemical properties (volume,

    polarity, charge, and hydrophobicity) were used to

    characterize the results of amino acid substitutions

    in comparisons of translated homologous se-quences (Bogardt et al., 1980; Kawashima and

    Kanehisa, 2000). Corresponding dG values were

    obtained usingMiyatasmatrix (Miyata et al., 1979)andwere calculated per one amino acid substitution

    so that they would not depend on the rates of nu-

    cleotide substitutions per se. Protein secondary

    d 51 (2004) 127147structure predictionwas done using SSpro (Pollastri

  • 1990) and PARACEL BLASTER machine were

    used for the searches against the TIGR databases

    Philadelphia 1 strain of L. pneumophila. The left

    side of the autoradiogram shown in Fig. 1 depicts

    some typical results. A table (Supplementary Table

    SI) compiling the results for every gene and strain

    is available online at http://genome3.cpmc.colum-

    bia.edu/~legion/comp_project/comp_proj.htm. Inaddition to 16S rRNA genes, positive signals were

    obtained in most species for housekeeping (e.g.,

    asd) genes; in contrast, only some icm/dot genes

    were detected in the non-pneumophila species (see

    right side of autoradiogram in Fig. 1), chiey in

    L. longbeachae (Leg14), and to a lesser extent in

    Legionella dumo, Legionella wadsworthii, Le-

    Fig. 1. Gene distribution in dierent legionellae as scored by

    hybridization. Comparison of hybridization results using Phil-

    adelphia 1 PCR amplied probes in L. pneumophila and other

    Legionella species for 16S rRNA, aspartate b-semialdehydedehydrogenase (asd), one of the lvh and three of the icm genes.

    From left to right: strains Leg 717. Presence of multiple bands

    for 16S rRNA likely due to multiple copies of the rDNA in the

    bacterial species (there are at least three partial or complete loci

    in the Philadelphia 1 strain of L. pneumophila based on the

    genomic sequence). Variation in banding patterns for other

    genes could be due to presence or absence of paralogs, or to

    EcoRI restriction site polymorphisms within or near the gene,

    in the dierent organisms. In the case of the lvhB4 and other lvh/

    lvr genes, variable patterns could reect the fact that they can be

    located on a plasmid, as supported by recent data from our

    laboratory (not shown).

    lasmifor completed bacterial genomes (http://www.tigr.org) and the NCBI nonredundant databases

    (http://www.ncbi.nlm.nih.gov).

    2.7. Domain search

    The SMART server (http://smart.embl-heidel-

    berg.de (Letunic et al., 2002)) and PFAM database

    (Bateman et al., 2002) were used to search for pro-tein functional domains and coiled-coil structures.

    2.8. Phylogenetic analyses

    Tree reconstruction and visualization were ac-

    complished using the MEGA 2.1 package (Kumar

    et al., 2001). The Li distance approach (Li et al.,

    1985) was used for building the distance matrix.The neighbor-joining (NJ) tree-building algorithm

    (Saitou and Nei, 1987), which builds a branching

    tree diagram from the distance matrix by succes-

    sively clustering pairs together, was used for phy-

    logenetic inference. Condence levels of inferred

    relationships were estimated following 1000 boot-

    strap iterations. To address uncertainties of tree

    branching, the split decomposition method of theprogram SplitsTree (Huson, 1998) was utilized.

    Unlike most tree building methods, which force

    data into a tree-like phylogeny, this method por-

    trays the data in a mesh-like graph allowing con-

    icting phylogenetic information to be visualized,

    estimated, and compared.

    3. Results

    3.1. Gene composition of Legionella species

    Low stringency hybridization to EcoRI-di-

    gested Legionella DNA was carried out using la-et al., 2002), APSSP2 (Raghava, 2000), and PHDprograms (Rost, 1996).

    2.6. Homology search

    The generic BLAST program (Altschul et al.,

    I. Morozova et al. / Pbeled amplied regions of selected genes from thed 51 (2004) 127147 131gionella gormanii, Legionella micdadei, Legionella

  • feeleii, and Legionella sainthelensis. Intermediateresults were obtained for the lvh/lvr genes (lvrA-E,

    lvhB2-11, D4). For the most part, they generated

    strong hybridization signals in L. dumo, L.

    longbeachae (Tucker 1), L. wadsworthii, L. oak-

    ridgensis, and L. cherrii, but weak or no signals in

    the remaining species tested. These genes have

    previously been shown not to be essential for

    growth of L. pneumophila in macrophages (Segalet al., 1999). As an alternative to hybridization,

    primer pairs from Philadelphia 1 were used in an

    attempt to amplify genes directly from the other

    strains and species. The results were usually con-

    sistent with those obtained using hybridization;

    with few exceptions, if negative results were ob-

    imply that dierent strain virulence phenotypes arenot accounted for by simple presence or absence of

    these orthologs. Therefore, to determine if more

    subtle genetic features were involved, we carried

    out a comparative sequence analysis on all these

    genes in the dierent L. pneumophila strains as well

    as the L. longbeachae icm/dot genes available from

    GenBank. (Although we were able to amplify a

    few of the icm/dot genes in the non-pneumophilaspecies, the comparative sequencing described be-

    low was restricted to the pneumophila strains.)

    3.2. Level of interstrain and interspecies variation in

    L. pneumophila

    gene

    (96%

    d ave

    r dat

    ubmi

    r dot

    ophil

    132 I. Morozova et al. / Plasmid 51 (2004) 127147tained using one approach, negative results were

    also obtained with the alternative procedure (seesupplementary Table SII at the above URL for all

    the PCR results). Still, it is important to realize

    that under the stringency conditions we utilized for

    hybridization, we would not expect to identify

    genes with less than 70% identity at the nucleic

    acid level; similarly, at least 90% conservation of

    primer sequence would be required for consistently

    successful PCR amplication. Thus, an unob-served signal may be due either to true absence of a

    gene, or perhaps more likely, substantial variation

    in the genes sequence compared to that of Phila-delphia 1.

    Among the L. pneumophila strains, high signal

    strength was obtained for nearly every gene

    (housekeeping, icm/dot, and lvr/lvh). This would

    Table 2

    Variations among Legionella genes

    16S rRNAa Non-icm

    DNA

    Within L. pneumophila 99.2% 89100%

    Between Legionella spp. 9199% (96%) 6999%

    With Coxiellad 85.5%

    The data in the above table represent averages or ranges (ana Based on our data and that of Adeleke et al. (1996).b Based on information available for 8 non-icm/dot genes, ou

    et al. (1998), and Avison and Simm (2002).c Based on our data plus gene sequences for L. longbeachae s

    (2002b) have shown a wider range of nucleic acid homology fo

    comparisons the L. pneumophila subsp. fraseri and subsp. pneum

    d TIGR data compared with the Philadelphia 1 strain of L. pneumThere are now about 48 known Legionella

    species (Perez-Luz et al., 2002) and about 15L. pneumophila serogroups, comprising approxi-

    mately 70 known serogroups in the genus overall.

    Despite detailed analyses, there are complications

    in some of the assignments (see review by Benson

    and Fields, 1998). Appreciating that taxonomic

    positioning cannot always accurately reect evo-

    lutionary distance (Rosello-Mora and Amann,

    2001), Table 2 summarizes icm and non-icm se-quence diversity based on our data for dierent

    strains of L. pneumophila, and in a lesser number

    of cases, other Legionellae, as well as published

    gene sequence data. The dierences between Le-

    gionella species are within or close to the standard

    boundaries of speciation (for review, see Rosello-

    Mora and Amann, 2001): 95% homology for 16S

    sb icm/dot genesc

    Protein DNA Protein

    ) 97.9% 9698% (97%) 94100% (98%)

    7599% 6279% (70%) 5891% (74%)

    3966% 2363%

    rages) of the percent homology for dierent genes.

    a plus that of Ratcli et al. (1997), Doyle et al. (1998), Ratcli

    tted to GenBank by Rogers et al. (2002) (AF288617). Ko et al.

    A within L. pneumophila (78100%), when they include in the

    a.ophila.

  • rRNA and 70% for other genes. As can be seenfrom the table, icm/dot genes have a higher level of

    inter-strain diversity (6279% homology, with a

    mean value of 70%), than non-icm/dot genes,

    though as of today, only L. pneumophila vs

    L. longbeachae comparisons for several icm genes

    are available.

    There is a considerable range of variability for

    the dierent icm genes among the L. pneumophilastrains examined both at the nucleotide and pro-

    tein levels (Table 3). Some genes have a very low

    percentage of variable positions, and even silent

    substitutions are rare, while others, such as icmX,

    have many polymorphic sites. There were no ma-jor insertions or deletions in the sequenced genes,

    though there were a few 1 or 2 amino acid inser-

    tions and deletions (e.g., in icmG in Leg7 and

    Leg31; icmX in several strains).

    The number of synonymous (Ks) and nonsyn-

    onymous (Kn) nucleotide substitutions was deter-

    mined per corresponding site, and the mean of all

    pairwise strain comparisons was calculated foreach gene. The icmX,W, V, and dotA genes, which

    are all members of icm/dot region I (the small icm

    locus), are quite variable, showing consistently

    higher Ks and Kn values (with the exception of Kn

    Table 3

    Sequence variations in icm/dot and non-icm genes

    Gene Number

    of strains

    sequenced

    Gene

    length in

    Leg 3

    % nished Number of

    polymorphic sites

    Mean pairwise value per

    site

    Kn/Ksratio

    dG per

    one aa

    changenucl aa Kn Ks

    icmF 15 2922 100 278 50 0.005 0.075 0.067 0.923

    icmB 16 3030 100 276 16 0.002 0.107 0.019 0.615

    icmJ 17 627 99 46 5 0.002 0.105 0.019 1.167

    icmD 14 399 99 39 4 0.002 0.092 0.022 0.566

    icmC 18 582 100 52 10 0.006 0.111 0.054 0.688

    icmG 18 807 99 94 27 0.012 0.131 0.092 1.063

    icmK 18 1083 98 141 30 0.008 0.169 0.047 0.373

    icmL 18 639 100 57 3 0.001 0.087 0.011 0.633

    icmM 17 285 100 20 6 0.008 0.063 0.127 1.433

    icmN 15 570 88 59 7 0.004 0.07 0.057 0.120

    icmP 14 1131 90 95 12 0.002 0.077 0.026 0.034

    icmQ 16 576 98 34 3 0.002 0.079 0.025 0.135

    icmR 17 363 100 40 10 0.01 0.083 0.120 1.148

    icmS 17 345 100 32 3 0.003 0.164 0.018 1.299

    icmT 17 261 100 19 2 0.002 0.097 0.021 0.850

    Mean for

    the locus

    0.005 0.101 0.048 0.736

    e sub

    one

    rogen

    I. Morozova et al. / Plasmid 51 (2004) 127147 133icmV 17 456 100 56

    icmW 17 456 100 34

    icmX 18 1404 99 298

    dotA 5 3189 100 557Mean for

    the locus

    tphA 17 1257 99 159

    asd 35 1020 99 29

    rpp 17 690 100 74

    RNAseH 11 573 100 22

    mip 17 699 100 54

    Both nonsynonymous (Kn) and synonymous (Ks) nucleotid

    inuence of gene composition and dG values are calculated per

    remainder. Abbreviations: asdaspartate b-semialdehyde dehydinammatory peptide.*Data from Bumbaugh et al., 2002.19 0.024 0.147 0.163 0.927

    3 0.002 0.119 0.017 0.998

    78 0.036 0.307 0.117 1.131

    139 0.042 0.352 0.118

    0.026 0.231 0.104 1.019

    37 0.008 0.097 0.082 0.987

    3 0.001 0.031 0.032 0.374

    11 0.006 0.147 0.040 1.084

    7 0.006 0.046 0.130 1.769

    4 0.002 0.070 0.025

    stitutions are calculated per corresponding site to avoid the

    amino acid change. Numbers in bold vary the most from the

    ase, rppagellar L-ring protein precursor, mipmacrophage

  • in the case of icmW), compared to most of thegenes from icm/dot region II. The icmX,V and

    dotA genes have Kn values approximately 10 times

    higher than most of the rest of the icm/dot genes.

    The ratio of nonsynonymous to synonymous

    nucleotide substitutions is usually taken as an in-

    dicator of the functional and structural restrictions

    on gene variability and is independent of the time

    of gene diversication. The icm/dot genes show awide distribution in their Kn/Ks ratios, with, for

    example, icmV having a ratio nearly 15 times

    higher than that of icmL. The highly conserved

    genes (icmL, W, S, B, J, T, D, P, and Q) have

    lower Kn/Ks ratios than even the very conservative

    housekeeping gene encoding aspartate b-semial-dehyde dehydrogenase (asd), which shows as much

    as 62% homology even with its relatively distantVibrio cholerae ortholog. In contrast, the most

    134 I. Morozova et al. / Plasmivariable genes (icmV, M, R, and X) have Kn/Ksratios close to or even higher than dotA, which

    is considered a relatively variable gene (Bumbaugh

    et al., 2002).

    Not all amino acid substitutions in the geneswith

    low Kn/Ks ratios are conservative, as assessed by

    changes in amino acid physico-chemical properties,and there are cases of genes with relatively conser-

    vative amino acid substitutions that nonetheless

    have a high level of gene variability as judged byKn/

    Ks ratios (Fig 2). For example, the IcmJ, S sand W

    Fig. 2. Comparison of Kn/Ks and dG values for icm/dot and

    non-icm/dot genes. Kn/Ks ratios and dG values for the icm/dot

    genes shown in order from highest (icmV) to lowest (icmL) Kn/Ks values. Dashed lines correspond to locus II mean values.protein products, despite displaying relatively lowKn/Ks values, have amino acid substitutions that

    result in drastic changes in their properties; on the

    other hand, three genes (icmN, P, and Q) have

    close to locus II average (0.05)Kn/Ks ratios, but their

    encoded proteins have extremely low dG values,

    indicating that only substitutions in amino acids

    with similar physico-chemical properties have been

    permitted. Since nucleotide substitutions may exerttheir inuence on the function of the nal protein

    product at any of several levels (e.g., DNA, mRNA

    or protein), Kn/Ks ratios reect general restrictions

    on gene and protein variability. On the other hand,

    dG values reect variation purely in protein

    structural and functional features, indicating some

    restrictions on the amino acid substitutions at

    the level of the nal functioning product. In thissense, icmN, P, and Q may be considered the most

    conservative of the icm/dot genes.

    There is no obvious correlation between the

    predicted cell localization of the protein products

    of these genes and their variability levels. While

    IcmN is thought to be an outer membrane protein

    and not necessary for macrophage killing, IcmK is

    an indispensable periplasmic or outer membrane,IcmP is an inner membrane, and IcmQ is a soluble

    cytoplasmic protein required for pore formation

    (Andrews et al., 1998; Coers et al., 2000; Dumenil

    and Isberg, 2001; Segal and Shuman, 1998a;

    Watarai et al., 2001).

    Overall, the levels of sequence variation found

    among the non-icm genes in L. pneumophila strains

    (last group of genes in Table 3) and most of theicm genes from locus II were comparable to the

    level of diversity in, for example, Salmonella ent-

    erica housekeeping genes reported by Boyd et al.

    (1997) (where the mean nonsynonymous to syn-

    onymous nucleotide substitution ratio was 0.032).

    The level of polymorphism among icm genes from

    locus I (second group in Table 3) and some locus II

    members exceeds signicantly that for both Le-gionella and Salmonella housekeeping genes and

    most of the genes from icm/dot locus II, and cor-

    responds to the variability level for the spaM and

    spaN genes of the S. enterica inv-spa pathogen

    invasion complex (Boyd et al., 1997).

    The order of the icm/dot genes was apparently

    d 51 (2004) 127147the same in all 18 strains we examined, as assessed

  • by our ability to amplify these genes using primersfrom the expected surrounding genes.

    3.3. Paralogs of icm/dot genes in Philadelphia strain

    of L. pneumophila

    It is not unusual to nd distant homologs

    among the genes of a single organism. These may

    represent members of a gene family that carry outrelated but not identical functions, or they may no

    longer have any functional properties in common.

    Among the icm/dot genes, four partial homologs

    (paralogs) for the 30 part of icmL (134 aa), one forits 50 portion (79 aa), and one for icmC wereidentied in a search of the now essentially com-

    plete Philadelphia 1 genome (http://genome3.

    180 aa. IcmC and IcmC1 have 40% identity over171 aa.

    3.4. Further analysis of individual icm/dot genes

    Multiple alignments of the icm/dot genes in all

    the L. pneumophila strains under study permitted

    more detailed sequence analyses. The sequence

    variation patterns at both the nucleotide andamino acid levels, and dG and hydrophilicity

    proles along the length of each ORF were de-

    termined, as well as potential structural and

    functional motifs. In Fig. 3, the distribution of

    nucleotide and amino acid substitutions along the

    nucleotide and corresponding amino acid se-

    quences are compared for all the sequenced icm

    s in t

    ar ha

    I. Morozova et al. / Plasmid 51 (2004) 127147 135cpmc.columbia.edu/~legion/). The icmL paralogsare located in dierent regions of the genome, and

    the icmC1 paralog is separated from the locus II

    icmC gene by 23 kbps. In each case, the paralogs

    are surrounded by genomic housekeeping genes.

    The average protein sequence homology between

    IcmL and its 30 paralogs is relatively low but clear:31% identity and 52% similarity over an approxi-

    mately 120 aa (amino acid) stretch. For compari-son, the L. pneumophila IcmL has 91% amino acid

    identity with L. longbeachae IcmL over a 220 aa

    stretch; 39% identity to C. burnetii IcmL over 200

    aa; and 2530% identity to traM genes (Klebsiella

    oxytoca, Pseudomonas syringae, Escherichia coli,

    and Salmonella typhimurium plasmids) over 160

    Fig. 3. Distribution of nucleotide and amino acid substitution

    (bottom halves of each bar) and amino acid substitution (upper bis indicated with a vertical hatchmark.genes in L. pneumophila strains. Apparently, inmany cases, nonsynonymous substitutions (lead-

    ing to amino acid changes in encoded proteins) are

    distributed unevenly along the sequence. The gene

    regions with low or no nonsynonymous substitu-

    tions and close to average number of synonymous

    substitutions are of special interest since the ob-

    served conservatism cannot be explained merely by

    too little evolutionary time for the compared se-quences to diverge. These regions, conservative at

    the protein level, especially those preserved also in

    distant homologous proteins, may correspond to

    important protein domains, so where possible,

    comparisons were made with distant homologs in

    other bacteria in conjunction with the functional

    he icm genes among L. pneumophila strains. Every nucleotide

    lves) along the icm sequences from all the L.pneumophila strains

  • motifs predictions. A more detailed description ofsome of the icm genes (icmP, G, N, and K) follows.

    3.5. IcmP

    IcmP is believed to be an inner membrane

    protein, possibly involved in DNA transfer, and

    absolutely indispensable for macrophage killing(Segal and Shuman, 1998a). The gene product is

    predicted to have a signal peptide (aa 135), trans-

    membrane regions (aa 1739 and 92114) and a

    trbA domain (aa 204372). trbA is one of the genes

    found within the transfer region of IncI1 plasmids

    such as R64, and is absolutely required for conjugal

    transfer of these plasmids (Furuya and Komano,

    1996). Although distant homologs of icmP arefound in Coxiella, Pseudomonas, and Salmonella,

    they display a low overall level of sequence simi-

    larity (1835% identity at the protein level); only in

    the region of the trbA domain slightly increased

    homology is found. Nonetheless, all the homologs

    have very comparable hydrophilicity proles over

    their entire lengths (Fig. 4). Since the gene has not

    been allowed to accumulate signicant variableamino acid positions, it is likely to share a closely

    related function in these fairly diverse genera.

    Taking the 15 L. pneumophila strains as a

    group, both synonymous and nonsynonymous

    substitutions are distributed evenly along the icmP

    Fig. 4. Hydrophilicity proles of IcmP and distant homologs.

    RedL. pneumophila IcmP; blueCoxiella IcmP homolog;

    greenPseudomonas sp. PyR19 plasmid conjugal-transfer re-

    lated sequence SAT (gi 2642198); brownSalmonella typhimu-

    rium R64 plasmid trbA gene (gi 20521502).

    Fig. 5. Hopp and Woods hydrophilicity proles for IcmG and

    its homologs. BlueL.pneumophila IcmG; redTraP of plas-

    mid R64 gi 4903119; greenC. burnetii IcmG homolog.

    quen

    cum m

    RE d

    romy

    136 I. Morozova et al. / Plasmid 51 (2004) 127147Fig. 6. Alignment of t-SNARE domains in assorted proteins. Se

    japonicum Blr2548 protein (BAC47813); Clostridium acetobutyli

    aeruginosa probable chemotaxis transducer (AE004706)t-SNA

    minal end (D21267); SNAP25 N-terminal end (D21267); SacchaThe conservative amino acids are highlighted.ces (from top to bottom): L. pneumophila IcmG; Bradrhizobium

    ethyl-accepting chemotaxis protein (AE007559); Pseudomonas

    omains predicted by SMART system; human SNAP25 C-ter-

    ces cerevisiae SEC9p proteinputative t-SNARE (NP_011523).

    Fig. 9. Hydrophilicity prole of IcmK and distant homologs.

    RedL. pneumophila; blueL. longbeachae; greenC. burnetii;

    brownShigella TraN; blackKlebsiella TraN.

  • lasmisequence, but the gene appears to be very conser-vative, both at the nucleotide and amino acid

    levels, with the lowest dG value of all the icm and

    housekeeping proteins sequenced, especially in the

    trbA region.

    3.6. IcmG

    IcmG has also been predicted to be an innermembrane protein; mutation of this gene leads to a

    partial reduction in the bacterias ability to killmacrophages (Segal et al., 1998). When Legionella

    pneumophila strains are compared, IcmG shows

    elevated variability, both at the nucleotide and

    protein levels. Variable positions are almost evenly

    distributed along the sequence, except in the vi-

    cinity of the C- and N-termini that lack evensynonymous substitutions.

    Fig. 5 shows hydrophilicity prole comparisons

    for icmG inLegionella and two distant homologs,C.

    burnetii IcmG and plasmid TraP. Despite relatively

    low sequence homology among the three genes (less

    than 20% at the protein level), their predicted sec-

    ondary structures (not shown) and hydrophilicity

    proles display signicant similarity. Preservationof the protein structure in some cases may be more

    important for a proteins function than the aminoacid sequence itself, and probably because of this,

    structure-based methods of searching for distant

    homologs are more ecient than sequence-based

    approaches (Pawlowski et al., 2001; Sauder et al.,

    2000). Examples of related bacterial proteins with

    very low sequence identity but nearly identicalstructures are not uncommon (Bauer et al., 2001;

    Ginalski et al., 2000; Girardeau et al., 2000).

    For the IcmCTraQ (not shown) and IcmG

    TraP comparisons, the protein similarity at these

    higher structural levels is indeed stronger than at

    the sequence level. Thus, despite sequence dis-

    crepancies, the major function of these distant

    homologs may remain intact. Local dissimilaritiesof the protein proles, as in the case of IcmG

    TraP at positions 165185 (Fig. 5), require addi-

    tional analysis. The Legionella and Coxiella IcmG

    proteins, unlike their TraP homolog, are predicted

    to have a t-SNARE domain precisely in this region

    (aa 142210 in the Legionella IcmG protein; aa 95

    I. Morozova et al. / P194 in C. burnetii homolog, which correspond topositions 153221 in aligned sequences in Fig. 5)and this similarity extends beyond the coiled-coil

    structural features predicted for all three homologs

    in this area (positions 123179) (Segal and Shu-

    man, 1998b). [Weimbs et al. (1997, 1998) even

    screen out coiled-coil features when performing

    t-SNARE domain searches.] Proteins with t-

    SNARE domains play important roles in mem-

    brane fusion in eukaryotes (Weber et al., 1998).While the t-SNARE domains are highly diverse,

    they usually possess a central glutamine (Q) resi-

    due and preserve the overall domain structure

    (Gotte and von Mollard, 1998; Weimbs et al.,

    1998). There are only a few bacterial proteins

    known to have similarity to the t-SNARE domain

    (SMART Accession No. SM0397); most of these

    are bacterial sensor and chemotaxis integralmembrane proteins. Several examples of these are

    aligned with IcmG in Fig. 6. It will be interesting

    to see if the t-SNARE domain is conserved in non-

    pneumophila Legionella species with icm/dot loci. If

    it is required for IcmG function during infection,

    this feature may dierentiate the global function

    of the Legionella icm/dot system from that of its

    homologs in other organisms.

    3.7. IcmN

    IcmN is a putative outer membrane lipoprotein,

    containing a signal peptide, and is dispensable for

    macrophage killing (Segal et al., 1998). The se-

    quence is well conserved, especially at the protein

    levelamino acid substitutions among L. pneumo-phila strains occur only in the N-terminal half of the

    protein, and the alternative amino acids always

    have very similar physico-chemical properties (Figs.

    2 and 3). An alignment of L. pneumophila and L.

    longbeachae sequences also reveals that the C-ter-

    minal half (after aa 90) is more conserved than the

    N-terminal portion (Fig. 7). Starting at aa 83, the

    IcmN protein shows weak homology to the OmpAdomain (Pfam F00691), which is found in bacterial

    porin-like integral-membrane proteins and lipo-

    proteins, most of which, like IcmN, have a con-

    served OmpA domain within the C-terminal half

    and a variable N-terminal portion. Some members

    of this protein group have antigenic determinants,

    d 51 (2004) 127147 137but IcmN does not display obvious hypervariable

  • . Dot

    IcmN

    ane pr

    lasmiFig. 7. Alignment of IcmN gene product with distant homologs

    (top to bottom with NCBI accession numbers): L. pneumophila

    hypothetical protein (NP_249524); E. coli putative outer membr

    138 I. Morozova et al. / Pregions. The alignment with several distant homo-

    logs reveals two extremely conserved motifs:

    QGVD at aa 147 and RVEIT at the C-terminus

    (boxed in Fig. 7).

    3.8. IcmK

    The IcmK product is putatively a periplasmic orouter membrane protein, and possesses a secretion

    signal peptide (Andrews et al., 1998); the protein is

    needed for pore formation (Kirby et al., 1998),

    indispensable for macrophage killing, but not

    necessary for conjugation (Andrews et al., 1998;

    Segal and Shuman, 1998a). It is homologous to the

    plasmid traN gene product. According to the Pfam

    database, the TraN domain starts at position 62 ofboth the protein alignment (Fig. 8) and the hy-

    drophilicity prole for L. pneumophila icmK and

    its distant homologs (Fig. 9); the alignment shown

    before that point is uncertain owing to very low

    homology.

    As seen in the alignment, the homology level

    between the orthologs is quite low with

  • lasmiI. Morozova et al. / Phydrophobic portion among distant homologs

    suggests that this region is a functionally impor-

    tant domain.

    3.9. Phylogenetic relationships between strains

    based on icm gene sequence

    The dot/icm genes were presumably introduced

    into the Legionella genomes from a plasmid

    (Komano et al., 2000; Segal and Shuman, 1998a),

    Fig. 8. Alignment of IcmK and TraN gene products. Dots represen

    Sequences from top to bottom: L. pneumophila, L. longbeachae (AF2

    ColIb-P9 TraN (BAA75158) (has only 1 aa dierence with Salmonell

    oxytoca plasmid pACM1 primase (AF139719).d 51 (2004) 127147 139possibly prior to their separating into two loci. It is

    unknown, though, if this was a one-time event or

    the region(s) were lost and re-introduced repeat-edly during Legionella evolution. Often when gene

    transfer occurs from a distant organism with dif-

    ferent nucleotide content, the transferred region is

    evident due to its dierent GC content compared

    to the rest of the genome. In the case of Legion-

    ellas icm/dot loci, their GC content is equivalent tothe genome average (38%). Moreover, the regions

    t identical amino acids and dashes are gaps in the alignment.

    88617), and C. burnetti IcmK; E. coli (Shigella sonnei) plasmid

    a typhimurium IncI1 plasmid R64 TraN, BAB91663); Klebsiella

  • 15 am

    lasmiare distinct from those of their homologs in Cox-

    iella and the R64 plasmid where most icm/dot gene

    homologs are around 44 and 50% GC, respec-

    tively. It is possible that the transfer occurred froma dierent plasmid with similar GC content to that

    of Legionella. Based on the dierences between

    phylogenetic trees built for mip and dotA (Bum-

    baugh et al., 2002) and dotA and rpoB genes (Ko

    et al., 2002a,b), it has been suggested that repeated

    Fig. 10. icmK variability proles. A window size of

    140 I. Morozova et al. / Pevents of genetic exchange or loss and acquisition

    led to the current complex composition of these

    loci.To determine if the rates of molecular evolution

    of icm genes are disparate in dierent L. pneu-

    mophila strains, a comparison of icm genes from

    all available strains was undertaken, using their

    C. burnetii orthologs as outgroups (Sexton and

    Vogel, 2002). The distances in synonymous and

    nonsynonymous substitutions per corresponding

    site were analyzed separately, as was done byWhittam and Bumbaugh (2002). All analyzed

    genes from all the L. pneumophila strains showed

    approximately equal relative substitution rates

    (data not shown). It is probable, though, that

    minor dierences were missed, using such distant

    homologs from Coxiella. In the future, when more

    of the closer homologs, e.g., icm/dot genes from

    other Legionella species, are available, it should bepossible to obtain a ner resolution.A detailed phylogenetic analysis was carried

    out. Phylogenetic trees were built for 18 icm genes,

    3 housekeeping genes, and the icmB/tphA inter-

    genic region as well as combined trees built foricm locus subregions (i.e., concatenated icm genes

    from extensive portions of the two loci or the en-

    tire loci). The presented trees were built by two

    methods: NJ, with 1000 bootstrap iterations to

    estimate condence level for the tree topology, and

    ino acids or codons was used. See text for details.

    d 51 (2004) 127147the split decomposition method which displays

    branching alternatives in a single representation.

    While trees were built for each gene and severalicm/dot subregions, only some representative ex-

    amples are included in Fig. 11.

    Based on the combined phylogenetic trees, the

    strains consistently group into seven subsets: [Leg

    5, 1, 9], [6, 11, 32], [{36, 10}, {30, 35}], [3, 4, 8, 34],

    [2, 33], and [7, 31], though the separation between

    groups {36, 10} and {30, 35} is less consistent (cf.

    Figs. 11A and C). This clustering is almost iden-tical, with a few exceptions, for the icm genes of

    the two loci, houskeeping genes and the icmB/tphA

    intergenic region and is supported by high boot-

    strap values on almost all of the trees. Exceptions

    to this clustering were most frequently found with

    the Leg6 strain, which, for 6 icm and 3 house-

    keeping genes, merges with the (5, 1, 9) group (see

    for example the tree for icmK in Fig. 11D). Itappears that in the case of trees built for genes of

  • lasmiI. Morozova et al. / Pthe small locus (icmV, W, and X) and those at one

    end of the large icm locus (icmF, tphA, icmB, J, D,

    C, and G), Leg 6 belongs to the (11, 32) group,

    whereas based on trees built for many of the genes

    at the other end of locus II (icmK, L, N, R, S, and

    T), this strain falls into the (5, 1, 9) group. In onlya very few cases was the clustering violated by

    other strains (e.g., Leg 30 and 35 are in separate

    branches in icmV, W, X, F, and B individual gene

    trees).

    Despite the largely consistent strain clustering,

    the relationship between clusters is not as clear,

    that is, the groups as a whole can switch their

    relative positions in dierent trees and sometimescannot be positioned unambiguously (for example,

    see Fig. 11D). In many cases, these cluster re-

    Fig. 11. Phylogenetic trees. The gene sets do not include dotA,B,C, i

    Combined NJ tree for all icm genes. (B) Aligned NJ trees for the two i

    the text. Left: locus II (all locus II icm genes except icmF). Right: locus

    for all icm genes. This gure shows the strain clustering, emphasizing t

    of the strains. (D) Split decomposition tree for IcmK, demonstrating th

    alternative branching.d 51 (2004) 127147 141locations have low bootstrap values, making it

    dicult to judge whether they correspond to ac-

    tual gene transfer or to recombination events.

    In all the trees Leg 7 and 31 constitute a sepa-

    rate group, so distant from the remaining strain

    clusters that it almost has the appearance of anoutgroup. But when strains 7 and 31 are consid-

    ered independently, they seem to be almost as

    distant from each other as from the remaining

    strains (Fig. 11C). Thus, they probably do not

    form an actual group, but are merely the two most

    divergent strains of L. pneumophila examined. It

    was previously shown that L. pneumophila strain

    Dallas, serogroup 5, which corresponds to our Leg7 strain, belongs to L. pneumophila subspecies

    fraseri (Brenner et al., 1988) and that the dotA and

    cmO or icmE. Numbers at the nodes are bootstrap values. (A)

    cm loci. Notable dierences in the tree topologies are detailed in

    I (icmX,W, and V only). (C) Combined split decomposition tree

    he divergence of Leg 7 and 31 from each other and from the rest

    e complicated picture of group branching. Rectangles represent

  • investigators have shown that among nine strains

    of L. pneumophila, eight from serogroup 1 in-

    lasmicluding three commonly used in laboratory studies

    (AA100, JR32, and Lp01), the presence or absence

    of two loci involved in Type IV secretion (traI andlvh) and the rtxA locus, may correlate to some

    extent with the strains pathogenicities (Samrak-andi et al., 2002). More specically, the lvh and

    rtxA loci were found more commonly in strains

    generally associated with disease, whereas the traI

    locus was not. These authors also were able

    to detect and discriminate these genes by hybridi-

    zation in some non-pneumophila species. Morerecently, dissection of an expanded locus sur-

    rounding a set of the so-called tra/trb genes, pre-

    sumably involved in pilus assembly, distinct from

    the traI locus of the AA100 strain, as well as from

    the icm/dot and lvr/lvh loci, revealed it to be a likely

    pathogenicity island, containing additional genesmip genes from this strain were most distant fromtheir homologs in other L. pneumophila strains

    (Bumbaugh et al., 2002).

    The observed strain clustering does not correlate

    with serogroups. Thus, while both Leg 2 and 33

    belong to serogroup 11 and also to one cluster, none

    of the ve strains of serogroup 1 for which we have

    sequences (Leg 1, 3, 31, 35, and 36), group together.

    Trees built for the locus I icm genes vary themost from the locus II genes (compare the two

    combined trees in Fig. 11B, left and right). The

    initial trees were aligned by rotating branches

    around internal nodes, while preserving the

    branching pattern, to accentuate the dierences

    between the two resulting topologies. Branches

    corresponding to strains Leg 5, 35, 1, and 6 could

    not be aligned.

    4. Discussion

    The icm/dot gene loci are present in each of the

    L. pneumophila serogroups and strains we se-

    quenced. Moreover, based on our ability to am-

    plify and sequence across genes of interest usingprimers in expected surrounding adjacent genes, it

    appears that gene order within these clusters is also

    retained within the L. pneumophila strains. Other

    142 I. Morozova et al. / Pfor putative virulence factors such as methioninesulfoxide reductases, as well as plasmid mobilityelements; while present in Philadelphia 1-derived

    strains, it appears to be missing in part or in its

    entirety from JR32 and several clinical isolates

    (Brassinga et al., 2003). Interestingly, this locus

    contains paralogs of the lvrA, B, and C genes of

    the lvr/lvh Type IV secretion locus. Perhaps the

    most intriguing nding was the presence of a 30 kb

    unstable genetic element in strain Olda but not inPhiladelphia 1 strains, possibly phage derived, in-

    volved in phase variation (Luneberg et al., 2001).

    When integrated into the chromosome, the strain

    is virulent, but when excised and replicating as a

    high-copy plasmid, it resultes in a mutant pheno-

    type with a modied lipopolysaccharide O-antigen

    epitope associated with reduced virulence.

    At this point, we have insucient evidence todetermine if the icm/dot genes are absent or present

    in most other Legionella species, with the excep-

    tion of L. longbeachae where good hybridization

    signals were obtained for icm C, D, G, K, L,M, O,

    P, and T; weak signals with J, Q, R, S, V, and X;

    and no signal for icm B, E, and F. Six L. long-

    beachae icm/dot genes from the center of locus II

    have been submitted to GenBank by T. Rogers, S.List, R.M. Doyle, and M.W. Heuzenroeder. In

    cases where we do not obtain positive hybridiza-

    tion signals using L. pneumophila probes, it is

    probable that the orthologs are too dissimilar

    in their sequence, at least in the region between

    where the primers were designed, to be detected by

    even the reduced stringency hybridization or am-

    plication used in this study. Their characteriza-tion thus awaits large-scale sequencing of other

    species, or the use of degenerate oligonucleotide-

    based PCR. Terry Alli et al. (2003) recently re-

    ported the presence of the icm/dot loci in every

    Legionella species they examined based on hy-

    bridization, even under high stringency conditions,

    using pooled regional probes. While we did get

    weak signals with many icm/dot genes in non-pneumophila species similar to the ones they dis-

    played in their paper, we are unable to explain the

    several cases of disagreement, except that we used

    single gene probes which might have been too

    species-specic. Since we probed the same blots

    subsequently with several other probes for 16S

    d 51 (2004) 127147rRNA, housekeeping or lvh/lvr genes and obtained

  • lasmiexcellent signals, the absence of hybridization withthose icm/dot gene probes can not be due to the

    quality of the DNA itself.

    The two icm/dot clusters may have been subject

    to substantial changes in the course of their intra-

    species evolution. Given that the icm/dot loci are

    present in all the L. pneumophila strains from the

    15 dierent serogroups we examined and the fact

    that these strains have a 100-fold range in theirability to replicate within macrophages (data not

    shown), it might be expected that the strains dif-ferences in virulence depend on sequence varia-

    tions within the genes, especially in functionally

    important gene and protein regions, such as those

    responsible for ecient transport of eector mol-

    ecules. Of course, it is also possible that altered

    regulation of these genes (when and where they areexpressed), or in the eector molecules themselves,

    can contribute to the pathogenic phenotype. It is

    worth noting that even though the entire icm gene

    set is present in the Coxiella genome (Seshadri

    et al., 2003), its lifestyle is very dierent from that

    of Legionella. In particular, Coxiella does not seem

    to depend on the disruption of phagosomelyso-

    some fusion for its survival, which is considered tobe the main function of the icm/dot system in Le-

    gionella. In the current study, we assessed the level

    of diversity among genes of the dot/icm loci, fo-

    cusing on the putative functional domains that are

    preserved even in distant homologs.

    The dot/icm genes display a wide range of

    variability, some being more conservative than

    an average houskeeping gene (icmP, Q, D, T, J,B, S, W, and L), while others are 510 times

    more variable (icmM, R, V, X, and dotA), as

    indicated by the ratio of nonsynonymous and

    synonymous nucleotide substitutions. Low vari-

    ability at the sequence level, though, does not

    necessarily mean that all the observed amino

    acid substitutions are conservative with regard to

    their physico-chemical properties. For example,it appears that IcmT, J, S, and W proteins are

    permitted rather dramatic amino acid substitu-

    tions. In contrast, IcmN, P, and Q are extremely

    conservative at this level, but not as much at the

    sequence level. In general, genes from locus I

    show higher diversity compared to locus II, both

    I. Morozova et al. / Pat the gene and protein levels.A second category of intra-species variation ispositional, with some portions of the genes and

    their products more dissimilar than others. For

    instance, the IcmK and IcmV proteins have many

    more amino acid substitutions in their N-terminal

    than their C-terminal portions. Most variation at

    the amino acid level is found at the ends of IcmP,

    but centrally in IcmG. At the same time, the silent

    nucleotide changes are often distributed evenlyalong the gene indicating that the preservation of

    amino acid sequence in some regions is not simply

    due to time of gene divergence, but rather to the

    presence of important functional domainsespe-

    cially when the sequence, or at least the protein

    structure, is preserved in distant orthologs. It is

    interesting in this regard that remote homology

    detection by structural methods has helped pre-dict the function of many otherwise uncharacter-

    ized proteins in several sequenced genomes

    (Pawlowski et al., 1999, 2001; Rychlewski et al.,

    1998).

    For some icm/dot genes (icmP, G, N, and K) the

    combination of relatively high regional sequence

    conservatism and the presence of predicted do-

    mains and sequence and/or structure preservationin distant homologs in the same areas serve as

    indicators of the presence of a functional domain,

    though they await experimental proof. Features

    such as the t-SNARE-like domain in IcmG and its

    Coxiella homolog, occur rarely enough in bacterial

    genes as to make them noteworthy. If the

    t-SNARE domain is functional in IcmG, it may

    compete with the hosts membrane fusion SNAREsystem, potentially altering its normal vesicular

    tracking pathways, and preventing phagosome

    lysosome fusion, for the bacterias own ends. Thusthese ndings may provide the impetus for future

    experimental studies to more directly determine

    the function of these proteins.

    Phylogenetic analysis for individual genes as

    well as locus subregions largely reveal similarstrain groupings, as in Fig. 11C. However, some

    branches either switch their positions on dierent

    trees or cannot be unambiguously positioned.

    Though it is tempting to speculate that these rep-

    resent instances of lateral transfer within the locus,

    it is not possible to determine this with any

    d 51 (2004) 127147 143certainty.

  • lasmiNot only are the locus I genes more variablethan most of locus II, but interestingly, genes of

    the smaller locus (icmW, V, X, and dotA) have

    accumulated more silent nucleotide substitutions

    per site (Ks values) than most of those from locus

    II. If both loci were acquired, probably from a

    plasmid, at the same time, this may mean that

    locus I is evolving at a higher rate. Alternatively,

    under the assumption that the evolutionary rateshave been the same and unchanged for both loci,

    genes from the smaller locus must be older than

    most of those in the large icm cluster. This, taken

    with the fact that the most disparate branching

    patterns are observed when either individual or

    combined trees for icm/dot locus I vs locus II are

    compared, leads to the assumption that the icm/

    dot region has a rather complex history of geneacquisition and rearrangment events. In Coxiella

    all the icm genes are located next to each other

    whereas in L. pneumophila they are split into two

    icm/dot loci that are located on opposite sides of

    the circular genome (http://genome3.cpmc.colum-

    bia.edu/~legion/index.html). This may serve as an

    additional indication that two loci in Legionella

    were acquired separately or rearranged after-wards.

    So far, full icm/dot gene sets have only been

    found in two relatively close species (Legionella

    and Coxiella), and this system diers substantially

    from the known Type IV systems. Nonetheless,

    given the presence of limited but obvious homol-

    ogy of most icm/dot genes from both loci and tra/

    trb genes, it is possible to suggest that they mayhave derived from the same ancestor. This ances-

    tor may be of plasmid origin or assembled from

    various chromosomal components in ancestral

    bacteria; in the latter case, these genes may sub-

    sequently have been incorporated into a plasmid,

    support for which would come from the fact that

    many dierent bacteria possess tra-like genes (e.g.,

    Type IV secretion systems).Other researchers have also pointed out that

    the icm/dot region may have a complicated evo-

    lutionary history in L. pneumophila. Bumbaugh

    et al. (2002) compared dotA and mip (a 24 kDa

    surface protein with peptidyl-prolyl-cis/trans

    isomerase activity that may be involved in es-

    144 I. Morozova et al. / Ptablishment of infections, but not intracellularsurvival (Cianciotto et al., 1990), in 17 clinicaland environmental isolates. Compared to mip,

    DotA, a cytoplasmic membrane spanning protein,

    was extremely and perhaps unexpectedly variable,

    and the neighbor-joining trees produced for the

    two genes were discordant at several branch

    points with high bootstrap values. The authors

    considered this an indication of lateral gene

    transfer and recombination and relatively recentgene dispersal. Ko et al. (2002b) compared the

    dotA and rpoB alleles in 79 Korean isolates of L.

    pneumophila from six clonal populations. The

    most parsimonious tree produced using rpoB

    distinguished four closely related L. pneumophila

    pneumophila subspecies and two closely related L.

    pneumophila fraseri subspecies. In contrast, in the

    case of dotA, one of the pneumophila subspeciesseemed more closely related to the fraseri sub-

    species than to the other three pneumophila. Some

    caution should be exercised, however, in that

    these authors previously showed that the rpoB

    trees, themselves, diered substantially from 16S

    rRNA and mip trees, which was the basis for

    distinguishing the six clonal populations (Ko

    et al., 2002a). Our comparisons, taking intoconsideration nearly all the members of the icm

    dot loci, may point out additional subpopula-

    tions, especially for those genes showing sub-

    stantial variation.

    In the future, comparisons with icm and lvh

    plasmid gene orthologs may be especially inter-

    esting. Since the lvh/lvr locus is likely to have been

    inherited as a plasmid unit, as we discovered dur-ing the sequencing of the Philadelphia 1 genome

    (manuscript in preparation), with a substantially

    higher GC content (43%) than the rest of the ge-

    nome (Segal et al., 1999), we intend to compare its

    history with that of the icm/dot islands, which have

    only some of the classic features of pathogenicity

    islands (apparent absence of essential genes, all-or-

    none presence of the complete gene set), but notothers (GC content the same as the remainder of

    the genome, separation into two subsets). The

    separate tra/trb locus also appears to be a patho-

    genicity island, the central core of which has an

    elevated GC content (Brassinga et al., 2003), and is

    thus another good candidate for such comparative

    d 51 (2004) 127147sequence analysis.

  • base. Nucleic Acids Res. 30, 276280.

    Bauer, F., Schweimer, K., Kluver, E., Conejo-Garcia, J.-R.,

    I. Morozova et al. / Plasmid 51 (2004) 127147 145Forssmann, W.-G., Rosch, P., Adermann, K., Sticht, H.,

    2001. Structure determination of human and murine b-defensins reveals structural conservation in the absence of

    signicant sequence similarity. Protein Sci. 10, 24702479.

    Benson, R., Fields, B., 1998. Classication of the genus

    Legionella. Semin. Respir. Infect. 13, 9099.

    Berger, K.H., Isberg, R.R., 1993. Two distinct defects in

    intracellular growth complemented by a single genetic locus

    in Legionella pneumophila. Mol. Microbiol. 7, 719.

    Bogardt, R.A., Jones, B.N., Dwulet, F.E., Garner, W.H.,

    Lehman, L.D., Gurd, F.R., 1980. Evolution of the amino

    acid substitution in the mammalian myoglobin gene. J. Mol.

    Evol. 15, 197218.

    Boyd, E.F., Li, J., Ochman, H., Selander, R.K., 1997. Com-

    parative genetics of the inv-spa invasion gene complex of

    Salmonella enterica. J. Bacteriol. 179, 19851991, id: 0021-Acknowledgments

    Strains Leg 1Leg 34 were kindly provided by

    Dr. Barry Fields at the CDC; Leg 35 and Leg 36,

    specimens from an outbreak at a Dutch owershow, were a generous gift from Dr. Ruud van

    Ketel at the University of Amsterdam. We thank

    Huitao Sheng for assistance in sequence submis-

    sion and Dr. Pavel Morozov for helpful comments

    throughout the course of this work. This work was

    supported by NIH Grant U01 1 AI 44371 awarded

    to J.J.R., and funds generously provided by the

    Columbia Genome Center.

    References

    Adeleke, A., Pruckler, J., Benson, R., Rowbotham, T., Hala-

    blab, M., Fields, B., 1996. Legionella-like amebal patho-

    gensphylogenetic status and possible role in respiratory

    disease. Emerg. Infect Dis. 2, 225230.

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman,

    D.J., 1990. Basic local alignment search tool. J. Mol. Biol.

    215, 403410, doi: 10.1006/jmbi.1990.9999.

    Andrews, H.L., Vogel, J.P., Isberg, R.R., 1998. Identication of

    linked Legionella pneumophila genes essential for intracellu-

    lar growth and evasion of the endocytic pathway. Infect.

    Immun. 66, 950958, id: 0019-9567/98/$04.00+0.

    Avison, M.B., Simm, A.M., 2002. Sequence and genome

    context analysis of a new molecular class D b-lactamasegene from Legionella pneumophila. J. Antimicrob. Chemo-

    ther. 50, 331338, doi: 10.1093/jac/dkf135.

    Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L.,

    Eddy, S.R., Griths-Jones, S., Howe, K.L., Marshall, M.,

    Sonnhammer, E.L., 2002. The Pfam protein families data-9193/97/$04.00+0.Brassinga, A.K.C., Hiltz, M.F., Sisson, G.R., Morash, M.G.,

    Hill, N., Garduno, E., Edelstein, P.H., Garduno, R.A.,

    Homan, P.S., 2003. A 65-kilobase pathogenicity island is

    unique to Philadelphia-1 strains. J. Bacteriol. 185, 4630

    4637, doi: 10.1128/JB185.15.4630-4637.2003.

    Brenner, D.J., Steigerwalt, A.G., Epple, P., Bibb, W.F.,

    McKinney, R.M., Starnes, R.W., Colville, J.M., Selander,

    R.K., Edelstein, P.H., Moss, C.W., 1988. Legionella pneu-

    mophila serogroup lansing 3 isolated from a patient with

    fatal pneumonia, and descriptions of L. pneumophila subsp.

    pneumophila subsp. nov., L. pneumophila subsp. fraseri subsp.

    nov., and L. pneumophila subsp. pascullei subsp. nov. J. Clin.

    Microbiol. 26, 16951703.

    Bumbaugh, A.C., McGraw, E.A., Page, K.L., Selander, R.K.,

    Whittam, T.S., 2002. Sequence polymorphism of dotA and

    mip alleles mediating invasion and intracellular replication

    of Legionella pneumophila. Curr. Microbiol. 44, 314322,

    doi: 10.1007/s0024-01-0024-6.

    Christie, P.J., 2001. Type IV secretion: intercellular transfer of

    macromolecules by systems ancestrally related to conjuga-

    tion machines. Mol. Microbiol. 40, 294305, doi: 10.1046/

    j.1365-2958.

    Cianciotto, N.P., Eisenstein, B.I., Mody, C.H., Engleberg,

    N.C., 1990. A mutation in the mip gene results in an

    attenuation of Legionella pneumophila virulence. J. Infect.

    Dis. 162, 121126.

    Coers, J., Kagan, J.C., Matthews, M., Nagai, H., Zuckman,

    D.M., Roy, C.R., 2000. Identication of Icm protein

    complexes that play distinct roles in the biogenesis of an

    organelle permissive for Legionella pneumophila intracellular

    growth. Mol. Microbiol. 38, 719736, doi: 10.1046/j.1365-

    2958.2000.02176.x.

    Doyle, R.M., Steele, T.W., McLennan, A.M., Parkinson, I.H.,

    Manning, P.A., Heuzenroeder, M.W., 1998. Sequence

    analysis of the mip gene of the soilborne pathogen Legion-

    ella longbeachae. Infect. Immun. 66, 14921499, id: 0019-

    9567/98/$04.00+0.

    Dumenil, G., Isberg, R., 2001. The Legionella pneumophila

    IcmR protein exhibits chaperone activity for IcmQ by

    preventing its participation in high-molecular-weight com-

    plexes. Mol. Microbiol. 40, 11131127, doi: 10.1046/j.1365-

    2958.2001.02454.x.

    Fields, B.S., Benson, R.F., Besser, R.E., 2002. Legionella and

    Legionnaires disease: 25 years of investigation. Clin.Microbiol. Rev. 15, 506526, doi: 10.1128/CMR.15.3.506-

    526.2002.

    Fraser, D.W., Tsai, T.R., Orenstein, W., Parkin, W.E.,

    Beecham, H.J., Sharrar, R.G., Harris, J., Mallison, G.F.,

    Martin, S.M., McDade, J.E., Shepard, C.C., Brachman,

    P.S., 1977. Legionnaires disease: description of anepidemic of pneumonia. N. Engl. J. Med. 297, 1189

    1197.

    Furuya, N., Komano, T., 1996. Nucleotide sequence and

    characterization of the trbABC region of the IncI1 plasmid

    R64: existence of the pnd gene for plasmid maintenance

    within the transfer region. J. Bacteriol. 178, 14911497, id:0021-9193/96/$04.00+0.

  • 146 I. Morozova et al. / Plasmid 51 (2004) 127147Ginalski, K., Venclovas, C., Lesyng, B., Fidelis, K., 2000.

    Structure-based sequence alignment for the beta-trefoil

    subdomain of the clostridial neurotoxin family provides

    residue level information about the putative ganglioside

    binding site. FEBS Lett. 482, 119124, doi: 10.1016/S0014-

    5793(00)01954-2.

    Girardeau, J.P., Bertin, Y., Callebaut, I., 2000. Conserved

    structural features in class i major mbrial subunits (Pilin)

    in gram-negative bacteria. Molecular basis of classication

    in seven subfamilies and identication of intrasubfamily

    sequence signature motifs which might be implicated in

    quaternary structure. J. Mol. Evol. 50, 424442, ISSN:

    0022-2844.

    Gotte, M., von Mollard, G.F., 1998. A new beat for the

    SNARE drum. Trends Cell. Biol. 8, 215218, doi: 10.1016/

    S0962-8924(98)01272-0.

    Helbig, J.H., Bernander, S., Castellani Pastoris, M., Etienne, J.,

    Gaia, V., Lauwers, S., Lindsay, D., Luck, P.C., Marques,

    T., Mentula, S., Peeters, M.F., Pelaz, C., Struelens, M.,

    Uldum, S.A., Wewalka, G., Harrison, T.G., 2002. Pan-

    European study on culture-proven legionnaires disease:distribution of Legionella pneumophila serogroups and

    monoclonal subgroups. Eur. J. Clin. Microbiol. Infect Dis.

    21, 710716, doi:10.1007/s10096-002-0820-3.

    Huson, D., 1998. SplitsTree: analyzing and visualizing evolu-

    tionary data. Bioinformatics 14, 6873.

    Kawashima, S., Kanehisa, M., 2000. AAIndex: amino acid

    index database. Nucleic Acids Res. 28, 374.

    Kirby, J.E., Vogel, J.P., Andrews, H.L., Isberg, R.R., 1998.

    Evidence for pore-forming ability by Legionella pneu-

    mophila. Mol. Microbiol. 27, 323336, doi: 10.1046/j.1365-

    2958.1998.00680.x.

    Ko, K.S., Lee, H.K., Park, M.Y., Lee, K.-H., Yun, Y.-J., Woo,

    S.-Y., Miyamoto, H., Kook, Y.-H., 2002a. Application of

    RNA polymerase beta-subunit gene (rpoB) sequences for

    the molecular dierentiation of Legionella species. J. Clin.

    Microbiol. 40, 26532658, doi: 10.1128/JCM.40.7.2653-

    2658.2002.

    Ko, K.S., Lee, H.K., Park, M.-Y., Park, M.-S., Lee, K.-H.,

    Woo, S.-Y., Yun, Y.-J., Kook, Y.-H., 2002b. Population

    genetic structure of Legionella pneumophila inferred from

    rna polymerase gene (rpoB) and DotA gene (dotA) se-

    quences. J. Bacteriol. 184, 21232130, doi: 10.1128/

    JB.184.8.2123-2130.2002.

    Komano, T., Yoshida, S., Narahara, K., Furuya, N., 2000. The

    transfer region of IncI1 plasmid R64: similarities

    between R64 tra and Legionella icm/dot genes. Mol.

    Microbiol. 35, 13481359, doi: 10.1046/j.1365-2958.2000.

    01769.x.

    Kumar, S., Tamura, K., Jakobsen, I.B., Nei, M., 2001.

    MEGA2: molecular evolutionary genetics analysis software.

    Bioinformatics 17, 12441245.

    Letunic, I., Goodstadt, L., Dickens, N.J., Doerks, T., Schultz,

    J., Mott, R., Ciccarelli, F., Copley, R.R., Ponting, C.P.,

    Bork, P., 2002. Recent improvements to the SMART

    domain-based sequence annotation resource. Nucleic AcidsRes. 30, 242244.Li, W.H., Wu, C.I., Luo, C.C., 1985. A new method for

    estimating synonymous and nonsynonymous rates of nu-

    cleotide substitution considering the relative likelihood of

    nucleotide and codon changes. Mol. Biol. Evol. 2, 150174,

    id: 0737-4038/85/0202-0201$02.00.

    Luneberg, E., Mayer, B., Daryab, N., Koolstra, O., Zahringer,

    U., Rohde, M., Swanson, J., Frosch, M., 2001. Chromo-

    somal insertion and excision of a 30 kb unstable genetic

    element is responsible for phase variation of lipopolysac-

    charide and other virulence determinants in Legionella

    pneumophila. Mol. Microbiol. 39, 12591271, doi: 10.1046/

    j.1365-2958.2001.02314.x.

    Miyata, T., Miyazawa, S., Yasunaga, T., 1979. Two types of

    amino acid substitutions in protein evolution. J. Mol. Evol.

    12, 219236.

    Nagai, H., Kagan, J.C., Zhu, X., Kahn, R.A., Roy, C.R., 2002.

    A bacterial guanine nucleotide exchange factor activates

    ARF on Legionella phagosomes. Science 295, 679

    682.

    Pawlowski, K., Rychlewski, L., Zhang, B., Godzik, A., 2001.

    Fold predictions for bacterial genomes. J. Struct. Biol. 134,

    219231, doi: 10.1006/jsbi.2001.4394.

    Pawlowski, K., Zhang, B., Rychlewski, L., Godzik, A., 1999.

    The Helicobacter pylori genome: from sequence analysis to

    structural and functional predictions. Proteins: Struct.,

    Funct., Genet. 36, 2030, 3.0.CO;2-X" locator-type-

    "doi">doi: 10.1002/(SICI)1097-0134(19990701)36.13.0.CO;2-X.

    Perez-Luz, S., Fernandez, J., Rodriguez-Valera, F., Pascual, L.,

    Moreno, C., Amo, A., Apraiz, D., Catalan, V., 2002.

    Sequence diversity of the internal transcribed spacer (its)

    region of the rRNA operons among dierent serogroups of

    Legionella pneumophila isolates. Syst. Appl. Microbiol. 25,

    212219, doi:10.1078/072320202320386370.

    Pollastri, G., Przybylski, D., Rost, B., Baldi, P., 2002. Improv-

    ing the prediction of protein secondary structure in three

    and eight classes using recurrent neural networks and

    proles. Proteins 47, 228235, online ISSN: 1097-0134;

    print ISSN:0887-3585.

    Raghava, G.P.S., 2000. Protein secondary structure prediction

    using nearest neighbor and neural network approach. CASP

    4, 7576.

    Ratcli, R., Donnellan, S.C., Lanser, J.A., Manning, P.A.,

    Heuzenroeder, M.W., 1997. Interspecies sequence dier-

    ences in the Mip protein from the genus Legionella:

    implications for function and evolutionary relatedness.

    Mol. Microbiol. 25, 11491158.

    Ratcli, R.M., Lanser, J.A., Manning, P.A., Heuzenroeder,

    M.W., 1998. Sequence-based classication scheme for the

    genus Legionella targeting the mip gene. J. Clin. Microbiol.

    36, 15601567, id: 0095-1137/98/$04.00+0.

    Rosello-Mora, R., Amann, R., 2001. The species concept for

    prokaryotes. FEMS Microbiol. Lett. 25, 3967, doi:

    10.1016/S0168-6445(00)00040-1.

    Rost, B., 1996. PHD: predicting one-dimensional protein

    structure by prole based neural networks. Methods Enz-ymol. 266, 525539.

  • Rychlewski, L., Zhang, B., Godzik, A., 1998. Fold and function

    predictions for Mycoplasma genitalium proteins. Fold Des.

    3, 229238, ISSN: 1359-0278.

    Sadosky, A., Wiater, L.A., Shuman, H.A., 1993. Identication

    of Legionella pneumophila genes required for growth within

    pathogen Coxiella burnetii. Proc. Natl. Acad. Sci. USA

    100, 54555460, doi 10.1073.

    Sexton, J.A., Vogel, J.P., 2002. Type IVB secretion by

    intracellular pathogens. Trac 3, 178185, doi: 10.1034/

    j.1600-0854.2002.030303.x.

    I. Morozova et al. / Plasmid 51 (2004) 127147 147and killing of human macrophages. Infect. Immun. 61,

    53615373.

    Saitou, N., Nei, M., 1987. The Neighbor-Joining Method: a

    new method for reconstructing phylogenetic trees. Mol.

    Biol. Evol. 4, 406425, id: 0737-4038/87/0.

    Samrakandi, M.M., Cirillo, S.L.G., Ridenour, D.A., Bermu-

    dez, L.E., Cirillo, J.D., 2002. Genetic and phenotypic

    dierences between Legionella pneumophila strains. J. Clin.

    Microbiol. 40, 13521362, doi: 10.1128/JCM.40.4.1352-

    1362.2002.

    Sauder, J.M., Arthur, J.W., Dunbrack Jr., R.L., 2000.

    Large-scale comparison of protein sequence alignment

    algorithms with structure alignments. Proteins: Struct.,

    Funct., Genet. 40, 622, online ISSN:1097-0134, print

    ISSN:0887-3585.

    Segal, G., Shuman, H.A., 1997. Characterization of a new

    region required for macrophage killing by Legionella

    pneumophila. Infect. Immun. 65, 50575066, id: 0019-9567/

    $04.00+0.

    Segal, G., Shuman, H.A., 1998a. Intracellular multiplication

    and human macrophage killing by Legionella pneumophila

    are inhibited by conjugal components of IncQ plasmid

    RSF1010. Mol. Microbiol. 30, 197208.

    Segal, G., Shuman, H.A., 1998b. How is the intracellular fate of

    the Legionella pneumophila phagosome determined. Trends

    Microbiol. 6, 253255, doi: 10.1016/S0966-842X(98)01308-0.

    Segal, G., Shuman, H.A., 1999. Possible origin of the Legionella

    pneumophila virulence genes and their relation to Coxiella

    burnetii. Mol. Microbiol. 33, 669670, doi: 10.1046/j.1365-

    2958.1999.01511.x.

    Segal, G., Purcell, M., Shuman, H.A., 1998. Host cell

    killing and bacterial conjugation require overlapping sets

    of genes within a 22-kb region of the Legionella

    pneumophila genome. Proc. Natl. Acad. Sci. USA 95,

    16691674.

    Segal, G., Russo, J.J., Shuman, H.A., 1999. Relationships

    between a new type iv secretion system and the icm/dot

    virulence system of Legionella pneumophila. Mol. Microbiol.

    34, 799809, doi: 10.1046/j.1365-2958.1999.01642.x.

    Seshadri, R., Paulsen, I.T., Eisen, J.A., Read, T.D., Nelson,

    K.E., Nelson, W.C., Ward, N.L., Tettelin, H., Davidsen,

    T.M., Beanan, M.J., Deboy, R.T., Daugherty, S.C.,

    Brinkac, L.M., Madupu, R., Dodson, R.J., Khouri, H.M.,

    Lee, K.H., Carty, H.A., Scanlan, D., Heinzen, R.A.,

    Thompson, H.A., Samuel, J.E., Fraser, C.M., Heidelberg,

    J.F, 2003. Complete genome sequence of the Q-feverSwanson, M.S., Hammer, B.K., 2000. Legionella pneumophila

    pathogenesis: a fateful journey from amoebae to macro-

    phages. Annu. Rev. Microbiol. 54, 567613.

    Terry Alli, O.A., Zink, S., von Lackum, N.K., Abu-Kwaik, Y.,

    2003. Comparative assessment of virulence traits in Legion-

    ella spp. Microbiology 149, 631641, doi: 10.1099/

    mic.0.25980-0.

    Thompson, J.D., Higgins, D.G., Gilbson, T.J., 1994. CLUS-

    TAL W: improving the sensitivity of progressive multiple

    sequence alignment through sequence weighting, position-

    specic gap penalties and weight matrix choice. Nucleic

    Acids Res. 22, 46734680.

    Vogel, J.P., Andrews, H.L., Wong, S.K., Isberg, R.R., 1998.

    Conjugative transfer by the virulence system of Legionella

    pneumophila. Science 279, 873876.

    Watarai, M., Andrews, H.L., Isberg, R.R., 2001. Formation of

    a brous structure on the surface of Legionella pneumophila

    associated with exposure of DotH and DotO proteins after

    intracellular growth. Mol. Microbiol. 39, 313329, doi:

    10.1046/j.1365-2958.2001.02193.x.

    Weber, T., Zemelman, B.V., McNew, J.A., Westermann, B.,

    Gmachl, M., Parlati, F., Sollner, T.H., Rothman, J.E., 1998.

    SNAREpins: minimal machinery for membrane fusion. Cell

    92, 759772.

    Weimbs, T., Low, S.H., Chapin, S.J., Mostov, K.E., Bucher, P.,

    Hofmann, K., 1997. A conserved domain is present in

    dierent families of vesicular fusion proteins: a new super-

    family. Proc. Natl. Acad. Sci. USA 94, 30463051.

    Weimbs, T., Mostov, K., Low, S.H., Hofmann, K., 1998.

    A model for structural similarity between dierent

    SNARE complexes based on sequence relationships.

    Trends Cell Biol. 8, 260262, doi: 10.1016/S0962-

    8924(98)01285-9.

    Whittam, T.S., Bumbaugh, A.C., 2002. Inferences from whole-

    genome sequences of bacterial pathogens. Curr. Opin.

    Genet. Dev. 12, 719725, doi: 10.1016/S0959-

    437X(02)0036-1.

    Yu, V.L., Ploue, J.F., Castellani Pastoris, M., Stout, J.E.,

    Schousboe, M., Widmer, A., Summersgill, J., File, T.,

    Heath, C.M., Paterson, D.L., Chereshsky, A., 2002. Distri-

    bution of Legionella species and serogroups isolated by

    culture in patients with sporadic community-acquired legi-

    onellosis: an international collaborative survey. J. Infect.

    Dis. 186, 127128, id: 0022-1899/2002/18601-0020$15.00.

    Communicated by R. Novick

    Comparative sequence analysis of the icm/dot genes in LegionellaIntroductionMaterials and methodsBacterial strainsHybridizationPCR and sequencingAdditional gene sequencesSequence alignment and analysisHomology searchDomain searchPhylogenetic analyses

    ResultsGene composition of Legionella speciesLevel of interstrain and interspecies variation in L. pneumophilaParalogs of icm/dot genes in Philadelphia strain of L. pneumophilaFurther analysis of individual icm/dot genesIcmPIcmGIcmNIcmKPhylogenetic relationships between strains based on icm gene sequence

    DiscussionAcknowledgementsReferences