Primers4clades, A Web Server to Design Lineage-Speciﬁc ...cimi.ccg.unam.mx/~vinuesa/Papers_PDFs/Sachman_primers4c...Primers4clades was mainly written in Perl and uses several BioPerl

de Bruijn Vol. I c51.tex V1 - 02/08/2011 5:16 P.M. Page 441

Chapter 51

Primers4clades, A Web Server toDesign Lineage-Specific PCR Primersfor Gene-Targeted Metagenomics

Bernardo Sachman-Ruiz, Bruno Contreras-Moreira, EnriqueZozaya, Cristina Martınez-Garza, and Pablo Vinuesa

51.1 INTRODUCTION

This chapter presents a practical overview and casestudy of the usage and capabilities of the primers4cladesversion 1.0 web server [Contreras-Moreira et al., 2009],a software tool useful to design degenerate PCR primersfor gene-targeted analysis of metagenomic DNA (seealso Chapter 28, Vol. I) and multilocus sequence analysis.Although direct shotgun sequencing of community DNAis now feasible [Venter et al., 2004], polymerase chainreaction (PCR) remains the most widely used technologyto gain molecular markers for studies in molecularecology and systematics. With the ongoing accumulationof fully sequenced genomes in public sequence databases,the focus of microbial ecology studies is increasinglyshifting to the analysis of protein coding genes andsequences (CDSs) to understand ecological, metabolic,and evolutionary processes in nature [Falkowski et al.,2008; Frias-Lopez et al., 2008; Hunt et al., 2008].This trend is reflected in the huge interest of studyingthe diversity and expression patterns of “functionalgenes” in the environment, such as those encoding forantibiotic resistance and virulence [Castiglioni et al.,2008; Manning et al., 2008], photosynthesis [Yutin andBeja, 2005], or enzymes involved in the nitrogen cycle[Smith et al., 2007; Zehr et al., 2003], to mention afew. Furthermore, multilocus sequence analysis (MLSA)and typing (MLST) of protein-coding genes are thenew standards in molecular systematics [Edwards, 2009;

Handbook of Molecular Microbial Ecology, Volume I: Metagenomics and Complementary Approaches, First Edition. Edited by Frans J. de Bruijn.© 2011 Wiley-Blackwell. Published 2011 by John Wiley & Sons, Inc.

Gevers et al., 2005; Vinuesa, 2010] and molecular epi-demiology [Maiden, 2006; Smith et al., 2009]. However,it still remains a major challenge to design optimal PCRprimers to specifically amplify CDSs from target lineagesdirectly from environmental DNA samples or fromnovel organisms. Primers4clades is a publicly availableweb server that uses phylogenetic trees to aid in thedesign of lineage-specific degenerate PCR primers for theabove-mentioned purposes, which takes into account bothprotein and the corresponding codon multiple sequencealignments [Contreras-Moreira et al., 2009].

The major advantages of using a lineage-specificgene-targeted approach to metagenomics lie in the highsampling coverage and long sequence reads that can beachieved by sequencing a relatively low number of clonesusing classic Sanger sequencing technology. Furthermore,amplification biases derived from sequence compositionheterogeneity [Polz and Cavanaugh, 1998] are reduceddue to the relatively homogeneous composition of thesequences targeted with lineage-specific PCR primers.Their lower degeneracy also contributes to a reduction inamplification biases.

Here we present an empirical validation studythat demonstrates the highly specific amplification ofMycobacterium spp. rpoB sequences from complexmetagenomic DNA samples gained from two tropicalMexican forest soils: (i) the humid evergreen forestreserve of Los Tuxtlas, Veracruz, and (ii) a conservedpatch of seasonally dry deciduous forest in the Biosphere

441


442 Chapter 51 Primers4clades, A Web Server to Design Lineage-Specific PCR Primers

Reserve of Sierra de Huautla, Morelos. Rarefaction curvesfor a random subsample of 50 clones of each library arecompared with rarefaction curves for a 16S rRNA-basedlibrary of 98 clones obtained from Amazonian pastureand forest soils using universal primers [Bornemanand Triplett, 1997], which illustrates the usefulness ofthe lineage-specific metagenomics analysis approach toimprove sampling coverage and statistical power.

Q1

51.2 METHODS

51.2.1 Implementation, Input DataProcessing, and Overview of theTwo Run Modes Using the Server’sActinobacteriales DemonstrationDatasetPrimers4clades (primers for clades) is an easy-to-useweb server developed for researchers interested in thedesign of PCR primers for cross-species amplificationof novel sequences from metagenomic DNA or fromuncharacterized organisms belonging to user-specifiedphylogenetic clades or taxa. Degenerate PCR primersare derived from conserved motifs in protein multiplesequence alignments using the CODEHOP primer designstrategy described below, but considering also theunderlying codon alignments to obtain what we call“corrected CODEHOP” primer formulations. This is oneof the diverse components of the extended CODEHOPalgorithm that the server implements [Contreras-Moreiraet al., 2009].

Primers4clades was mainly written in Perl and usesseveral BioPerl modules [Stajich et al., 2002] along withthe open source software cited below to perform the dif-ferent computations involved in the extended CODEHOPalgorithm. The input for the server is a set of homolo-gous protein-coding genes in FASTA format and in frame+1, which may be aligned or not, with or without introns(Fig. 51.1).

We will demonstrate the usage of the software bychoosing the server’s Actinobacteriales dataset, one of thethree provided as tutorial material, which can be selectedby clicking on the “demo” button. Although no registra-tion is required, we recommend that the user provide hisemail address, which will allow access for one week tothe analyses stored on the server and also avoids potentialloss of results due to eventual browser timeout events dueto often long computation times (see FAQ section in theonline tutorial for the details). Note that the taxon names(Latin binomials) should be placed within brackets, whichis how the header from FASTA-formatted sequences aredownloaded from GenBank, as shown in Figure 51.1.

This is required for the server to parse the taxonnames in order to select the corresponding codon usage

tables, as explained below. The server excises introns iftheir coordinates are indicated in the FASTA header (seethe server’s documentation and the fungal alpha-tubulinsdemo dataset), collapses redundant sequences to haplo-types, translates the CDSs with user-selected translationtables, and aligns them using Muscle [Edgar, 2004]. Thealignment step is skipped, if the server detects that theuploaded DNA sequences are aligned. The protein align-ment is projected on the underlying DNA sequences tocompute the corresponding codon alignment, along withmaximum likelihood (ML) distance matrices from the pro-tein (WAG+G) or the codon (HKY85+G) alignments,using Tree-Puzzle [Schmidt et al., 2002]. If the server isrun in the “advanced” and interactive “cluster sequences”mode, the above-mentioned distance matrices are used tocompute and display a neighbor-joining (NJ) tree with“neighbor” from the PHYLIP package [Felsenstein, 2004],based either on the protein or codon alignments. Theuser can then select a clade from the displayed NJ treeto target the primer design toward its sequence members(Fig. 51.2). Note that the first sequence in the alignmentwill be used to root the tree. In the default, noninteractive“get primers” run mode, all the uploaded sequences willbe considered to compute the primer formulations.

51.2.2 The Standard and ExtendedCODEHOP Primer Design StrategiesPrimers4clades implements an extended and fullyautomated CODEHOP (Consensus Degenerate HybridOligonucleotide Primer) design strategy, which is basedon both DNA and protein multiple sequence alignmentsof protein-coding sequences [Contreras-Moreira et al.,2009]. The standard CODEHOP design strategy was firstpublished by Rose et al. [1998, 2003]. It represents anovel PCR primer design method devised to amplifydistantly related members of viral protein families. Theprimers are derived from conserved amino acid BLOCKSwithin a protein multiple sequence alignment that don’tcontain gaps. Each hybrid primer consists of (a) a short(11–12 nucleotides long) 3′ degenerate core region thatbinds to the codons encoding 3–4 highly conservedamino acids and (b) a longer 5′ consensus clamp regionthat contains the most abundant codon for each aminoacid as predicted on the basis of a user-selected codonusage table (CUT). The basic structure of a CODEHOPcan be seen from the link # 1 provided in the InternetResources section. Amplification initiates by annealingand extension of primers in the pool with the highestsimilarity in the 3′ core region, which confers thespecificity of the amplification, and their stabilization bythe consensus clamp, which only partially matches thetarget template. Because all primers are identical at their5′ consensus clamp region, they will all anneal at high


51.2 Methods 443

Figure 51.1 Data input and analysis customizationspage of the primers4clades server. Three demodatasets are provided for rapid testing of the server’sinterface and capabilities. The tutorial presented in thischapter will focus on the actinobacterial dataset andthe advanced, interactive “cluster sequences” runmode. The selected “cluster distance metric” optionindicates to the server that the reference NJ tree shouldbe computed from the protein sequence alignmentgenerated from the uploaded coding sequences usingNCBI’s translation table number 11. The user can alsochange the preferred size range for the amplicons, Tm

of the consensus clamp, and whether the phylogeneticinformation content of the resulting amplicon setsshould be evaluated at the DNA or protein levelsunder a variety of substitution models.

(A)

(B)

Figure 51.2 Reference NJ tree displayed on thefirst output page generated by primers4clades run inthe interactive “cluster sequences” mode using theactinobacterial demonstration dataset. (A) Labeled NJtree computed from the protein alignment of the inputdataset. (B) Cluster selection panel based on the labelsdisplayed on the NJ tree. Hitting the re-cluster buttonparses the alignment to use only the selectedsequences. Hitting the get primers button starts theprimer design and evaluation steps.



stringency during subsequent rounds of amplification.This increases the efficiency of the PCR amplification,making it in principle more efficient than PCR withstandard consensus or degenerate oligonucleotides andallowing the amplification of divergent sequences ofprotein families. If no oligonucleotide formulations arereturned by the server, or these are very long, it may benecessary to decrease the Tm of the consensus clamp (setto 55◦C), as explained below. Primers4clades automati-cally uses the CUTs for all species identified in the inputdataset for which a CUT is available at the Codon UsageDatabase [Nakamura et al., 2000]. It also computes analignment-specific CUT on the fly. Taking into accountseveral nonredundant CUTs notably increases the amountand quality of the primers found for a given dataset,as shown by our genome-scale benchmark analysisresults reported in the original primers4clades publication[Contreras-Moreira et al., 2009]. This represents a notableperformance improvement of our extended CODEHOPalgorithm with respect to that implemented in the originalCODEHOP and recent iCODEHOP servers (links #2 and#3 of Internet Resources).

51.2.3 Customization of thePrimers4clades Server BehaviorAs mentioned above, the server can be run either inthe basic “get primers” or in the advanced “clustersequences” modes. However, further customization ofthe server’s behavior may be required for optimal primerdesign in either mode. For demonstration purposes, wewill use the advanced and interactive “cluster sequences”run mode. Clicking on the “customize settings” buttonopens a window with self-explanatory options (Fig. 51.1).First of all, the input DNA protein-coding sequence canbe translated by any of the current NCBI translationtables, with the “bacterial plastid” translation table 11being the correct choice for our actinobacterial demon-stration data. Next, the user can select whether to workat the DNA or protein levels to estimate a first referenceNJ tree using the “cluster metric” option. As shownin Figure 51.1, we selected the default protein choice.Next, the user can choose a desired size range for theamplicons, as well as the Tm for the consensus clamp.Lastly, we choose the HKY85 + G DNA substitutionmodel for the evaluation of the phylogenetic informationcontent of the different amplicon sets, as explainedbelow. Please read the servers FAQ section of the onlinedocumentation if you need some background informationon selecting DNA substitution models.

A unique feature of the primers4clades server is itsability to compute a robust phylogenetic information con-tent parameter for each theoretical amplicon set, whichis based on a recently developed Shimodaira–Hasegawa

(SH)-like test for the significance of branches in amaximum likelihood tree, originally implemented inPhyML v2.4.5 [Anisimova and Gascuel, 2006]. Inbrief, the test assesses whether the branch being studiedprovides a significant likelihood gain, in comparisonwith the null hypothesis that involves collapsing thatbranch, but leaving the rest of the tree topology identical.The resulting SH-like branch support values thereforeindicate the probability that the corresponding split issignificant. The phylogenetic information content ofeach amplicon set is calculated as the mean and medianSH-like branch support values for the correspondingmaximum likelihood tree inferred under the user-specifiedsubstitution model or matrix. As a rule of thumb, thephylogenetic information content of the targeted sequenceregion increases as branch support values approach to 1.This strategy of calculating the phylogenetic informationcontent of different sequence alignments was recentlydeveloped by Vinuesa et al. [2008].

Once all customization choices have been properlyset, we are ready to hit the submit button to let the serverstart its work. After a few seconds, we get a new page dis-playing the NJ phylogeny (see Fig. 51.2A) based on thetranslation products of the actinobacterial rpoB sequenceswith some summary information of the alignment and dis-tance matrix written on top of it (not shown). Noticethat the two Corynebacterium sequences form an out-group clade to the target Mycobacterium spp. ingroupclade on which we will now instruct the server to focus thesearch for PCR primers. To do so, we provide the clusterboundary identification numbers of the target clade in thecorresponding box, as shown in Figure 51.2B. This canbe simply done by copying and pasting the numbers fromthe displayed tree that demarcate the target clade, sepa-rating them with a coma without spaces. In our example,this is 002_0,013_0. After hitting the re-cluster button anew page will be presented, displaying the same tree butonly with the target sequences selected. We could still atthis point exclude further sequences or define a differentsubcluster for primer design, but in our example we arenow ready to hit the “get-primers button.”

51.2.4 Results Returned to theUser by the Primers4clades ServerA very useful feature of primers4clades is that it returnsa nonredundant set of primer pair formulations, rankedaccording to their thermodynamic properties (the theoret-ically best primers are displayed first). The server checksthat the resulting amplicon sets for the primer pairs do notoverlap more than 80%, ensuring a high coverage of thetarget locus, but filtering excessive redundancy, as shownon the amplicon distribution maps generated by the serverafter some seconds or minutes (Fig. 51.3), depending on


51.2 Methods 445

0k 0.1k39 (quality = 60%)

33 (quality = 70%) 4 (quality = 100%)

32 (quality = 80%)

19 (quality = 90%) 12 (quality = 100%)

6 (quality = 100%)

20 (quality = 90%)21 (quality = 90%)

36 (quality = 70%)

22 (quality = 90%)

40 (quality = 50%)

41 (quality = 50%)

37 (quality = 70%)

3 (quality = 100%)

1 (quality = 100%)

7 (quality = 100%)

11 (quality = 100%)

10 (quality = 100%)

13 (quality = 100%)

15 (quality = 100%)

27 (quality = 90%)

28 (quality = 80%)

34 (quality = 70%)

24 (quality = 90%)

2 (quality = 100%)

23 (quality = 90%)

9 (quality = 100%)

16 (quality = 100%)

18 (quality = 100%)

30 (quality = 80%)

31 (quality = 80%)

8 (quality = 100%)

26 (quality = 90%) 17 (quality = 100%)

35 (quality = 70%)

14 (quality = 100%)

5 (quality = 100%) 38 (quality = 70%)

25 (quality = 90%)

29 (quality = 80%)

0.2k 0.3k 0.4k 0.5k 0.6k 0.7k 0.8k 0.9k 1k 1.1k

Figure 51.3 Amplicon distribution map for the actinobacterial demonstration dataset generated using the parameters shown in Figure 51.1 forthe selected strains as shown in Fig. 51.2. The positions of the different amplicons are mapped on the first sequence of the input alignment atthe protein level. A quality value is assigned to each amplicon, with 100% being best. This value is computed based on thermodynamicproperties of the primer pair (see text for the details). The amplicon distribution map and all primer pair formulations along with theirthermodynamic properties can be downloaded from the server.

the size of the alignment and number of primer pairsfound. During the process of primer pair calculation, theserver displays information about the redundancy amongthe codon usage tables (CUTs) for each taxon and thenumber of primer pairs found for each nonredundant CUT(see the section on “Technical Information” of the server’sonline documentation page for a detailed explanation ofhow it computes the nonredundant CUTs).

The amplicon distribution map can be downloadedfrom the server as well as a TAB-delimited text file con-taining all the primer pair formulations, amplicon size, anda comprehensive set of their thermodynamic propertiessuch as degeneracy, Tm , hairpin-formation potential, andcross-hybridization potential computed with subroutinesborrowed from the Amplicon software [Jarman, 2004].The quality % value shown on top of each bar represent-ing an amplicon indicates how “good” the correspondingprimer pairs are, based on the above-mentioned thermo-dynamic properties. A 100% score indicates that none ofthe property values is worse than a predetermined cutoffvalue for each parameter that we have empirically chosenbased on the evaluation of dozens of well-working primerpairs. Note also that the bars representing amplicons arecolor-coded according to their corresponding primer pairquality scores.

In addition to the standard CODEHOP, theprimers4clades server computes three additional primerformulations: the corrected CODEHOP, relaxed correctedCODEHOP, and fully degenerate oligonucleotides.

The former two are based on the original CODEHOPformulation, but adjusting the degeneracy of the latterby taking into account the sequences of the underlyingcodon alignment, as shown in Figure 51.4. The relaxedcorrected CODEHOP extends the 3′ core region of thecorrected CODEHOP toward the 5′ end until it reachesa degeneracy level of 24. If the corrected CODEHOPalready has a degeneracy equal or greater than 24, thenthe latter two formulations will be identical. The fullydegenerated oligonucleotide formulation is computedto reflect the full nucleotide sequence variation of thetargeted binding site. Table 1 of the server’s onlinedocumentation page summarizes the recommended usesof the four oligonucleotide formulations computed bythe server. We recommend the use of the correctedCODEHOP formulation for most purposes.

After displaying the amplicon distribution map,the server enters the second computing intensive phase.Maximum likelihood phylogenies (in our case study usingthe HKY85+G model) will be estimated from each ofthe theoretical amplicon sets displayed on the amplicondistribution map. It is from these phylogenies that thephylogenetic information content of the amplicons iscalculated based on Shimodaira–Hasegawa-like branchsupport values [Anisimova and Gascuel, 2006; Guindonet al., 2009], as we have described recently [Vinuesa,2010; Vinuesa et al., 2008]. The best scoring amplicons(based on the quality parameter explained above) areevaluated first.



Figure 51.4 Example primer formulation output panel (only the reverse formulation of the first primer pair returned by the server is shown)for the actinobacterial example dataset. The server returns the standard CODEHOP (bold) and three codon-based primer formulations, alignedwith the underlying codons. An ‘!’ sign denotes positions corrected by the system based on the codon alignment. Also shown are the summaryoutputs for key primer pair and corresponding amplicon characteristics, as well as the associated phylogenetic information content of thetheoretical amplicon sets (see the text for the details).

After the server is done with the ML search for thefirst amplicon set, the corresponding primer pairs areshown on screen, aligned with the underlying codons(Fig. 51.4). A header line indicates the codon usagetable used to calculate the corresponding primer pair,in this case a data-specific one computed from theinput alignment (not shown on Fig. 51.4). The first lineafter the header shows the sequence of the CODEHOPformulation, as calculated by the original CODEHOPprogram, showing also the coordinates of the forward(fw) and reverse (rev) oligos on the original (full) proteinalignment. The lowercase letters on the 3′ end of the oligocorrespond to the degenerated portion of the CODEHOP,while the uppercase residues toward the 5′ end correspondto the consensus clamp. The lines below the CODEHOPformulation are the corresponding target sites at the DNAlevel. This codon alignment is used to calculate thecorrected CODEHOP formulation. The ′!′ and ′?′ char-acters just above the corrected CODEHOP formulationline indicate the discrepancies between the CODEHOPformulation and the codon alignment that are consideredto formulate the corrected CODEHOP sequence, the ′!′indicating nondegenerate sites, and the ′?′ highlighting

degenerate sites (Fig. 51.4). The ′.′ symbol indicatesmatches between the original CODEHOP formulation andthe codon alignment. The relaxed corrected CODEHOPformulation and the fully degenerated oligonucleotideformulations are also displayed (Fig. 51.4).

A primer pair quality summary report is displayedon screen after each primer pair, based on computationsperformed on the corrected CODEHOP formulations. Thefirst line provides a summary score of the “primer qual-ity” in thermodynamic terms for each primer pair (range100% to 0%, best to worst). The next three lines indicatethe expected amplicon length in bps and the computed Tm

ranges for the fw- and rev-corrected codehop formulations(Fig. 51.4). If the quality score is <100%, this means thatat least one oligo has a thermodynamic parameter valueworse than the cutoff values used by the server (thesevalues are presented in Table 2 of the server’s online doc-umentation page).

Lastly, we get a summary of the phylogenetic eval-uation of the corresponding amplicon (Fig. 51.4), whichindicates the substitution matrix/model used for the evalu-ation, and the value of the corresponding shape parameter(alpha). The next line is very useful for users interested in

contrera

Nota adhesiva

no debería ser 'A' en vez de 'An'?

contrera

Nota adhesiva

Igual queda mejor: "Example primer formulation (only the reverse primer of the first pair returned by the server is shown)"


51.3 Results 447

using the new molecular markers for phylogenetics, sinceit provides the mean and median Shimodaira–Hasegawa-like bipartition support values computed by the PhyMLmaximum likelihood algorithm. The closer to 1, the higherthe resolution level of the tree inferred from the theoreti-cal amplicon set and therefore the higher its phylogeneticinformation content. A FASTA-formatted file containingthe aligned amplicon sets can be downloaded (in this caseat the protein level). The ML trees for each amplicon canbe either directly visualized or downloaded.

51.2.5 Purification of Soil DNA,rpoB Amplification, Clone LibraryConstruction, and AnalysisTen soil samples of approximately 100 g were randomlysampled within a 10-m × 10-m area in both the tropicalhumid forest Reserve of Los Tuxtlas, Estacion Biologicade la UNAM, Veracruz, Mexico (18◦34′23.24′′N,95◦9′40.10′′O), and a conserved patch of seasonallydry deciduous tropical forest near “El Limon” in theBiosphere Reserve of Sierra de Huautla, Morelos,Mexico (18◦32′33.20′′N, 98◦56′11.19′′O). The 10 soilsamples were from each site were pooled, mixed, andsieved (2 mm) to obtain two composite samples. A 2-galiquot from each of the composite samples was usedfor DNA extraction using the MoBio PowerMax SoilDNA Extraction Kit. DNA integrity was checked on anagarose gel, and the lack of strong PCR inhibitors wasconfirmed by amplification of 16S rRNA genes using theuniversal fD1 and rD1 primers [Weisburg et al., 1991].The rpoB amplicons were generated using a hot-startprotocol and Taq-platinum (Invitrogen), with annealingat 60◦C, and cloned using the PCR Topo cloning system2.1 (Invitrogen). Plasmids were purified using a 96-wellplate format and commercially sequenced at LANGEBIO,Mexico, using the Sanger method at 2 × coverage withthe M13f and M13rev primers that bind at vector posi-tions flanking its cloning site. The two sequence reads foreach plasmid were automatically assembled and validatedby using a custom Perl script that runs Phred/Phrap

[Ewing and Green, 1998; Ewing et al., 1998] followedby a BlastX [Altschul et al., 1997] search of each contigagainst a locally formatted rpoB sequence database ofActinobacteria assembled from fully sequenced genomesof these bacteria. Only sequences aligning globally withentries from the database were further processed bysubjecting them to a sequence-to-profile alignment usingClustalw 2.1 [Larkin et al., 2007]. The reference profileconsists of a codon alignment of 25 nonredundant rpoBsequences of reference Mycobacterium strains, trimmedto the segment corresponding to the expected amplicon.The resulting multiple sequence alignment was furthercorrected manually.

Phylogenetic analyses were performed with PhyMLv. 3.0 [Guindon et al., 2009], as previously described[Vinuesa, 2010; Vinuesa et al., 2008]. All communityecology analyses of the environmental rpoB -ampliconlibraries were performed with Mothur version 1.7[Schloss et al., 2009].

51.3 RESULTS

51.3.1 Design and Evaluation ofthe Specificity of Mycobacteriumspp. PCR Primers Targeting the RNAPolymerase Beta Subunit (rpoB)Gene Using Soil Metagenomic DNAUsing the primers4clades server and selected referencesequences (a few more than those provided at the server’sactinobacterial demonstration sequence set), we designedthe primer pair shown in Table 51.1. These primers wereused to generate PCR amplicons from metagenomic DNApurified from the two contrasting tropical forest soils inMexico (humid evergreen forest at Los Tuxtlas, Veracruz;deciduous seasonally dry forest from Sierra de Huautla,Morelos), as described in Materials and Methods.

Specific amplification products could be obtainedfor both soils, as shown in Figure 10 of the server’sonline documentation page. We sequenced 50 clonesfrom each of the two amplicon libraries to evaluate

Table 51.1 Basic Thermodynamic Parameters and Sequence Formulation of the Corrected Codehop Primers Used inThis Study to Amplify Mycobacterium-Specific Sequences from Two Contrasting Soil Samples

Name La Sequence Tm Corrected Degeneracyb Full Degeneracyc h_potd c_pote

rpoB2_N326 26 GAGAACCTGTTCTTCaaggagaagcg 63.3 0 0 0.52 0.50rpoB2_C757 25 CGTCCTCGTAGTTGtgrccytccca 66.6 4 4 0.00 0.50

a Oligonucleotide length.bDegeneracy of the corrected codehop oligonucleotide (see technical details in the server’s tutorial page).cFull degeneracy of the target alignment site (see technical details in the server’s tutorial page).d Hairpin formation potential.eCross-hybridization potential.



Figure 51.5 Maximum likelihood phylogeny under the GTR+G+I substitution model for 32 reference sequences [Corynebacterium (7),Mycobacterium (17), Nocardia (1), Rhodococcus (3), Saccharopolyspora (1), Salinispora (2), and Tsukamurella (1)] and 100 soil clones (50from the Reserve of Los Tuxtlas, labeled as TUX; 50 from the Sierra de Huautla Reserve, labeled as REB). For the sake of clarity, thenon-Mycobacterium sequence clades are collapsed. Eighty-nine percent of the soil clones clustered within the Mycobacterium spp. clade,demonstrating the high selectivity of the rpoB primers shown in Table 51.1 to target this group of organisms in complex metagenomic DNAsamples.

the specificity of the oligonucleotides in a phylogeneticcontext. Figure 51.5 shows a maximum likelihood phy-logeny (GTR+G) of the 100 environmental sequencesalong with 15 reference outgroup strains in the generaCorynebacterium (7), Nocardia (1), Rhodococcus (3),Saccharopolyspora (1), Salinispora (2), and Tsukamurella(1), which are shown as collapsed clades in Figure 51.5,and 17 Mycobacterium spp. ingroup reference strains.From this phylogeny we could unambiguously define theperfectly supported Mycobacterium spp. clade, which

clustered 89 of the environmental clones. The remaining11 environmental sequences clustered with the closelyrelated actinobacterial genera used as outgroups.

In summary, 89% of the environmental rpoB ampli-cons could be unambiguously assigned to the genusMycobacterium , indicating that the primers have a highspecificity for the target genus. Only ∼5% of the envi-ronmental Mycobacterium sequences had suspiciouslylong branches, which might reflect sequence artifactssuch as chimeras, heteroduplexes, or Taq errors.


51.4 Discussion 449

51.3.2 Statistical Evaluation of theSampling Effort Achieved by theLineage-Targeted ApproachA recent review on the use of rpoB sequences inMycobacterium molecular systematics suggests that97.7–98.2% similarity cutoff values for rpoB sequencescould be used to delimit species within the genus[Adekambi et al., 2009]. However, in this study we use aslightly more conservative cutoff value (3% divergence)to define operational taxonomic units (OTUs; approxi-mately equivalent to species) for our Mycobacterium spp.diversity study.

Figures 51.6A and 51.6B show the rarefactioncurves [Hughes et al., 2001] generated for the 50Mycobacterium spp. and closely related actinobacterialsequences obtained from both rpoB soil clone librariesat different cutoff values. Notice that for both soilsthe rarefaction curves are close to reach an asymptoteat the 3% divergence OTU definition, indicating thatthe sequencing effort has been reasonably good. Asa reference point, Figure 51.6C shows the rarefactioncurves at the same cutoff values for a classical 16S rRNAgene clone library made from a Brazilian tropical soilusing universal primers [Borneman and Triplett, 1997].These curves indicate that the 98 clones sequenced in thatstudy are far from being representative of the diversityat the species level, which is classically defined by acutoff level of 97–98% of sequence similarity (2–3%divergence). Figures 51.6D and 51.6E show collectorcurves for the two libraries using the nonparametricChao1 richness estimator [Hughes et al., 2001], whichindicate that the Mycobacterium community at seasonallydry deciduous forest of El Limon is more diverse (19predicted species, with a 95% credibility interval 14–46)than that found in the humid tropical forest of Los Tuxtlas(12 predicted species, with a 95% credibility interval12–32). Interestingly, not a single OTU is shared amongthe communities at this cutoff level, indicating a stronglydifferentiated species composition of both communities,which together bring about 31 OTUs. The Chao1 collectorcurves shown in Figures 51.6D and 51.6E indicate thatthe sampling effort has not been sufficient to evaluatethe diversity at the two lower cutoff values since theyhave not stabilized [Hughes et al., 2001], but those forthe 0.03 and 0.05 cutoff values seem to have stabilized.Therefore, inferences at these cutoff levels are robust.

51.4 DISCUSSION

Primers4clades is currently the only publicly availableserver that integrates alternative primer-design strategies

with phylogenetic trees to (a) interactively target thesearch for oligonucleotide formulations to specificsequence clusters and (b) evaluate the phylogeneticinformation content of the new molecular markers.Together, these features are very useful to make aninformed choice among alternative, nonredundant primerpair formulations, taking into account both the thermo-dynamic properties of the primers and the phylogeneticinformation content of the expected amplicon sets. Theseattributes make of primers4clades a novel and powerfultool for the targeted design and informed selection ofPCR primers for lineage-targeted metagenomic studiesof microbial communities, as demonstrated herein forenvironmental mycobacteria.

The genus Mycobacterium currently has 147 validlydescribed species (see link #4 in Internet resources),most of which are poorly characterized ubiquitousenvironmental mycobacteria (EM) [Bland et al., 2005;Jacobs et al., 2009]. These are known as “nontubercu-lous” or “atypical” mycobacteria among clinicians [vanIngen et al., 2009]. The most “famous” mycobacterialspecies are M. tuberculosis and M. leprae, obligateparasites and the causative agents of tuberculosis (TB)and leprosy, respectively. M. avium, M. abscessus, M.chelonae, M. fortuitum, M. intracellulare, M. kansasii,M. mucogenicum, M. phocaicum, M. scrophulaceum, M.ulcerans , and M. vanbalenii are among the most com-mon nontuberculous or atypical mycobacterial speciescausing opportunistic infections in immunocompromisedhumans and have a broad environmental distribution[Falkinham, 2009; Primm et al., 2004]. Understandingthe diversity, abundance, and distribution of EM is ofprime interest from a biomedical standpoint becauseimmune responses following exposure to these bacteriamay influence susceptibility to tuberculosis and leprosyand may severely interfere with the proliferation andprotective effect of the attenuated BCG vaccine usedagainst TB [Black et al., 2001; Weir et al., 2006]. Theirfunctional role in ecosystems remains largely unknowndespite their environmental ubiquity.

The results presented in this study demonstrate thata good sampling effort can be achieved for the genusMycobacterium with the primers described in Table 51.1by sequencing as few as 50 clones per site using thelineage-targeted approach to gene-based metagenomicanalyses of microbial communities. Further sequencingand statistical analysis of these and additional libraries(>200 clones each) generated with the mentioned primersfrom soil [Sachman-Ruiz and Vinuesa, 2011] and water Q2samples [Sachman-Ruiz et al., 2009] demonstrated thatsequencing around 200 clones provides very strongstatistical power, allowing for the first time to make



40

(A) (B)

(E)(D)

(C)

unique

rarefaction curves for theSierra de Huautla dataset (n = 50)at the indicated clustering cutoffs

Chao’s richness estimator plotSierra de Huautla dataset (n = 50)at the indicated clustering cutoffs

Chao’s richness estimator plotReserva de los Tuxtlas (n = 50)

at the indicated clustering cutoffs

rarefaction curves for theReserva de los Tuxtlas dataset (n = 50)

at the indicated clustering cutoffs

rarefaction curves for theAmazonian Soils dataset (n = 98)at the indicated clustering cutoffs

0.010.030.05

30

20

10

0

0 10 20Number of sequences sampled

OT

Us

OT

Us

OT

Us

30

300

250

200

150

100

50

0

0 10 20Number of sequence sampled

30 40 50

40 50

20

15

10

5

0


OT

Us

30 40 50

2580

60

40

20

0


OT

Us

60 80 100

unique0.010.030.05

30

25

20

15

10

5

0

0 10 20Number of sequence sampled

30 40 50

unique0.010.030.05

unique0.010.030.05

unique0.010.030.05

Figure 51.6 Rarefaction (A–C) and collector curves for the nonparametric Chao1 richness estimator (C, D) at four clustering levels(0.0–0.05 P distance) for OTU definition. Panel C shows for comparative purposes the rarefaction curves for 98 SSU rRNA Amazonian soilclones generated with universal primers from the classical study by Borneman and Triplett [1997].

fine-grained inferences about the richness and structureof mycobacteria communities in diverse natural environ-ments. Furthermore, the latter studies have revealed themagnitude of the bias in diversity estimation introducedby classical Mycobacterium isolation procedures andhave also revealed that we have a strongly distortednotion of the diversity of this species-rich bacterial genusof great environmental and clinical relevance.

These results demonstrate the utility and strong sta-tistical power of the lineage-targeted approach to micro-bial community ecology and show that primers4clades[Contreras-Moreira et al., 2009] (link #5 in the Inter-net Resources) is a useful tool to develop primers forsuch gene-targeted metagenomic analyses of microbialdiversity. The development of the tool is now coupledto its recent implementation in a phylogenomics analy-sis pipeline to construct an interactive primer databasefor phylogenetic clades at different taxonomic and phylo-genetic depths. The graphical interface, analysis options,

and parameter evaluation procedures will be improved,extended, and refined in future versions, which will alsoinclude faster ML algorithms.

INTERNET RESOURCES

Link 1: CODEHOP structure (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC168931/figure/gkg524f1/)

Link 2: CODEHOP server (http://blocks.fhcrc.org/codehop.html)

Link 3: iCODEHOP server (https://icodehop.cphi.washington.edu/i-codehop-context/iCODEHOP/view/PrimerAnalysis)

Link 4: Mycobacterium taxonomy (http://www.bacterio.cict.fr/m/mycobacterium.html)

Link 5: Primers4clades web server (http://maya.ccg.unam.mx/primers4clades)

contrera

Nota adhesiva

Es lo mismo Chao1 que Chao's?


References 451

AcknowledgmentsRomualdo Zayas-Laguna and Vıctor del Moral from theInformation Technology Administration UNIT at CCG-UNAM are acknowledged for their technical support. Thiswork was supported by DGAPA/PAPIIT-UNAM [grantnumber IN201806-2], CONACyT-Mexico [grant numberP1-60071], and by Consejo Superior de InvestigacionesCientıficas [grant number 200720I038]. UNAM’s Ph.D.Program in Biomedical Sciences and CONACyT-Mexicoare acknowledged for the financial support offered to BSRfor his Ph.D. studies.

REFERENCES

Adekambi T, Drancourt M, Raoult D. 2009. The rpoB gene as atool for clinical microbiologists. Trends Microbiol . 17(1):37–45.

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al.1997. Gapped BLAST and PSI-BLAST: A new generation of proteindatabase search programs. Nucleic Acids Res . 25:3389–3402.

Anisimova M, Gascuel O. 2006. Approximate likelihood-ratio testfor branches: A fast, accurate, and powerful alternative. Syst. Biol .55(4):539–352.

Black GF, Dockrell HM, Crampin AC, Floyd S, Weir RE,et al. 2001. Patterns and implications of naturally acquired immuneresponses to environmental and tuberculous mycobacterial antigensin northern Malawi. J. Infect. Dis . 184(3):322–329.

Bland CS, Ireland JM, Lozano E, Alvarez ME, Primm TP. 2005.Mycobacterial ecology of the Rio Grande. Appl. Environ. Microbiol .71(10):5719–5727.

Borneman J, Triplett EW. 1997. Molecular microbial diversity insoils from eastern Amazonia: Evidence for unusual microorganismsand microbial population shifts associated with deforestation. Appl.Environ. Microbiol . 63(7):2647–2653.

Castiglioni S, Pomati F, Miller K, Burns BP, Zuccato E,et al. 2008. Novel homologs of the multiple resistance regu-lator marA in antibiotic-contaminated environments. Water Res .42(16):4271–4280.

Contreras-Moreira B, Sachman-Ruiz B, Figueroa-Palacios I,Vinuesa P. 2009. Primers4clades: A web server that uses phylo-genetic trees to design lineage-specific PCR primers for metage-nomic and diversity studies. Nucleic Acids Res . 37(Web Serverissue):W95–W100.

Edgar RC. 2004. MUSCLE: Multiple sequence alignment with highaccuracy and high throughput. Nucleic Acids Res . 32(5):1792–1797.

Edwards SV. 2009. Is a new and general theory of molecular system-atics emerging? Evolution 63(1):1–19.

Ewing B, Green P. 1998. Base-calling of automated sequencer tracesusing phred. II. Error probabilities. Genome Res . 8(3):186–194.

Ewing B, Hillier L, Wendl MC, Green P. 1998. Base-calling of auto-mated sequencer traces using phred. I. Accuracy assessment. GenomeRes . 8(3):175–185.

Falkinham JO, 3rd. 2009. Surrounded by mycobacteria: nontubercu-lous mycobacteria in the human environment. J. Appl. Microbiol .107(2):356–367.

Falkowski PG, Fenchel T, Delong EF. 2008. The micro-bial engines that drive Earth’s biogeochemical cycles. Science320(5879):1034– 1039.

Felsenstein J. 2004. PHYLIP (Phylogeny Inference Package), 3.6 ed.Distributed by the author. Seattle: Department of Genetics, Universityof Washington.

Frias-Lopez J, Shi Y, Tyson GW, Coleman ML, Schuster SC, et al.2008. Microbial community gene expression in ocean surface waters.Proc. Natl. Acad. Sci. USA 105(10):3805–3810.

Gevers D, Cohan FM, Lawrence JG, Spratt BG, Coenye T, et al.2005. Opinion: Reevaluating prokaryotic species. Nat. Rev. Micro-biol . 3(9):733–739.

Guindon S, Delsuc F, Dufayard JF, Gascuel O. 2009. Estimatingmaximum likelihood phylogenies with PhyML. Methods Mol. Biol .537:113–137.

Hughes JB, Hellmann JJ, Ricketts TH, Bohannan BJ. 2001. Count-ing the uncountable: Statistical approaches to estimating microbialdiversity. Appl. Environ. Microbiol . 67(10):4399–4406.

Hunt DE, David LA, Gevers D, Preheim SP, Alm EJ, et al. 2008.Resource partitioning and sympatric differentiation among closelyrelated bacterioplankton. Science 320(5879):1081– 1085.

Jacobs J, Rhodes M, Sturgis B, Wood B. 2009. Influence of envi-ronmental gradients on the abundance and distribution of Mycobac-terium spp. in a coastal lagoon estuary. Appl. Environ. Microbiol .75(23):7378–7384.

Jarman SN. 2004. Amplicon: Software for designing PCR primers onaligned DNA sequences. Bioinformatics 20(10):1644–1645.

Larkin MA, Blackshields G, Brown NP, Chenna R, McGettiganPA, et al. 2007. Clustal W and Clustal X version 2.0. Bioinformatics23(21):2947–2948.

Maiden MC. 2006. Multilocus sequence typing of bacteria. Annu. Rev.Microbiol . 60:561–588.

Manning SD, Motiwala AS, Springman AC, Qi W, Lacher DW,et al. 2008. Variation in virulence among clades of Escherichia coliO157:H7 associated with disease outbreaks. Proc. Natl. Acad. Sci.USA 105(12):4868–4873.

Nakamura Y, Gojobori T, Ikemura T. 2000. Codon usage tabulatedfrom international DNA sequence databases: Status for the year 2000.Nucleic Acids Res . 28(1):292.

Polz MF, Cavanaugh CM. 1998. Bias in template-to-product ratiosin multitemplate PCR. Appl. Environ. Microbiol . 64(10):3724–3730.

Primm TP, Lucero CA, Falkinham JO, 3rd. 2004. Health impacts ofenvironmental mycobacteria. Clin. Microbiol. Rev . 17(1):98–106.

Rose TM, Schultz ER, Henikoff JG, Pietrokovski S, McCal-lum CM, et al. 1998. Consensus-degenerate hybrid oligonucleotideprimers for amplification of distantly related sequences. Nucleic AcidsRes . 26(7):1628–1635.

Rose TM, Henikoff JG, Henikoff S. 2003. CODEHOP (COnsensus-DEgenerate Hybrid Oligonucleotide Primer) PCR primer design.Nucleic Acids Res . 31(13):3763–3766.

Sachman-Ruiz B, Vinuesa P. 2011. In preparation. Q3Sachman-Ruiz B, Castillo-Rodal AI, Lopez-Vidal Y, Martınez-

Romero E, Vinuesa P. 2009. Diversity of environmental mycobac-teria in Mexican rivers assessed by cultivation and metagenomicsapproaches. Poster abstract U-007. 109th General Meeting of theAmerican Society for Microbiology. May 17–21 2009. Philadelphia.

Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M,et al. 2009. Introducing mothur: open-source, platform-independent,community-supported software for describing and comparing micro-bial communities. Appl. Environ. Microbiol . 75(23):7537–7541.

Schmidt HA, Strimmer K, Vingron M, von Haeseler A. 2002.TREE-PUZZLE: maximum likelihood phylogenetic analysis usingquartets and parallel computing. Bioinformatics 18(3):502–504.

Smith CJ, Nedwell DB, Dong LF, Osborn AM. 2007. Diversityand abundance of nitrate reductase genes (narG and napA), nitritereductase genes (nirS and nrfA), and their transcripts in estuarinesediments. Appl. Environ. Microbiol . 73(11):3612–3622.

Smith GJ, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, et al.2009. Origins and evolutionary genomics of the 2009 swine-originH1N1 influenza A epidemic. Nature 459(7250):1122– 1125.

de Bruijn Vol. I c51.tex V1 - 02/08/2011 5:16 P.M.


Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, et al.2002. The Bioperl toolkit: Perl modules for the life sciences. GenomeRes . 12(10):1611–1618.

van Ingen J, Boeree MJ, Dekhuijzen PN, van Soolingen D. 2009.Environmental sources of rapid growing nontuberculous mycobacteriacausing disease in humans. Clin. Microbiol. Infect . 15(10):888–893.

Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D,et al. 2004. Environmental genome shotgun sequencing of the Sar-gasso Sea. Science 304(5667):66–74.

Vinuesa P. 2010. Multilocus sequence analysis and bacterial speciesphylogeny estimation. In Oren A, Papke RT, eds. Molecular Phy-logeny of Microorganisms . Norwich, UK: Caister Academic Press,pp. 000–000.Q4

Vinuesa P, Rojas-Jimenez K, Contreras-Moreira B, Mahna SK,Prasad BN, et al. 2008. Multilocus sequence analysis for assessmentof the biogeography and evolutionary genetics of four Bradyrhizobiumspecies that nodulate soybeans on the Asiatic continent. Appl. Environ.Microbiol . 74(22):6987–6996.

Weir RE, Black GF, Nazareth B, Floyd S, Stenson S, et al. 2006.The influence of previous exposure to environmental mycobacteria onthe interferon-gamma response to bacille Calmette–Guerin vaccina-tion in southern England and northern Malawi. Clin. Exp. Immunol .146(3):390–399.

Weisburg WG, Barns SM, Pelletie DA, Lane DJ. 1991. 16Sribosomal DNA amplification for phylogenetic study. J. Bacteriol .173:697–703.

Yutin N, Beja O. 2005. Putative novel photosynthetic reaction centreorganizations in marine aerobic anoxygenic photosynthetic bacteria:Insights from metagenomics and environmental genomics. Environ.Microbiol . 7(12):2027–2033.

Zehr JP, Jenkins BD, Short SM, Steward GF. 2003. Nitrogenasegene diversity and microbial community structure: A cross-systemcomparison. Environ. Microbiol . 5(7):539–554.

de Bruijn Vol. I c51.tex V1 - 02/08/2011 5:16 P.M.

Queries in Chapter 51

Q1. We have shortened the running head, since it exceeds the textwidth. Please confirm

Q2. Update in ref list

Q3. Update

Q4. Supply page range

Documents

Primers4clades, A Web Server to Design Lineage-Speciﬁc ...cimi.ccg.unam.mx/~vinuesa/Papers_PDFs/Sachman_primers4c...Primers4clades was mainly written in Perl and uses several BioPerl