Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
FREQ-Seq: A Rapid, Cost-Effective, Sequencing-BasedMethod to Determine Allele Frequencies Directly fromMixed PopulationsLon M. Chubiz1., Ming-Chun Lee1.¤, Nigel F. Delaney1, Christopher J. Marx1,2*
1 Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts, United States of America, 2 Faculty of Arts and Sciences Center for
Systems Biology, Harvard University, Cambridge, Massachusetts, United States of America
Abstract
Understanding evolutionary dynamics within microbial populations requires the ability to accurately follow allelefrequencies through time. Here we present a rapid, cost-effective method (FREQ-Seq) that leverages Illumina next-generation sequencing for localized, quantitative allele frequency detection. Analogous to RNA-Seq, FREQ-Seq relies uponcounts from the .105 reads generated per locus per time-point to determine allele frequencies. Loci of interest are directlyamplified from a mixed population via two rounds of PCR using inexpensive, user-designed oligonucleotides and a bar-coded bridging primer system that can be regenerated in-house. The resulting bar-coded PCR products contain theadapters needed for Illumina sequencing, eliminating further library preparation. We demonstrate the utility of FREQ-Seq bydetermining the order and dynamics of beneficial alleles that arose as a microbial population, founded with an engineeredstrain of Methylobacterium, evolved to grow on methanol. Quantifying allele frequencies with minimal bias down to 1%abundance allowed effective analysis of SNPs, small in-dels and insertions of transposable elements. Our data reveal large-scale clonal interference during the early stages of adaptation and illustrate the utility of FREQ-Seq as a cost-effective toolfor tracking allele frequencies in populations.
Citation: Chubiz LM, Lee M-C, Delaney NF, Marx CJ (2012) FREQ-Seq: A Rapid, Cost-Effective, Sequencing-Based Method to Determine Allele Frequencies Directlyfrom Mixed Populations. PLoS ONE 7(10): e47959. doi:10.1371/journal.pone.0047959
Editor: Jurg Bahler, University College London, United Kingdom
Received July 31, 2012; Accepted September 17, 2012; Published October 31, 2012
Copyright: � 2012 Chubiz et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work has been supported by funding from National Science Foundation (DEB-0845893) and National Institutes of Health (GM078209). The fundershad no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
. These authors contributed equally to this work.
¤ Current address: Department of Biochemistry, The University of Hong Kong, Pokfulam, Hong Kong
Introduction
The textbook definition of evolution is the change in frequency
of alleles in a population over time [1]. While this extremely simple
definition provides a clear picture of the evolutionary process, the
ability to actually detect and track the frequency of known
mutations in populations had previously been limited to work on
phage [2], and remains surprisingly challenging for microbes.
Identifying evolved alleles by applying next-generation sequencing
technologies either to isolates [3,4,5] or mixed-population samples
[6] has revolutionized the discovery of polymorphisms that occur
during adaptation. However, this information does not often yield
quantitative allele frequency information. The classic method for
determining allele frequencies - examination of a large number of
isolates from a population [7,8,9] - is extremely time- and
resource-intensive. Several methods exist that allow accurate
determination of allelic frequencies directly from population
samples, but these either require substantial material and
personnel cost and/or have limited sensitivity and accuracy
[10,11,12]. For these reasons, a reliable, high-throughput pipeline
for allele frequency detection in natural and laboratory popula-
tions will allow a more extensive understanding of evolutionary
processes.
Considering the equipment present in most molecular biological
laboratories, there exist only a handful of techniques available to
researchers to determine the alleles present in isolates, or their
population frequencies directly from mixed samples. The simplest
and most time-honored of these methods is restriction fragment-
length polymorphism (RFLP) analysis. Although generally quite
accurate, this technique requires the polymorphisms under
investigation to change restriction endonuclease cleavage sites or
grossly change the sequence length [13,14]. This limits application
to a small subset of mutations. Additionally, a number of
quantitative PCR-based (qPCR) methods have shown promise
though their implementation requires significant optimization and
calibration [3,15]. Lastly, and arguably the most straight-forward
method, has been the use of Sanger sequencing to determine
allelic frequencies in mixed DNA samples [3,9,16]. The chief
drawbacks of this method are that accuracy and detection limits
are highly variable requiring a large number of replicate samples.
In recent years, a number of commercial systems have been
developed to quantitate allele frequencies [17,18,19,20,21]. The
Sequenom iPLEXH platform utilizes differential PCR extension
coupled with matrix assisted laser desorption/ionization – time of
flight (MALDI-TOF) mass spectrometry to infer relative allele
frequencies in a given pooled sample. Alternatively, the Fluidigm
PLOS ONE | www.plosone.org 1 October 2012 | Volume 7 | Issue 10 | e47959
Access ArrayTM and Raindance ThunderstormTM systems use
proprietary, dedicated microfluidic thermal cyclers for targeted
PCR enrichment of variable loci followed by deep-sequencing
using any number of next-generation sequencing platforms.
Additionally, use of the Access ArrayTM and ThunderstormTM
systems require synthesis and re-ordering of pairs of individual,
long oligonucleotides for each allele probed. Although these
methods have been successfully applied, the substantial equipment
and/or supplies costs pose barriers to most academic end-users.
In this work, we have developed a rapid, cost-effective strategy
for localized, Illumina-based allele ‘‘frequency sequencing’’, or
‘‘FREQ-Seq’’. The basic rationale of FREQ-Seq is analogous to
RNA-Seq: allele frequencies (rather than transcript levels) are
determined by counting DNA sequence reads. FREQ-Seq utilizes
two rounds of PCR that take advantage of the inclusion of a small
amount of a long, bridging primer that can be produced and re-
generated in-house at low cost in order to introduce sample-
specific bar-codes (Figure 1). Similar to Access ArrayTM and
ThunderstormTM, this method applies PCR enrichment and next-
generation sequencing platforms to estimate allele frequencies, but
in our case, without the expense of individual long oligonucleotides
for each locus, nor the need for additional instrumentation. The
resulting PCR products require no further library preparation for
sequencing, and are simply pooled together for analysis using
either Illumina HiSeq or Genome Analyzer platforms. The
resulting libraries, introduced at just 5% of an Illumina HiSeq
control lane, produced an average of 147,000 reads per sample.
With a custom, publicly available analysis software package we
have developed, users can easily extract and interpret experimen-
tal results.
We validated and demonstrated the utility of the FREQ-Seq
system by determining the frequency of four beneficial mutations
that arose during adaptation of an engineered Methylobacterium
extorquens AM1 strain that requires a heterologous pathway for
formaldehyde oxidation [22]. We examined both i) controlled
cellular ratios of each allele and ii) cryogenically stored population
time-point samples. This engineered strain had been evolved for
900 generations in the laboratory in a minimal medium containing
methanol as a sole carbon source [23]. Use of these allele
frequency data allowed for direct comparison of observed fitness
increases to the rapid rise and selective sweep of these mutations.
These data revealed a period of clonal interference that
temporarily reversed the rise of alleles that would later move to
fixation. For simple polymorphisms, like SNPs and small in-dels
we found calibration was unnecessary, whereas analysis of more
complex allele types such as .10 bp deletions benefit from
correction via control ratios and a simple model fit. Finally, we
demonstrate that a modified primer strategy allows FREQ-Seq to
be used to analyze new DNA junctions such as novel transposable
element insertions which have been repeatedly observed in this
[24] and other evolution experiments [13,14].
Materials and Methods
Strains, culturing and automated growth ratemeasurements
All strains used are derivatives of Methylobacterium extorquens AM1
with relevant genotypes listed in Table S4. Bacterial cultures were
routinely grown in ‘Hypho’ minimal medium exactly as described
by Chou et al. [24], supplemented with 3.5 mM succinate or
15 mM methanol, unless otherwise described, and was performed
at 30uC with aeration. For growth rate measurements, 5 mL frozen
population cultures were inoculated in 640 mL medium supple-
mented with 13.125 mM methanol and 0.4375 mM succinate
(ratio of carbon from M:S = 7:1) in 48-well micotiter plates
(Costar) and grown to saturation to acclimate to growth on
methanol. Following acclimation, saturated cultures were trans-
ferred with a 1:64 dilution in fresh Hypho medium with 20 mM
methanol and placed on a plate shaking tower (Caliper) at 30uC in
a humidified environmental room. Optical densities were obtained
every 2 hrs on a Wallac Victor 2 plate reader (Perkin-Elmer) until
cultures reached saturation using an automated measurement
system [25]. Growth rates were determined by fitting an
exponential growth model using custom analysis software, Curve
Fitter (Delaney et al., unpublished; http://www.evolvedmicrobe.
com/Software.html), with a minimum of 4 replicates.
Construction of a bar-coded adapter plasmid libraryThe entirety of the adapter library was assembled in a vector
backbone for ease of storage and routine amplification. To
generate an appropriate plasmid backbone, the pUC19 vector
(New England Biolabs) was PCR amplified using primers
TCGGTGGTCGCCGTATCATTTTAATTGCGTTGCGCT-
CACTG and AGAGTAAAACGACGGCCAGTTACG-
CATCTGTGCGGTATTTC, and Phusion DNA polymerase
(New England Biolabs, NEB) under manufacturer recommended
conditions, to generate a linear DNA fragment containing only the
replication origin and ampicillin resistance marker. The absence of
the lacZa multiple cloning site ensures no interference with
downstream PCR amplification using the M13f priming site. The
resulting PCR reaction was treated with 20 units of DpnI
restriction endonuclease (NEB) for 1 hr at 37uC to remove the
pUC19 plasmid template prior to purification of the linear DNA
product. The bridging primer was first synthesized and PAGE
purified (Integrated DNA Technologies, IDT) using the template
AATGATACGGCGACCACCGAGATCTACACTCTTTCCC-
TACACGACGCTCTTCCGATCTNNNNNNGTAAAAC-
GACGGCCAGT, where N signifies random nucleotides com-
prising the unique bar code. The subsequent 81 nt ssDNA
oligonucleotide was then converted into dsDNA by five cycles of
PCR using the primer ACTGGCCGTCGTTTTAC and Phusion
DNA polymerase under standard reaction conditions with a 5 s
extension time. Presence of dsDNA following PCR was verified by
10% Acylamide/1X TAE electrophoresis and SybrSafe staining.
The resulting DNA was then purified using phenol extraction and
ethanol precipitation.
To create the plasmid-borne bar-code library, 100 ng of the
linear pUC19 PCR product as well as an equimolar amount of the
81 bp dsDNA Illumina-M13f bar-code were mixed and assembled
into circular plasmid DNA using the one-step method developed
by Gibson and co-workers [26]. Approximately 20 ng of the
resulting assembled DNA was then transformed into NEB10b cells
using the 5 minute transformation protocol (NEB) to eliminate
sibling clones, followed by plating of 100 ml aliquots onto SOB
plates containing ampicillin (100 mg/mL) and X-Gal (80 mg/mL).
The resulting white, single colonies were inoculated into 1 mL of
SOB with ampicillin (100 mg/mL), grown overnight at 37uC in
deep, 96-well plates with shaking at 320 rpm, and stored at
280uC after the addition of DMSO to 8% final concentration.
Approximately, 192 clones were verified by Sanger sequencing
(MWG Operon) and a resulting set of 48 bar-coded adapters were
picked based on a minimum Hamming distance of 2.
The entire FREQ-Seq barcode library is publicly available
through Addgene.org and available individually or as a full kit of
48 plasmids (www.addgene.org/Christopher_Marx/).
FREQ-Seq: Cost-Effective Allele Frequency Assay
PLOS ONE | www.plosone.org 2 October 2012 | Volume 7 | Issue 10 | e47959
Figure 1. Illustration of the FREQ-Seq method. A.) The bridging primers can be produced in house using standard PCR and purification.Individual, uniquely bar-coded bridging primers are amplified from the FREQ-Seq plasmid library set using two primers ABC1 (AATGATACGGC-GACCAC) and ABC2 (ACTGGCCGTCGTTTTAC). The resulting bridging primers are stored as dsDNA at 220uC. B.) Construction of FREQ-Seq Illuminalibraries is conducted in two stages. A first round of PCR amplification is used to generate an allele-specific, representative fragment pool withproduct inserts approximately 250 bp in size. Each product at this stage contains an M13f sequence at the sequencing end and an Illumina B adaptersequence at the non-sequencing end. These products are then treated with exonuclease I or Ampure XP beads to remove remaining PCR primers. Ina second stage of PCR amplification, individual samples are barcoded and enriched using three primers; a small amount of a FREQ-Seq bridging
FREQ-Seq: Cost-Effective Allele Frequency Assay
PLOS ONE | www.plosone.org 3 October 2012 | Volume 7 | Issue 10 | e47959
Amplication and purification of the bar-coded Illumina-M13f bridging primers
Production of dsDNA Illumina-M13f bridging primers for
sample bar-coding was performed by PCR followed by agarose gel
extraction. To amplify a given adapter from the plasmid adapter
clone library, PCR reactions were formulated with primers ABC1
and ABC2 (Table S1) using Phusion DNA polymerase and cycled
under standard conditions with a 5 s extension time. The resulting
products were purified by 2% agarose gel electrophoresis and gel
extraction (Zymo ZR-96 kit) according to manufacturer’s instruc-
tion. Illumina-M13f bridging primers were then verified for
correct size and purity using an Agilent Bioanalyzer.
Allele frequency estimates as determined by flowcytometry
Allele frequency mixtures were generated by growing each
isogenic allele type found in Table S4 to late-log phase.
Subsequent mixtures were made using cell cultures diluted to
OD600 = 0.15. Ratios of cell types were measured using a BD
LSRII flow cytometer. Estimates of allele frequency were
generated using the FlowCore BioConductor package in R to
gate fluorescent and non-fluorescent cell types [27].
Allele specific library generation, sequencing andfrequency determination
Allele-specific, Illumina-compatible FREQ-Seq sequencing
libraries are generated via a two-step procedure outlined in
Figure 1B. A basic overview of the protocol is provided in BoxS1. Additional workflow examples may be found in the online
Text S1. In the initial step, the region containing genetic variation
(SNPs or deletions) is amplified by PCR using allele-specific
primers containing overhangs shown in Figure 1B and listed in
Table S1. These PCR products thereby contain the region of
interest flanked by the non-sequencing Illumina adapter sequence
on one end (Illumina ‘‘B’’) and a ‘universal adapter’ (M13f
sequence) on the sequencing end to enable subsequent bar-coding.
Notably, any polymorphisms must be positioned closely (e.g., ,5–
10 bp for 50-bp reads; 55–60 bp for 100-bp reads) to the end with
the universal adapter in order to enable detection. This is due to
the first 40–45 bases of a sequencing read being consumed by the
FREQ-Seq related barcodes and adapters (Figure 1A). Thus, for
each new locus, only two primers need to be synthesized as
standard DNA oligonucleotides. The second round of PCR is
unique in that it uses three primers. The outer primers are generic
to all reactions and utilize the Illumina ‘‘A’’ and ‘‘B’’ sequences to
amplify all products. Critical to our method, a small amount (1/10
normal concentration) of a long, bridging primer is doped into
each reaction. These 81-mers have the Illumina ‘‘A’’ sequence at
the 59-end, the M13f adapter sequence at the 39-end to amplify the
templates from the first round of PCR, and a 6-mer bar-code in
between that allows each amplicon pool to be identified in a
sample-specific manner (see Figure 1 and Tables S1 and S2).
This results in production of DNA fragment pools already
compatible with the Illumina single-end read flow cell, obviating
all subsequent library preparation steps (e.g., shearing, adapter
ligation, column selection, etc.). The resulting products are then
pooled and sequenced en masse on a Illumina HiSeq or Genome
Analyzer flow cell.
Following Illumina sequencing, data analysis is performed using
custom, open-source software (FREQout), available at http://
www.evolvedmicrobe.com/FreqSeq/index.html. The software in-
cludes a command-line executable and a program with a graphical
user interface. Both programs are designed to read and align
sequences contained in the Illumina FASTQ output file. User
defined alleles and bar-code sequences, as well as alignment and
other analysis parameters, are inputted to the software via an
XML formatted file. The software runs the analysis by assigning
each read to an allele and barcode group based on its sequence
and the quality of its alignment to the different possible alleles (see
also Text S1). Because the software checks for the presence of the
M13f sequence while parsing the reads, this step allows for FREQ-
Seq samples to be run in parallel with other samples on the same
Illumina flow cell. The data generated on the frequency of each
type, as well as statistics related to various quality and alignment
metrics, are then written to a CSV file. The graphical user
interface also includes interactive plots that allow users to view
these details immediately.
Results
Method validationIn order to determine the efficacy of FREQ-Seq, we sought to
address three primary concerns: the scalability of library prepa-
ration, sources of bias, accuracy in the measurements, and the
detection limits of the technique. To ensure scalability, we
designed our ‘bridging’ primer system, that introduces the
sample-specific barcode, in a manner that can be produced in-
house (Figure 1A and Text S1). This alleviates the significant
cost and high error rates of custom synthesis of .60 bp DNA
oligonucleotides. Specifically, we found that approximately 15% of
clones from the FREQ-Seq bridging primer plasmid library
contained frame-shift or mismatch mutations, apparently intro-
duced during oligonucleotide synthesis, and were thus culled.
Presently, we have experimentally confirmed 48 unique bar-coded
adapters all possessing sequence Hamming distances greater than
two. This criterion allows for greater fidelity in bar code
assignment during data processing. In addition, these 48 bar
codes, positioned to be the very first bases sequenced, contain no
significant nucleotide bias within the first four bases. This is critical
because biases in the first four bases of Illumina reads are known to
cause significant reductions in cluster identification resulting in
large numbers of abandoned reads by the Illumina analysis
software [28].
Since sequencing libraries require PCR amplification, we
examined the potential bias that could arise from differential
product amplification or detection. As controls, we assayed
mixtures containing a wide range of cell ratios – accurately
determined via flow cytometry – containing either the ancestral or
evolved version of an allele. Following 600 generations of
adaptation to methanol growth by a genetically engineered strain
of M. extorquens AM1, an isolate was chosen and sequenced to
determine the genetic basis of adaptation [22]. We began our
analysis of this population by interrogating three beneficial
mutations, pntABEVO, gshAEVO, and fghAEVO, which span the gamut
of a SNP and deletions of 2- and 11-bp, respectively (Figure 2A–C).
primer and two Illumina enrichment primers. Initially, amplification proceeds via the sample-specific bridging primer that has the M13f sequence atits 39 end. The resulting products can then be amplified to higher quantity with the generic, enrichment primers A and B. Following sample poolingand standard PCR product purification, the resulting products constitute a complete FREQ-Seq library.doi:10.1371/journal.pone.0047959.g001
FREQ-Seq: Cost-Effective Allele Frequency Assay
PLOS ONE | www.plosone.org 4 October 2012 | Volume 7 | Issue 10 | e47959
Comparison of the observed allele frequencies compared to
those expected from the cell ratios demonstrated that there was
remarkably little bias throughout the procedure (Figure 2). Only
the 11-bp deletion of the fghA polymorphism deviated to somewhat
from expectation (Figure 2C). . Because, fortuitously, each of the
allele types of pntAB, gshA, and fghA resulted in change in restriction
pattern, we used quantitative RFLP analysis of the same amplicon
pools and found a similarly weak bias (Figure S1). Fitting these
control data to a quadratic function, however, allowed us to
readily control for the minor bias of fghA, but it appears to be
largely unnecessary for simple SNPs or small in-dels.
The FREQ-Seq method can consistently produce allele
frequency estimates with low absolute errors. However, for very
high and very low allele frequency estimates (.99% or ,1%), the
relative error on the estimates can become large because the
frequency of the allele (the signal) starts to become equal to or less
than the frequency of sequencing errors (the noise). In principle,
one could obtain better estimates of very low or high frequency
alleles by making the estimate within a probabilistic framework
that accounts for the chance that a read is a sequencing error. This
would require a model for the likelihood of sequencing errors,
which would be best verified by running a sample that is entirely of
one type and seeing how often the other type appears. We
performed this assay by sequencing barcoded samples for both the
gshA and the pntAB loci that were at 100% frequency for only one
allele type, and did this with four independently prepared samples
for both alleles on four different dates. For both alleles, and for all
sequencing dates, the observed error rate was very low and the
median observed percentage of the type not supposed to be
present in the sample was 0.23%, indicating that this percentage is
approximately equal to the range in which signal and noise
become equally important to the direct allele frequency estimate.
However, interestingly, a comparison of the different dates shows
that the frequencies of incorrect sequencing reads in the samples
were statistically different for both allele types (logistic regression,
p-values,10215), and two samples from one date contained errors
that were greater than 1% at 1.41% and 1.13% respectively. We
interpreted this result to mean that the frequency of sequencing
errors can be influenced by subtle and distinct biases for different
sequencing runs, and so one must be careful before applying an
error model trained on one day’s data to sequencing data obtained
on another day. However, once data from many more sequencing
runs is obtained, we suspect one could create a reliable method to
more accurately assess uncertainty in low and high frequency
samples by using a hierarchical model that accounts for day to day
variation and that also includes read quality scores as a covariate.
FREQ-Seq can be used to monitor novel DNA junctionsA reoccurring mutational type in many evolution experiments is
the formation of new DNA junctions, either by deletion,
amplification or insertion sequence (IS) element/transposon
insertions [4,13,14,23,24,29,30,31]. To assess the ability of
FREQ-Seq to detect new DNA junctions we modified our primer
strategy to use three primers in the first reaction (two forward,
sequencing-end and one reverse found in Table S1, see FigureS2) to amplify a region adjacent to the icuAB locus on the M.
extorquens AM1 chromosome. This site was previously shown to
contain an IS element insertion that resulted in the increased
expression of icuAB (a cobalt transport system) under metal
limitation [24]. Of the three primers designed, the two forward,
sequencing-end primers were designed to differentially bind to the
icuABWT and icuABEVO (IS containing) genotypes, while the third
reverse primer annealed to a common site downstream of the IS
insertion site. Thus, each amplicon would only occur in the
presence of icuABWT or icuABEVO, respectively. Using the same
protocol developed for two locus primers, we found that FREQ-
Seq was capable of detecting the IS element insertion located
proximal to icuAB and allowed us to follow its rise to fixation
(Figure 3).
Figure 2. Calibration of FREQ-Seq data for three ancestral/evolved (WT/EVO) allele mixtures pntABEVO (A.), gshAEVO (B.), and fghAEVO
(C.). Strains bearing each evolved allele were compared to one with the wild-type version that also expressed mCherry to permit precisequantification of expected ratios by flow cytometry (Table S3). Presented in each column are sequence variation between the ancestral WT and EVOalleles, and subsequent quadratic fits of the data for each respective allele. Observed data are the frequencies of the evolved allele as determined byFREQ-Seq, whereas expected data are wild-type allele frequencies estimated using flow cytometry. Quadratic model fitting was performed in R.doi:10.1371/journal.pone.0047959.g002
FREQ-Seq: Cost-Effective Allele Frequency Assay
PLOS ONE | www.plosone.org 5 October 2012 | Volume 7 | Issue 10 | e47959
Determining the order and dynamics of beneficial allelesfrom evolved populations
Numerous evolution experiments have been performed using
microbial populations where temporal population samples have
been cryogenically stored. Wide-spread access to whole genome
resequencing has allowed many of the beneficial mutations that
have risen in such populations to be identified and characterized.
For the M. extorquens AM1 population we have used as a test case,
we had previously uncovered a generic pattern of epistasis between
the mutations that emerged that led to diminishing returns [23].
What was unknown, however, was the order and dynamics by
which these beneficial alleles emerged. Therefore, to elucidate the
adaptive trajectory and dynamics for this population we employed
FREQ-Seq to the three loci described above (Figure 2), as well as
icuAB. Interestingly, we discovered that the order of incorporation
of these alleles followed the ranking of their selective advantages:
gshAEVO, fghAEVO, icuABEVO and then pntAEVO (Figure 3). Despite
the large selective advantage of the alleles we examined, we
uncovered a substantial, temporary reversal in relative frequencies
between generations 108 and 210. This pattern is indicative of
clonal interference (i.e. the Hill-Robertson effect; [32]); whereby
multiple lineages accrue beneficial mutations and interfere with
each other’s path toward fixation [31,33,34,35,36].
Comparison of FREQ-Seq data to genotype data directlyfrom random isolates
The FREQ-Seq data from our test population suggested three
things: the order of mutations, probable early linkage of gshAEVO
and fghAEVO before they both dramatically rose in frequency, and
that the majority of the population at generation 150 had
unidentified mutations. As an independent method to test these
inferences, we obtained and independently genotyped 72 random
isolates from generation 150 (Figure 4 and Table S2). First
considering the known alleles from the eventual winning lineage
that were present at this time (gshAEVO and fghAEVO), isolates were
found with both gshAEVO and fghAEVO, or just gshAEVO, but not just
fghAEVO. This pattern and linkage data is consistent with the
FREQ-Seq inferences that the fghAEVO mutation arose later and on
the background of gshAEVO.
Turning to trying to uncover what other mutations were present
in the lineages that temporarily outcompeted those with gshAEVO
and fghAEVO, we considered previous work on parallelism across
the replicate populations [27]. This paper had found that all
populations contain mutations that, like the fghAEVO allele
described here, reduce expression of the introduced formaldehyde
oxidation pathway in this strain. The most common mutational
event across populations was IS-mediated co-integration of the
introduced, multi-copy plasmid expressing the foreign formalde-
hyde pathway with an endogenous, single copy plasmid in M.
Figure 3. Allele frequencies for four beneficial alleles in population F4 (A) with correlated changes in relative growth rate (B). AllelesgshAEVO (circles), fghAEVO (triangles), pntABEVO (squares), and icuABEVO (diamonds) were determined by FREQ-Seq and corresponding values werecorrected with calibrated data shown in Figure 2. Growth rates (B.) were normalized to the growth rate of the ancestral strain.doi:10.1371/journal.pone.0047959.g003
FREQ-Seq: Cost-Effective Allele Frequency Assay
PLOS ONE | www.plosone.org 6 October 2012 | Volume 7 | Issue 10 | e47959
extorquens AM1 [29]. We used PCR amplifications to determine
whether such an event occurred in these isolates (Table S3). Of
the 72 isolates, 43 of these had an trfA::ISMex25 allele (trfA
encodes the plasmid replication protein of this IncP replicon, [37]).
Interestingly, of these there were 2 that possessed both gshAEVO and
an trfA::ISMex25 allele. This homoplasy is most likely due to
repeated cointegrations on different backgrounds. Finally, there
were 14 isolates that lacked all of the known beneficial alleles, but
co-existed with the other genotypes at generation 150 when the
growth rate of the population had risen by 84.462.5%. Clearly
there were other, as yet unknown mutations that occurred and
temporarily rose in frequency.
Comparison of beneficial allele emergence withobserved increases in population fitness
With a quantitative picture of genotypic change at key loci
available via FREQ-Seq, we measured the change in growth rate
of the population to determine if this corroborated the finding of
clonal interference. We used growth rate as a proxy for fitness
because it has been previously demonstrated for these populations
that these two measures are linearly correlated [23]. As has been
seen for other experiments with M. extorquens AM1 [38], as well as
other model systems (e.g., [39]), the rate of fitness increase was
faster early in the experiment then later, but continued to rise
throughout. Importantly, the relative smoothness, rather than
stepwise nature of the observed dynamics was consistent with
theoretical models of clonal interference (and recurring ‘‘multiple
mutations’’; [40]). Furthermore, clear evidence for other lineages
that were ‘‘invisible’’ to our FREQ-Seq data comes from the fact
that the temporary peak of the gshAEVO allele at 34.7% of the
populations occurred at generation 108, when the relative growth
had already increased by 69.368.9%. Given that this was the
inflection point between rising and falling in frequency, this
indicates that the genotype containing gshAEVO had a fitness of
,70%. This means that the rest of the population was somewhat
less fit than this genotype prior to that time-point, and somewhat
higher afterwards. Despite these fast reversals in allele frequencies
during this interval, growth rate continued to rise relatively
continuously. Compared to the wavering trajectory for gshAEVO,
the trajectories of fghAEVO, and later pntABEVO and icuABEVO did
not exhibit wild reversals, yet were more complex than the simple
sinusoidal pattern expected were there to be periodic selection of
single mutations in the absence of clonal interference [41].
Discussion
Reported here is a method for quantitative determination of
allele frequencies called FREQ-Seq. FREQ-Seq utilizes standard
PCR and depends upon a novel bar-coding system to generate
locus-specific Illumina sequencing libraries. This capitalizes on
focusing the sequencing depth of next-generation sequencing
platforms upon just the key loci of interest across timepoints or
populations in order to accurately infer their frequency. A key
aspect of FREQ-Seq is its capacity for multiplexing a large
number of samples. For example, for academic laboratories
engaged in experimental evolution or environmental microbiolo-
gy, FREQ-Seq provides a low-cost, highly extendable method for
investigating genetic variation over large temporal and spatial
scales.
As a proof of principle we have demonstrated the utility of
FREQ-Seq in tracking the emergence of four beneficial alleles
from a previously characterized evolution experiment [23]. We
also observed that, while the order of allele emergence matched
the order which would have been the most optimal adaptive
trajectory amongst the epistatic combinations generated by Chou
and coworkers [23], a previously unknown period of clonal
interference existed early during adaptation. Based on genotyping
of clones during this period, one prominent source of interfering
lineages resulted from IS insertions into the replication machinery
(trfA) of the RK2-based plasmid harboring the exogenous
formaldehyde oxidation pathway. Similar IS insertions have
recently been found in many of the replicate populations from
this experiment, and provide 17.4–24.1% fitness increases due to
reducing the cost associated with exogenous pathway expression
[29]. Interestingly, while these IS mutational variants were the
eventual evolutionary ‘winners’ in other populations, they were
unable to outcompete the fghAEVO gshAEVO genotype despite being
simultaneously present in gshAEVO backgrounds. These results
underpin the importance of allele frequency data in revealing the
complex processes during adaptation. Furthermore, by simulta-
neously determining the population growth rate with high
resolution, we were able to both estimate the fitness of the
genotype containing the allele of interest and demonstrate that the
reversal in its frequency actually occurred during a period of
continued rise in fitness. As more studies move to characterizing
improvement accurately and with fine time resolution, the
synergistic role of FREQ-Seq estimates of allele frequencies will
become even more important to connect between phenotypic and
genotypic change.
This test case highlights advantages of the FREQ-Seq method.
First, our results show that a wide range of allelic types can be
detected ranging from SNPs and small in-dels to novel IS
junctions. Second, operating at approximately 5% of an Illumina
flow cell capacity we were able to achieve .105 reads per locus per
time-point. Third, we observed minimal bias in our control ratios.
This suggests that FREQ-Seq is fully capable of quantitatively
detecting these types of alleles with little or no calibration required,
particularly for SNPs or small in-dels. Where needed a simple
quadratic fit can be used to account for small, systematic bias. We
encourage potential users to consider a control mixture to account
for possible skews in observed frequencies for these types of alleles.
Fourth, the sensitivity available is quite reasonable. Our data
allowed us to quantify allele frequencies reliably from 1–99%.
Figure 4. Genotyping of 72 isolates from population F4 atgeneration 150. Genotypes were determined by PCR amplificationand RFLP analysis using enzymes HhaI and HpyAV for the gshA, andfghA alleles, respectively. trfA::ISMex25 alleles were assayed by standardPCR amplification.doi:10.1371/journal.pone.0047959.g004
FREQ-Seq: Cost-Effective Allele Frequency Assay
PLOS ONE | www.plosone.org 7 October 2012 | Volume 7 | Issue 10 | e47959
More sophisticated statistical error models or increased flow cell
capacity can likely increase the sensitivity of FREQ-Seq. Fifth, the
publicly available FREQout analysis package allows for research-
ers to easily recover data and receive a full report of allele
frequency types.
A critical component of any new method is the cost of
implementation, and the associated tradeoff in accuracy and
throughput, relative to existing techniques and technologies. With
this in mind, we explored the cost of FREQ-Seq implementation
compared to existing high-throughput platforms (Fluidigm Acces-
sArrayTM and Raindance ThunderstormTM) and sequencing-
based allele frequency detection methods (Sanger sequencing and
whole genome resquencing). As an initial cost comparison, we
chose to examine two commercial systems from Fluidigm and
Raindance that use a very similar strategy for PCR enrichment
and barcoding of chromosomal loci. FREQ-Seq and these two
platforms both require the use of locus specific oligonucleotide
primers containing long overhangs. Unlike FREQ-Seq, however,
both the Fluidigm and Raindance systems require long barcoded
primers, containing both Illumina adapter sequences, for every
locus of interest. While an unassuming difference, the latter
primers routinely exceed most standard oligonucleotide synthesis
lengths (currently 60 nucleotides at $0.35 US/base, IDT) resulting
in increased synthesis and purification costs (.$1 US/base).
Additionally, as these primers are locus specific, each locus must
have a new set of barcoded primers. To directly compare, each
Fluidigm or Raindance barcoded primer set would cost approx-
imately $3,120 US (48665 nt barcoded primers at approximately
$1 US/base) for a single locus whereas a FREQ-Seq locus primer
set of similar size costs $32 US (1637 nt primer +1654 nt primer
at $0.35 US/base) as all barcoding is accomplished with the
universal bridging primer in a locus-indendent manner. Finally,
comparing equipment costs (assuming equal amounts of PCR
reagents and user time are used) shows that the aforementioned
commercial technologies (not considering annual custom consum-
ables and instrument servicing costs) range between $70,000 to
$100,000 US. In contrast, FREQ-Seq relies only on a standard,
96-well PCR thermal cycler already present in or generally
available to most laboratories. Collectively, FREQ-Seq is two
orders of magnitude less expensive in oligonucleotide costs and
again obviates purchasing an expensive equipment specifically
designed for this purpose. Further, as FREQ-Seq is an open-
source platform, no reagents or equipment are proprietary,
thereby alleviating significant annual service and licensing costs
for small research laboratories and institutions.
Additionally, one may consider alternative strategies such as
traditional Sanger sequencing and whole genome resequencing to
infer allele frequencies in mixed populations. Using the analysis of
an experimentally evolved population demonstrated in Figure 3as a metric, Sanger sequencing of four loci for 22 time points
would result in a cost for a single set of replicates to be $440 US (5
$US/reaction, GeneWiz). However, to achieve any statistical
accuracy, multiple replicate samples for each allele and time point
would need to be run. Moreover, the accuracy generated by
Sanger sequencing is routinely difficult to achieve below 10%
allele abundance [3,9,16]. Likewise, employing the same logic to
whole genome resequencing of mixed population samples using
the Illumina HiSeq platform, approximately 12 samples can be
run per lane to produce an average coverage of 100–200 fold.
Conservatively, this experiment would result in the use of two full
lanes in an Illumina flow cell (costing approximately $2,400 US
using Harvard FAS Core pricing) and would require 22 barcoded
Illumina libraries to be generated (an additional cost of $1,200 US
with TruSeq preparations at $55 US/sample). An immediate
benefit of this strategy is that nearly all chromosomal mutations can
be seen simultaneously. The drawback is that 100–200 fold coverage
results in both substantial sampling error in estimating alleles of
intermediate frequency, and is insufficient to detect alleles at low
frequency, resulting in increased coverage being necessary (with
costs increasing linearly). In contrast, using only ,3% of a single
Illumina flow cell lane ($40 US), FREQ-Seq is able to achieve an
average of 100,000 fold coverage per allele, per time point. If we thus
consider the cost per 1,000 fold coverage, whole genome sequencing
would cost ,$700 US per time point, whereas FREQ-Seq costs
$0.005 US per locus of interest in the population. By focusing on loci
of interest, FREQ-Seq uses the five order of magnitude difference in
cost per coverage to achieve robust quantitative results and sub 1%
frequency detection. Taken together, these comparisons demon-
strate that FREQ-Seq is a cost-effective, quantitative strategy for
localized allele frequency determination.
In the present implementation of FREQ-Seq, we have described
a set of 48 uniquely bar-coded bridging primers to produce
localized Illumina sequencing libraries compatible with either
single-end or paired-end read flow cells. Although currently
designed for single-end reads, changing compatibility to paired-
end flow cells would only require simple modification to the
reverse locus primer to change its 59 overhang, and an identical
change in the primer ‘B’ for enrichment (Table S1). Though
employed here to explore the frequencies of known alleles in a
laboratory evolution experiment, we expect FREQ-Seq to be
useful in other diagnostic applications. Multi-way competitions
between bar-coded strains could be readily performed. Moving out
of the lab entirely, microbial species variation in environmental
samples is entirely feasible. For example, phylogenetic analyses of
16S rRNA from communities could occur with excellent
throughput and no subsequent library preparation. As a result,
FREQ-Seq should be a useful tool to many biomedical and
environmental microbiologists.
Supporting Information
Figure S1 Estimation of PCR-introduced bias in alleleamplification. Allele pntABEVO (circles), gshAEVO (squares), and
fghAEVO (triangles) were amplified using primer pairs AF3/AF4,
AF7/8, and AF9/10, respectively followed by bridging primer
(MLBC1) addition. Final FREQ-Seq products for pntABEVO,
gshAEVO, and fghAEVO were digested with HhaI, AluI, or HpyAV,
respectively. Digested DNA fragments were sized fractionated and
analyzed using a Bioanalyzer 2100 (Agilent) with DNA1000 chips.
Ratios were determined using the relative ratios of integrated peak
areas from the Bioanalyzer trace. Data were analyzed using the
Agilent 2100 Expert software package.
(TIF)
Figure S2 Schematic of the three primer, novel DNAjunction implementation of FREQ-Seq.(TIF)
Box S1 Abbreviated FREQ-Seq protocol.(TIF)
Table S1 FREQ-Seq primer sequences and allele-spe-cific primers used in this study.(DOCX)
Table S2 Sequences of the 48 barcodes contained in theFREQ-Seq kit. All sequences are the N bases in the sequence
AATGATACGGCGACCACCGAGATCTACACTCTTTCCC-
TACACGACGCTCTTCCGATCTNNNNNNGTAAAAC-
GACGGCCAGT.
(DOCX)
FREQ-Seq: Cost-Effective Allele Frequency Assay
PLOS ONE | www.plosone.org 8 October 2012 | Volume 7 | Issue 10 | e47959
Table S3 Genotypes of 72 isolates from population F4 atgeneration 150. An (X) indicates the presence of an evolved
allele (EVO) as determined by RFLP or differential PCR
amplication.
(DOCX)
Table S4 Bacterial strains used in this study and theirrelevant genotypes.
(DOCX)
Text S1
(DOCX)
Acknowledgments
We would like to thank Christian Daly, Jiangwen Zhang and Claire
Reardon for their technical assistance in Illumina sequencing. Additionally,
we thank Andres Benitez and Gerda Saxer for technical feedback on
FREQ-Seq implementation, Sean M. Carroll for helpful comments on
nomenclature, and Deepa Agashe and members of the Marx laboratory for
helpful comments on the manuscript.
Author Contributions
Conceived and designed the experiments: LMC ML CJM. Performed the
experiments: LMC ML. Analyzed the data: LMC ML NFD CJM.
Contributed reagents/materials/analysis tools: LMC ML NFD. Wrote the
paper: LMC ML NFD CJM.
References
1. Futuyma DJ (2009) Evolution: Sinauer Associates.
2. Wichman HA, Badgett MR, Scott LA, Boulianne CM, Bull JJ (1999) Different
trajectories of parallel evolution during viral adaptation. Science 285: 422–424.
3. Kvitek DJ, Sherlock G (2011) Reciprocal sign epistasis between frequentlyexperimentally evolved adaptive mutations causes a rugged fitness landscape.
PLoS Genet 7: e1002056.
4. Barrick JE, Yu DS, Yoon SH, Jeong H, Oh TK, et al. (2009) Genome evolutionand adaptation in a long-term experiment with Escherichia coli. Nature 461:
1243–1247.
5. Conrad TM, Joyce AR, Applebee MK, Barrett CL, Xie B, et al. (2009) Whole-genome resequencing of Escherichia coli K-12 MG1655 undergoing short-term
laboratory evolution in lactate minimal media reveals flexible selection of
adaptive mutations. Genome Biol 10: R118.
6. Barrick JE, Lenski RE (2009) Genome-wide mutational diversity in an evolving
population of Escherichia coli. Cold Spring Harb Symp Quant Biol 74: 119–129.
7. Woods R, Schneider D, Winkworth CL, Riley MA, Lenski RE (2006) Tests ofparallel molecular evolution in a long-term experiment with Escherichia coli. Proc
Natl Acad Sci U S A 103: 9107–9112.
8. Woods RJ, Barrick JE, Cooper TF, Shrestha U, Kauth MR, et al. (2011)Second-order selection for evolvability in a large Escherichia coli population.
Science 331: 1433–1436.
9. Pena MI, Davlieva M, Bennett MR, Olson JS, Shamoo Y (2010) Evolutionaryfates within a microbial population highlight an essential role for protein folding
during natural selection. Mol Syst Biol 6: 387.
10. Norton N, Williams NM, Williams HJ, Spurlock G, Kirov G, et al. (2002)Universal, robust, highly quantitative SNP allele frequency measurement in
DNA pools. Hum Genet 110: 471–478.
11. Monsion B, Duborjal H, Blanc S (2008) Quantitative Single-letter Sequencing: amethod for simultaneously monitoring numerous known allelic variants in single
DNA samples. BMC Genomics 9: 85.
12. Kirov G, Nikolov I, Georgieva L, Moskvina V, Owen MJ, et al. (2006) PooledDNA genotyping on Affymetrix SNP genotyping arrays. BMC Genomics 7: 27.
13. Papadopoulos D, Schneider D, Meier-Eiss J, Arber W, Lenski RE, et al. (1999)
Genomic evolution during a 10,000-generation experiment with bacteria. ProcNatl Acad Sci U S A 96: 3807–3812.
14. Schneider D, Duperchy E, Coursange E, Lenski RE, Blot M (2000) Long-term
experimental evolution in Escherichia coli. IX. Characterization of insertionsequence-mediated mutations and rearrangements. Genetics 156: 477–488.
15. Wilkening S, Hemminki K, Thirumaran RK, Bermejo JL, Bonn S, et al. (2005)
Determination of allele frequency in pooled DNA: comparison of three PCR-based methods. Biotechniques 39: 853–858.
16. Gresham D, Desai MM, Tucker CM, Jenq HT, Pai DA, et al. (2008) The
repertoire and dynamics of evolutionary adaptations to controlled nutrient-limited environments in yeast. PLoS Genet 4: e1000303.
17. Lee HH, Molla MN, Cantor CR, Collins JJ (2010) Bacterial charity work leads
to population-wide resistance. Nature 467: 82–85.
18. Herbon N, Werner M, Braig C, Gohlke H, Dutsch G, et al. (2003) High-
resolution SNP scan of chromosome 6p21 in pooled samples from patients with
complex diseases. Genomics 81: 510–518.
19. Werner M, Sych M, Herbon N, Illig T, Konig IR, et al. (2002) Large-scale
determination of SNP allele frequencies in DNA pools using MALDI-TOF mass
spectrometry. Hum Mutat 20: 57–64.
20. Forshew T, Murtaza M, Parkinson C, Gale D, Tsui DW, et al. (2012)
Noninvasive identification and monitoring of cancer mutations by targeted deep
sequencing of plasma DNA. Sci Transl Med 4: 136ra168.
21. Pekin D, Skhiri Y, Baret J-C, Le Corre D, Mazutis L, et al. (2011) Quantitative
and sensitive detection of rare mutations using droplet-based microfluidics. Lab
on a Chip 11: 2156–2166.22. Marx CJ, Chistoserdova L, Lidstrom ME (2003) Formaldehyde-detoxifying role
of the tetrahydromethanopterin-linked pathway in Methylobacterium extorquens
AM1. J Bacteriol 185: 7160–7168.
23. Chou HH, Chiu HC, Delaney NF, Segre D, Marx CJ (2011) Diminishingreturns epistasis among beneficial mutations decelerates adaptation. Science
332: 1190–1192.
24. Chou HH, Berthet J, Marx CJ (2009) Fast growth increases the selectiveadvantage of a mutation arising recurrently during evolution under metal
limitation. PLoS Genet 5: e1000652.25. Delaney NF, Rojas Echenique JI, Marx CJ (2012) Clarity: An open source
manager for laboratory automation. J Lab Autom doi:10.1177/
2211068212460237.26. Gibson DG, Young L, Chuang RY, Venter JC, Hutchison CA, 3rd, et al (2009)
Enzymatic assembly of DNA molecules up to several hundred kilobases. NatMethods 6: 343–345.
27. Hahne F, LeMeur N, Brinkman RR, Ellis B, Haaland P, et al. (2009) flowCore:a Bioconductor package for high throughput flow cytometry. BMC Bioinfor-
matics 10: 106.
28. Krueger F, Andrews SR, Osborne CS (2011) Large scale loss of data in low-diversity illumina sequencing libraries can be recovered by deferred cluster
calling. PLoS One 6: e16607.29. Chou HH, Marx CJ (2012) Optimization of Gene Expression through Divergent
Mutational Paths. Cell Reports 1: 133–140.
30. Lee MC, Marx CJ (2012) Repeated, selection-driven genome reduction ofaccessory genes in experimental populations. PLoS Genet 8: e1002651.
31. Kao KC, Sherlock G (2008) Molecular characterization of clonal interferenceduring adaptive evolution in asexual populations of Saccharomyces cerevisiae. Nat
Genet 40: 1499–1504.32. Hill WG, Robertson A (2007) The effect of linkage on limits to artificial
selection. Genet Res 89: 311–336.
33. Fogle CA, Nagle JL, Desai MM (2008) Clonal interference, multiple mutationsand adaptation in large asexual populations. Genetics 180: 2163–2173.
34. Notley-McRobb L, Ferenci T (1999) The generation of multiple co-existing mal-regulatory mutations through polygenic evolution in glucose-limited populations
of Escherichia coli. Environ Microbiol 1: 45–52.
35. Gerrish PJ, Lenski RE (1998) The fate of competing beneficial mutations in anasexual population. Genetica 102–103: 127–144.
36. Helling RB, Vargas CN, Adams J (1987) Evolution of Escherichia coli duringgrowth in a constant environment. Genetics 116: 349–358.
37. Thomas CM, Meyer R, Helinski DR (1980) Regions of broad-host-range
plasmid RK2 which are essential for replication and maintenance. J Bacteriol141: 213–222.
38. Lee MC, Chou HH, Marx CJ (2009) Asymmetric, bimodal trade-offs duringadaptation of Methylobacterium to distinct growth substrates. Evolution 63: 2816–
2830.39. Lenski RE, Travisano M (1994) Dynamics of adaptation and diversification: a
10,000-generation experiment with bacterial populations. Proc Natl Acad
Sci U S A 91: 6808–6814.40. Desai MM, Fisher DS, Murray AW (2007) The speed of evolution and
maintenance of variation in asexual populations. Curr Biol 17: 385–394.41. Lenski RE, Rose M R., Simpson, S C., and Tadler, S C. (1991) Long-term
experimental evolution in Escherichia coli. I. Adaptation and divergence during
2,000 generations. Am Nat 138: 1315–1341.
FREQ-Seq: Cost-Effective Allele Frequency Assay
PLOS ONE | www.plosone.org 9 October 2012 | Volume 7 | Issue 10 | e47959
Table S1. FREQ-Seq primer sequences and allele-specific primers used in this study.
Name Sequence Description FO GTAAAACGACGGCCAGT Forward/sequencing end overhang RO CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT Single-end read reverse overhang PRO AAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT Paired-end read reverse overhang ABC1 AATGATACGGCGACCAC Bar code amplification ABC2 ACTGGCCGTCGTTTTAC Bar code amplification AF1 AATGATACGGCGACCAC FREQ-Seq enrichment AF2 CAAGCAGAAGACGGCATAC FREQ-Seq enrichment AF3 FO+CAGATCTGAACTTCCCAGCA pntAB forward AF4 RO+GACCCCAGACCTATGAACTTC pntAB reverse AF7 FO+ACGCTGCAAGAGTGAACAAC gshA forward AF8 RO+TGAGATCGATCTGCTGGGTC gshA reverse AF9 FO+CTAGAGTTCCACGACTTGACAG fghA forward AF10 RO+CATTTTCATGCGTGCAGGTC fghA reverse AF11 FO+GATGCTGCGACCGAGATT icuAB forward AF12 FO+ATCCACTGCCCTCGTGAATA icuAB::ISMex4 forward AF13 RO+AGTTCCTCCAGCTCAACTGC icuAB reverse MLBC1 AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCT
TCCGATCTGAGAGAGTAAAACGACGGCCAGT Control bridge adapter
S11 GACTTCCGGCAAGCTATACG ISMex25 insertion forward S12 CCCGCAAGGAGGGTGAATG ISMex25 insertion reverse
Table S2. Sequences of the 48 barcodes contained in the FREQ-Seq kit. All sequences are the N bases in the sequence AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNGTAAAACGACGGCCAGT.
Name Sequence Name Sequence BC01 AGCAAT BC25 TGACGA BC02 CCTGTT BC26 CAAATA BC03 GGGTTT BC27 GTTCAG BC04 GAAGGC BC28 CTTCAA BC05 ATCTCA BC29 GTTGGG BC06 ATGGAT BC30 GCTTAG BC07 ATGTCT BC31 TAGCCA BC08 CGTGAC BC32 TAACTT BC09 TTAGGT BC33 CGGATA BC10 GTGCAT BC34 CAGCAG BC11 AACTTT BC35 AAGTAG BC12 GGATCG BC36 GGGACG BC13 ATAAGG BC37 CCGTGG BC14 ATTGGT BC38 ATTGTA BC15 AGTGAG BC39 TTTAGA BC16 CCCACC BC40 CCACGA BC17 CGATGC BC41 TCATGG BC18 GATAGC BC42 GAACCA BC19 GTCAGA BC43 TCCTAA BC20 TTAAGC BC44 CAACGC BC21 AACCTG BC45 AGTGTT BC22 CTTTGC BC46 GGATTA BC23 TGGAGA BC47 TATATA BC24 AATTGT BC48 GTACAA
Table S3. Genotypes of 72 isolates from population F4 at generation 150. An (X) indicates the presence of an evolved allele (EVO) as determined by RFLP or differential PCR amplication.
F4, G150 isolate
trfA::ISMex25 fghAEVO gshAEVO
CM3317 CM3319 CM3320 CM3330 CM3338 CM3343 CM3344 CM3345 CM3350 CM3359 CM3365 CM3368 CM3370 CM3373 CM3307 X CM3312 X CM3318 X CM3336 X CM3355 X CM3305 X X CM3308 X X CM3327 X X CM3335 X X CM3339 X X CM3348 X X CM3352 X X CM3362 X X CM3363 X X CM3364 X X CM3322 X CM3354 X CM3303 X CM3304 X CM3306 X CM3309 X CM3310 X CM3311 X CM3313 X CM3314 X CM3315 X CM3316 X CM3321 X CM3323 X
CM3324 X CM3325 X CM3326 X CM3328 X CM3329 X CM3331 X CM3332 X CM3333 X CM3334 X CM3337 X CM3340 X CM3341 X CM3342 X CM3346 X CM3347 X CM3349 X CM3351 X CM3356 X CM3357 X CM3358 X CM3360 X CM3361 X CM3366 X CM3367 X CM3371 X CM3372 X CM3374 X CM3353 X X CM3369 X X
Table S4. Bacterial strains used in this study and their relevant genotypes.
Strain Relevant genotype or characteristics Reference NEB10β E. coli recA1 endA1 rpsL Φ80(lacZΔM15) NEB CM502 M. extorquens crtI502 [S2] CM701 M. extorquens ΔmptG / pCM410 [23] CM1145 M. extorquens ΔmptG crtI502 gshAEVO pntABEVO icuABEVO/ pCM410.1145 [fghAEVO] [23] CM1175 M. extorquens ΔkatA::[loxP-trrnB-PtacA-mCherry-tT7] [37] CM1290 M. extorquens ΔmptG crtI502 pntAEVO
[23] CM1298 M. extorquens ΔmptG crtI502 gshAEVO [23] CM3137a M. extorquens CM1290 mptGWT This work CM3138a M. extorquens CM1298 mptGWT This work CM3277 M. extorquens CM502 / pCM410.1145 [fghAEVO] This work CM3943 M. extorquens CM1175 / pCM410 This work
a. The wild-type mptG allele was restored into the respective strains using pCM436 following the sucrose counter-selection method described by C.J. Marx [S2]
Supplementary Text
Text S1: FREQ-Seq protocol
Primer design
To implement FREQ-Seq, users must design a single set of oligonucleotide primers for each experimental locus. Using the sequences provided in Table S1, primers for single-end read flow cells are designed as below:
Sequencing end
5’-GTAAAACGACGGCCAGT + ~20 bp primer sequence homologous to region adjacent to mutation
Non-sequencing end
5’- CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT + ~20 bp primer sequence homologous to region ~250 bp downstream of the mutation
Bridging primer synthesis
Synthesis of the FREQ-Seq bar-coded, bridging primers can be performed in house using a thermal cycler and standard PCR reagents with any polymerase that generates blunt-ended products, such as Pfu, DeepVentR™ or Phusion™. Example reaction conditions using Phusion™ polymerase are as follows:
DNA template: FREQ-Seq plasmid (~ 1-10 pg)
Oligo primers: ABC1 and ABC2 (0.5 µM each)
Cycling conditions:
98 °C – 60 s Cycle 30x: 98 °C – 20 s 52 °C – 15 s 72 °C – 5 s 4 °C – <3 hrs or freeze
We have tried three methods for purification of the dsDNA briding primers with success:
1. 2% agarose/1X TAE separation followed by gel extraction using a ZR-96 Zymo gel recovery kit. This provides pure bridge adapter preparations and is compatible with preparing large numbers adapter simultaneously.
2. 10% polyacrylamide/1X TAE separation followed by crush and soak (overnight) extraction, and EtOH precipitation of DNA products. This results in the purest bridge primer preparations, though not typically required.
3. Purification using Ampure XP™ magnetic beads. Best results are achieved if PCR reactions are treated with DpnI to eliminate plasmid contaminants prior to purification. The most rapid cleanup, but results in less pure bridge adapter preparations.
FREQ-Seq library preparation
The FREQ-Seq library preparation requires two stages of PCR amplification. Reactions can be performed with any blunt-end producing DNA polymerase.
Stage 1: Locus amplification
1. For microbial populations, cells can be spun down and resuspended in water or other suitable lysis buffer. Following resuspension, cells are boiled at 98-100 °C in a thermal cycler or appropriate water bath. Cell debris can then be removed by centrifugation. The supernatant contains a crude mixture of microbial DNA representative of the original population.
2. Desired loci are amplified in a 50 µL reaction using the locus specific FREQ-Seq primers with conditions:
a. 5-10 µL of crude DNA template from 1. b. 0.5 µM of each primer, 200 µM dNTP, and appropriate buffering
conditions c. PCR thermal cycling: 10-15X cycles using conditions optimized for the
locus specific FREQ-Seq primer set. 3. Primers from this first reaction are removed by:
a. Treatment with 5U of exonuclease I (ExoI) and 1U of DpnI for 30 minutes at 37 °C and 20 minutes at 80 °C to heat inactivate these enzymes. DpnI helps remove methylated chromosomal DNA templates. This reaction can be performed in the PCR reaction buffer. Fastest for large sample numbers.
b. Purification using Ampure XP™ magnetic beads, eluting in 27 µL of ddH2O.
Stage 2: Bar-coding and library purification
1. For amplified loci, samples can be bar-coded as follows: a. For samples treated as 1.3a (above), 25 µL of the ExoI/DpnI treated
reaction can be mixed with 25 µL of fresh PCR master mix at 1X concentration and enrichment primers at 2X concentrations (ex. 0.1 µM Primer A and Primer B).
b. For samples treated as 1.3b (above), 25 µL of the purified reaction can be mixed with 25 µL of fresh PCR master mix at 2X concentration and enrichment primers at 2X concentrations (ex. 0.1 µM Primer A and Primer B).
2. To each reaction, add 1/10th the molar equivalent of a single FREQ-Seq bridging primer to a single reaction (ex. for ~50 µL reaction add 10-20 ng of dsDNA FREQ-Seq primer).
3. PCR amplify the mixture under the following conditions: a. Cycling conditions: b. 98 °C – 60 s c. 98 °C – 20 s d. 53 °C – 20 s e. 72 °C – 30 s f. Cycle c-e 10-15X depending on the amplicon g. 4 °C – <3 hrs or freeze
4. The resulting reactions can be treated with QIAGEN Buffer PB (5M Guanidine-HCl (aq), 30% Isopropyl alcohol) to stop polymerase activity.
5. Pool samples according to the loci of interest, followed by purification using QIAGEN columns or generic silica gel equivalents. This is a complete FREQ-Seq library.
6. (Optional) To ensure FREQ-Seq product size and purity, products can be run on a 2% agarose/1X TAE gel followed by standard gel extraction.
7. Determine the concentration of each allele pool. 8. Dilute and mix all allele pools together to give a final concentration of 10-20 nM
depending on your sequencing facility requirements.
Text S2: Description of the FREQout Parsing Steps
Freq-Out is the companion program to this method which takes as input the FASTQ files from an Illumina run and produces as output the frequency of each type as different barcodes. We briefly described the steps in this process here, but more information and the source code is available at www.evolvedmicrobe.com/FreqSeq/index.html.
The FREQout Data Parsing Process
1. Specify Settings: Users specify the barcodes and allele types they are using in an XML file. FREQout uses the settings in this file to parse the FASTQ files, and for each read either assign it to a specific barcode and allele combination, or discard it if the match is poor or certain quality controls are not passed.
2. Determine Quality Filters: The most computationally intense portion of Freq-Out is the task of assigning reads that are not exact matches to known barcode and allele combinations. The program optionally allows users to avoid performing this step for low quality data. This not only allows for faster parsing but also can avoid poor quality matches. Unfortunately, defining low quality data is complicated because there are several variants of the way FASTQ files encode quality scores, although a user typically does not know which format any particular file is in. To circumvent this issue, FREQout instead estimates the distribution of the average quality score for reads in a FASTQ file by parsing a user specified number of reads from the beginning of the file. If the user sets a quality score percentile cut-off, FREQout estimates the average read quality score at this percentile using this empirical distribution, and does not attempt to assign any reads that are below this threshold if they are not an exact match. By default, FREQout will parse the first 5,000 reads to estimate the quality score, and not align any read with an average quality score below the bottom 2% of the quality scores in these initial reads.
3. Parse the Reads – FREQout next assigns, in parallel, each read to either a barcode/allele combination, or determines that the read cannot be assigned. For each read it performs the following steps.
a. Filters the read based on the presence of the M13f sequence that should be in every read produced by the FreqSeq method. However, this step is skipped if the user does not require this sequence. If they do (this is the default option), the read is immediately excluded if an exact match is not found unless they also specify the option to allow reads within a Hamming distance of one. If so, the read can still be accepted if it has an M13f sequence that is within a Hamming distance of 1 to the actual sequence.
b. Assigns the read to a barcode – If the read contains an exact match to a barcode it is simply assigned to that group. If not, and if the user has specified the option to allow inexact matches, the read is assigned to a barcode group if the 6 bp barcode sequence is within a Hamming distance of 1 to that barcode (this is the default option provided all barcodes are separated by a Hamming distance over two, which is true for all barcodes considered in this paper).
c. Assign the read to an allele type - If the read contains an exact match to an allele it is assigned to that group. If not, and if the user has set the file to allow for inexact matches the following steps are taken.
i. Determine if the read passes the quality threshold set in step 2, and also the user specified maximum percentage of “N” basepairs to be considered for inexact alignment.
ii. If the read passes (i), then a list of candidate allelic types is aligned against the sequence to determine if any are significant matches. First all 7 bp K-mers in the read (after trimming the barcode and M13f sequence) are queried against a dictionary of all 7 bp K-mers made from the available allelic types set by the XML file. For each allelic type that matches a K-mer, a Smith-Waterman alignment is done between the type and the sequence from the read. The parameters for the alignment are set in the XML file. If only one allelic type is matched, that allelic type is assigned. If more than two alignments are performed, the allelic type with the highest scoring alignment is assigned, provided that the difference between the highest scoring alignment and the next highest scoring alignment is greater than the Smith-Waterman match score plus the absolute value of the mismatch score (to ensuring the difference is at least equivalent to one SNP in an ungapped alignment).
iii. Update the list of the relevant barcode/allele group with the average quality score for that read.
4. Output the Results – FREQout next outputs the results to a comma separated value (csv) file. The output includes the number of reads that were parsed and excluded at each quality step, as well as the number that were assigned as exact or inexact matches and relevant quality scores for each group. Frequencies of each type for each barcode are derived from these quantities.
Supplemental Methods
Allele frequencies determined by restriction fragment length polymorphism. The FREQ-Seq libraries were first amplified by allele specific primers with M13f and Illumina B overhangs (15 cycles). The PCR products were purified using a QIAGEN PCR purification kit. Purified products were further PCR amplified in a second step using the universal primers (AF1 and AF2) with a small amount (0.02 µM) of barcode primer (MLBC1) (15 cycles). For small indel alleles (gshA and fghA), we applied reconditioning PCR to reduce the heteroduplex structure of amplified products in each step of PCR (Supplemental Reference 1). These additional PCR amplifications were run for only 2 cycles with excess primers, using the previous 15-cycle PCR product as the template. Endonuclease digestion reactions were carried out with specific enzymes for each allele and purified by QIAGEN MinElute kit. The alleles pntAB, gshA, and fghA were digested with enzymes AluI, HhaI, and HpyAV (New England Biolabs), respectively. The purified digests were then run on an Agilent 2100 Bioanalyzer using the DNA1000 kit. The frequency of each allele was estimated using the Bioanalyzer 2100 Expert software. Supplemental References 1. Thompson, J.R., Marcelino, L.A. and Polz, M.F. (2002) Heteroduplexes in mixed-
template amplifications: formation, consequence and elimination by 'reconditioning PCR'. Nucleic Acids Res, 30, 2083-2088.
2. Marx, C.J. (2008) Development of a broad-host-range sacB-based vector for unmarked allelic exchange. BMC Res Notes, 1, 1.