FREQ-Seq: A Rapid, Cost-Effective, Sequencing-Based Method ... · FREQ-Seq: A Rapid, Cost-Effective, Sequencing-Based Method to Determine Allele Frequencies Directly from Mixed Populations

FREQ-Seq: A Rapid, Cost-Effective, Sequencing-BasedMethod to Determine Allele Frequencies Directly fromMixed PopulationsLon M. Chubiz1., Ming-Chun Lee1.¤, Nigel F. Delaney1, Christopher J. Marx1,2*

1 Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts, United States of America, 2 Faculty of Arts and Sciences Center for

Systems Biology, Harvard University, Cambridge, Massachusetts, United States of America

Abstract

Understanding evolutionary dynamics within microbial populations requires the ability to accurately follow allelefrequencies through time. Here we present a rapid, cost-effective method (FREQ-Seq) that leverages Illumina next-generation sequencing for localized, quantitative allele frequency detection. Analogous to RNA-Seq, FREQ-Seq relies uponcounts from the .105 reads generated per locus per time-point to determine allele frequencies. Loci of interest are directlyamplified from a mixed population via two rounds of PCR using inexpensive, user-designed oligonucleotides and a bar-coded bridging primer system that can be regenerated in-house. The resulting bar-coded PCR products contain theadapters needed for Illumina sequencing, eliminating further library preparation. We demonstrate the utility of FREQ-Seq bydetermining the order and dynamics of beneficial alleles that arose as a microbial population, founded with an engineeredstrain of Methylobacterium, evolved to grow on methanol. Quantifying allele frequencies with minimal bias down to 1%abundance allowed effective analysis of SNPs, small in-dels and insertions of transposable elements. Our data reveal large-scale clonal interference during the early stages of adaptation and illustrate the utility of FREQ-Seq as a cost-effective toolfor tracking allele frequencies in populations.

Citation: Chubiz LM, Lee M-C, Delaney NF, Marx CJ (2012) FREQ-Seq: A Rapid, Cost-Effective, Sequencing-Based Method to Determine Allele Frequencies Directlyfrom Mixed Populations. PLoS ONE 7(10): e47959. doi:10.1371/journal.pone.0047959

Editor: Jurg Bahler, University College London, United Kingdom

Received July 31, 2012; Accepted September 17, 2012; Published October 31, 2012

Copyright: � 2012 Chubiz et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work has been supported by funding from National Science Foundation (DEB-0845893) and National Institutes of Health (GM078209). The fundershad no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

. These authors contributed equally to this work.

¤ Current address: Department of Biochemistry, The University of Hong Kong, Pokfulam, Hong Kong

Introduction

The textbook definition of evolution is the change in frequency

of alleles in a population over time [1]. While this extremely simple

definition provides a clear picture of the evolutionary process, the

ability to actually detect and track the frequency of known

mutations in populations had previously been limited to work on

phage [2], and remains surprisingly challenging for microbes.

Identifying evolved alleles by applying next-generation sequencing

technologies either to isolates [3,4,5] or mixed-population samples

[6] has revolutionized the discovery of polymorphisms that occur

during adaptation. However, this information does not often yield

quantitative allele frequency information. The classic method for

determining allele frequencies - examination of a large number of

isolates from a population [7,8,9] - is extremely time- and

resource-intensive. Several methods exist that allow accurate

determination of allelic frequencies directly from population

samples, but these either require substantial material and

personnel cost and/or have limited sensitivity and accuracy

[10,11,12]. For these reasons, a reliable, high-throughput pipeline

for allele frequency detection in natural and laboratory popula-

tions will allow a more extensive understanding of evolutionary

processes.

Considering the equipment present in most molecular biological

laboratories, there exist only a handful of techniques available to

researchers to determine the alleles present in isolates, or their

population frequencies directly from mixed samples. The simplest

and most time-honored of these methods is restriction fragment-

length polymorphism (RFLP) analysis. Although generally quite

accurate, this technique requires the polymorphisms under

investigation to change restriction endonuclease cleavage sites or

grossly change the sequence length [13,14]. This limits application

to a small subset of mutations. Additionally, a number of

quantitative PCR-based (qPCR) methods have shown promise

though their implementation requires significant optimization and

calibration [3,15]. Lastly, and arguably the most straight-forward

method, has been the use of Sanger sequencing to determine

allelic frequencies in mixed DNA samples [3,9,16]. The chief

drawbacks of this method are that accuracy and detection limits

are highly variable requiring a large number of replicate samples.

In recent years, a number of commercial systems have been

developed to quantitate allele frequencies [17,18,19,20,21]. The

Sequenom iPLEXH platform utilizes differential PCR extension

coupled with matrix assisted laser desorption/ionization – time of

flight (MALDI-TOF) mass spectrometry to infer relative allele

frequencies in a given pooled sample. Alternatively, the Fluidigm

PLOS ONE | www.plosone.org 1 October 2012 | Volume 7 | Issue 10 | e47959

Access ArrayTM and Raindance ThunderstormTM systems use

proprietary, dedicated microfluidic thermal cyclers for targeted

PCR enrichment of variable loci followed by deep-sequencing

using any number of next-generation sequencing platforms.

Additionally, use of the Access ArrayTM and ThunderstormTM

systems require synthesis and re-ordering of pairs of individual,

long oligonucleotides for each allele probed. Although these

methods have been successfully applied, the substantial equipment

and/or supplies costs pose barriers to most academic end-users.

In this work, we have developed a rapid, cost-effective strategy

for localized, Illumina-based allele ‘‘frequency sequencing’’, or

‘‘FREQ-Seq’’. The basic rationale of FREQ-Seq is analogous to

RNA-Seq: allele frequencies (rather than transcript levels) are

determined by counting DNA sequence reads. FREQ-Seq utilizes

two rounds of PCR that take advantage of the inclusion of a small

amount of a long, bridging primer that can be produced and re-

generated in-house at low cost in order to introduce sample-

specific bar-codes (Figure 1). Similar to Access ArrayTM and

ThunderstormTM, this method applies PCR enrichment and next-

generation sequencing platforms to estimate allele frequencies, but

in our case, without the expense of individual long oligonucleotides

for each locus, nor the need for additional instrumentation. The

resulting PCR products require no further library preparation for

sequencing, and are simply pooled together for analysis using

either Illumina HiSeq or Genome Analyzer platforms. The

resulting libraries, introduced at just 5% of an Illumina HiSeq

control lane, produced an average of 147,000 reads per sample.

With a custom, publicly available analysis software package we

have developed, users can easily extract and interpret experimen-

tal results.

We validated and demonstrated the utility of the FREQ-Seq

system by determining the frequency of four beneficial mutations

that arose during adaptation of an engineered Methylobacterium

extorquens AM1 strain that requires a heterologous pathway for

formaldehyde oxidation [22]. We examined both i) controlled

cellular ratios of each allele and ii) cryogenically stored population

time-point samples. This engineered strain had been evolved for

900 generations in the laboratory in a minimal medium containing

methanol as a sole carbon source [23]. Use of these allele

frequency data allowed for direct comparison of observed fitness

increases to the rapid rise and selective sweep of these mutations.

These data revealed a period of clonal interference that

temporarily reversed the rise of alleles that would later move to

fixation. For simple polymorphisms, like SNPs and small in-dels

we found calibration was unnecessary, whereas analysis of more

complex allele types such as .10 bp deletions benefit from

correction via control ratios and a simple model fit. Finally, we

demonstrate that a modified primer strategy allows FREQ-Seq to

be used to analyze new DNA junctions such as novel transposable

element insertions which have been repeatedly observed in this

[24] and other evolution experiments [13,14].

Materials and Methods

Strains, culturing and automated growth ratemeasurements

All strains used are derivatives of Methylobacterium extorquens AM1

with relevant genotypes listed in Table S4. Bacterial cultures were

routinely grown in ‘Hypho’ minimal medium exactly as described

by Chou et al. [24], supplemented with 3.5 mM succinate or

15 mM methanol, unless otherwise described, and was performed

at 30uC with aeration. For growth rate measurements, 5 mL frozen

population cultures were inoculated in 640 mL medium supple-

mented with 13.125 mM methanol and 0.4375 mM succinate

(ratio of carbon from M:S = 7:1) in 48-well micotiter plates

(Costar) and grown to saturation to acclimate to growth on

methanol. Following acclimation, saturated cultures were trans-

ferred with a 1:64 dilution in fresh Hypho medium with 20 mM

methanol and placed on a plate shaking tower (Caliper) at 30uC in

a humidified environmental room. Optical densities were obtained

every 2 hrs on a Wallac Victor 2 plate reader (Perkin-Elmer) until

cultures reached saturation using an automated measurement

system [25]. Growth rates were determined by fitting an

exponential growth model using custom analysis software, Curve

Fitter (Delaney et al., unpublished; http://www.evolvedmicrobe.

com/Software.html), with a minimum of 4 replicates.

Construction of a bar-coded adapter plasmid libraryThe entirety of the adapter library was assembled in a vector

backbone for ease of storage and routine amplification. To

generate an appropriate plasmid backbone, the pUC19 vector

(New England Biolabs) was PCR amplified using primers

TCGGTGGTCGCCGTATCATTTTAATTGCGTTGCGCT-

CACTG and AGAGTAAAACGACGGCCAGTTACG-

CATCTGTGCGGTATTTC, and Phusion DNA polymerase

(New England Biolabs, NEB) under manufacturer recommended

conditions, to generate a linear DNA fragment containing only the

replication origin and ampicillin resistance marker. The absence of

the lacZa multiple cloning site ensures no interference with

downstream PCR amplification using the M13f priming site. The

resulting PCR reaction was treated with 20 units of DpnI

restriction endonuclease (NEB) for 1 hr at 37uC to remove the

pUC19 plasmid template prior to purification of the linear DNA

product. The bridging primer was first synthesized and PAGE

purified (Integrated DNA Technologies, IDT) using the template

AATGATACGGCGACCACCGAGATCTACACTCTTTCCC-

TACACGACGCTCTTCCGATCTNNNNNNGTAAAAC-

GACGGCCAGT, where N signifies random nucleotides com-

prising the unique bar code. The subsequent 81 nt ssDNA

oligonucleotide was then converted into dsDNA by five cycles of

PCR using the primer ACTGGCCGTCGTTTTAC and Phusion

DNA polymerase under standard reaction conditions with a 5 s

extension time. Presence of dsDNA following PCR was verified by

10% Acylamide/1X TAE electrophoresis and SybrSafe staining.

The resulting DNA was then purified using phenol extraction and

ethanol precipitation.

To create the plasmid-borne bar-code library, 100 ng of the

linear pUC19 PCR product as well as an equimolar amount of the

81 bp dsDNA Illumina-M13f bar-code were mixed and assembled

into circular plasmid DNA using the one-step method developed

by Gibson and co-workers [26]. Approximately 20 ng of the

resulting assembled DNA was then transformed into NEB10b cells

using the 5 minute transformation protocol (NEB) to eliminate

sibling clones, followed by plating of 100 ml aliquots onto SOB

plates containing ampicillin (100 mg/mL) and X-Gal (80 mg/mL).

The resulting white, single colonies were inoculated into 1 mL of

SOB with ampicillin (100 mg/mL), grown overnight at 37uC in

deep, 96-well plates with shaking at 320 rpm, and stored at

280uC after the addition of DMSO to 8% final concentration.

Approximately, 192 clones were verified by Sanger sequencing

(MWG Operon) and a resulting set of 48 bar-coded adapters were

picked based on a minimum Hamming distance of 2.

The entire FREQ-Seq barcode library is publicly available

through Addgene.org and available individually or as a full kit of

48 plasmids (www.addgene.org/Christopher_Marx/).

FREQ-Seq: Cost-Effective Allele Frequency Assay


Figure 1. Illustration of the FREQ-Seq method. A.) The bridging primers can be produced in house using standard PCR and purification.Individual, uniquely bar-coded bridging primers are amplified from the FREQ-Seq plasmid library set using two primers ABC1 (AATGATACGGC-GACCAC) and ABC2 (ACTGGCCGTCGTTTTAC). The resulting bridging primers are stored as dsDNA at 220uC. B.) Construction of FREQ-Seq Illuminalibraries is conducted in two stages. A first round of PCR amplification is used to generate an allele-specific, representative fragment pool withproduct inserts approximately 250 bp in size. Each product at this stage contains an M13f sequence at the sequencing end and an Illumina B adaptersequence at the non-sequencing end. These products are then treated with exonuclease I or Ampure XP beads to remove remaining PCR primers. Ina second stage of PCR amplification, individual samples are barcoded and enriched using three primers; a small amount of a FREQ-Seq bridging



Amplication and purification of the bar-coded Illumina-M13f bridging primers

Production of dsDNA Illumina-M13f bridging primers for

sample bar-coding was performed by PCR followed by agarose gel

extraction. To amplify a given adapter from the plasmid adapter

clone library, PCR reactions were formulated with primers ABC1

and ABC2 (Table S1) using Phusion DNA polymerase and cycled

under standard conditions with a 5 s extension time. The resulting

products were purified by 2% agarose gel electrophoresis and gel

extraction (Zymo ZR-96 kit) according to manufacturer’s instruc-

tion. Illumina-M13f bridging primers were then verified for

correct size and purity using an Agilent Bioanalyzer.

Allele frequency estimates as determined by flowcytometry

Allele frequency mixtures were generated by growing each

isogenic allele type found in Table S4 to late-log phase.

Subsequent mixtures were made using cell cultures diluted to

OD600 = 0.15. Ratios of cell types were measured using a BD

LSRII flow cytometer. Estimates of allele frequency were

generated using the FlowCore BioConductor package in R to

gate fluorescent and non-fluorescent cell types [27].

Allele specific library generation, sequencing andfrequency determination

Allele-specific, Illumina-compatible FREQ-Seq sequencing

libraries are generated via a two-step procedure outlined in

Figure 1B. A basic overview of the protocol is provided in BoxS1. Additional workflow examples may be found in the online

Text S1. In the initial step, the region containing genetic variation

(SNPs or deletions) is amplified by PCR using allele-specific

primers containing overhangs shown in Figure 1B and listed in

Table S1. These PCR products thereby contain the region of

interest flanked by the non-sequencing Illumina adapter sequence

on one end (Illumina ‘‘B’’) and a ‘universal adapter’ (M13f

sequence) on the sequencing end to enable subsequent bar-coding.

Notably, any polymorphisms must be positioned closely (e.g., ,5–

10 bp for 50-bp reads; 55–60 bp for 100-bp reads) to the end with

the universal adapter in order to enable detection. This is due to

the first 40–45 bases of a sequencing read being consumed by the

FREQ-Seq related barcodes and adapters (Figure 1A). Thus, for

each new locus, only two primers need to be synthesized as

standard DNA oligonucleotides. The second round of PCR is

unique in that it uses three primers. The outer primers are generic

to all reactions and utilize the Illumina ‘‘A’’ and ‘‘B’’ sequences to

amplify all products. Critical to our method, a small amount (1/10

normal concentration) of a long, bridging primer is doped into

each reaction. These 81-mers have the Illumina ‘‘A’’ sequence at

the 59-end, the M13f adapter sequence at the 39-end to amplify the

templates from the first round of PCR, and a 6-mer bar-code in

between that allows each amplicon pool to be identified in a

sample-specific manner (see Figure 1 and Tables S1 and S2).

This results in production of DNA fragment pools already

compatible with the Illumina single-end read flow cell, obviating

all subsequent library preparation steps (e.g., shearing, adapter

ligation, column selection, etc.). The resulting products are then

pooled and sequenced en masse on a Illumina HiSeq or Genome

Analyzer flow cell.

Following Illumina sequencing, data analysis is performed using

custom, open-source software (FREQout), available at http://

www.evolvedmicrobe.com/FreqSeq/index.html. The software in-

cludes a command-line executable and a program with a graphical

user interface. Both programs are designed to read and align

sequences contained in the Illumina FASTQ output file. User

defined alleles and bar-code sequences, as well as alignment and

other analysis parameters, are inputted to the software via an

XML formatted file. The software runs the analysis by assigning

each read to an allele and barcode group based on its sequence

and the quality of its alignment to the different possible alleles (see

also Text S1). Because the software checks for the presence of the

M13f sequence while parsing the reads, this step allows for FREQ-

Seq samples to be run in parallel with other samples on the same

Illumina flow cell. The data generated on the frequency of each

type, as well as statistics related to various quality and alignment

metrics, are then written to a CSV file. The graphical user

interface also includes interactive plots that allow users to view

these details immediately.

Results

Method validationIn order to determine the efficacy of FREQ-Seq, we sought to

address three primary concerns: the scalability of library prepa-

ration, sources of bias, accuracy in the measurements, and the

detection limits of the technique. To ensure scalability, we

designed our ‘bridging’ primer system, that introduces the

sample-specific barcode, in a manner that can be produced in-

house (Figure 1A and Text S1). This alleviates the significant

cost and high error rates of custom synthesis of .60 bp DNA

oligonucleotides. Specifically, we found that approximately 15% of

clones from the FREQ-Seq bridging primer plasmid library

contained frame-shift or mismatch mutations, apparently intro-

duced during oligonucleotide synthesis, and were thus culled.

Presently, we have experimentally confirmed 48 unique bar-coded

adapters all possessing sequence Hamming distances greater than

two. This criterion allows for greater fidelity in bar code

assignment during data processing. In addition, these 48 bar

codes, positioned to be the very first bases sequenced, contain no

significant nucleotide bias within the first four bases. This is critical

because biases in the first four bases of Illumina reads are known to

cause significant reductions in cluster identification resulting in

large numbers of abandoned reads by the Illumina analysis

software [28].

Since sequencing libraries require PCR amplification, we

examined the potential bias that could arise from differential

product amplification or detection. As controls, we assayed

mixtures containing a wide range of cell ratios – accurately

determined via flow cytometry – containing either the ancestral or

evolved version of an allele. Following 600 generations of

adaptation to methanol growth by a genetically engineered strain

of M. extorquens AM1, an isolate was chosen and sequenced to

determine the genetic basis of adaptation [22]. We began our

analysis of this population by interrogating three beneficial

mutations, pntABEVO, gshAEVO, and fghAEVO, which span the gamut

of a SNP and deletions of 2- and 11-bp, respectively (Figure 2A–C).

primer and two Illumina enrichment primers. Initially, amplification proceeds via the sample-specific bridging primer that has the M13f sequence atits 39 end. The resulting products can then be amplified to higher quantity with the generic, enrichment primers A and B. Following sample poolingand standard PCR product purification, the resulting products constitute a complete FREQ-Seq library.doi:10.1371/journal.pone.0047959.g001



Comparison of the observed allele frequencies compared to

those expected from the cell ratios demonstrated that there was

remarkably little bias throughout the procedure (Figure 2). Only

the 11-bp deletion of the fghA polymorphism deviated to somewhat

from expectation (Figure 2C). . Because, fortuitously, each of the

allele types of pntAB, gshA, and fghA resulted in change in restriction

pattern, we used quantitative RFLP analysis of the same amplicon

pools and found a similarly weak bias (Figure S1). Fitting these

control data to a quadratic function, however, allowed us to

readily control for the minor bias of fghA, but it appears to be

largely unnecessary for simple SNPs or small in-dels.

The FREQ-Seq method can consistently produce allele

frequency estimates with low absolute errors. However, for very

high and very low allele frequency estimates (.99% or ,1%), the

relative error on the estimates can become large because the

frequency of the allele (the signal) starts to become equal to or less

than the frequency of sequencing errors (the noise). In principle,

one could obtain better estimates of very low or high frequency

alleles by making the estimate within a probabilistic framework

that accounts for the chance that a read is a sequencing error. This

would require a model for the likelihood of sequencing errors,

which would be best verified by running a sample that is entirely of

one type and seeing how often the other type appears. We

performed this assay by sequencing barcoded samples for both the

gshA and the pntAB loci that were at 100% frequency for only one

allele type, and did this with four independently prepared samples

for both alleles on four different dates. For both alleles, and for all

sequencing dates, the observed error rate was very low and the

median observed percentage of the type not supposed to be

present in the sample was 0.23%, indicating that this percentage is

approximately equal to the range in which signal and noise

become equally important to the direct allele frequency estimate.

However, interestingly, a comparison of the different dates shows

that the frequencies of incorrect sequencing reads in the samples

were statistically different for both allele types (logistic regression,

p-values,10215), and two samples from one date contained errors

that were greater than 1% at 1.41% and 1.13% respectively. We

interpreted this result to mean that the frequency of sequencing

errors can be influenced by subtle and distinct biases for different

sequencing runs, and so one must be careful before applying an

error model trained on one day’s data to sequencing data obtained

on another day. However, once data from many more sequencing

runs is obtained, we suspect one could create a reliable method to

more accurately assess uncertainty in low and high frequency

samples by using a hierarchical model that accounts for day to day

variation and that also includes read quality scores as a covariate.

FREQ-Seq can be used to monitor novel DNA junctionsA reoccurring mutational type in many evolution experiments is

the formation of new DNA junctions, either by deletion,

amplification or insertion sequence (IS) element/transposon

insertions [4,13,14,23,24,29,30,31]. To assess the ability of

FREQ-Seq to detect new DNA junctions we modified our primer

strategy to use three primers in the first reaction (two forward,

sequencing-end and one reverse found in Table S1, see FigureS2) to amplify a region adjacent to the icuAB locus on the M.

extorquens AM1 chromosome. This site was previously shown to

contain an IS element insertion that resulted in the increased

expression of icuAB (a cobalt transport system) under metal

limitation [24]. Of the three primers designed, the two forward,

sequencing-end primers were designed to differentially bind to the

icuABWT and icuABEVO (IS containing) genotypes, while the third

reverse primer annealed to a common site downstream of the IS

insertion site. Thus, each amplicon would only occur in the

presence of icuABWT or icuABEVO, respectively. Using the same

protocol developed for two locus primers, we found that FREQ-

Seq was capable of detecting the IS element insertion located

proximal to icuAB and allowed us to follow its rise to fixation

(Figure 3).

Figure 2. Calibration of FREQ-Seq data for three ancestral/evolved (WT/EVO) allele mixtures pntABEVO (A.), gshAEVO (B.), and fghAEVO

(C.). Strains bearing each evolved allele were compared to one with the wild-type version that also expressed mCherry to permit precisequantification of expected ratios by flow cytometry (Table S3). Presented in each column are sequence variation between the ancestral WT and EVOalleles, and subsequent quadratic fits of the data for each respective allele. Observed data are the frequencies of the evolved allele as determined byFREQ-Seq, whereas expected data are wild-type allele frequencies estimated using flow cytometry. Quadratic model fitting was performed in R.doi:10.1371/journal.pone.0047959.g002



Determining the order and dynamics of beneficial allelesfrom evolved populations

Numerous evolution experiments have been performed using

microbial populations where temporal population samples have

been cryogenically stored. Wide-spread access to whole genome

resequencing has allowed many of the beneficial mutations that

have risen in such populations to be identified and characterized.

For the M. extorquens AM1 population we have used as a test case,

we had previously uncovered a generic pattern of epistasis between

the mutations that emerged that led to diminishing returns [23].

What was unknown, however, was the order and dynamics by

which these beneficial alleles emerged. Therefore, to elucidate the

adaptive trajectory and dynamics for this population we employed

FREQ-Seq to the three loci described above (Figure 2), as well as

icuAB. Interestingly, we discovered that the order of incorporation

of these alleles followed the ranking of their selective advantages:

gshAEVO, fghAEVO, icuABEVO and then pntAEVO (Figure 3). Despite

the large selective advantage of the alleles we examined, we

uncovered a substantial, temporary reversal in relative frequencies

between generations 108 and 210. This pattern is indicative of

clonal interference (i.e. the Hill-Robertson effect; [32]); whereby

multiple lineages accrue beneficial mutations and interfere with

each other’s path toward fixation [31,33,34,35,36].

Comparison of FREQ-Seq data to genotype data directlyfrom random isolates

The FREQ-Seq data from our test population suggested three

things: the order of mutations, probable early linkage of gshAEVO

and fghAEVO before they both dramatically rose in frequency, and

that the majority of the population at generation 150 had

unidentified mutations. As an independent method to test these

inferences, we obtained and independently genotyped 72 random

isolates from generation 150 (Figure 4 and Table S2). First

considering the known alleles from the eventual winning lineage

that were present at this time (gshAEVO and fghAEVO), isolates were

found with both gshAEVO and fghAEVO, or just gshAEVO, but not just

fghAEVO. This pattern and linkage data is consistent with the

FREQ-Seq inferences that the fghAEVO mutation arose later and on

the background of gshAEVO.

Turning to trying to uncover what other mutations were present

in the lineages that temporarily outcompeted those with gshAEVO

and fghAEVO, we considered previous work on parallelism across

the replicate populations [27]. This paper had found that all

populations contain mutations that, like the fghAEVO allele

described here, reduce expression of the introduced formaldehyde

oxidation pathway in this strain. The most common mutational

event across populations was IS-mediated co-integration of the

introduced, multi-copy plasmid expressing the foreign formalde-

hyde pathway with an endogenous, single copy plasmid in M.

Figure 3. Allele frequencies for four beneficial alleles in population F4 (A) with correlated changes in relative growth rate (B). AllelesgshAEVO (circles), fghAEVO (triangles), pntABEVO (squares), and icuABEVO (diamonds) were determined by FREQ-Seq and corresponding values werecorrected with calibrated data shown in Figure 2. Growth rates (B.) were normalized to the growth rate of the ancestral strain.doi:10.1371/journal.pone.0047959.g003



extorquens AM1 [29]. We used PCR amplifications to determine

whether such an event occurred in these isolates (Table S3). Of

the 72 isolates, 43 of these had an trfA::ISMex25 allele (trfA

encodes the plasmid replication protein of this IncP replicon, [37]).

Interestingly, of these there were 2 that possessed both gshAEVO and

an trfA::ISMex25 allele. This homoplasy is most likely due to

repeated cointegrations on different backgrounds. Finally, there

were 14 isolates that lacked all of the known beneficial alleles, but

co-existed with the other genotypes at generation 150 when the

growth rate of the population had risen by 84.462.5%. Clearly

there were other, as yet unknown mutations that occurred and

temporarily rose in frequency.

Comparison of beneficial allele emergence withobserved increases in population fitness

With a quantitative picture of genotypic change at key loci

available via FREQ-Seq, we measured the change in growth rate

of the population to determine if this corroborated the finding of

clonal interference. We used growth rate as a proxy for fitness

because it has been previously demonstrated for these populations

that these two measures are linearly correlated [23]. As has been

seen for other experiments with M. extorquens AM1 [38], as well as

other model systems (e.g., [39]), the rate of fitness increase was

faster early in the experiment then later, but continued to rise

throughout. Importantly, the relative smoothness, rather than

stepwise nature of the observed dynamics was consistent with

theoretical models of clonal interference (and recurring ‘‘multiple

mutations’’; [40]). Furthermore, clear evidence for other lineages

that were ‘‘invisible’’ to our FREQ-Seq data comes from the fact

that the temporary peak of the gshAEVO allele at 34.7% of the

populations occurred at generation 108, when the relative growth

had already increased by 69.368.9%. Given that this was the

inflection point between rising and falling in frequency, this

indicates that the genotype containing gshAEVO had a fitness of

,70%. This means that the rest of the population was somewhat

less fit than this genotype prior to that time-point, and somewhat

higher afterwards. Despite these fast reversals in allele frequencies

during this interval, growth rate continued to rise relatively

continuously. Compared to the wavering trajectory for gshAEVO,

the trajectories of fghAEVO, and later pntABEVO and icuABEVO did

not exhibit wild reversals, yet were more complex than the simple

sinusoidal pattern expected were there to be periodic selection of

single mutations in the absence of clonal interference [41].

Discussion

Reported here is a method for quantitative determination of

allele frequencies called FREQ-Seq. FREQ-Seq utilizes standard

PCR and depends upon a novel bar-coding system to generate

locus-specific Illumina sequencing libraries. This capitalizes on

focusing the sequencing depth of next-generation sequencing

platforms upon just the key loci of interest across timepoints or

populations in order to accurately infer their frequency. A key

aspect of FREQ-Seq is its capacity for multiplexing a large

number of samples. For example, for academic laboratories

engaged in experimental evolution or environmental microbiolo-

gy, FREQ-Seq provides a low-cost, highly extendable method for

investigating genetic variation over large temporal and spatial

scales.

As a proof of principle we have demonstrated the utility of

FREQ-Seq in tracking the emergence of four beneficial alleles

from a previously characterized evolution experiment [23]. We

also observed that, while the order of allele emergence matched

the order which would have been the most optimal adaptive

trajectory amongst the epistatic combinations generated by Chou

and coworkers [23], a previously unknown period of clonal

interference existed early during adaptation. Based on genotyping

of clones during this period, one prominent source of interfering

lineages resulted from IS insertions into the replication machinery

(trfA) of the RK2-based plasmid harboring the exogenous

formaldehyde oxidation pathway. Similar IS insertions have

recently been found in many of the replicate populations from

this experiment, and provide 17.4–24.1% fitness increases due to

reducing the cost associated with exogenous pathway expression

[29]. Interestingly, while these IS mutational variants were the

eventual evolutionary ‘winners’ in other populations, they were

unable to outcompete the fghAEVO gshAEVO genotype despite being

simultaneously present in gshAEVO backgrounds. These results

underpin the importance of allele frequency data in revealing the

complex processes during adaptation. Furthermore, by simulta-

neously determining the population growth rate with high

resolution, we were able to both estimate the fitness of the

genotype containing the allele of interest and demonstrate that the

reversal in its frequency actually occurred during a period of

continued rise in fitness. As more studies move to characterizing

improvement accurately and with fine time resolution, the

synergistic role of FREQ-Seq estimates of allele frequencies will

become even more important to connect between phenotypic and

genotypic change.

This test case highlights advantages of the FREQ-Seq method.

First, our results show that a wide range of allelic types can be

detected ranging from SNPs and small in-dels to novel IS

junctions. Second, operating at approximately 5% of an Illumina

flow cell capacity we were able to achieve .105 reads per locus per

time-point. Third, we observed minimal bias in our control ratios.

This suggests that FREQ-Seq is fully capable of quantitatively

detecting these types of alleles with little or no calibration required,

particularly for SNPs or small in-dels. Where needed a simple

quadratic fit can be used to account for small, systematic bias. We

encourage potential users to consider a control mixture to account

for possible skews in observed frequencies for these types of alleles.

Fourth, the sensitivity available is quite reasonable. Our data

allowed us to quantify allele frequencies reliably from 1–99%.

Figure 4. Genotyping of 72 isolates from population F4 atgeneration 150. Genotypes were determined by PCR amplificationand RFLP analysis using enzymes HhaI and HpyAV for the gshA, andfghA alleles, respectively. trfA::ISMex25 alleles were assayed by standardPCR amplification.doi:10.1371/journal.pone.0047959.g004



More sophisticated statistical error models or increased flow cell

capacity can likely increase the sensitivity of FREQ-Seq. Fifth, the

publicly available FREQout analysis package allows for research-

ers to easily recover data and receive a full report of allele

frequency types.

A critical component of any new method is the cost of

implementation, and the associated tradeoff in accuracy and

throughput, relative to existing techniques and technologies. With

this in mind, we explored the cost of FREQ-Seq implementation

compared to existing high-throughput platforms (Fluidigm Acces-

sArrayTM and Raindance ThunderstormTM) and sequencing-

based allele frequency detection methods (Sanger sequencing and

whole genome resquencing). As an initial cost comparison, we

chose to examine two commercial systems from Fluidigm and

Raindance that use a very similar strategy for PCR enrichment

and barcoding of chromosomal loci. FREQ-Seq and these two

platforms both require the use of locus specific oligonucleotide

primers containing long overhangs. Unlike FREQ-Seq, however,

both the Fluidigm and Raindance systems require long barcoded

primers, containing both Illumina adapter sequences, for every

locus of interest. While an unassuming difference, the latter

primers routinely exceed most standard oligonucleotide synthesis

lengths (currently 60 nucleotides at $0.35 US/base, IDT) resulting

in increased synthesis and purification costs (.$1 US/base).

Additionally, as these primers are locus specific, each locus must

have a new set of barcoded primers. To directly compare, each

Fluidigm or Raindance barcoded primer set would cost approx-

imately $3,120 US (48665 nt barcoded primers at approximately

$1 US/base) for a single locus whereas a FREQ-Seq locus primer

set of similar size costs $32 US (1637 nt primer +1654 nt primer

at $0.35 US/base) as all barcoding is accomplished with the

universal bridging primer in a locus-indendent manner. Finally,

comparing equipment costs (assuming equal amounts of PCR

reagents and user time are used) shows that the aforementioned

commercial technologies (not considering annual custom consum-

ables and instrument servicing costs) range between $70,000 to

$100,000 US. In contrast, FREQ-Seq relies only on a standard,

96-well PCR thermal cycler already present in or generally

available to most laboratories. Collectively, FREQ-Seq is two

orders of magnitude less expensive in oligonucleotide costs and

again obviates purchasing an expensive equipment specifically

designed for this purpose. Further, as FREQ-Seq is an open-

source platform, no reagents or equipment are proprietary,

thereby alleviating significant annual service and licensing costs

for small research laboratories and institutions.

Additionally, one may consider alternative strategies such as

traditional Sanger sequencing and whole genome resequencing to

infer allele frequencies in mixed populations. Using the analysis of

an experimentally evolved population demonstrated in Figure 3as a metric, Sanger sequencing of four loci for 22 time points

would result in a cost for a single set of replicates to be $440 US (5

$US/reaction, GeneWiz). However, to achieve any statistical

accuracy, multiple replicate samples for each allele and time point

would need to be run. Moreover, the accuracy generated by

Sanger sequencing is routinely difficult to achieve below 10%

allele abundance [3,9,16]. Likewise, employing the same logic to

whole genome resequencing of mixed population samples using

the Illumina HiSeq platform, approximately 12 samples can be

run per lane to produce an average coverage of 100–200 fold.

Conservatively, this experiment would result in the use of two full

lanes in an Illumina flow cell (costing approximately $2,400 US

using Harvard FAS Core pricing) and would require 22 barcoded

Illumina libraries to be generated (an additional cost of $1,200 US

with TruSeq preparations at $55 US/sample). An immediate

benefit of this strategy is that nearly all chromosomal mutations can

be seen simultaneously. The drawback is that 100–200 fold coverage

results in both substantial sampling error in estimating alleles of

intermediate frequency, and is insufficient to detect alleles at low

frequency, resulting in increased coverage being necessary (with

costs increasing linearly). In contrast, using only ,3% of a single

Illumina flow cell lane ($40 US), FREQ-Seq is able to achieve an

average of 100,000 fold coverage per allele, per time point. If we thus

consider the cost per 1,000 fold coverage, whole genome sequencing

would cost ,$700 US per time point, whereas FREQ-Seq costs

$0.005 US per locus of interest in the population. By focusing on loci

of interest, FREQ-Seq uses the five order of magnitude difference in

cost per coverage to achieve robust quantitative results and sub 1%

frequency detection. Taken together, these comparisons demon-

strate that FREQ-Seq is a cost-effective, quantitative strategy for

localized allele frequency determination.

In the present implementation of FREQ-Seq, we have described

a set of 48 uniquely bar-coded bridging primers to produce

localized Illumina sequencing libraries compatible with either

single-end or paired-end read flow cells. Although currently

designed for single-end reads, changing compatibility to paired-

end flow cells would only require simple modification to the

reverse locus primer to change its 59 overhang, and an identical

change in the primer ‘B’ for enrichment (Table S1). Though

employed here to explore the frequencies of known alleles in a

laboratory evolution experiment, we expect FREQ-Seq to be

useful in other diagnostic applications. Multi-way competitions

between bar-coded strains could be readily performed. Moving out

of the lab entirely, microbial species variation in environmental

samples is entirely feasible. For example, phylogenetic analyses of

16S rRNA from communities could occur with excellent

throughput and no subsequent library preparation. As a result,

FREQ-Seq should be a useful tool to many biomedical and

environmental microbiologists.

Supporting Information

Figure S1 Estimation of PCR-introduced bias in alleleamplification. Allele pntABEVO (circles), gshAEVO (squares), and

fghAEVO (triangles) were amplified using primer pairs AF3/AF4,

AF7/8, and AF9/10, respectively followed by bridging primer

(MLBC1) addition. Final FREQ-Seq products for pntABEVO,

gshAEVO, and fghAEVO were digested with HhaI, AluI, or HpyAV,

respectively. Digested DNA fragments were sized fractionated and

analyzed using a Bioanalyzer 2100 (Agilent) with DNA1000 chips.

Ratios were determined using the relative ratios of integrated peak

areas from the Bioanalyzer trace. Data were analyzed using the

Agilent 2100 Expert software package.

(TIF)

Figure S2 Schematic of the three primer, novel DNAjunction implementation of FREQ-Seq.(TIF)

Box S1 Abbreviated FREQ-Seq protocol.(TIF)

Table S1 FREQ-Seq primer sequences and allele-spe-cific primers used in this study.(DOCX)

Table S2 Sequences of the 48 barcodes contained in theFREQ-Seq kit. All sequences are the N bases in the sequence

AATGATACGGCGACCACCGAGATCTACACTCTTTCCC-

TACACGACGCTCTTCCGATCTNNNNNNGTAAAAC-

GACGGCCAGT.

(DOCX)



Table S3 Genotypes of 72 isolates from population F4 atgeneration 150. An (X) indicates the presence of an evolved

allele (EVO) as determined by RFLP or differential PCR

amplication.

(DOCX)

Table S4 Bacterial strains used in this study and theirrelevant genotypes.

(DOCX)

Text S1

(DOCX)

Acknowledgments

We would like to thank Christian Daly, Jiangwen Zhang and Claire

Reardon for their technical assistance in Illumina sequencing. Additionally,

we thank Andres Benitez and Gerda Saxer for technical feedback on

FREQ-Seq implementation, Sean M. Carroll for helpful comments on

nomenclature, and Deepa Agashe and members of the Marx laboratory for

helpful comments on the manuscript.

Author Contributions

Conceived and designed the experiments: LMC ML CJM. Performed the

experiments: LMC ML. Analyzed the data: LMC ML NFD CJM.

Contributed reagents/materials/analysis tools: LMC ML NFD. Wrote the

paper: LMC ML NFD CJM.

References

1. Futuyma DJ (2009) Evolution: Sinauer Associates.

2. Wichman HA, Badgett MR, Scott LA, Boulianne CM, Bull JJ (1999) Different

trajectories of parallel evolution during viral adaptation. Science 285: 422–424.

3. Kvitek DJ, Sherlock G (2011) Reciprocal sign epistasis between frequentlyexperimentally evolved adaptive mutations causes a rugged fitness landscape.

PLoS Genet 7: e1002056.

4. Barrick JE, Yu DS, Yoon SH, Jeong H, Oh TK, et al. (2009) Genome evolutionand adaptation in a long-term experiment with Escherichia coli. Nature 461:

1243–1247.

5. Conrad TM, Joyce AR, Applebee MK, Barrett CL, Xie B, et al. (2009) Whole-genome resequencing of Escherichia coli K-12 MG1655 undergoing short-term

laboratory evolution in lactate minimal media reveals flexible selection of

adaptive mutations. Genome Biol 10: R118.

6. Barrick JE, Lenski RE (2009) Genome-wide mutational diversity in an evolving

population of Escherichia coli. Cold Spring Harb Symp Quant Biol 74: 119–129.

7. Woods R, Schneider D, Winkworth CL, Riley MA, Lenski RE (2006) Tests ofparallel molecular evolution in a long-term experiment with Escherichia coli. Proc

Natl Acad Sci U S A 103: 9107–9112.

8. Woods RJ, Barrick JE, Cooper TF, Shrestha U, Kauth MR, et al. (2011)Second-order selection for evolvability in a large Escherichia coli population.

Science 331: 1433–1436.

9. Pena MI, Davlieva M, Bennett MR, Olson JS, Shamoo Y (2010) Evolutionaryfates within a microbial population highlight an essential role for protein folding

during natural selection. Mol Syst Biol 6: 387.

10. Norton N, Williams NM, Williams HJ, Spurlock G, Kirov G, et al. (2002)Universal, robust, highly quantitative SNP allele frequency measurement in

DNA pools. Hum Genet 110: 471–478.

11. Monsion B, Duborjal H, Blanc S (2008) Quantitative Single-letter Sequencing: amethod for simultaneously monitoring numerous known allelic variants in single

DNA samples. BMC Genomics 9: 85.

12. Kirov G, Nikolov I, Georgieva L, Moskvina V, Owen MJ, et al. (2006) PooledDNA genotyping on Affymetrix SNP genotyping arrays. BMC Genomics 7: 27.

13. Papadopoulos D, Schneider D, Meier-Eiss J, Arber W, Lenski RE, et al. (1999)

Genomic evolution during a 10,000-generation experiment with bacteria. ProcNatl Acad Sci U S A 96: 3807–3812.

14. Schneider D, Duperchy E, Coursange E, Lenski RE, Blot M (2000) Long-term

experimental evolution in Escherichia coli. IX. Characterization of insertionsequence-mediated mutations and rearrangements. Genetics 156: 477–488.

15. Wilkening S, Hemminki K, Thirumaran RK, Bermejo JL, Bonn S, et al. (2005)

Determination of allele frequency in pooled DNA: comparison of three PCR-based methods. Biotechniques 39: 853–858.

16. Gresham D, Desai MM, Tucker CM, Jenq HT, Pai DA, et al. (2008) The

repertoire and dynamics of evolutionary adaptations to controlled nutrient-limited environments in yeast. PLoS Genet 4: e1000303.

17. Lee HH, Molla MN, Cantor CR, Collins JJ (2010) Bacterial charity work leads

to population-wide resistance. Nature 467: 82–85.

18. Herbon N, Werner M, Braig C, Gohlke H, Dutsch G, et al. (2003) High-

resolution SNP scan of chromosome 6p21 in pooled samples from patients with

complex diseases. Genomics 81: 510–518.

19. Werner M, Sych M, Herbon N, Illig T, Konig IR, et al. (2002) Large-scale

determination of SNP allele frequencies in DNA pools using MALDI-TOF mass

spectrometry. Hum Mutat 20: 57–64.

20. Forshew T, Murtaza M, Parkinson C, Gale D, Tsui DW, et al. (2012)

Noninvasive identification and monitoring of cancer mutations by targeted deep

sequencing of plasma DNA. Sci Transl Med 4: 136ra168.

21. Pekin D, Skhiri Y, Baret J-C, Le Corre D, Mazutis L, et al. (2011) Quantitative

and sensitive detection of rare mutations using droplet-based microfluidics. Lab

on a Chip 11: 2156–2166.22. Marx CJ, Chistoserdova L, Lidstrom ME (2003) Formaldehyde-detoxifying role

of the tetrahydromethanopterin-linked pathway in Methylobacterium extorquens

AM1. J Bacteriol 185: 7160–7168.

23. Chou HH, Chiu HC, Delaney NF, Segre D, Marx CJ (2011) Diminishingreturns epistasis among beneficial mutations decelerates adaptation. Science

332: 1190–1192.

24. Chou HH, Berthet J, Marx CJ (2009) Fast growth increases the selectiveadvantage of a mutation arising recurrently during evolution under metal

limitation. PLoS Genet 5: e1000652.25. Delaney NF, Rojas Echenique JI, Marx CJ (2012) Clarity: An open source

manager for laboratory automation. J Lab Autom doi:10.1177/

2211068212460237.26. Gibson DG, Young L, Chuang RY, Venter JC, Hutchison CA, 3rd, et al (2009)

Enzymatic assembly of DNA molecules up to several hundred kilobases. NatMethods 6: 343–345.

27. Hahne F, LeMeur N, Brinkman RR, Ellis B, Haaland P, et al. (2009) flowCore:a Bioconductor package for high throughput flow cytometry. BMC Bioinfor-

matics 10: 106.

28. Krueger F, Andrews SR, Osborne CS (2011) Large scale loss of data in low-diversity illumina sequencing libraries can be recovered by deferred cluster

calling. PLoS One 6: e16607.29. Chou HH, Marx CJ (2012) Optimization of Gene Expression through Divergent

Mutational Paths. Cell Reports 1: 133–140.

30. Lee MC, Marx CJ (2012) Repeated, selection-driven genome reduction ofaccessory genes in experimental populations. PLoS Genet 8: e1002651.

31. Kao KC, Sherlock G (2008) Molecular characterization of clonal interferenceduring adaptive evolution in asexual populations of Saccharomyces cerevisiae. Nat

Genet 40: 1499–1504.32. Hill WG, Robertson A (2007) The effect of linkage on limits to artificial

selection. Genet Res 89: 311–336.

33. Fogle CA, Nagle JL, Desai MM (2008) Clonal interference, multiple mutationsand adaptation in large asexual populations. Genetics 180: 2163–2173.

34. Notley-McRobb L, Ferenci T (1999) The generation of multiple co-existing mal-regulatory mutations through polygenic evolution in glucose-limited populations

of Escherichia coli. Environ Microbiol 1: 45–52.

35. Gerrish PJ, Lenski RE (1998) The fate of competing beneficial mutations in anasexual population. Genetica 102–103: 127–144.

36. Helling RB, Vargas CN, Adams J (1987) Evolution of Escherichia coli duringgrowth in a constant environment. Genetics 116: 349–358.

37. Thomas CM, Meyer R, Helinski DR (1980) Regions of broad-host-range

plasmid RK2 which are essential for replication and maintenance. J Bacteriol141: 213–222.

38. Lee MC, Chou HH, Marx CJ (2009) Asymmetric, bimodal trade-offs duringadaptation of Methylobacterium to distinct growth substrates. Evolution 63: 2816–

2830.39. Lenski RE, Travisano M (1994) Dynamics of adaptation and diversification: a

10,000-generation experiment with bacterial populations. Proc Natl Acad

Sci U S A 91: 6808–6814.40. Desai MM, Fisher DS, Murray AW (2007) The speed of evolution and

maintenance of variation in asexual populations. Curr Biol 17: 385–394.41. Lenski RE, Rose M R., Simpson, S C., and Tadler, S C. (1991) Long-term

experimental evolution in Escherichia coli. I. Adaptation and divergence during

2,000 generations. Am Nat 138: 1315–1341.



Table S1. FREQ-Seq primer sequences and allele-specific primers used in this study.

Name Sequence Description FO GTAAAACGACGGCCAGT Forward/sequencing end overhang RO CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT Single-end read reverse overhang PRO AAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT Paired-end read reverse overhang ABC1 AATGATACGGCGACCAC Bar code amplification ABC2 ACTGGCCGTCGTTTTAC Bar code amplification AF1 AATGATACGGCGACCAC FREQ-Seq enrichment AF2 CAAGCAGAAGACGGCATAC FREQ-Seq enrichment AF3 FO+CAGATCTGAACTTCCCAGCA pntAB forward AF4 RO+GACCCCAGACCTATGAACTTC pntAB reverse AF7 FO+ACGCTGCAAGAGTGAACAAC gshA forward AF8 RO+TGAGATCGATCTGCTGGGTC gshA reverse AF9 FO+CTAGAGTTCCACGACTTGACAG fghA forward AF10 RO+CATTTTCATGCGTGCAGGTC fghA reverse AF11 FO+GATGCTGCGACCGAGATT icuAB forward AF12 FO+ATCCACTGCCCTCGTGAATA icuAB::ISMex4 forward AF13 RO+AGTTCCTCCAGCTCAACTGC icuAB reverse MLBC1 AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCT

TCCGATCTGAGAGAGTAAAACGACGGCCAGT Control bridge adapter

S11 GACTTCCGGCAAGCTATACG ISMex25 insertion forward S12 CCCGCAAGGAGGGTGAATG ISMex25 insertion reverse

Table S2. Sequences of the 48 barcodes contained in the FREQ-Seq kit. All sequences are the N bases in the sequence AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNGTAAAACGACGGCCAGT.

Name Sequence Name Sequence BC01 AGCAAT BC25 TGACGA BC02 CCTGTT BC26 CAAATA BC03 GGGTTT BC27 GTTCAG BC04 GAAGGC BC28 CTTCAA BC05 ATCTCA BC29 GTTGGG BC06 ATGGAT BC30 GCTTAG BC07 ATGTCT BC31 TAGCCA BC08 CGTGAC BC32 TAACTT BC09 TTAGGT BC33 CGGATA BC10 GTGCAT BC34 CAGCAG BC11 AACTTT BC35 AAGTAG BC12 GGATCG BC36 GGGACG BC13 ATAAGG BC37 CCGTGG BC14 ATTGGT BC38 ATTGTA BC15 AGTGAG BC39 TTTAGA BC16 CCCACC BC40 CCACGA BC17 CGATGC BC41 TCATGG BC18 GATAGC BC42 GAACCA BC19 GTCAGA BC43 TCCTAA BC20 TTAAGC BC44 CAACGC BC21 AACCTG BC45 AGTGTT BC22 CTTTGC BC46 GGATTA BC23 TGGAGA BC47 TATATA BC24 AATTGT BC48 GTACAA

Table S3. Genotypes of 72 isolates from population F4 at generation 150. An (X) indicates the presence of an evolved allele (EVO) as determined by RFLP or differential PCR amplication.

F4, G150 isolate

trfA::ISMex25 fghAEVO gshAEVO

CM3317 CM3319 CM3320 CM3330 CM3338 CM3343 CM3344 CM3345 CM3350 CM3359 CM3365 CM3368 CM3370 CM3373 CM3307 X CM3312 X CM3318 X CM3336 X CM3355 X CM3305 X X CM3308 X X CM3327 X X CM3335 X X CM3339 X X CM3348 X X CM3352 X X CM3362 X X CM3363 X X CM3364 X X CM3322 X CM3354 X CM3303 X CM3304 X CM3306 X CM3309 X CM3310 X CM3311 X CM3313 X CM3314 X CM3315 X CM3316 X CM3321 X CM3323 X

CM3324 X CM3325 X CM3326 X CM3328 X CM3329 X CM3331 X CM3332 X CM3333 X CM3334 X CM3337 X CM3340 X CM3341 X CM3342 X CM3346 X CM3347 X CM3349 X CM3351 X CM3356 X CM3357 X CM3358 X CM3360 X CM3361 X CM3366 X CM3367 X CM3371 X CM3372 X CM3374 X CM3353 X X CM3369 X X

Table S4. Bacterial strains used in this study and their relevant genotypes.

Strain Relevant genotype or characteristics Reference NEB10β E. coli recA1 endA1 rpsL Φ80(lacZΔM15) NEB CM502 M. extorquens crtI502 [S2] CM701 M. extorquens ΔmptG / pCM410 [23] CM1145 M. extorquens ΔmptG crtI502 gshAEVO pntABEVO icuABEVO/ pCM410.1145 [fghAEVO] [23] CM1175 M. extorquens ΔkatA::[loxP-trrnB-PtacA-mCherry-tT7] [37] CM1290 M. extorquens ΔmptG crtI502 pntAEVO

[23] CM1298 M. extorquens ΔmptG crtI502 gshAEVO [23] CM3137a M. extorquens CM1290 mptGWT This work CM3138a M. extorquens CM1298 mptGWT This work CM3277 M. extorquens CM502 / pCM410.1145 [fghAEVO] This work CM3943 M. extorquens CM1175 / pCM410 This work

a. The wild-type mptG allele was restored into the respective strains using pCM436 following the sucrose counter-selection method described by C.J. Marx [S2]

Supplementary Text

Text S1: FREQ-Seq protocol

Primer design

To implement FREQ-Seq, users must design a single set of oligonucleotide primers for each experimental locus. Using the sequences provided in Table S1, primers for single-end read flow cells are designed as below:

Sequencing end

5’-GTAAAACGACGGCCAGT + ~20 bp primer sequence homologous to region adjacent to mutation

Non-sequencing end

5’- CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT + ~20 bp primer sequence homologous to region ~250 bp downstream of the mutation

Bridging primer synthesis

Synthesis of the FREQ-Seq bar-coded, bridging primers can be performed in house using a thermal cycler and standard PCR reagents with any polymerase that generates blunt-ended products, such as Pfu, DeepVentR™ or Phusion™. Example reaction conditions using Phusion™ polymerase are as follows:

DNA template: FREQ-Seq plasmid (~ 1-10 pg)

Oligo primers: ABC1 and ABC2 (0.5 µM each)

Cycling conditions:

98 °C – 60 s Cycle 30x: 98 °C – 20 s 52 °C – 15 s 72 °C – 5 s 4 °C – <3 hrs or freeze

We have tried three methods for purification of the dsDNA briding primers with success:

1. 2% agarose/1X TAE separation followed by gel extraction using a ZR-96 Zymo gel recovery kit. This provides pure bridge adapter preparations and is compatible with preparing large numbers adapter simultaneously.

2. 10% polyacrylamide/1X TAE separation followed by crush and soak (overnight) extraction, and EtOH precipitation of DNA products. This results in the purest bridge primer preparations, though not typically required.

3. Purification using Ampure XP™ magnetic beads. Best results are achieved if PCR reactions are treated with DpnI to eliminate plasmid contaminants prior to purification. The most rapid cleanup, but results in less pure bridge adapter preparations.

FREQ-Seq library preparation

The FREQ-Seq library preparation requires two stages of PCR amplification. Reactions can be performed with any blunt-end producing DNA polymerase.

Stage 1: Locus amplification

1. For microbial populations, cells can be spun down and resuspended in water or other suitable lysis buffer. Following resuspension, cells are boiled at 98-100 °C in a thermal cycler or appropriate water bath. Cell debris can then be removed by centrifugation. The supernatant contains a crude mixture of microbial DNA representative of the original population.

2. Desired loci are amplified in a 50 µL reaction using the locus specific FREQ-Seq primers with conditions:

a. 5-10 µL of crude DNA template from 1. b. 0.5 µM of each primer, 200 µM dNTP, and appropriate buffering

conditions c. PCR thermal cycling: 10-15X cycles using conditions optimized for the

locus specific FREQ-Seq primer set. 3. Primers from this first reaction are removed by:

a. Treatment with 5U of exonuclease I (ExoI) and 1U of DpnI for 30 minutes at 37 °C and 20 minutes at 80 °C to heat inactivate these enzymes. DpnI helps remove methylated chromosomal DNA templates. This reaction can be performed in the PCR reaction buffer. Fastest for large sample numbers.

b. Purification using Ampure XP™ magnetic beads, eluting in 27 µL of ddH2O.

Stage 2: Bar-coding and library purification

1. For amplified loci, samples can be bar-coded as follows: a. For samples treated as 1.3a (above), 25 µL of the ExoI/DpnI treated

reaction can be mixed with 25 µL of fresh PCR master mix at 1X concentration and enrichment primers at 2X concentrations (ex. 0.1 µM Primer A and Primer B).

b. For samples treated as 1.3b (above), 25 µL of the purified reaction can be mixed with 25 µL of fresh PCR master mix at 2X concentration and enrichment primers at 2X concentrations (ex. 0.1 µM Primer A and Primer B).

2. To each reaction, add 1/10th the molar equivalent of a single FREQ-Seq bridging primer to a single reaction (ex. for ~50 µL reaction add 10-20 ng of dsDNA FREQ-Seq primer).

3. PCR amplify the mixture under the following conditions: a. Cycling conditions: b. 98 °C – 60 s c. 98 °C – 20 s d. 53 °C – 20 s e. 72 °C – 30 s f. Cycle c-e 10-15X depending on the amplicon g. 4 °C – <3 hrs or freeze

4. The resulting reactions can be treated with QIAGEN Buffer PB (5M Guanidine-HCl (aq), 30% Isopropyl alcohol) to stop polymerase activity.

5. Pool samples according to the loci of interest, followed by purification using QIAGEN columns or generic silica gel equivalents. This is a complete FREQ-Seq library.

6. (Optional) To ensure FREQ-Seq product size and purity, products can be run on a 2% agarose/1X TAE gel followed by standard gel extraction.

7. Determine the concentration of each allele pool. 8. Dilute and mix all allele pools together to give a final concentration of 10-20 nM

depending on your sequencing facility requirements.

Text S2: Description of the FREQout Parsing Steps

Freq-Out is the companion program to this method which takes as input the FASTQ files from an Illumina run and produces as output the frequency of each type as different barcodes. We briefly described the steps in this process here, but more information and the source code is available at www.evolvedmicrobe.com/FreqSeq/index.html.

The FREQout Data Parsing Process

1. Specify Settings: Users specify the barcodes and allele types they are using in an XML file. FREQout uses the settings in this file to parse the FASTQ files, and for each read either assign it to a specific barcode and allele combination, or discard it if the match is poor or certain quality controls are not passed.

2. Determine Quality Filters: The most computationally intense portion of Freq-Out is the task of assigning reads that are not exact matches to known barcode and allele combinations. The program optionally allows users to avoid performing this step for low quality data. This not only allows for faster parsing but also can avoid poor quality matches. Unfortunately, defining low quality data is complicated because there are several variants of the way FASTQ files encode quality scores, although a user typically does not know which format any particular file is in. To circumvent this issue, FREQout instead estimates the distribution of the average quality score for reads in a FASTQ file by parsing a user specified number of reads from the beginning of the file. If the user sets a quality score percentile cut-off, FREQout estimates the average read quality score at this percentile using this empirical distribution, and does not attempt to assign any reads that are below this threshold if they are not an exact match. By default, FREQout will parse the first 5,000 reads to estimate the quality score, and not align any read with an average quality score below the bottom 2% of the quality scores in these initial reads.

3. Parse the Reads – FREQout next assigns, in parallel, each read to either a barcode/allele combination, or determines that the read cannot be assigned. For each read it performs the following steps.

a. Filters the read based on the presence of the M13f sequence that should be in every read produced by the FreqSeq method. However, this step is skipped if the user does not require this sequence. If they do (this is the default option), the read is immediately excluded if an exact match is not found unless they also specify the option to allow reads within a Hamming distance of one. If so, the read can still be accepted if it has an M13f sequence that is within a Hamming distance of 1 to the actual sequence.

b. Assigns the read to a barcode – If the read contains an exact match to a barcode it is simply assigned to that group. If not, and if the user has specified the option to allow inexact matches, the read is assigned to a barcode group if the 6 bp barcode sequence is within a Hamming distance of 1 to that barcode (this is the default option provided all barcodes are separated by a Hamming distance over two, which is true for all barcodes considered in this paper).

c. Assign the read to an allele type - If the read contains an exact match to an allele it is assigned to that group. If not, and if the user has set the file to allow for inexact matches the following steps are taken.

i. Determine if the read passes the quality threshold set in step 2, and also the user specified maximum percentage of “N” basepairs to be considered for inexact alignment.

ii. If the read passes (i), then a list of candidate allelic types is aligned against the sequence to determine if any are significant matches. First all 7 bp K-mers in the read (after trimming the barcode and M13f sequence) are queried against a dictionary of all 7 bp K-mers made from the available allelic types set by the XML file. For each allelic type that matches a K-mer, a Smith-Waterman alignment is done between the type and the sequence from the read. The parameters for the alignment are set in the XML file. If only one allelic type is matched, that allelic type is assigned. If more than two alignments are performed, the allelic type with the highest scoring alignment is assigned, provided that the difference between the highest scoring alignment and the next highest scoring alignment is greater than the Smith-Waterman match score plus the absolute value of the mismatch score (to ensuring the difference is at least equivalent to one SNP in an ungapped alignment).

iii. Update the list of the relevant barcode/allele group with the average quality score for that read.

4. Output the Results – FREQout next outputs the results to a comma separated value (csv) file. The output includes the number of reads that were parsed and excluded at each quality step, as well as the number that were assigned as exact or inexact matches and relevant quality scores for each group. Frequencies of each type for each barcode are derived from these quantities.

Supplemental Methods

Allele frequencies determined by restriction fragment length polymorphism. The FREQ-Seq libraries were first amplified by allele specific primers with M13f and Illumina B overhangs (15 cycles). The PCR products were purified using a QIAGEN PCR purification kit. Purified products were further PCR amplified in a second step using the universal primers (AF1 and AF2) with a small amount (0.02 µM) of barcode primer (MLBC1) (15 cycles). For small indel alleles (gshA and fghA), we applied reconditioning PCR to reduce the heteroduplex structure of amplified products in each step of PCR (Supplemental Reference 1). These additional PCR amplifications were run for only 2 cycles with excess primers, using the previous 15-cycle PCR product as the template. Endonuclease digestion reactions were carried out with specific enzymes for each allele and purified by QIAGEN MinElute kit. The alleles pntAB, gshA, and fghA were digested with enzymes AluI, HhaI, and HpyAV (New England Biolabs), respectively. The purified digests were then run on an Agilent 2100 Bioanalyzer using the DNA1000 kit. The frequency of each allele was estimated using the Bioanalyzer 2100 Expert software. Supplemental References 1. Thompson, J.R., Marcelino, L.A. and Polz, M.F. (2002) Heteroduplexes in mixed-

template amplifications: formation, consequence and elimination by 'reconditioning PCR'. Nucleic Acids Res, 30, 2083-2088.

2. Marx, C.J. (2008) Development of a broad-host-range sacB-based vector for unmarked allelic exchange. BMC Res Notes, 1, 1.

Documents

FREQ-Seq: A Rapid, Cost-Effective, Sequencing-Based Method ... · FREQ-Seq: A Rapid, Cost-Effective, Sequencing-Based Method to Determine Allele Frequencies Directly from Mixed Populations