17
Copyright Ó 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum of Finitely Many Sites Under Selection Michael M. Desai* and Joshua B. Plotkin †,1 *Lewis–Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544 and Department of Biology and Program in Applied Mathematics and Computation Science, University of Pennsylvania, Philadelphia, Pennsylvania 19104 Manuscript received January 21, 2008 Accepted for publication October 6, 2008 ABSTRACT The distribution of genetic polymorphisms in a population contains information about evolutionary processes. The Poisson random field (PRF) model uses the polymorphism frequency spectrum to infer the mutation rate and the strength of directional selection. The PRF model relies on an infinite-sites approximation that is reasonable for most eukaryotic populations, but that becomes problematic when u is large (u* 0.05). Here, we show that at large mutation rates characteristic of microbes and viruses the infinite- sites approximation of the PRF model induces systematic biases that lead it to underestimate negative selection pressures and mutation rates and erroneously infer positive selection. We introduce two new methods that extend our ability to infer selection pressures and mutation rates at large u: a finite-site modification of the PRF model and a new technique based on diffusion theory. Our methods can be used to infer not only a ‘‘weighted average’’ of selection pressures acting on a gene sequence, but also the distribution of selection pressures across sites. We evaluate the accuracy of our methods, as well that of the original PRF approach, by comparison with Wright–Fisher simulations. M UTATION rates and selective pressures are of central importance to evolution. The number and frequency distribution of genetic polymorphisms within a population carry information about these fundamental processes. Polymorphisms at higher fre- quencies reflect weaker selective pressures (or positive selection), and vice versa. Similarly, a larger number of polymorphisms indicates a higher mutation rate. Thus we can use the polymorphism frequency spectrum observed in genetic sequences sampled from a popula- tion to infer the mutation rate and the selection pressure acting on the sequence. This intuition can be formalized into a rigorous method for estimating selection pressures and mutation rates by calculating the likelihood of sampled poly- morphism data as a function of these parameters. The Poisson random field (PRF) model provides an impor- tant and widely used method of doing so. The PRF model assumes a panmictic population of constant size, free recombination, infinite sites, no dominance or epistasis, and equal selection pressures at all sites. Under these assumptions, Sawyer and Hartl (1992) showed that the distribution of frequencies of mutant lineages in a population forms a Poisson random field whose properties depend on the selection pressure and the mutation rate. Hartl et al. (1994) and Bustamante et al. (2001) developed a maximum-likelihood method of estimating these parameters from data on the poly- morphism frequency spectrum. This method has been widely used to study, for example, purifying selection on synonymous (Hartl et al. 1994; Akashi and Schaeffer 1997; Akashi 1999) and nonsynonymous (Hartl et al. 1994; Akashi 1999) variation, and the evolution of base composition (Lercher et al. 2002; Galtier et al. 2006). Closely related to these analyses of polymorphism data are methods that calculate, on the basis of the PRF model, the ratio of the expected number of polymor- phisms within species to divergence between species for synonymous and nonsynonymous sites [using the idea behind the McDonald–Kreitman test (McDonald and Kreitman 1991)]. These methods discard some of the available data, as they depend only on the number of polymorphisms and not on their full frequency spec- trum. However, they are also less sensitive to assump- tions (Sawyer and Hartl 1992; Loewe et al. 2006). Such methods have been applied to estimate selection pressures on synonymous variation (Akashi 1995), on nonsynonymous mutations in mitochondrial genomes (Nachman 1998; Rand and Kann 1998; Weinreich and Rand 2000), and on nonsynonymous variation in a variety of nuclear genomes (Bustamante et al. 2002; Sawyer et al. 2003; Bartolome et al. 2005), including humans (Bustamante et al. 2005). Recent theoretical work has focused on relaxing various assumptions of the original PRF method. These include allowing for dominance (Williamson et al. 1 Corresponding author: Department of Biology, University of Pennsylva- nia, Philadelphia, PA 19104. E-mail: [email protected] Genetics 180: 2175–2191 (December 2008)

The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

Copyright � 2008 by the Genetics Society of AmericaDOI: 10.1534/genetics.108.087361

The Polymorphism Frequency Spectrum of Finitely Many SitesUnder Selection

Michael M. Desai* and Joshua B. Plotkin†,1

*Lewis–Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544 and †Department of Biology andProgram in Applied Mathematics and Computation Science, University of Pennsylvania, Philadelphia, Pennsylvania 19104

Manuscript received January 21, 2008Accepted for publication October 6, 2008

ABSTRACT

The distribution of genetic polymorphisms in a population contains information about evolutionaryprocesses. The Poisson random field (PRF) model uses the polymorphism frequency spectrum to infer themutation rate and the strength of directional selection. The PRF model relies on an infinite-sitesapproximation that is reasonable for most eukaryotic populations, but that becomes problematic when u islarge (u * 0.05). Here, we show that at large mutation rates characteristic of microbes and viruses the infinite-sites approximation of the PRF model induces systematic biases that lead it to underestimate negativeselection pressures and mutation rates and erroneously infer positive selection. We introduce two newmethods that extend our ability to infer selection pressures and mutation rates at large u: a finite-sitemodification of the PRF model and a new technique based on diffusion theory. Our methods can be used toinfer not only a ‘‘weighted average’’ of selection pressures acting on a gene sequence, but also the distributionof selection pressures across sites. We evaluate the accuracy of our methods, as well that of the original PRFapproach, by comparison with Wright–Fisher simulations.

MUTATION rates and selective pressures are ofcentral importance to evolution. The number

and frequency distribution of genetic polymorphismswithin a population carry information about thesefundamental processes. Polymorphisms at higher fre-quencies reflect weaker selective pressures (or positiveselection), and vice versa. Similarly, a larger number ofpolymorphisms indicates a higher mutation rate. Thuswe can use the polymorphism frequency spectrumobserved in genetic sequences sampled from a popula-tion to infer the mutation rate and the selection pressureacting on the sequence.

This intuition can be formalized into a rigorousmethod for estimating selection pressures and mutationrates by calculating the likelihood of sampled poly-morphism data as a function of these parameters. ThePoisson random field (PRF) model provides an impor-tant and widely used method of doing so. The PRFmodel assumes a panmictic population of constant size,free recombination, infinite sites, no dominance orepistasis, and equal selection pressures at all sites.Under these assumptions, Sawyer and Hartl (1992)showed that the distribution of frequencies of mutantlineages in a population forms a Poisson random fieldwhose properties depend on the selection pressure andthe mutation rate. Hartl et al. (1994) and Bustamante

et al. (2001) developed a maximum-likelihood methodof estimating these parameters from data on the poly-morphism frequency spectrum. This method has beenwidely used to study, for example, purifying selection onsynonymous (Hartl et al. 1994; Akashi and Schaeffer

1997; Akashi 1999) and nonsynonymous (Hartl et al.1994; Akashi 1999) variation, and the evolution of basecomposition (Lercher et al. 2002; Galtier et al. 2006).

Closely related to these analyses of polymorphismdata are methods that calculate, on the basis of the PRFmodel, the ratio of the expected number of polymor-phisms within species to divergence between species forsynonymous and nonsynonymous sites [using the ideabehind the McDonald–Kreitman test (McDonald andKreitman 1991)]. These methods discard some of theavailable data, as they depend only on the number ofpolymorphisms and not on their full frequency spec-trum. However, they are also less sensitive to assump-tions (Sawyer and Hartl 1992; Loewe et al. 2006).Such methods have been applied to estimate selectionpressures on synonymous variation (Akashi 1995), onnonsynonymous mutations in mitochondrial genomes(Nachman 1998; Rand and Kann 1998; Weinreich andRand 2000), and on nonsynonymous variation in avariety of nuclear genomes (Bustamante et al. 2002;Sawyer et al. 2003; Bartolome et al. 2005), includinghumans (Bustamante et al. 2005).

Recent theoretical work has focused on relaxingvarious assumptions of the original PRF method. Theseinclude allowing for dominance (Williamson et al.

1Corresponding author: Department of Biology, University of Pennsylva-nia, Philadelphia, PA 19104. E-mail: [email protected]

Genetics 180: 2175–2191 (December 2008)

Page 2: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

2004), population subdivision (Wakeley 2003), chang-ing population size (Williamson et al. 2005), andlinkage between sites (Zhu and Bustamante 2005).Several methods for studying the properties of the dis-tribution of selection pressures across sites based on thePRF model have also been developed, using the poly-morphism frequency spectrum (Nielsen et al. 2005),the ratio of polymorphism to divergence (Sawyer et al.2003; Loewe et al. 2006), or several of these methods inconjunction (Bustamante et al. 2003; Piganeau andEyre-Walker 2003; Boyko et al. 2008).

A fundamental assumption of the PRF approach isthat mutation is irreversible—the infinite-sites assump-tion. Thus any particular site can be only transientlypolymorphic, and there is no steady-state solution to theevolutionary dynamics at any given site. However, sincethe constant creation of new polymorphic sites is bal-anced by older polymorphisms fixing or going extinct,the frequency distribution of polymorphisms across sitescurrently polymorphic does reach an equilibrium. Thisapproach has important advantages. Since it describesthe frequency spectrum of new mutations (‘‘derivedalleles’’)—the frequency spectrum relative to the lastancestral state—it provides a natural framework foranalyzing polymorphism data when the ancestral stateis known. This ancestral state can often be inferred froman appropriate outgroup.

Despite its advantages, the infinite-sites approxima-tion can present problems. The approximation isreasonable for most data currently available from typicaleukaryotic populations. However, in many biologicallyreasonable parameter regimes, particularly those rele-vant to bacterial and viral populations, more than onemutational event may contribute to polymorphism at agiven site. In this article, we show that under theseparameter regimes the infinite-sites assumption causesthe PRF method to underestimate negative selectionpressures and mutation rates by as much as an order ofmagnitude. In addition, the PRF method often infersthat a gene is under strong positive selection when in factthe gene is experiencing weak negative selection. Thisproblem arises both for inferences based on the poly-morphism frequency spectrum and for inferences basedon the ratio of within-species polymorphism to between-species divergence, but here we focus exclusively on theformer.

In this article, we present two methods that relax theinfinite-sites assumption of the PRF method, each withits own advantages and drawbacks. Rather than studyingmutant lineages across a sequence, our methods focuson explicit models of the evolutionary dynamics atindividual sites. We first present a modification of thePRF method that retains the essential framework, butcalculates the frequency distribution of mutant lineagesat each site rather than across the whole sequence. Wenext present an alternative method based on well-known diffusion equations in place of the PRF frame-

work. This alternative framework avoids all of the finite-site biases of the PRF, but it cannot make use ofknowledge about the ancestral state. Rather than de-scribing a steady-state frequency distribution of derivednucleotides relative to this ancestral state, it describesthe frequency distribution of all four nucleotidespossible at each site—a fundamentally different steadystate. Both of our methods allow us to estimate theselection pressure and the mutation rate from data onthe polymorphism frequency spectrum. In addition,these methods also allow us to infer the distribution ofselection pressures across sites.

To assess the accuracy of these methods, we generatepolymorphism data from simulated Wright–Fisher pop-ulations with known selection pressures and mutationrates. By comparing inferences drawn from these simu-lated data sets, we demonstrate that our methods extendand improve upon the original PRF approach. Through-out this article, we focus on accurately inferring the signand strength of negative selection, since the most trou-bling bias in the original PRF method is erroneousinference of positive selection when mutation rates arelarge. We focus primarily on situations when theancestral state at each site cannot be reliably inferred,which is the typical situation when the mutation rate islarge.

We emphasize that the effects of finite sites are ofpractical relevance only when the mutation rate isrelatively large (u per site $0.05). As a result, themethods of inference developed here are not necessaryfor analyzing human or Drosophila population-geneticdata. However, as we shall demonstrate, finite-site effectshave significant practical implications when studyingthe population genetics of viruses, microbes, and somehigher eukaryotes, such as sea squirts and starfish, thatexperience large mutation rates (Drake et al. 1998;Lynch and Conery 2003).

THEORY

The Poisson random field model of polymorphisms:We begin by outlining the PRF model of the site-frequency spectrum developed by Sawyer and Hartl(Sawyer and Hartl 1992; Hartl et al. 1994). Thismodel assumes that mutations occur in a population ofeffective size N at a Poisson rate Nu, where u is the per-sequence mutation rate, and are all subject to selection ofstrength s. The fate of each mutant lineage is modeled bya diffusion approximation to the processes of selectionand drift. When a new mutant lineage enters the popu-lation, it is assumed to arise at a site that has notpreviously experienced any mutations (the infinite-sitesassumption). Each mutant lineage is assumed to beindependent of all others (the free-recombinationassumption).

Extending earlier work by Wright (1938) andMoran (1959), Sawyer and Hartl (1992) calculated

2176 M. M. Desai and J. B. Plotkin

Page 3: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

a steady-state distribution of mutant lineage frequen-cies. They found that the number of lineages withfrequency between x and x 1 dx is Poisson distributedwith mean f(x)dx, where

f ðxÞ ¼ ul1� e�2gð1�xÞ

1� e�2g

1

xð1� xÞ : ð1Þ

Here g [ Ns is a measure of the strength of selection onthe mutant lineages and ul [ 2Nu is twice the populationper-sequence mutation rate. The function f(x) is re-ferred to as a Poisson random field. In other words, thenumber of mutant lineages with frequency between x1

and x2 is a Poisson random variable with meanÐ x2

x1f ðxÞdx. In addition, the number of mutant lineages

with frequency in [x1, x2] is independent (as a randomvariable) from the number of mutant lineages withfrequency in [y1, y2], provided these intervals do notintersect. Note that f(x) is not integrable at x ¼ 0. Thisdivergence occurs because the steady state arises from abalance between new mutations constantly occurringand older lineages fixing or going extinct. Thus there isno finite, steady-state expression for the number oflineages that have fixed or gone extinct.

Hartl et al. (1994) and Bustamante et al. (2001)used Equation 1 as the basis for maximum-likelihood(ML) estimation of the mutation rate ul and selectionpressure g from polymorphism data. They imaginedsampling n individuals from a population with thissteady-state distribution of segregating mutant lineages.They made the infinite-sites assumption that all mutantlineages occur at different sites, consistent with theearlier assumption that each lineage is independent.Since the number of mutant lineages at frequency x inthe population is Poisson distributed with mean f(x)dx,the number of sampled sites containing i mutantnucleotides (we refer to these as i-fold mutant sites) isPoisson distributed with mean

F ðiÞ ¼ ul

ð1

0

1� e�2gð1�xÞ

1� e�2g

1

xð1� xÞni

� �xið1� xÞn�idx:

ð2ÞThis equation leads immediately to a maximum-like-

lihood procedure for estimating g and ul (Bustamante

et al. 2001). A set of sequences from n sampledindividuals within a population will contain somenumber, yi, of i-fold mutant sites for 0 , i , n. The setof values y1, y2, . . . , yn�1 is called the site-frequencyspectrum of the observed data. The probability of aspectrum {yi}, given g and ul, is

Luðul ; gÞ ¼Yn�1

i¼1

e�F ði;g;ul Þ F ði; g; ulÞ½ �yi

yi !: ð3Þ

For an observed spectrum {yi} in a particular data set,one maximizes this likelihood over ul and g to estimatethe mutation rate and selection pressure.

The likelihood expression above assumes we knowwhich nucleotide is ancestral and which nucleotide isthe mutant (‘‘derived’’) at each polymorphic site. Werefer to this situation as the unfolded case. When we donot have this information, we cannot distinguish be-tween an i-fold mutant site and an (n � i)-fold mutantsite. In this case, a data set will contain some number, yi,i-fold and/or (n � i)-fold mutant sites, where i runsbetween 1 and the largest integer #n/2. We refer to thisas the folded case. In this situation, the likelihood of aparticular data set Lfðul ; gÞ is given by the sameexpression as for Lu, but with the product running from1 to n/2 and with F(i) replaced by F(i) 1 F(n� i) (exceptif i ¼ n/2).

The infinite-sites approximation: The PRF modelmakes two key assumptions: that each site is indepen-dent of all the others and that two mutant lineages neversegregate at the same site. The former assumption isequivalent to assuming free recombination between allsegregrating sites, and it has been investigated elsewhere(Akashi and Schaeffer 1997; Bustamante et al. 2001);we return to it in the discussion. The main focus of thisarticle, however, is on the second assumption of the PRF,that there are an infinite number of sites, which has notbeen discussed much in the literature. This assumptioncan lead to problems that are most apparent whenconsidering how the PRF method treats ‘‘multiply poly-morphic’’ sites—those that exhibit more than two typesof segregating nucleotides. Polymorphisms of this typeare indeed observed in data analyzed by the PRF method(Hartl et al. 1994). We refer to the configuration of aparticular site as (a, b, c, d), where a, b, c, and d are thenumbers of sampled sequences that exhibit each ofthe four nucleotides. When we have unfolded data, a isthe frequency of the ancestral nucleotide and b, c, and dare the frequencies of the three possible mutant nucle-otides, in order of decreasing frequency. For foldeddata, a, b, c, and d are the frequencies of all four possiblenucleotides, again in order of decreasing frequency. Inthe original PRF analysis, a site with a (12, 1, 1, 0)configuration, for example, is treated identically to a sitewith a (12, 2, 0, 0) configuration. Such a treatment isincorrect: the former configuration can arise only fromtwo mutant lineages, whereas the latter configurationcould be caused by a single mutant lineage (presumablyat relatively high frequency in the population). Yet thePRF analysis excludes the first possibility and treats bothconfigurations as if they were (12, 2, 0, 0) sites (Hartl

et al. 1994; Bustamante et al. 2001). Similarly, the PRFmethod treats (10, 2, 2, 0), (10, 3, 1, 0), and (10, 2, 1, 1)sites as if they were in a (10, 4, 0, 0) configuration, etc.

The infinite-sites approximation also affects sites thatexhibit only two types of segregrating nucleotides.When multiple lineages are segregating at the samesite, an i-fold and a k-fold mutant lineage sampled at thesame site can lead to an apparently (i 1 k)-fold mutantsite, if the two lineages happen to be mutations to the

Polymorphism Frequency Spectrum 2177

Page 4: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

same nucleotide. In other words, a (12, 2, 0, 0) site couldreflect two low-frequency mutant lineages or onehigher-frequency lineage, but the PRF method incor-rectly assumes that only the latter is possible. This leadsto systematic biases in the estimates of selection andmutation obtained by the PRF method: by disregardingthe possibility that an apparently high-frequency mu-tant lineage is actually several lower-frequency mutantlineages, the PRF method underestimates the mutationrate u and the strength of negative selection jsj.

Since a mutant lineage will survive on averageOðln½1=jsj�Þ generations before fixing or going extinct(Desai and Fisher 2007), and mutations arise at rateNm per site, the infinite-sites approximation will be validonly when N m ln½1=jsj�>1 (although for sufficientlysmall samples the condition is weaker, since we maynever sample multiple lineages even though they aresegregating at the same site). This condition is some-times violated in real populations, particularly of virusesand microbes (e.g., Hartl et al. 1994).

We stress that in parameter regimes relevant to mosteukaryotes, including humans and Drosophila, finite-sites biases are negligible. But in parameter regimesrelevant to bacteria and viruses, sites with multiplesegregrating mutations have a disproportionate weightin estimates of selection and mutation, and thus theycan lead to errors of an order of magnitude or more(Figure 1). In such circumstances, the PRF method alsoerroneously infers positive selection in many situationswhere selection is actually negative (Figure 1).

A per-site Poisson random field model of poly-morphisms: In this section, we describe a method thatextends the PRF framework to the case of finite sites andtakes full advantage of the information provided by thefrequencies of all possible configurations at a site. Thebasic idea behind this modified approach is to recast thePRF framework on a per-site basis. We describe the steady-state frequency distribution of mutant lineages at a givensite. From this, we can calculate the probability that asample of n individuals will contain any configuration ofmutants at that site. As in the original PRF method, weretain the assumption of free recombination, so that theDNA sequence is a collection of independent sites. Thusour per-site analysis leads directly to ML estimation ofmutation rate and selection strength.

We begin by recasting the PRF expression for thesteady-state distribution of mutant lineages to describethe frequencies of mutant lineages at a single site. At agiven site, we have

f ðxÞ ¼ us1� e�2gð1�xÞ

1� e�2g

1

xð1� xÞ ; ð4Þ

where us is the per-site value, us¼ 2Nm, where m is the per-site mutation rate. Using this formula to describemultiple lineages at a single site is somewhat peculiar,because this result assumes that all mutant lineages

behave independently of one another. Clearly this is notstrictly true, since the mutant lineages are segregating atthe same site. However, provided two mutant lineagesrarely achieve simultaneous high frequencies in thepopulation, then the assumption of independent mu-tant lineages is a good approximation. This assumptionof noninteracting mutant lineages will often hold evenwhen the other aspects of the infinite-sites approxima-tion are violated.

Analogous to the original PRF method, at a single sitethe number of mutant lineages that are observed i timesin a sample of n sequences (‘‘i-fold mutant lineages’’) isPoisson distributed with mean

F ðiÞ ¼ us

ð1

0

1� e�2gð1�xÞ

1� e�2g

1

xð1� xÞni

� �xið1� xÞn�idx:

ð5Þ

On the basis of this, we can calculate the probability ofany particular polymorphism configuration at a site.

We begin by describing this calculation in the un-folded case. The probability that a site is monomorphicis just the probability that no i-fold mutant lineages arefound at that site, for all i between 1 and n � 1. This is

Pmono ¼ Pðn;0;0;0Þ ¼ e�F ð1Þe�F ð2Þ . . . e�F ðn�1Þ: ð6Þ

We define M [ Pmono to denote this monomorphismprobability. The probability that a site exhibits a (n � 1,1, 0, 0) configuration is the probability that a single 1-fold mutant lineage is sampled, and no 2-fold or higherlineages are found,

Pðn�1;1;0;0Þ ¼ F ð1ÞM ; ð7Þ

where M denotes the probability of monomorphism asabove. The probability that a site exhibits an (n� 2, 2, 0,0) configuration is more complex. This configurationcould arise from a single 2-fold sampled lineage (asassumed under the standard PRF method) or it couldarise from two 1-fold sampled mutant lineages thathappen to involve mutations to the same nucleotide.Hence its probability is

Pðn�2;2;0;0Þ ¼F ð1Þ2

2!M

1

31 F ð2ÞM ; ð8Þ

where the factor of 13 is the probability that two

mutations result in the same nucleotide. This expres-sion assumes that mutations between all possible nu-cleotides are equally likely—the obvious generalizationapplies when there are mutational biases, which we donot discuss further. Similarly the probability of an (n� 2,1, 1, 0) configuration is

Pðn�2;1;1;0Þ ¼F ð1Þ2

2!M

2

3: ð9Þ

The probabilities of more complex configurationscan be calculated in a similar way. The probability of

2178 M. M. Desai and J. B. Plotkin

Page 5: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

an (n � 4, 4, 0, 0) configuration, for example, is theprobability of four 1-fold lineages to the same nucleo-tide plus the probability of two 1-fold lineages and a 2-fold lineage, plus the probability of two 2-fold lineages,plus the probability of one 1-fold lineage and one 3-foldlineage, plus the probability of a single 4-fold lineage.We have

Pðn�4;4;0;0Þ ¼ MF ð1Þ4

4!

1

33 1F ð1Þ2F ð2Þ

2!

1

32 1F ð2Þ2

2!

1

3

1 F ð1ÞF ð3Þ13

1 F ð4Þ�: ð10Þ

In general, the probability of a particular configurationis given by the sum of the probabilities of all possiblepartitions of n that lead to that configuration.

In the folded case, these calculations become fairlycomplex. The probability of a (12, 2, 0, 0) foldedconfiguration, for example, includes the probability ofa single 12-fold sampled lineage, as well as two 6-foldsampled lineages, and so on. Thousands of terms mayarise in the expression for the probability of a particularconfiguration, even for moderate values of n. We do notquote any of these expressions here, but rather we havedeveloped a computer program to output symbolic ex-pressions for the probabilities of all possible folded aswell as unfolded configurations, for a given sample sizen. The algorithm used in this program is described in theappendix, and the program is freely available onrequest.

These probabilities of site configurations form thebasis of maximum-likelihood parameter estimation.The probability of a data set with L total sites, includingya,b,c,d sites in configuration (a, b, c, d), is given by

L!Qya;b;c;d !

YP

ya;b;c;d

a;b;c;d ; ð11Þ

where the products are over all possible configurations.Given a particular data set, we maximize this probabilityover us and g to find the ML estimate of theseparameters. In the original PRF method, this maximi-zation is particularly simple, because the ML estimatefor u can be expressed analytically in terms of the MLestimate for g, leaving a one-dimensional numericalmaximization procedure to estimate g. In our per-sitePRF method, however, a full two-dimensional numericalmaximization is required to find the ML estimates of us

and g. We have implemented a computer program toperform this maximization; it is freely available onrequest.

Our per-site PRF method relaxes most of the con-sequences of the infinite-sites approximation inherentin the original PRF estimation procedure. We allow forthe possibility that multiple mutant lineages contributeto the polymorphism observed at a single site. Thus weavoid the systematic underestimation of u and g, andincorrect inference of positive selection, that affect the

traditional PRF when mutation rates are large (Figure1). This new method also uses all of the data available inthe sample, including the differences between (n� 2, 2,0, 0) and (n � 2, 1, 1, 0) sites and the number ofmonomorphic sites (note the infinite-sites approxima-tion means that the original PRF cannot predict thenumber of monomorphic sites and hence makes no useof these data). It does still retain one aspect of theinfinite-sites approximation: it assumes that mutantlineages segregating at the same site do not interferewith each other. This is never strictly true, but it is a goodapproximation unless multiple mutant lineages reachhigh frequency at a given site at the same time. Note thatbecause of this assumption, the probabilities of allpossible configurations (a, b, c, d) described above donot precisely sum to unity, because our approach allowsa typically small but nonzero probability of multiplemutant processes adding to more than n sampledindividuals. The no-interference assumption is substan-tially weaker than the infinite-sites assumption, and thusour revised sampling method extends the applicabilityof the PRF framework.

A per-site diffusion model of polymorphisms: In thissection, we describe a method that shifts fundamentallyfrom the PRF framework and avoids all of the problemsassociated with the infinite-sites assumption. Ratherthan studying the distribution of the frequencies of mu-tant lineages, we focus on the evolutionary dynamics ateach individual site, without keeping track of individualmutant lineages. We develop this into a maximum-like-lihood estimation of g and u from polymorphism data,which requires neither the infinite-sites nor the no-interference approximation described above. As in theoriginal PRF method, we assume free recombinationbetween sites.

At an individual site, we imagine that one nucleotide ispreferred and the other three have the same fitnessdisadvantage s (s , 0). We assume that mutations occurat rate m and hence at rate m/3 between any two specificnucleotides (i.e., no mutational biases). These assump-tions simplify the discussion, but are not essential. Infact, one advantage of this approach is that these as-sumptions can be easily relaxed with obvious general-izations (noted below).

We can analyze the process of mutation, selection, anddrift at a single site with a three-dimensional diffusionapproximation and calculate the joint steady-state prob-ability distribution of the frequencies of the fourpossible nucleotides at the site. This then leads naturallyto the likelihood of any configuration of polymorphismdata at the site as a function of g and us and hence to MLestimation of these parameters from data. Before ex-ploring this fully, we first analyze a simplification of thismodel, in which we treat all three disfavored nucleotidesas a single class. Such a treatment reduces to a standardone-dimensional diffusion process whose steady-stateprobability distribution describes the frequency of the

Polymorphism Frequency Spectrum 2179

Page 6: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

preferred nucleotide vs. the sum of the frequencies ofthe disfavored ones; this treatment is essentially a steady-state version of Williamson et al. (2005). This simplifiedmodel is not likely to be particularly useful, because itrequires us to know a priori the identity of the preferrednucleotide, which in practice is part of what we wish toinfer. This simplification also discards some of theinformation in the data [e.g., not making use of thedifference between (12, 1, 1, 0) and (12, 2, 0, 0) sites].However, first studying this analytically and computa-tionally simpler model makes the analysis of the fullmodel more clear. We therefore begin by describing theone-dimensional method and then turn to the three-dimensional method.

The one-dimensional diffusion model: We begin bydescribing a simplified diffusion approach that calcu-lates the frequency distribution of favored vs. disfavorednucleotides, similar in spirit to that introduced byMustonen and Lassig (2007). As noted above, we forsimplicity assume that one nucleotide is preferred andthe other three nucleotides are disfavored. We denotethe sum of the frequencies of the three disfavoredalleles by x; the frequency of the preferred nucleotide is1 � x.

We assume that mutation, selection, and random driftoccur at each site according to standard Wright–Fisherdynamics. Thus the probability distribution of x can bedescribed by the diffusion equation

@

@tf ðx; tÞ ¼ 1

2

@2

@x2 vðxÞf ðx; tÞ½ � � @

@xmðxÞf ðx; tÞ½ �; ð12Þ

where f(x, t) is the probability that the disfavorednucleotides have frequency x at time t and m(x) andv(x) are given by

mðxÞ ¼ sxð1� xÞ1 mð1� xÞ � m

3x ð13Þ

vðxÞ ¼ xð1� xÞN

; ð14Þ

where s is the selection coefficient against the disfavorednucleotides (s , 0) and m is the per-site mutation rateper individual per generation. This diffusion equation iswell known (Ewens 2004) and has the steady-statesolution derived by a zero-flux boundary condition:

f ðxÞ ¼ Cxus�1ð1� xÞus=3�1e2gx : ð15Þ

Here C is a (us- and g-dependent) normalization factor,and as before us [ 2Nm and g [ Ns.

If the frequency of disfavored nucleotides at a siteequals x, the probability that we find i such nucleotidesin a sample of n individuals is ð n

i Þxið1� xÞn�i . Averagingover x, the overall probability that we sample i disfavorednucleotides at a given site is

F ðiÞ ¼ ni

� �ð1

0Cxus1i�1ð1� xÞus=31n�i�1e2gxdx: ð16Þ

This integral, including the calculation of the normal-ization factor C, can be solved analytically. We find

F ðiÞ

¼n

i

� �Gðn�i 1 us=3ÞGði 1 usÞ1F1ði 1 us; n 1 4us=3; 2gÞ

Gðus=3ÞGðusÞ1F1ðus; 4us=3; 2gÞ ;

ð17Þwhere G is Euler’s Gamma function and 1F1 is a hyper-geometric function.

The expression above leads immediately to a maxi-mum-likelihood method for estimating g and us in theunfolded case. Note, however, that in contrast to theoriginal PRF method, this model describes a steady stateof the frequency distribution of preferred relative tounpreferred nucleotides, rather than derived vs. ances-tral nucleotides. This is a very different sort of steadystate, a point to which we return in more detail below.Since this model refers only to the preferred andunpreferred states at each site, it makes sense to applyto unfolded data only if we assume that the ancestralstate is preferred. In a sample of n sequences each oflength L, we count the number of sites at which idisfavored nucleotides are sampled, yi, for 0 # i # n.Since all sites are assumed independent, each with thepolymorphism frequency distribution described above,the likelihood of the data given the parameters is

Luðus; gÞ ¼ L!Qni¼0 yi !

Yni¼0

F ðiÞyi : ð18Þ

For any set of polymorphism data, it is straightforward tomaximize this function numerically, producing MLestimates of us and g. As with the per-site PRF method,this procedure involves a two-dimensional maximiza-tion routine. Simulated data and fits confirm that thismethod produces accurate and unbiased estimates ofmutation rate and selection strength (data not shown).

When we do not know which nucleotide is preferredat each site, which will be typical in most microbialapplications, we must use the ‘‘folded’’ version of thedata. This presents difficulties. Imagine a site with an (a,b, c, d) polymorphism configuration. We might naivelysuppose that since any of the four nucleotides could bethe preferred one, the probability of these data is simplyF(b 1 c 1 d) 1 F(a 1 c 1 d) 1 F(a 1 b 1 d) 1 F(a 1 b 1 c).However, this is not the case. For example, if a is indeedthe preferred nucleotide, then F(b 1 c 1 d) does notequal the probability that the three disfavored nucleo-tides will form a (b, c, d) configuration. Rather, it equalsthe sum of the probabilities that the three disfavorednucleotides will form a configuration (i, j, k), summedover all i, j, k triplets that sum to b 1 c 1 d. This is a seriousproblem, because the difference between F(b 1 c 1 d)and the probability of the data depends on b 1 c 1 d,and hence it is not identical for all four possible pre-ferred nucleotides. Simply assuming that the most common

2180 M. M. Desai and J. B. Plotkin

Page 7: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

nucleotide is the preferred one (Hartl et al. 1994) is areasonable approach. But this approach will be in-accurate for sites where the most common nucleotideis not overwhelmingly so; when this situation describes asubstantial fraction of sites, the method will fail. As aresult, the one-dimensional diffusion framework doesnot allow for a rigorous ML estimate of parameters withfolded data.

To perform rigorous ML fits to folded frequency datawe must turn to a three-dimensional diffusion method.

The three-dimensional diffusion model: Rather thanconsidering all disfavored nucleotides as a single class,we can instead keep track of the evolutionary dynamicsof all four possible nucleotides at a site. To do so, weassume the standard four-allele Wright–Fisher dynam-ics, with mutation at rate m=3 between any two particularnucleotides, and selection acting with strength s againstthe three disfavored nucleotides. The dynamics canthen be described by a three-dimensional diffusionequation for the joint distribution of the frequenciesof the three disfavored alleles x1, x2, and x3, f(x1, x2, x3, t)(where the preferred allele has frequency x0 ¼ 1 � x1 �x2 � x3). We have

@

@tf ðx1; x2; x3; tÞ

¼ 1

2

X3

i¼1

X3

j¼1

@2

@xi@xjVij 3 f� �

�X3

i¼1

@

@xiMi 3 f½ �; ð19Þ

where

Mi ¼ ð1 1 sÞxi 1�X3

j¼1

xj

!1

mð1� 4xiÞ3

ð20Þ

Vii ¼xið1� xiÞ

Nð21Þ

Vij ¼ �xixj

N; ði 6¼ jÞ: ð22Þ

This is a somewhat less well-known diffusion equation(Wright 1949; Watterson 1977); the steady-statesolution is

f ðx1; x2; x3Þ ¼ C ½x1x2x3ð1� x1 � x2 � x3Þ�ze2gðx11x21x3Þ;

ð23Þ

where C is a normalization factor, and we have definedz ¼ us/3 � 1.

Given the frequencies x0, x1, x2, and x3 of nucleotidesin the population, the probability of sampling a site inan unordered configuration (n0, n1, n2, n3) in a sample ofn individuals (adopting the convention that the firstnucleotide listed is the preferred one) is just themultinomial probability

n!

n0!n1!n2!n3!ð1� x1 � x2 � x3Þn0 xn1

1 xn22 xn3

3 : ð24Þ

Averaging over f, we therefore find that the probabilityof sampling a site in an unordered (n0, n1, n2, n3)configuration is

Pn0;n1;n2;n3 ¼ Cn!

n0!n1!n2!n3!

ð1

0

ð1�x1

0

ð1�x1�x2

0ð25Þ

3 e2gðx11x21x3Þxz1n11 xz1n2

2 xz1n33 ½1� x1 � x2 � x3�z1n0

3 dx3dx2dx1:

ð26Þ

If we have unfolded data and wish to assume that theancestral state is preferred, we can use unfolded MLinference. In this case, the probability of finding a site inan ordered unfolded configuration (a, b, c, d) (where byconvention a is the number of individuals that have thepreferred nucleotide and b $ c $ d) is

Pua;b;c;d ¼

Xfn0;n1;n2;n3g

Pn0;n1;n2;n3 ; ð27Þ

where the sum is over all unordered configurations (n0,n1, n2, n3) that give rise to the ordered unfoldedconfiguration (a, b, c, d).

If on the other hand we do not have a reliableoutgroup (or choose to ignore the information froman outgroup on the ancestral state), we can use foldedML inference. Here, the probability of sampling a site inthe ordered configuration (a, b, c, d) is just the sum ofthe probabilities assuming that each of the four possiblenucleotides is preferred. As before, we adopt theconvention that ordered folded configurations arewritten as (a, b, c, d) with a $ b $ c $ d. The probabilityof a folded configuration P f

a;b;c;d is then

P fa;b;c;d ¼

Xfn0;n1;n2;n3g

Pn0;n1;n2;n3 ; ð28Þ

where in this case the sum is over all unorderedconfigurations that give rise to the ordered foldedconfiguration (a, b, c, d).

For either folded or unfolded data, given n samplesof a sequence L sites long, with ya,b,c,d sites in an (a, b, c, d)polymorphism configuration, the likelihood of thedata is

Lðus; gÞ ¼ L!Qya;b;c;d !

Y½Pa;b;c;d �ya;b;c;d ; ð29Þ

where the products are taken over all possible config-urations (a, b, c, d) and Pa,b,c,d is the folded or unfoldedprobability defined above. We can numerically maxi-mize this function to find the ML estimates of us and g.We have implemented a computer program to performthis maximization; it is freely available on request.

Variable selection pressures across sites: Both theoriginal PRF method and the per-site methods we haveproposed in this article assume that all sites experiencethe same selective pressure. In reality, we expect that

Polymorphism Frequency Spectrum 2181

Page 8: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

there is some distribution of selective pressures acrosssites. Hartl et al. (1994) suggest that in this case the MLestimate of g from their PRF method reflects a weightedaverage selection pressure across the sites, but thenature of this weighting is not well understood. Almostno weight is given to sites at which jgj?1, because thesesites will likely be monomorphic and hence ignored bythe original PRF method. It is unclear how sites withdifferent values of g of order 1 will be weighted or howthe presence of some effectively neutral sites (jgj>1)will change the ML estimate. Piganeau and Eyre-Walker (2003), Nielsen et al. (2005), and (Boyko

et al. 2008) have addressed these questions by introduc-ing procedures that allow for inference of some aspectsof the distribution of selective coefficients across siteswithin the PRF framework. In this section we describe asimilar generalization in our models.

The issue of variable g across sites is of particularconcern for the methods we have proposed, becausethese methods make use of the monomorphism data. Ifsome number Ll of sites are effectively lethal (i.e., havejgj?1), these sites will all be monomorphic, which willtend to depress our ML estimate of us and increase ourestimate of jgj. Fortunately, our methods are able to usethe monomorphism data to investigate the number oflethal sites, Ll, or more generally the full distribution ofselection pressures across sites.

Since all of the methods we have proposed aredefined at a per-site level, it is straightforward to assumethat there are multiple different classes of sites withdifferent values of g. We can posit that there are k classesof sites. Each class is represented by Lj of L total sites andhas its own value of g, which we call gj. The probabilitythat a site is in an (a, b, c, d) configuration is then

Pa;b;c;d ¼Xk

j¼1

Lj

LP

ja;b;c;dðgj ; usÞ; ð30Þ

where Pj

a;b;c;d (gj, us) is the probability that a site withparameters gj and us is in the configuration (a, b, c, d).This expression is correct for both the per-site PRF andthe diffusion approaches.

Given our new definition of Pa,b,c,d, we can constructthe folded or unfolded likelihood of the overall poly-morphism data set in exactly the same way as before.This likelihood function now depends on 2k 1 1parameters: us, the gj, and the Lj. We can find MLestimates of all of these parameters, using a multidi-mensional numerical maximization of the likelihoodfunction. By choosing k, we determine the resolution atwhich we measure the distribution of values of g acrosssites. Naturally, the larger the k we choose, and hence thegreater the resolution on g, the more data we require toobtain accurate estimates of the individual Lj and gj.

Rather than estimating both the Lj and the gj, wecould instead posit that there are several classes ofmutations with different prespecified gj and estimate only

the values of Lj. In other words, we ask what fraction ofsites have different values of selective constraints. Wedescribe here one particularly important example of ahybrid between these two procedures, with two classes ofsites (k ¼ 2). Rather than fitting an ML estimate of g toboth classes, we assume that one class of sites is unable toevolve: mutations at these sites are lethal (more pre-cisely, they have jgj?1). We wish to calculate thenumber of lethal sites and the average selective pressureon the remaining, nonlethal sites. Thus we have threeparameters: the mutation rate us, the number of lethalsites L2, and the strength of selection g acting at theother L1 ¼ L � L2 sites. The probability that a site ismonomorphic is given by

Pmono ¼L2

L1

L1

LP 1

mono: ð31Þ

Here P 1mono is the probability that a site with strength of

selection g and mutation rate us will be monomorphic,as defined by either the per-site PRF or the per-sitediffusion approach (whichever method we are using).The probability that a site is in a nonmonomorphic (a, b,c, d) configuration is

Pa;b;c;d ¼L1

LP 1

a;b;c;d : ð32Þ

Now we can write the likelihood of the data in the usualway. This results in a three-dimensional ML problem.However, we can simplify the problem by first maximiz-ing L2 given g and u. We find that the ML estimate of L2

is

L2 ¼Lmono � LP 1

mono

1� P 1mono

; ð33Þ

where Lmono is the number of monomorphic sites in thedata and P 1

mono is the probability a nonlethal (i.e., an L1)site is monomorphic. Substituting this value for L2, weare left with a two-dimensional maximization problemin g and us, similar to the original situation.

It is worth exploring how this procedure for estimat-ing the number of lethal sites utilizes the data. It turnsout that this procedure is equivalent to ignoring themonomorphism data when finding the maximum-likeli-hood estimates of g and us for the nonlethal sites. Thelikelihood of the data ignoring monomorphic sites is

Lðus; gÞ ¼ yp !Qya;b;c;d

Y P 1a;b;c;d

1� P 1mono

" #ya;b;c;d

; ð34Þ

where yp is the total number of nonmonomorphic sitesand the products are over all configurations of non-monomorphic sites. After finding ML estimates of g andu from this monomorphism-ignoring likelihood func-tion, the procedure then calculates the number ofmonomorphic sites that would be expected given g

and us. This is L1P 1mono ¼ ðL � L2ÞP 1

mono. We then

2182 M. M. Desai and J. B. Plotkin

Page 9: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

estimate the number of lethal sites L2 as the differencebetween the observed number of monomorphic sitesLmono and the number that would be predicted if allsites were of the L1 variety, L2 ¼ Lmono � ðL � L2ÞP 1

mono.Rearranging this expression, we see that it is identical toEquation 33 above. And indeed, plugging Equations31–33 into Equation 30 yields Equation 34.

Thus, this procedure ignores the monomorphismdata when calculating g and us (at the nonlethal sites),and it instead uses the monomorphism data to infer oneaspect of the distribution of g across sites—specifically,the number of lethal sites. Since the original PRFmethod also ignores monomorphism data, we obtainthis information on the distribution of g for ‘‘free,’’relative to the power of the original method, simply byshifting to the per-site model. If desired, we can alsoposit that there are a number of sites L3 that areeffectively neutral (i.e., with jgj>1) and estimate L3.This would devote some part of the data describingpolymorphisms at intermediate frequencies to estimat-ing L3. From this procedure we could estimate thenumber of effectively lethal sites, the number ofeffectively neutral ones, and the ‘‘weighted average’’selection pressure acting on the remaining sites. If moreresolution is desired, and enough data are available, wecan increase the number of classes of sites and obtainML estimates of the numbers of sites in each class andthe selection pressure acting on each class.

METHODS

Simulations and fits: We used Wright–Fisher simu-lations to test the inferential accuracy of the PRFmethod as well as the accuracy of our two alternativemethods. The Wright–Fisher model (or, more precisely,its diffusion limit) forms the basis of the PRF method,and it is therefore the appropriate simulation frame-work for testing the method.

All simulations assumed a constant population of N¼1000 haploid individuals. Each of L ¼ 1000 sites, sim-ulated independently, could assume one of four states: a,c, t, or g. One state is assigned fitness 1, and the otherthree states fitness 1 1 s (where s , 0). Mutations oc-curred at rate m per site. The allele frequencies evolvedaccording to the standard Wright–Fisher Markov chain(Ewens 2004). Each simulation was run for at least 10/m

generations, to ensure relaxation to steady state. At theend of the simulation, n ¼ 14 individuals were sampledfrom the population and the polymorphism frequencyspectrum was recorded. We chose to consider samples ofsize n ¼ 14 to facilitate comparison with Hartl et al.(1994). This choice does affect the relative accuracy ofthe techniques; the finite-sites biases in the PRF methodincrease with n (see results). We have focused our fitson the case of folded frequency spectra, which willtypically be the only reliable type of polymorphism dataavailable when us is large.

We performed simulations over a wide range ofparameter values relevant to viruses and microbes. Weconsidered five different values of us: 0.05, 0.1, 0.5, 1.0,5.0. For each value of us, we performed one simulation ateach of 17 different values of g, ranging from g¼�10.0to g ¼ �0.1. For each set of simulation parameters (g,us), once the simulated folded polymorphism data hadbeen generated, ML parameter estimates ðg; usÞ wereobtained by numerical maximization of the likelihoodfunction, as specified by the original PRF model, theper-site PRF model, or the three-dimensional diffusionmodel. This maximization can be difficult in the diffu-sion case (see below). A 95% confidence interval for g

was constructed according to Bustamante et al. (2001):the interval includes those values of g within 0:5x2

1;0:95

log-likelihood units from g (note these confidence in-tervals rely on the assumption of no linkage; seediscussion). The estimated parameters shown in Fig-ure 1 are somewhat ‘‘jagged,’’ because the inferencemethods have been applied to a single draw of n ¼ 14sequences for each set of simulation parameters, asopposed to averaging over many such draws. Figure 1also shows fits obtained from inference techniquesapplied to data generated under the infinite-sites modelat us ¼ 0.01 (i.e., deviates drawn from a Poissondistribution with mean given by Equation 1).

As discussed above, the original PRF model disallowsmultiple mutant lineages at a site. Therefore, when a sitesampled from the simulated data exhibited more thantwo types of segregating nucleotides, the frequencies ofall unpreferred nucleotides were summed to representthe frequency of the ‘‘mutant’’ type for the purposes offitting using the original PRF model, as suggested byHartl et al. (1994). When fitting folded data using thePRF method, the most common nucleotide was as-sumed to be the ancestral type, as suggested by Hartl

et al. (1994). This approach is not entirely accurate, asdiscussed above, but it is probably the best optionavailable within the original PRF framework.

Numerical maximization procedure: In practice, theML estimation procedure for the diffusion model isdifficult to implement, because of the triple integral inthe definition of Pn0;n1;n2;n3

(as well as that implicit in thedefinition of the normalization constant C). This in-tegral cannot be solved exactly, and it is difficult toevaluate numerically because the integrand may diverge(though the integral itself converges) near the bound-ary of the simplex over which it is integrated. Weadopt a hybrid method to simplify the evaluation of thistriple integral. Near the boundary of the simplex, weTaylor expand the integrand and integrate it analyti-cally. Away from the boundary, the integrand is wellbehaved and standard numerical integration has nodifficulties.

This approach is most easily achieved by making thesubstitutions y ¼ x3/(1 � x1 � x2) and z ¼ x2/(1 � x1).On doing so, we can rewrite the integral as

Polymorphism Frequency Spectrum 2183

Page 10: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

Pn0;n1;n2;n3 ¼ Cn!

n0!n1!n2!n3!

ð1

0

ð1

0

ð1

0

3 exp½2gðx 1 y 1 z � xy � zy � xz 1 xyzÞ�3 xn11zzn21zyn31z

3 ð1�xÞn01n21n313z12ð1�zÞn01n312z11ð1�yÞn01z:

ð35Þ

This expression is much easier to handle than ouroriginal expression, because the three integrals can bedone in arbitrary order. We divide each of the threeintegrals into three pieces: one from 0 to d, one from d

to 1� d, and one from 1� d to 1. Thus the triple integralis split into 27 total terms. For each of the integrals from0 to d or 1 � d to 1, we Taylor expand the integrand inthe integration variable, and we solve the integralanalytically. All of the remaining integrals, from d to1 � d, are done numerically. We must choose d largeenough that we can perform the numerical integralsquickly, but not so large that the Taylor expansions usedfor the analytical parts become invalid. For these Taylorexpansions, we need d>1=2g, d>1, d>1=ð3z 1 2Þ,d>1=ð2z 1 1Þ, and d>1=z. For the computationalanalysis described in this article, we choose whicheverof these conditions is most restrictive and set d to be one-tenth of the most restrictive requirement. We find thatthis choice of d is sufficiently small to provide accuracyin the analytical parts of the integrals, but large enoughto enable quick numerical integration on the interior ofthe simplex.

We have written a computer program implementingthis numerical integration and the resulting ML estima-tion for the diffusion model. This program and thesimpler programs for ML estimation using the one-dimensional diffusion model and the per-site PRF arefreely available on request.

RESULTS

Using data from the Wright–Fisher simulations de-scribed above, we tested the inferential accuracy of theoriginal PRF method, our per-site PRF method, and thethree-dimensional diffusion method. Figure 1 com-pares the accuracy of selection pressures estimated fromfolded data using each of these three methods. For largeus, the original PRF method systematically underesti-mates the strength of negative selection, by as much as afactor of 10. In addition, the PRF method oftenerroneously infers strong positive selection when in factmutants are under negative selection. These problemsare more severe when the mutation rate is large. Thesmallest mutation rate at which such problems occur(us ¼ 0.05) is four times smaller than the mutation rateestimated for bacterial genes (Hartl et al. 1994). Theper-site version of the PRF method that we have de-veloped improves upon the standard PRF method, but it

too exhibits systematic biases, especially when selectionis weak and the mutation rate large (Figure 1). The one-dimensional diffusion method provides accurate andunbiased estimates of g over the full range of selectivepressures and mutation rates (not shown). Like its one-dimensional counterpart, the three-dimensional diffu-sion method also provides accurate and unbiasedestimates over the full range of simulated parameters(Figure 1). When selection is weak (i.e., jgj , 1),however, the confidence intervals on diffusion-basedestimates of g are appreciably larger. This behaviormakes perfect sense: when selection is nearly neutraland the ancestral state is unknown, the frequencydistribution does not exhibit sufficient skew to deducethe preferred nucleotide. As a result, the diffusion-based estimator cannot distinguish between weak pos-itive and weak negative selection in the absence ofinformation on the preferred nucleotide. Thus, theconfidence intervals obtained under the folded diffu-sion technique properly reflect our inability to estimatethe selection pressure precisely when selection is weak.

As shown in Figure 1, when selection is weaklynegative, the original PRF method erroneously inferspositive selection. This problem occurs in the parameterregimes that have been estimated from biological datafrom bacterial and viral populations. For example, onthe basis of n ¼ 14 sampled sequences each 367 siteslong, Hartl et al. (1994) estimated g ¼ �1.34 and us ¼0.183 for silent sites in a bacterial gene. If we simulate367 Wright–Fisher sites under these parameters andsample n ¼ 14 sequences, we find that the most likelyparameters fitted using the original PRF method areðg; usÞ ¼ ð118:45; 0:067Þ. That is, in this example usingestimated microbial parameter values, the PRF methodis strongly biased.

Figure 2 shows the accuracy of estimated mutationrates using the original PRF method, the per-site PRFmethod, and the three-dimensional diffusion method.The original PRF method systematically underestimatesthe mutation rate. Estimates obtained using the per-sitePRF method are an improvement, but still exhibit somebiases. The diffusion-based method provides accurateand unbiased estimates of the mutation rate, across thefull range of mutation rates and selection pressures(Figure 2).

In addition to the simulations described above, wealso fitted data that had been generated under theinfinite-sites model at a small mutation rate, us ¼ 0.01(Figure 1). In this case, we generated a sampled poly-morphism frequency spectrum using Poisson deviateswith mean given by Equation 1, for n ¼ 14 samples ofL ¼ 10,000 sites. When applied to such data, inferencesbased on the original PRF method must be unbiased.Figure 1 shows that inferences based on the modifiedper-site PRF method and the diffusion method are alsoaccurate and unbiased when applied to such data. Thus,the new methods we have developed here perform as

2184 M. M. Desai and J. B. Plotkin

Page 11: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

well as the original PRF method under the assumptionof infinite sites and at small mutation rates.

The methods we have developed in this article alsoallow us to estimate the distribution of selection pres-sures across sites. In one simple case discussed above, wehave presented a procedure for estimating the numberof lethal sites and the selective pressure operating on theremaining, nonlethal sites in a gene. This procedureinvolves estimating g and us on the basis of polymorphicsites alone and thereafter estimating the proportion ofobserved monomorphic sites that are lethal. To assess

the power and accuracy of this approach, Table 1 showsthe error in the predicted number of monomorphicsites in each of our simulations, compared to thenumber of monomorphic sites actually observed. Acrossa large range of selective pressures and mutation rates,this approach typically estimates the number of (non-lethal) monomorphic sites within a few percent. As aresult, for a gene of length L ¼ 2000 sites, at one-half ofwhich mutations are strongly deleterious (jsj?1=N ),our procedure will accurately predict the number oflethal sites within a few percent, and it will accurately

Figure 1.—Maximum-likelihood estimates of selection pressures obtained under the PRF method, the modified (per-site) PRFmethod, and the three-dimensional diffusion method. Selection pressures were estimated from the folded polymorphism fre-quencies among n ¼ 14 sequences sampled from a simulated Wright–Fisher population. In each panel, the simulated selectionpressure g is shown on the x-axis and the estimated selection pressure g on the y-axis. Dashed lines indicate 95% confidenceintervals around the ML estimates obtained from the x2-distribution (see methods). The line g ¼ g is shown in red, and the lineg ¼ 0 is shown in gray. The diffusion method and, to a lesser extent, the modified PRF method correct the biases inherent in theoriginal PRF method. When selection is weakly negative, the diffusion method cannot reject positive selection on the basis offolded data, as indicated by the lack of the upper confidence interval for some of the fits. The simulated data in the top panels,u ¼ 0.01, were generated according to the infinite-sites model (see methods).

Polymorphism Frequency Spectrum 2185

Page 12: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

predict the selection pressure on the remaining, non-strongly deleterious sites.

All of the simulation results that we have presentedthus far have been for the case n¼ 14. This choice affectsthe relative accuracy of the methods. As n increases, theprobability of seeing a high-frequency polymorphismwithin the sample depends more strongly on theprobability that a site has a high-frequency polymor-phism in the population, f(x) for large x. This is preciselythe part of frequency distribution that is artificiallysuppressed by the infinite-sites assumption of the PRFmethod. Thus, as the sample size increases the biases inthe PRF method should become more severe. We havetested this expectation by sampling n ¼ 100 simulatedWright–Fisher sequences. In the case of us ¼ 0.1, forexample, when fitting unfolded polymorphism fre-quency spectra measured relative to the (known)ancestral state at each site, the PRF method alwaysrejects the true value of g in favor of a weaker or evenpositive selection coefficient, throughout the range g 2[�10, �0.5]. Inferential biases associated with largesample sizes and mutation rates will become increas-ingly important as the availability of sequence dataimproves—in metagenomic surveys of microbial pop-ulations, for example.

DISCUSSION

The Poisson random field model (Sawyer andHartl 1992) and the associated likelihood procedure

for estimating parameters (Hartl et al. 1994) areperfectly valid when the assumptions underlying themethod are met—namely, infinite sites, free recombi-nation, and constant selective pressure across sites.Moreover, the assumption of infinite sites is perfectlyreasonable for most eukaryotic populations. It will holdwhenever N m ln½1=jsj�>1; as we have seen, a rough ruleof thumb is that this tends to be true when us , 0.05. Wehave shown, however, that when the mutation rate isrelatively large the PRF method can lead to incorrectinferences. In practice, such concerns arise whenstudying populations of viruses, microbes, and thoseeukaryotes that experience large mutation rates (us .

0.05). We have developed two new methods that relax orremove the infinite-sites assumption. These new meth-ods also extend the types of inferences that can bedrawn from polymorphism data to include inferenceson the distribution of selective pressures across sites.

It may seem surprising that the infinite-sites approx-imation can lead to substantial errors in the mutationrates and selective strengths inferred by the PRFmethod. After all, sites at which multiple mutantlineages are sampled are presumably very rare (e.g.,Hartl et al. 1994). Yet despite their rarity, these siteshave a large impact on maximum-likelihood estimationof g. Because g enters the likelihood function in thefactor e 2gx, changing g has a much larger impact on thelikelihood of sites with many mutant nucleotides thanthose with few. In other words, a single site with a highfrequency of mutant nucleotides is very strong evidencefor positive selection or low jgj, whereas a site with a lowfrequency of mutant nucleotides is not very strongevidence for the opposite. Thus even a few sites at

Figure 2.—Maximum-likelihood estimates of the mutationrate, us ¼ 2Nm, obtained under the PRF method (blue), themodified (per-site) PRF method (green), and the diffusionmethods (red). Mutation rates were estimated from thefolded polymorphism frequencies among n ¼ 14 sequencessampled from a simulated Wright–Fisher population. Thesimulated mutation rate us is shown on the x-axis and the es-timated mutation rate us on the y-axis. The line us¼ us isshown in black. For each value of us, simulations and fitsare shown for 17 different values of g, ranging from g ¼�10.0 to g ¼ �0.1. The PRF method systematically underes-timates the mutation rate, especially when selection is weak.The diffusion method provides accurate and unbiased esti-mates of the mutation rate across the full range of parameters.

TABLE 1

Accuracy of the estimated number of monomorphic sites

Actual us Average error (%) Median error (%)

0.05 7.6 7.80.1 3.3 1.80.5 2.3 1.11.0 1.4 0.55.0 0.1 0.1

Accuracy of estimates for the expected number of mono-morphic sites. We calculated maximum-likelihood estimatesof g and us using the one-dimensional diffusion method ap-plied to unfolded simulated data excluding monomorphicsites. From these values, we calculated the expected numberof monomorphic sites. Shown are the differences between theobserved and the expected number of monomorphic sites (ina simulated gene of length L ¼ 1000 sites). For each value ofus, we show both the average and the median differences innumbers of monomorphic sites, across simulations with granging from �0.1 to �10. These results imply that we canaccurately estimate the number of sites that are monomor-phic due to drift. Thus, if a gene contains a similar or largernumber of lethal sites, we can also estimate the number ofsuch lethal sites to within the above accuracy.

2186 M. M. Desai and J. B. Plotkin

Page 13: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

which multiple mutant lineages are sampled can causelarge inaccuracies in the inferred g when they areincorrectly assumed by the original PRF method to bethe result of a single high-frequency mutant lineage.These inaccuracies in g then force correspondinginaccuracies in the inferred us.

Previous simulation studies have not observed theseproblems with the PRF method (Bustamante et al.2001). However, such simulations themselves implicitlyassumed infinite sites, and hence they cannot be used totest this aspect of the PRF method (to be fair, thesesimulations were done with typical eukaryotic popula-tions in mind, where these problems are much lesssevere). In this article, by contrast, we have simulated afinite number of sites that evolve according to theWright–Fisher model.

The role of linkage: While our focus has been on theinfinite-sites approximation, both the original PRF andall of our methods make another key assumption: thateach site is independent of all the others (i.e., freerecombination). This is essentially an assumption oflinkage equilibrium between all polymorphic sites. Thispotentially crucial assumption is likely to be violated inmany real populations, particularly when the method isapplied to estimate selective pressure on short stretchesof DNA (e.g., a single gene) that are linked over longtimescales. This is problematic for both the original PRFand our methods, but it may be particularly relevant inpopulations in which finite-sites issues are important,because recombination is presumably less common inbacteria than in eukaryotes and because higher us

implies a higher density of polymorphic sites, and hencetighter linkage. Ideally, we would address this assump-tion using theory that allows us to infer the strength ofselection acting on a large number of sites with anarbitrary degree of linkage. However, this is a formida-ble challenge, and no such theory yet exists.

Despite this, models that assume free recombinationare still useful for several reasons. First, when selectionpressures are weak, sites segregate over long timescales,so recombination may be frequent enough even amongintragenic sites (or in bacterial genomes) that segregat-ing sites are unlinked over these timescales. Since bothPRF and diffusion-type methods are primarily usefulat estimating selection pressures when they are of order1/N, the selection pressures our methods can resolveget smaller as us increases. Thus while populations withlarge us may tend to have lower recombination ratesamong polymorphic sites, the polymorphic sites we aresensitive to segregate over longer timescales, and re-combination has longer to act. Whether this means thatlinkage can be neglected in any particular population isan empirical question, but fortunately it is usuallystraightforward to estimate linkage disequilibria directlyfrom the data. Thus in any particular case we canestimate whether or not the assumption of free re-combination is valid for the purposes of the original PRF

or our modified methods, before applying eitherapproach. When there is no linkage disequilibrium inthe data, we can use our methods with confidence.When on the other hand it appears from the data thatfree recombination is not a reasonable assumption, amodel that assumes no linkage may still be useful as anull model and a limiting case, and it can be comparedwith more complicated possibilities.

Finally, it is important to note that linkage betweensegregating sites will not bias estimates of the selectionpressure g, provided g is equal across sites (Akashi andSchaeffer 1997; Bustamante et al. 2001). Rather, theprimary effect of linkage is to increase the variance insuch estimates above the predictions of our method(Bustamante et al. 2001). Thus while the predictions ofour models (or the original PRF) will tend to have smallerconfidence intervals than they should, the estimatedvalues of g and us should not be systematically biased. It isimportant to note, however, that when linkage is im-portant we must be cautious about testing hypothesesusing these methods; underestimates of the confidenceintervals could, for example, lead us to erroneously re-ject a neutral null. Further, when selection pressures varysignificantly across sites, the mean parameter estimatesthemselves could be biased; this more complex situationremains an important topic for future exploration.

Comparison between the PRF and diffusion meth-ods: Our per-site version of the PRF model relaxes someaspects of the infinite-sites approximation, but it still as-sumes the mutant lineages do not interact. The diffusionapproach removes the infinite-sites approximation alto-gether. Thus, the diffusion approach contains none ofthe biases associated with infinite-sites approximationassociated with the traditional PRF and, to a lesserdegree, the per-site PRF.

The diffusion method is also easily extendable tomore complex evolutionary situations. For example, wecan explore different selective costs for different nucleo-tides, or more than one preferred nucleotide. Thesepossibilities lead to obvious modifications of the diffu-sion equations and their solutions and hence to themaximum-likelihood estimation. Mutational biases aremore difficult to explore, as no simple steady-state solu-tion to the three-dimensional diffusion equation existsfor biased mutation rates, but computational results canstill be obtained. It is also straightforward to investigatebalancing selection or the effects of dominance: theselead to well-understood modifications to the diffusionequations and their steady-state solutions (Ewens 2004).In the PRF framework, by contrast, such generalizationsare much more complex. In particular, balancing selec-tion is impossible to analyze within the PRF framework,because it leads to mutant lineages reaching stableintermediate frequencies in the population. As a result,the generation of new mutations is not balanced by theextinction or fixation of older ones, and hence nosteady-state distribution of lineage frequencies exists.

Polymorphism Frequency Spectrum 2187

Page 14: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

The methods we have developed also allow us to relaxthe assumption of constant g across sites and insteadinfer aspects of the distribution of selection pressures(Piganeau and Eyre-Walker 2003, Nielsen et al. 2005,and Boyko et al. 2008 have recently analyzed similarextensions of the original PRF method). It is not yetclear how much data is required to provide adequatepower for inferring this distribution to a given resolu-tion. However, we can hope to gain a great deal ofinsight with only a few additional parameters—say, thenumber of sites that are neutral, lethal, and negativelyselected and the weighted average selection pressure onthe latter class. As we have shown, we can estimate thenumber of lethal sites with no reduction in powerrelative to the original PRF method, so this proposalwould involve only one additional parameter. Addi-tional classes of sites would involve more parameters.Eyre-Walker and Keightley (2007) have recentlystressed that the distribution of fitness effects ofdeleterious mutations is likely to be complex andmultimodal. Using our approach, we can choose howto focus the power in the data to investigate the aspectsof this complex distribution that are most interesting ina given situation. The appropriate choice of resolutionwill depend on the context and quality of the data.

However, the diffusion approach is not without draw-backs. The one-dimensional version does not make useof all of the data available in the observed polymor-phism spectrum and is likely to be of little use in practicebecause we typically do not know whether or notthe ancestral nucleotide was preferred. The three-dimensional version uses all the data in the folded case,but cannot make use of outgroup data (the unfoldedcase). In practice, when we do not have an outgroup, thethree-dimensional diffusion method provides the mostaccurate estimates of selection pressures across the fullrange of parameters. When an outgroup is available, thethree-dimensional diffusion approach is still sometimesmore accurate than the PRF, despite wasting informa-tion on the ancestral state, but only in parameterregimes typical of bacteria or viruses. In other casesthe per-site PRF method is best.

The diffusion method also cannot naturally handlepositive selection. The steady-state evolutionary dynam-ics at a site are always dominated by the preferrednucleotide (or nucleotides), with negative selectionacting against polymorphisms for the disfavored nucleo-tides. At the level of an individual site, positive selectionis a process that is intrinsically out of steady state: thespread of a favorable nucleotide before it becomesfixed. The PRF method handles positive selection byimplicitly positing rather strange dynamics at individualsites. All mutant lineages are assumed to be positivelyselected—so if a mutant nucleotide fixes at a given site,mutations back to the ancestral nucleotide are againassumed to be positively selected (the analogous strangedynamics also apply to negative selection). While this

assumption makes little sense at a per-site level, it allowsthe PRF model to obtain a steady state across sites,provided positive selection is ongoing and not satu-rated. We could modify our diffusion methods to mimicthe PRF treatment of positive selection by changing theboundary conditions in our diffusion equations. Specif-ically, we would assume that probability flowing into x¼1 (i.e., fixation of a mutant nucleotide) is absorbed andmoved to x ¼ 0 (i.e., ‘‘reset’’ so that new mutations willagain be favored). This diffusion equation can be solvedexactly, and the solution used as a basis for inferringpositive selection using the the per-site diffusion meth-ods we have developed.

Ideally, however, we want to infer positive selection inthe context of a realistic and well-defined model of thedynamics at individual sites. Such an approach wouldnecessarily involve full time-dependent solutions to thediffusion equations; Mustonen and Lassig (2007)suggest a method along these lines. We do not pursuethis approach here, but see Chen et al. (2007) for a step inthis direction. Regardless of the methodology, it willalways be difficult to discriminate positive selection fromnegative selection on the basis of the polymorphismfrequency spectrum alone, particularly when only foldeddata are available. Whether selection is positive ornegative, mutant lineages drift nearly neutrally whentheir frequency is between 0 and 1/jgj. Positively selectedlineages then fix relatively quickly once their frequencybecomes substantially larger than 1/jgj, while negativelyselected lineages rarely ever reach frequencies largerthan 1/jgj. Thus from the point of view of the poly-morphism frequency spectrum, positive selection issimilar to random drift on [0, 1/jgj], with the upperbound a roughly absorbing boundary condition. Nega-tive selection, on the otherhand, is also similar to randomdrift on [0, 1/jgj], but with the upper bound a roughlyreflecting boundary condition. Although this is relativelycrude—selection does in fact have some impact on low-frequency lineages, and the boundary conditions are notexactly absorbing or reflecting—it indicates that thepolymorphism frequency spectrum is roughly similarfor negative and positive selection at the same jgj. Thuspower to distinguish positive from negative selectionbased on the polymorphism frequency spectrum, espe-cially with folded data, will always be relatively limited,regardless of the method used.

Different steady states: The PRF framework and thediffusion methods are based on fundamentally differentsteady states. The PRF assumes an ancestral state; thesteady-state frequency distribution is that of derivednucleotides relative to this ancestral state. By contrast,the diffusion methods assume mutation back and forthbetween all possible nucleotides at each site. The steady-state frequency distribution is a full mutation-selection-drift balance of all four nucleotides at the site. It has‘‘forgotten’’ the ancestral state, and it is obtained over alonger timescale than the PRF steady state.

2188 M. M. Desai and J. B. Plotkin

Page 15: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

The differences in the nature of the two steady statesare most readily apparent when considering theirbehavior for small us. This is easiest to see by comparingthe PRF steady state to the one-dimensional diffusionresult (assuming as usual that one nucleotide is pre-ferred and the other three are unpreferred). For smallus, the population will typically be fixed for one of thefour nucleotides. Occasionally mutations will occur,making the site temporarily polymorphic within thepopulation. Usually these mutant lineages will drift atlow frequency for a short while before going extinct, butoccasionally they will fix and the population will be fixedfor this new nucleotide. The PRF steady state describesthe (transient) frequency spectrum of these occasionalmutant lineages before they fix or go extinct. Thisspectrum is always measured relative to the ancestralnucleotide, which is the last mutation that fixed. Onthe other hand, the diffusion method describes thelonger-term steady-state frequency distribution of thenucleotides. This includes some weight near x ¼ 0(the preferred nucleotide is fixed, or nearly so), as wellas some weight near x ¼ 1 (one of the unpreferrednucleotides is fixed, or nearly so). Because of theselective bias, the population will be fixed or nearlyfixed for the preferred nucleotide a fraction 1=ð1 1 3e2gÞof the time (Ewens 2004), so 1=ð1 1 3e2gÞ of the totalweight is in the part of the distribution near x ¼ 0. Theremainder of the time, the population will be fixed ornearly fixed for one of the unpreferred nucleotides; thisis the weight in the distribution near x ¼ 1.

The discussion above helps to elucidate the relation-ship between the steady states in the PRF model vs. thosein the diffusion model. A fraction 1=ð1 1 3e2gÞ of thetime the most recent fixation will have been for thepreferred state. This is the ancestral state for the PRFmethod, and the frequency spectrum of the occasionalunpreferred negatively selected mutants is given by thePRF steady state. The remainder of the time the mostrecent fixation will have been for one of the unpre-ferred states. Now this is the ancestral state for the PRFmethod, and the frequency spectrum of occasionalmutants for the preferred nucleotide, which are underpositive selection, is again given by the PRF steady state,but with the opposite sign of g. Thus we expect that thediffusion method steady state is just the sum of these twoPRF steady states, weighted by the frequency with whicheach type of nucleotide is the ancestral state, namely

f diffðg; xÞ � pðgÞf PRFðg; xÞ1 ½1� pðgÞ�f PRFð�g; 1� xÞ;

where p(g) denotes the chance that the most recentfixation was for the preferred state. This relationshipbetween the two steady states is illustrated in Figure 3. Itholds in the limit of small us. For larger us, thepopulation is not always nearly fixed for one state.Mutant lineages occur more frequently and can segre-gate simultaneously (i.e., finite-sites effects). Since thediffusion approach accounts for this while the PRF does

not, the relationship between the steady states breaksdown.

The difference between the diffusion and PRF steadystates points to one additional bias in the PRF methodthat we have not yet discussed. Imagine a stretch ofsequence that has been under constant purifyingselection for an extremely long time. As we have justseen, we expect that at a fraction 3e2g=ð1 1 3e2gÞ of thesites in this sequence, the most recent ancestor will bean unpreferred state. The PRF frequency spectrum atthese sites will be characteristic of positive selection(mutations to the preferred nucleotides are beneficial).Although this will typically represent a minority of thesites, the PRF method weights sites under positiveselection more strongly than those under negativeselection when inferring a weighted average g. Thus,the PRF method will be biased toward inferring positiveselection—despite the fact the sequence has experi-enced negative selection for an arbitrarily long time.This effect is present even for small us, and it is mostsevere when selection is weak but nonzero.

Whether this effect represents a bias in the PRFmethod is a matter of taste. In some sense, the PRF is

Figure 3.—Relationship between the PRF and diffusionfrequency spectra for small us. Shown at the top is the one-di-mensional diffusion frequency spectrum for us ¼ 10�3, g ¼�0.5. Most of the time the population is fixed or nearly fixedfor the preferred nucleotide (this is the weight near x ¼ 0),while sometimes the population is fixed or nearly fixed foran unpreferred nucleotide (the weight near x ¼ 1). Whenthe population is nearly fixed for the preferred (unpreferred)nucleotide, mutations produce a PRF site frequency spectrum(SFS) as shown on the left (right) of the bottom graph. Forthis small value of us, the diffusion frequency spectrum isroughly equal to the sum of these PRF frequency spectra.

Polymorphism Frequency Spectrum 2189

Page 16: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

correctly identifying sites under positive selection: afterall, the mutations to the preferred nucleotides are infact beneficial. In another sense, however, the sequencehas experienced constant purifying selection through-out its past. The fact that occasionally an unpreferrednucleotide fixes before reverting to the preferred statedoes not reflect any change in selection pressures. Thus,if we define positive selection as a pressure to innovate,or a shift in the selective landscape, then such areversion does not reflect positive selection. We haveexplored this effect through simulation and found thatin microbial parameter regimes it is a detectable butminor effect and is much less important in practice thanthe other biases we have studied. However, it would bestraightforward to modify either the original or the per-site PRF method to correct for this (the diffusionmethods already account for this effect). We simplychange the expected spectrum to reflect the fact that afraction 1=ð1 1 3e2gÞ of sites will exhibit the spectrumcharacteristic of negative selection while a fraction3e2g=ð1 1 3e2gÞ will exhibit that characteristic of positiveselection (with equal but opposite g). This leads to asimple modification of the PRF spectrum, with noadded parameters, which we can use for ML estimationas before.

Different timescales: Because the PRF frequencyspectrum is measured relative to the last fixed state, itreaches steady state in at most of order N generations,the coalescent timescale. The diffusion method takeslonger to reach steady state—in the neutral case,mutations fix every 1/m generations, so it takes of order1=m generations to reach steady state. When selection isoperating, the diffusion steady state can arise eithermore quickly or more slowly, depending on the initialstate of the population. Note that the ratio of the PRF tothe diffusion timescale is of order us. Thus for the smallus relevant for most eukaryotic populations, there is abig gap between the two timescales, while for the largerus relevant to bacteria and viruses the two timescales willbe more similar.

These two timescales determine whether the pres-ence of information from an outgroup is likely toimprove our inferential power. Imagine we samplepolymorphism from some population A and one se-quence from an outgroup species. If the outgroupdiverged from the population A less than N generationsago, population A has not reached the PRF steady statein the time since the divergence. Thus the frequencyspectrum of population A is not independent from thefrequency spectrum in the outgroup. The outgroupdoes not provide much more information than simplysampling another individual from population A(though it does provide some additional information).This situation has recently been analyzed in the neutralcase by Chen et al. (2007) and applied to humanpolymorphism data using a Neanderthal outgroup.

If instead the outgroup diverged from populationA more than N but less than 1/m generations ago, it canbe used to infer the most recent ancestral state. Sincethe time of this ancestor, the PRF frequency spectrumwill have reached equilibrium, and it can be used toestimate parameters accounting for this knowledge(though finite-sites issues can still matter when us islarge, and diffusion methods may be more accuratedespite the information from the outgroup). But thecloser the time since the outgroup diverged is to 1/m

generations, the more unreliable the inference of theancestral state becomes, and ancestral state misidentifi-cation becomes more of a problem. This problem andpossible ways to account for it have recently beenexamined by Hernandez et al. (2007).

If the outgroup diverged more than 1/m generationsago, the population will have forgotten the ancestralstate since divergence, and the outgroup will not beuseful. Of course, in this case we would probably noteven be able to align the outgroup sequence to thesequences from population A. In this case, only thefolded frequency data are available, and the diffusionmethods are most appropriate.

We thank Daniel Fisher, Marc Feldman, and Warren Ewens forhelpful comments on the manuscript. M.M.D. acknowledges supportfrom Center grant P50GM071508 from the National Institute ofGeneral Medical Science to the Lewis–Sigler Institute. J.B.P. acknowl-edges support from the Burroughs Wellcome Fund, the James S.McDonnell Foundation, and the Defense Advanced Research ProjectsAgency Fun-Bio Program (HR0011-05-1-0057).

LITERATURE CITED

Akashi, H., 1995 Inferring weak selection from patterns of polymor-phism and divergence at ‘‘silent’’ sites in Drosophila DNA. Genet-ics 139: 1067–1076.

Akashi, H., 1999 Inferring the fitness effects of DNA mutationsfrom polymorphism and divergence data: statistical power to de-tect directional selection under stationarity and free recombina-tion. Genetics 151: 221–238.

Akashi, H., and S. W. Schaeffer, 1997 Natural selection and thefrequency distributions of ‘‘silent’’ DNA polymorphism in Dro-sophila. Genetics 146: 295–307.

Bartolome, C., X. Maside, S. Yi, A. L. Grant and B. Charlesworth,2005 Patterns of selection on synonymous and nonsynonymousvariants in Drosophila miranda. Genetics 169: 1495–1507.

Boyko, A. R., S. H. Williamson, A. R. Indap, J. D. Degenhardt, R. D.Hernandez et al., 2008 Assessing the evolutionary impact ofamino acid mutations in the human genome. PLoS Genet. 4:e1000083.

Bustamante, C. D., J. Wakeley, S. Sawyer and D. L. Hartl,2001 Directional selection and the site-frequency spectrum. Ge-netics 159: 1779–1788.

Bustamante, C. D., R. Nielsen, S. Sawyer, K. M. Olsen, M. D.Purugganan et al., 2002 The cost of inbreeding in Arabidopsis.Nature 416: 531–534.

Bustamante, C. D., R. Nielsen and D. L. Hartl, 2003 Maximumlikelihood and Bayesian methods for estimating the distributionof selective effects among classes of mutations using DNA poly-morphism data. Theor. Popul. Biol. 63: 91–103.

Bustamante, C. D., A. Fledel-Alon, S. Williamson, R. Nielsen,M. T. Hubisz et al., 2005 Natural selection on protein-codinggenes in the human genome. Nature 437: 1153–1157.

Chen, H., R. E. Green, S. Paabo and M. Slatkin, 2007 The joint allele-frequencyspectrumincloselyrelatedspecies.Genetics177:387–398.

2190 M. M. Desai and J. B. Plotkin

Page 17: The Polymorphism Frequency Spectrum of Finitely Many Sites ... · Copyright 2008 by the Genetics Society of America DOI: 10.1534/genetics.108.087361 The Polymorphism Frequency Spectrum

Desai, M. M., and D. S. Fisher, 2007 Beneficial mutation-selectionbalance and the effect of linkage on positive selection. Genetics176: 1759–1798.

Drake, J. W., B. Charlesworth, W. Charlesworth and J. F. Crow,1998 Rates of spontaneous mutation. Genetics 148: 1667–1686.

Ewens, W. J., 2004 Mathematical Population Genetics: I. Theoretical In-troduction. Springer, New York.

Eyre-Walker, A., and P. D. Keightley, 2007 The distribution of fit-ness effects of new mutations. Nat. Rev. Genet. 8: 610–618.

Galtier, N., E. Bazin and N. Bierne, 2006 Gc-biased segregation ofnoncoding polymorphisms in Drosophila. Genetics 172: 221–228.

Hartl, D. L., E. N. Moriyama and S. A. Sawyer, 1994 Selection in-tensity for codon bias. Genetics 138: 227–234.

Hernandez, R. D., S. H. Williamson and C. D. Bustamante,2007 Context dependence, ancestral misidentification, andspurious signatures of natural selection. Mol. Biol. Evol. 24:1792–1800.

Lercher, M. J., N. G. C. Smith, A. Eyre-Walker and L. Hurst,2002 The evolution of isochores: evidence from SNP frequencydistributions. Genetics 162: 1805–1810.

Loewe, L., B. Charlesworth, C. Bartolome and V. Noel,2006 Estimating selection on nonsynonymous mutations. Ge-netics 172: 1079–1092.

Lynch, M., and J. S. Conery, 2003 The origins of genome complex-ity. Science 21: 1401–1404.

McDonald, J., and M. Kreitman, 1991 Adaptive protein evolutionat the adh locus in Drosophila. Nature 351: 652–654.

Moran, P. A. P., 1959 The survival of a mutant gene under selection.ii. J. Aust. Math. Soc. 1: 485–491.

Mustonen, V., and M. Lassig, 2007 Adaptations to fluctuating se-lection in Drosophila. Proc. Natl. Acad. Sci. USA 104: 2277–2282.

Nachman, M. W., 1998 Deleterious mutations in animal mitochon-drial DNA. Genetica 102/103: 61–69.

Nielsen, R., C. D. Bustamante, A. G. Clark, S. Glanowski, T. B.Sackton et al., 2005 A scan for positively selected genes inthe genomes of humans and chimpanzees. PLoS Biol. 3: 976–985.

Piganeau, G., and A. Eyre-Walker, 2003 Estimating the distribu-tion of fitness effects from DNA sequence data: implicationsfor the molecular clock. Proc. Natl. Acad. Sci. USA 100:10335–10340.

Rand, D. M., and L. M. Kann, 1998 Mutation and selection at silentand replacement sites in the evolution of animal mitochondrialDNA. Genetica 102/103: 393–407.

Sawyer, S., R. J. Kulathinal, C. D. Bustamante and D. L. Hartl,2003 Bayesian analysis suggests that most amino acid replace-ments in Drosophila are driven by positive selection. J. Mol. Evol.57: S154–S164.

Sawyer, S. A., and D. L. Hartl, 1992 Population-genetics of poly-morphism and divergence. Genetics 132: 1161–1176.

Wakeley, J., 2003 Polymorphism and divergence for island-modelspecies. Genetics 163: 411–420.

Watterson, G. A., 1977 Heterosis or neutrality? Genetics 85: 789–814.

Weinreich, D. M., and D. M. Rand, 2000 Contrasting patterns ofnonneutral evolution in proteins encoded in nuclear and mito-chondrial genomes. Genetics 156: 385–399.

Williamson, S., A. Fledel-Alon and C. D. Bustamante,2004 Population genetics of polymorphism and divergencefor diploid selection models with arbitrary dominance. Genetics168: 468–475.

Williamson,S.H.,R.Hernandez,A.Fledel-Alon,L.Zhu,R.Nielsen

et al., 2005 Simultaneous inference of selection and populationgrowth from patterns of variation in the human genome. Proc.Natl. Acad. Sci. USA 102: 7882–7887.

Wright, S., 1938 The distribution of gene frequencies underirreversible mutation. Proc. Natl. Acad. Sci. USA 24: 253–259.

Wright, S., 1949 Adaptation and selection, pp. 365–389 in Genetics,

Paleontology, and Evolution, edited by G. L. Jepson, G. G. Simpson

and E. Mayr. Princeton University Press, Princeton, NJ.Zhu, L., and C. D. Bustamante, 2005 A composite-likelihood ap-

proach for detecting directional selection from DNA sequencedata. Genetics 170: 1411–1421.

Communicating editor: M. K. Uyenoyama

APPENDIX: ALGORITHM TO COMPUTE THELIKELIHOOD EXPRESSIONS UNDER

THE PER-SITE PRF

To implement an inference procedure based on ourper-site version of the PRF model we need to computelikelihood expressions for any possible unfolded con-figuration of nucleotides, an example of which is shownin Equation 10.

Our algorithm starts by initializing the expressionsassociated with each configuration to zero. We loop overall partitions of the sample size n written as the sum ofnonzero integers, n ¼ n1 1 n2 1 . . . 1 nz 1 nw. In thispartition, numbers n1 through nz represent the numberof samples of distinct mutant lineages, and nw repre-sents the number of samples with the ancestral state.The sum of all such sample numbers equals the totalsample size, n. Such a sample will correspond to aparticular folded configuration (a, b, c, d) that dependsupon the identity of the mutant nucleotides associatedwith the z mutant lineages. There are 3z distinct possibleassignments of mutant nucleotide types to the mutantlineages. We loop over each of these possible assign-ments, and for each we add a term of the form

ð1=3Þz F ð1Þn1

n1!3

F ð2Þn2

n2!� � � F ðnzÞnz

nz!

to the likelihood expression for the folded configura-tion (a, b, c, d) that arises from the assignment ofnucleotide types. After looping over all possible num-bers of mutant lineages and over all possible assign-ments of mutant nucleotide types to such lineages, weobtain complete expressions for the likelihoods of allunfolded configurations.

To compute likelihood expressions for folded con-figurations, we perform the exact same algorithm asabove with a small modification. In this case, once wehave assigned nucleotide types to each of the z mutantlineages, we determine the associated folded configu-ration by simply sorting the number of lineages (in-cluding the wild type) of each mutant nucleotide type indescending order.

A computer implementation of this algorithm, foreither folded or unfolded likelihood expressions, isfreely available upon request.

Polymorphism Frequency Spectrum 2191