28
The ‘hitch-hiking’ effect April 4, 2016

Hitch hiking journalclub

Embed Size (px)

Citation preview

The ‘hitch-hiking’ effectApril 4, 2016

Motivation

• You want to sequence some individuals and identify loci “subject to selection”, in some general, vague sense.

• To make any progress in terms of theory, you first have to formalize the question.

Define the question• What is the effect of natural selection at “site A” on

the change in allele frequency of “site B”?

A B

r

“Selected site”

“Neutral locus”

In the absence of selection

E[�x] = 0“HWE”V [�x] = x(1�x)

2N“More drift in small

populations” E[H] = ✓1+✓ ; ✓ = 4Neµ

“More variability in large

populations”

(See any introductory evolution/population genetics text)

In the absence of selection

E[⇡] = 2 in

n�in�1 = ✓

Expected mean # differences b/w all pairs of sequences in a sample of size n.

Tajima (1983) Genetics

✓̂ = SPn�1i=1

Watterson (1975) Theoretical Pop’n

Biology

Expected # of mutations in sample of size nE[S] = ✓

Pn�1i=1

1i

f(i) = ✓i ; 1 i < n

Expected # of mutation where derived state occurs i

times in a sample of size n. Tajima (1983), but see

Hudson (2015) PLoS One for way easier derivation.

0

2.5

5

7.5

10

1 2 3 4 5

n = 6; ✓ = 10

Figure from Hudson, 1990 “Gene genealogies and the coalescent process”

How does selection change these predictions?

• “Classic sweep” - new mutation, beneficial upon origin. This is 1, 1+sh, 1+2s.

• “Soft sweep” - neutral or deleterious variant, becomes beneficial later. “Selection on standing variation.”

• Polygenic trait. This is quadratic selection based on deviations from an optimum. Will not cover in detail. I will make qualitative comments.

Define the question• What is the effect of natural selection at “site A” on

the change in allele frequency of “site B”?

A B

r

“Selected site”

“Neutral locus”

Classic sweeps: intuitionr = 0

r > 0

Heterozygosity ( = diversity) is reduced at neutral locus due

to “hitch-hiking”. Magnitude of effect will depend on ’s’ and ‘r’

Hitch-hiking effect of a gene 29

-0008 -0006 -0004 -0002 0 0002 0004 0006 0008Fig. 2. 4Qao(l —Qoo) is the final amount of heterozygosity at a locus, when initialfrequencies of o, A are 0-5. The graph here, with N = 106 and s = 0-01, is calculatedfrom (8).

heterozygosity remained, the gene frequencies would return towards their equi-librium frequencies. But we are assuming a neutral polymorphism, for which (26)states that the expected heterozygosity at some future time will be a fixed fractionof what it is now.

Since, in (28), L depends on Ro, we can calculate the mean value of L given anyinitial distribution of frequencies at the polymorphic locus. If the initial distributionof frequencies is uniform over [0,1], let the initial frequencies of A, a at the poly-morphic locus be 1 — x, x. The probability that, when B arises it is linked to a is x,and the resulting value of L is lOOsx/log (l/p0), (taking a = 2). Hence

E(L) = 2 P 100sa;/log(l/2>0)a:dzJo200s

31ogl/a»0"(29)

(29) is the expected equivalent length of chromosome, in map units, made homo-zygous by the substitution of a single favourable mutation.

Thus the effect is greatest if there is a large selective advantage s per locus, andif the population is small. For a selective advantage of 0-1, and a population of size106, the equivalent length of chromosome made homozygous would be 0-48 mapunits.

From Maynard-Smith & Haigh (1974)

(Remember this when reading Kim and Stephan)

Quantifying the process• Need trajectory of beneficial mutation from

frequency 1/(2N) to 1-1/(2N).

• Need a way to simulate a coalescent on top of that.

• This is the “structured coalescent”, introduced by Dick Hudson and Norm Kaplan

• See recent Perspective by Barton in Genetics: http://www.genetics.org/content/202/3/865

Trajectories

240 STEPHAN, WIEHE, AND LEN2

where

t,=inf{t :X(t)=&},

t,-E=X-‘(l -&),

and x(t) satisfies the differential equation

dx(t) -=sx(t)(l -x(t)), dt x( t,) = E.

The solution of this differential equation is

x(t) = &

E+(l--E)e-““-‘“”

(2)

(3a)

It is convenient to introduce a new variable r = t - t,. The time it takes for the X process to go from E to 1 -E is

f = -2 ln(s)/s. (3b)

In order to describe the effect of the selected mutation on a linked neutral locus, Ohta and Kimura (1975) divide the population into two parts. One part consists of chromosomes carrying the advantageous muta- tion B, another one the disadvantageous allele b. Let pi be the frequency of allele A among chromosomes carrying the favorable mutation B, and pZ the frequency of allele A among b-chromosomes. Note that these variables are different from the usual state space variables of two-locus, two-allele models. Furthermore, let ~$(pi, p2, T) be the joint probability density func- tion of pi and p2 at time t > 0. Our goal is to compute the expectations of the frequencies p, and p2 and their second-order moments p:, p1 pz, and pi at time r with respect to 4. For an arbitrary polynomial, f, of p1 and p2 we define these expectations as

Because the frequency of allele A can be expressed by p1 and p2 as

(5)

this may allow us to calculate the expected heterozygosity at time t. The general approach is to write down a set of ordinary differential equations

240 STEPHAN, WIEHE, AND LEN2

where

t,=inf{t :X(t)=&},

t,-E=X-‘(l -&),

and x(t) satisfies the differential equation

dx(t) -=sx(t)(l -x(t)), dt x( t,) = E.

The solution of this differential equation is

x(t) = &

E+(l--E)e-““-‘“”

(2)

(3a)

It is convenient to introduce a new variable r = t - t,. The time it takes for the X process to go from E to 1 -E is

f = -2 ln(s)/s. (3b)

In order to describe the effect of the selected mutation on a linked neutral locus, Ohta and Kimura (1975) divide the population into two parts. One part consists of chromosomes carrying the advantageous muta- tion B, another one the disadvantageous allele b. Let pi be the frequency of allele A among chromosomes carrying the favorable mutation B, and pZ the frequency of allele A among b-chromosomes. Note that these variables are different from the usual state space variables of two-locus, two-allele models. Furthermore, let ~$(pi, p2, T) be the joint probability density func- tion of pi and p2 at time t > 0. Our goal is to compute the expectations of the frequencies p, and p2 and their second-order moments p:, p1 pz, and pi at time r with respect to 4. For an arbitrary polynomial, f, of p1 and p2 we define these expectations as

Because the frequency of allele A can be expressed by p1 and p2 as

(5)

this may allow us to calculate the expected heterozygosity at time t. The general approach is to write down a set of ordinary differential equations

Deterministic. From Stephan et al. (1992, TPB)

2315SELECTION ON STANDING VARIATION

one of two absorbing states: zero (i.e., loss of A from thepopulation) and one (i.e., fixation). The conditional diffusionprocess has the same !2(x) as the usual diffusion, but theinfinitesimal mean includes an additional term, which effec-tively gives the appropriate push toward the boundary onwhich we have conditioned.Our approach also relies on the reversibility of the diffu-

sion process (cf. Griffiths 2003). Specifically, we use the factthat the diffusion process looking backward in time from thepresent (i.e., toward the introduction of the allele) has thesame distribution as a process forward in time conditionalon absorption at zero. This conditional process (t) is theX*Nsame as XN(t) but with "N(x) replaced by (x) # $x (Ewens"*N2004). Likewise, because we are only interested in beneficialalleles that eventually reach fixation, we consider the dif-fusion process conditional on the selected allele reaching afrequency of one. This conditional process (t) has an in-%XSfinitesimal mean (x) # 2Nsx(1 $ x)/tanh(2Nsx) (Ewens%"S2004).To generate a trajectory for allele A, we use a variable-

sized jump random walk to approximate to the diffusion pro-cess. Given a current frequency x, at time intervals &t, thefrequency x jumps to either:

x → x % "(x)&t $ !x(1 $ x)&t or (2a)

x → x % "(x)&t % !x(1 $ x)&t (2b)

with equal probability. The term "(x) is replaced by the con-ditional infinitesimal mean of the phase in question (i.e.,neutral or selective). This process has the correct diffusionlimit, that is, the correct infinitesimal mean and variance areobtained and all higher moments are zero, as the time interval&t → 0 (Karlin and Taylor 1981). Hence, for small &t, itprovides a good approximation to the diffusion process. Weverified this for our choice of &t # 1/(4N) by comparison toanalytical expectations and to alternative methods of simu-lation (results not shown).There are two steps in our implementation: (1) simulation

of the trajectory of a neutral allele from frequency f to loss,with "(x) # (x) in the jumps described above; by the"*Nreversibility property described above, we can flip this tra-jectory to model the allele A from introduction to frequencyf; and (2) simulation of the trajectory of a selected allele fromf to fixation, with "(x) # (x) in the jumps. We then con-%"Scatenate the results of (1) and (2) to obtain one trajectoryfrom introduction to fixation.Our approach ensures that A is initially selected when at

frequency f, without assuming that A is selected when it firstreaches frequency f (which would be unrealistic). Moreover,it is computationally efficient, because we only generate tra-jectories where A eventually fixes in the population.

RESULTS

We are interested in contrasting two models: the standardselective sweep, in which an allele is favored from intro-duction to fixation, and a model of directional selection onstanding variation. For the latter, we consider the followingscenario: a neutral allele A arises and drifts in the populationuntil time ts, when it becomes favored; it eventually reachesfixation in the population at time T (see Methods for details).

The frequency of allele A at time ts, f, is the salient parameterin the comparison.To characterize the effects of these two models on genetic

variation, we simulate samples from a linked, neutrally evolv-ing region using a structured coalescent approach. Specifi-cally, we generate a trajectory of allele A from introductionto fixation, then condition on this particular realization of thegenealogical process to generate an ancestral recombinationgraph for our sample (Fig. 1). The trajectory of allele A ismodeled stochastically, using a new approach (see Methods).Under the standard sweep model, f # 1/(2N), while underthe model of directional selection on standing variation, f k1/(2N).

Effect of f on Diversity Levels

Irrespective of the value of f, mean diversity levels aremost distorted near the selected site and tend toward theirneutral expectation with increasing genetic distance. This isillustrated in Figure 2A, using parameters that may be ap-plicable to humans (e.g., Frisse et al. 2001). We present threesummaries of diversity: 'W (Watterson 1975), 'H (Fay andWu 2000), and ( (Tajima 1989). Under a neutral equilibriummodel, these statistics provide an unbiased estimate of ', thepopulation mutation rate (' # 4N", where " is the mutationrate per generation per base pair). For these parameters, astandard sweep leads to a reduction in the mean levels ofvariation throughout the 100-kb region (relative to the neutralexpectation of ' # 0.001 per base pair).A very similar picture is expected so long as f ) 1/(2Ns)

and selection is strong (Stephan et al. 1992). As an example,in Figure 2A the expected levels of variation are indistin-guishable for f # 1/(2N) # 5 * 10$5 and f # 1/(2Ns) # 10$3.As f increases, the substitution of a favored allele has a weak-er effect on diversity at linked neutral sites (Innan and Kim2004); for f # 0.20, the effect is hardly detectable. If theeffective population size, N, is larger, the difference betweenthe standard sweep and a model of directional selection onstanding variation is more readily apparent. For instance,using parameters that may be realistic for D. melanogaster(e.g., Andolfatto and Przeworski 2000), a model with f #0.05 leads to a very slight reduction in diversity levels relativeto the expectation at neutral equilibrium (see Fig. 2B).

Effect of f on Allele Frequencies

This finding might suggest that directional selection onstanding variation behaves like the standard sweep, but witha weaker footprint. This turns out not to be true. Figure 3plots ( and 'W as a function of the distance from the selectedsite for four simulated datasets generated under models ofdirectional selection where f # 0.05 and where f # 1/(2N),as well as under the neutral equilibrium. As can be seen,some cases of selection on standing variation resemble astandard sweep with a weaker footprint (example 3); otherslook like the neutral equilibrium case, with high diversityvery close to the selected site (example 4); and yet otherslook unlike either the standard sweep or neutrality, with val-leys of low diversity in some segments and neutral equilib-rium levels in others (examples 1, 2).Moreover, while the fixation of a new beneficial mutation

Stochastic. From Przeworski et al. (2005), orig. Coop & Griffiths (2004) TPB

but kinda hard to dig out.

TL;DR - the latter is preferred. The former over-estimates time to fixation by approximately two-fold.

Kaplan et al.890 N. L. Kaplan, R. R. Hudson and C. H. Langley

selectively neutral mutations occur) is completely linked. In this case it is appropriate to think of the region as a single locus.

In view of (4), the distribution of S follows directly from the distribution of T, and so it suffices to study the distributional properties of T. For example,

E ( S ) = LpE(T) and

Var(S) = LpE(T) + (Lp)'Var(T). The formula for E ( S ) holds even if there is recombi- nation between the L nucleotide sites. This is not the case however, for Var(S) (HUDSON 1983).

Assuming an isolated selectively neutral locus, WAT- TERSON (1975) showed that T (measured in 2N gen- erations) can be represented as

n

T = j Y ( j ), (5)

where the { Y ( j ) ] are independent random variables, and for large N the distribution of Y ( j ) is ap- proximately negative exponential with parameter ( j ( j - 1))/2, 2 d j d n. The time to the most recent common ancestor of the sample, To, can also be rep- resented as

j=2

n

To = c Y ( j ) . (6) j = 2

The occurrence of a selected substitution at some time in the past can affect the distribution of T and consequently the distribution of S. Our goal is to quantify this effect, and to do this we will use the results of HUDSON and KAPLAN (1988) on the coales- cent process for a sample of genes at a selectively neutral locus that is linked to a locus at which selection is operating.

Each ancestor of each sampled gene (referred to as an ancestral gene) is linked to either a B allele or a b allele. We therefore define Q(t) = (i, j ) if, in the t ancestral generation, t > 0, i of the ancestral genes of the sample are linked to a B allele and j to a b allele, 1 C i+j C n. Since the B allele is fixed in the population at the time of sampling, it is necessarily the case that Q(0) = (n, 0). In Figure 1 a possible realization of the Q process for a sample of size 4 is described.

The Q process is a jump process and because the number of ancestral genes cannot increase, this pro- cess eventually reaches either of the two states (0, 1 ) or ( 1 , 0), i .e. , there is a single ancestor of the sample and it is linked to either a b allele or a B allele. The ancestral generation in which this first occurs, To, is the generation that has the most recent common ancestor of the sample.

HUDSON and KAPLAN showed that, when 2N is large and time is measured in units of 2N generations, the

1 -E

X

E

past present FIGURE ]."A realization of the Q process of a sample of four

genes at a selectively neutral region linked to another locus at which a selectively favored mutation (destined to fix at time T I ) arose at time T in the past (time measured in 2N generations). The most recent time that the Q process changes value [(4,0) to (3. O)] is T I , the occurrence of the most recent common ancestor of sampled genes 2 and 3. During the selective phase [0 < X ( t ) < 1, T/ < t < T ]

the ancestral gene of sampled gene 1 crosses over to a wild type or b bearing chromosome, and the ancestral genes of sampled genes 3 and 4 have a most recent common ancestor [Q(T) = ( I , I ) ] and so Q(T+) = (0, 2). Finally, at time T + T2, the most recent common ancestor of the sample occurs [Q(T + T2) = (0, I)] .

distribution of the Q process when conditioned on the ancestral frequency process, ( X ( t ) , t > 01, of the se- lected allele B, can be approximated by the time inhomogeneous Markov jump process described be- low. From now on time will be measured in units of 2N generations unless otherwise specified.

If, for any t > 0, Q(t) = (i, j ) , then, according to HUDSON and KAPLAN (1 988), the probability that the process jumps to a different state before t + A equals

hg(X(t))A + O(A2)

as A approaches 0, where

and R is the expected number of crossovers between the neutral region and the selected locus per genome, per 2N generations. Also, x(x > 0) = 1 if x > 0 and equals 0 otherwise. x(x < 1 ) is defined similarly.

Finally is set equal to 0 if i < 2. (i) The only states that the Q process can jump to from

(i, j ) are (i - 1 , j ) , (i, j - l), (i + 1 , j - 1) and (i - 1 , j + 1) . The first two states represent common ancestor events and the latter two crossover events.

Pr(escape) ⇡ r/s

Kaplan et al.896 N. L. Kaplan, R. R. Hudson and C. H. Langley

10'

10'

10'

10'

10'

10 '

4 3 2 1 10 10

1 10

a 10 10 10

I c

I I

1

2

J

4 l-

l o 4

3

a=10

10 -2 10 -l l o o 10'

FIGURE 3.--E(T), the expected size (measured in 2N genera- tions) of the ancestral tree of a sample of two genes at a selectively neutral region that is linked to a selected locus (Equation 12), is plotted against T, the ancestral time of fixation of the selected substitution, a. For different values of R, the expected number of crossovers between the neutral region and the selected locus per genome per 2N generations (a = lo4), and b. For different values of selection, a ( R = 10); (see text for explanation).

significantly from 2, its neutral value, if 7 < 0.1 and R / a < 0.01. This means that the expected level of variation will be substantially reduced for all the sites within a physical distance of (O.Ol)a/C base pairs of a locus at which a selected substitution has recently occurred. For example, if 2N = lo8, s = and c =

then the width of the affected region is only about 200 bp. But if s = and c = 1 0-', then the expected variation is reduced in a region about 2000 bp wide.

In Table 2 the values of M (Equation 20), A M A X

(Equation 22) and Z22(M) are given for different values of a (2N = lo8 and 6 = 0.01). The value of Mf in (20) is chosen so that 1 - P22(Mf) is within 1 % of its limit, 1 - E ( e - 7 ) E*(s). The quantities M and Z22(M)

increase more or less linearly with a, and the ratio Z22(M)/M =: 0.24, independent of the value of a. The product 2 M h ~ ~ ~ varies between 0.8 and 84 as a varies between 1 O3 and 10'. All the quantities in Table 2 are

insensitive to 2N so long as a < 10- ' (2~) (calculations not shown).

The major goal of this paper is to determine the consequence of hitchhiking resulting from recurring selected substitutions on standing, selectively neutral variation at the DNA level. In Figure 4 E ( T ) , (Equa- tion la), is plotted as a function of A,. for different values of a with 2N = 10'. Since Z22(M) is insensitive to 2N (for fixed a ) , the same is true of E ( T ) . Even for small values of A, the hitchhiking effect can reduce E ( T ) substantially from 2 (its expectation for an iso- lated, selectively neutral locus) for large values of a, e.g. , if a 3 lo5, and 0.0002 < A, < A M A X , then E ( T ) G 0.7. Since the expected number of polymorphic sites, E ( S ) , is proportional to E ( T ) , it is clear that the hitchhiking effect associated with the rapid fixation of selected mutants (or very rare alleles) can substantially reduce the expected number of polymorphic sites in a sample from that expected in the absence of selec- tion.

The expected number of selected substitutions in a region of size 2M until the most recent common ancestor of the sample, E(&), can be calculated from Equation 23. The values of E(&) corresponding to A M A X range from 0.7 (for a = lo3) to 4.1 (for a = lo6). It is interesting to note that this latter value is near the maximum value of E(&), M/Z22(M), which is about 5 and independent of a.

DISCUSSION

The analysis of the theoretical population genetics model of selectively neutral molecular variation under the forces of mutation and random genetic drift (KI- MURA 1983; GILLESPIE 1987) has yielded many im- portant and useful predictions. The ability of the theory to explain much of the observed variation within and between species has led many to accept the proposition that most molecular polymorphism within and divergence between species is of no phenotypic consequence to the fitness of the organisms. Two critical assumptions of the neutral theory are that selected variants are so rare that they comprise a minute portion of molecular genetic variation and that their dynamics have negligible effects on the dynamics of the preponderant, neutral variation. It is this second assumption that we have investigated.

MAYNARD SMITH and HAIGH (1974) studied the effect of a single selected substitution of a newly arising mutant (or rare variant) on a neutral polymor- phism, and showed that the hitchhiking effect of a single selected substitution can substantially reduce heterozygosity at a linked selectively neutral polymor- phic locus. The experimental data motivating their investigation was the mounting evidence of allozyme polymorphism in the early 1970s. Today more de-

E[T ] = E[total time on tree]

Var.

redu

ced

Recent sweep Old sweep

Var.

not r

educ

ed

Strong selection!!!!

↵ = 2NsR = 2Nr

⌧ = Generations since fixation

2N

Kaplan et al.

896 N. L. Kaplan, R. R. Hudson and C. H. Langley

10'

10'

10'

10'

10'

10 '

4 3 2 1 10 10

1 10

a 10 10 10

I c

I I

1

2

J

4 l-

l o 4

3

a=10

10 -2 10 -l l o o 10'

FIGURE 3.--E(T), the expected size (measured in 2N genera- tions) of the ancestral tree of a sample of two genes at a selectively neutral region that is linked to a selected locus (Equation 12), is plotted against T, the ancestral time of fixation of the selected substitution, a. For different values of R, the expected number of crossovers between the neutral region and the selected locus per genome per 2N generations (a = lo4), and b. For different values of selection, a ( R = 10); (see text for explanation).

significantly from 2, its neutral value, if 7 < 0.1 and R / a < 0.01. This means that the expected level of variation will be substantially reduced for all the sites within a physical distance of (O.Ol)a/C base pairs of a locus at which a selected substitution has recently occurred. For example, if 2N = lo8, s = and c =

then the width of the affected region is only about 200 bp. But if s = and c = 1 0-', then the expected variation is reduced in a region about 2000 bp wide.

In Table 2 the values of M (Equation 20), A M A X

(Equation 22) and Z22(M) are given for different values of a (2N = lo8 and 6 = 0.01). The value of Mf in (20) is chosen so that 1 - P22(Mf) is within 1 % of its limit, 1 - E ( e - 7 ) E*(s). The quantities M and Z22(M)

increase more or less linearly with a, and the ratio Z22(M)/M =: 0.24, independent of the value of a. The product 2 M h ~ ~ ~ varies between 0.8 and 84 as a varies between 1 O3 and 10'. All the quantities in Table 2 are

insensitive to 2N so long as a < 10- ' (2~) (calculations not shown).

The major goal of this paper is to determine the consequence of hitchhiking resulting from recurring selected substitutions on standing, selectively neutral variation at the DNA level. In Figure 4 E ( T ) , (Equa- tion la), is plotted as a function of A,. for different values of a with 2N = 10'. Since Z22(M) is insensitive to 2N (for fixed a ) , the same is true of E ( T ) . Even for small values of A, the hitchhiking effect can reduce E ( T ) substantially from 2 (its expectation for an iso- lated, selectively neutral locus) for large values of a, e.g. , if a 3 lo5, and 0.0002 < A, < A M A X , then E ( T ) G 0.7. Since the expected number of polymorphic sites, E ( S ) , is proportional to E ( T ) , it is clear that the hitchhiking effect associated with the rapid fixation of selected mutants (or very rare alleles) can substantially reduce the expected number of polymorphic sites in a sample from that expected in the absence of selec- tion.

The expected number of selected substitutions in a region of size 2M until the most recent common ancestor of the sample, E(&), can be calculated from Equation 23. The values of E(&) corresponding to A M A X range from 0.7 (for a = lo3) to 4.1 (for a = lo6). It is interesting to note that this latter value is near the maximum value of E(&), M/Z22(M), which is about 5 and independent of a.

DISCUSSION

The analysis of the theoretical population genetics model of selectively neutral molecular variation under the forces of mutation and random genetic drift (KI- MURA 1983; GILLESPIE 1987) has yielded many im- portant and useful predictions. The ability of the theory to explain much of the observed variation within and between species has led many to accept the proposition that most molecular polymorphism within and divergence between species is of no phenotypic consequence to the fitness of the organisms. Two critical assumptions of the neutral theory are that selected variants are so rare that they comprise a minute portion of molecular genetic variation and that their dynamics have negligible effects on the dynamics of the preponderant, neutral variation. It is this second assumption that we have investigated.

MAYNARD SMITH and HAIGH (1974) studied the effect of a single selected substitution of a newly arising mutant (or rare variant) on a neutral polymor- phism, and showed that the hitchhiking effect of a single selected substitution can substantially reduce heterozygosity at a linked selectively neutral polymor- phic locus. The experimental data motivating their investigation was the mounting evidence of allozyme polymorphism in the early 1970s. Today more de-

s/r

• Routinely mis-quoted as “distance at which a sweep will affect variation”.

• Wrong! It is distance at which site has Pr(escape) close to 1.

Hitchhiking Revisited 895

2A,M(l - p ) = I - 1 + 2ArZ22(M) '

Since p is close to 1, 1 - p = 2A,ME(v). Hence, A M A X satisfies

where 6 is a small positive number and E' (7 ) is the estimate of E (7).

The renewal argument used to derive (21) can also be used to obtain the expectation of KO, so long as A, C AMAX. Indeed,

+ 2ArM 1 + 2A,M '(' - 7) z22(M) ('1 + E(K0))

- - ~ M Z Z ( M ) P 1 + 2Ar122(M)p + 2ArM(1 - p )

- 2A,M - I + 2ArZ22(M) *

The analysis up until now has assumed that the initial frequency of the favored allele is 1/2N as it would be in the case of a newly arising mutant that is destined to fix. If the favored allele is a rare existing variant that for some reason becomes selectively fa- vored, then the analysis presented above still holds.

CALCULATIONS

For simplicity, we only examine a sample of size 2. Similar calculations can be made for larger samples. When considering the hitchhiking effect caused by the substitution of a single, rare, selected allele, the distribution of the Q process for a sample of size 2, only depends on PZ2(R), where R is the expected number of crossovers per genome per 2N generations between the selectively neutral region and the selected locus. In Figure 2, P22(R) is plotted as a function of R for different values of a. To calculate P22(R) we used formula ( 1 4) with c = 1 02/2N and the population size, 2N = 10'. This value of c is small enough so that P 2 , 2 . 0 ( 7 ( & ) ) is negligible. Calculations not presented here show that the curves in Figure 2 are fairly insen- sitive to 2N so long as 2N 3 100a.

Since 5/a > 102/2N, the curves in Figure 2 over- estimate P22(R). In order to examine how much of an overestimation there is, a simulation was performed as described in the theory section. For each value of a, P22(R) was evaluated for R = 10, R = 100 and R = 1000. For each set of parameter values 50 sample paths of the X process that reach 5/a were simulated.

1 .a

0.0

0.6

0.4

0.2

O S

I 2N= 10' I

FIGURE 2.-P**(R), the probability of escaping the hitchhiking effect for a sample o f size 2, is plotted against R , the expected number of crossovers between the selectively neutral region and the selected locus per genome per 2N generations for various values of a, where 2N = 10' (see text for explanation).

In all cases the estimate of P22(R) differed from the value plotted in Figure 2 by less than 2%. For exam- ple, if CY = lo4 and R = 100, then the plotted value of PZ2(R) is 0.158, while the estimate obtained from the simulation is 0.155. Thus the error caused by using the deterministic model for the frequency of the B allele when the frequency is small, is negligible.

As the amount of crossing over increases between the neutral region and the selected locus, the proba- bility of no common ancestor occurring during the selected phase, PZ2(R), would be expected to increase. It is clear from Figure 2 that P22(R) is an increasing function of R. Increasing a decreases 7 , the time of the selective phase, and so one would expect that PZ2(R) decreases as a function of a. This behavior is also seen in Figure 2. Another way to interpret this observation is that the larger a is, the larger the region of the genome that is affected by the selected substi- tution.

The curves in Figure 2 are all similar in shape and they appear to be equal distance from each other, suggesting that P22(R) may be a function of R/a and so independent of population size. Direct calculation however, shows that this is only approximately true.

The expectation, E ( T ) , whose formula is given in (12), is plotted in Figure 3a for different values of R , and in Figure 3b for different values of a as a function of r (measured in 2N = 1 Os generations), the ancestral time when the selected mutation was introduced into the population (or when the rare allele becomes selec- tively favored). In Figure 3a, a = lo4 and in Figure 3b, R = 10. It is not difficult to show from Figure 2 and Equation (12) that E ( T ) is an increasing function of r and R and decreasing function of a. In particular, as is seen in Figure 3, a and b, E ( T ) will differ

896 N. L. Kaplan, R. R. Hudson and C. H. Langley

10'

10'

10'

10'

10'

10 '

4 3 2 1 10 10

1 10

a 10 10 10

I c

I I

1

2

J

4 l-

l o 4

3

a=10

10 -2 10 -l l o o 10'

FIGURE 3.--E(T), the expected size (measured in 2N genera- tions) of the ancestral tree of a sample of two genes at a selectively neutral region that is linked to a selected locus (Equation 12), is plotted against T, the ancestral time of fixation of the selected substitution, a. For different values of R, the expected number of crossovers between the neutral region and the selected locus per genome per 2N generations (a = lo4), and b. For different values of selection, a ( R = 10); (see text for explanation).

significantly from 2, its neutral value, if 7 < 0.1 and R / a < 0.01. This means that the expected level of variation will be substantially reduced for all the sites within a physical distance of (O.Ol)a/C base pairs of a locus at which a selected substitution has recently occurred. For example, if 2N = lo8, s = and c =

then the width of the affected region is only about 200 bp. But if s = and c = 1 0-', then the expected variation is reduced in a region about 2000 bp wide.

In Table 2 the values of M (Equation 20), A M A X

(Equation 22) and Z22(M) are given for different values of a (2N = lo8 and 6 = 0.01). The value of Mf in (20) is chosen so that 1 - P22(Mf) is within 1 % of its limit, 1 - E ( e - 7 ) E*(s). The quantities M and Z22(M)

increase more or less linearly with a, and the ratio Z22(M)/M =: 0.24, independent of the value of a. The product 2 M h ~ ~ ~ varies between 0.8 and 84 as a varies between 1 O3 and 10'. All the quantities in Table 2 are

insensitive to 2N so long as a < 10- ' (2~) (calculations not shown).

The major goal of this paper is to determine the consequence of hitchhiking resulting from recurring selected substitutions on standing, selectively neutral variation at the DNA level. In Figure 4 E ( T ) , (Equa- tion la), is plotted as a function of A,. for different values of a with 2N = 10'. Since Z22(M) is insensitive to 2N (for fixed a ) , the same is true of E ( T ) . Even for small values of A, the hitchhiking effect can reduce E ( T ) substantially from 2 (its expectation for an iso- lated, selectively neutral locus) for large values of a, e.g. , if a 3 lo5, and 0.0002 < A, < A M A X , then E ( T ) G 0.7. Since the expected number of polymorphic sites, E ( S ) , is proportional to E ( T ) , it is clear that the hitchhiking effect associated with the rapid fixation of selected mutants (or very rare alleles) can substantially reduce the expected number of polymorphic sites in a sample from that expected in the absence of selec- tion.

The expected number of selected substitutions in a region of size 2M until the most recent common ancestor of the sample, E(&), can be calculated from Equation 23. The values of E(&) corresponding to A M A X range from 0.7 (for a = lo3) to 4.1 (for a = lo6). It is interesting to note that this latter value is near the maximum value of E(&), M/Z22(M), which is about 5 and independent of a.

DISCUSSION

The analysis of the theoretical population genetics model of selectively neutral molecular variation under the forces of mutation and random genetic drift (KI- MURA 1983; GILLESPIE 1987) has yielded many im- portant and useful predictions. The ability of the theory to explain much of the observed variation within and between species has led many to accept the proposition that most molecular polymorphism within and divergence between species is of no phenotypic consequence to the fitness of the organisms. Two critical assumptions of the neutral theory are that selected variants are so rare that they comprise a minute portion of molecular genetic variation and that their dynamics have negligible effects on the dynamics of the preponderant, neutral variation. It is this second assumption that we have investigated.

MAYNARD SMITH and HAIGH (1974) studied the effect of a single selected substitution of a newly arising mutant (or rare variant) on a neutral polymor- phism, and showed that the hitchhiking effect of a single selected substitution can substantially reduce heterozygosity at a linked selectively neutral polymor- phic locus. The experimental data motivating their investigation was the mounting evidence of allozyme polymorphism in the early 1970s. Today more de-

(Remember this when reading Kim and Stephan)

Which SFS?? (Blue = no selection, for reference)

0

5

10

15

20

1 2 3 4 5 6 7 8 9

Single recent, strong sweep. Fay and Wu.

Sweeps occurring at some rate. Braverman et al., Przeworski.

Hitchhiking Effect 789

0.0

-0.5

Q - -1.0 v)

.- E p a 0) 9 a 2 -1.5

-2.0

-2.5

I \- -a= a = 10 lo”

a = lo3

0.0000 0.00050 0.001 0

4

over by these typical 6s gives &. Table 2 lists & for several recent data sets from regions of the D. mlanogas- ter genome where crossing over is reduced and thus where the hitchhiking effect is expected to operate ( MART~N-CAMPOS et al. 1992; BEGUN and AQUADRO 1993, 1995; AGUADE et al. 1994). Included in this table are one example of a locus exhibiting a significant and negative Tajima’s D, and the only four published surveys in which Tajima’s D could be calculated yet was not significantly different from zero.

Plotted in Figure 5 is average Tajima’s Dagainst REcn. Each data point (larger circles) in this figure is the average D calculated from 1000 genealogies con- structed for a different combination of A,, which ranged from zero to AmX, and a, which equaled lo3, lo4, lo5, lo6, or 10’. These are the same points used to draw Figure 3. Also plotted in this figure are the 95% confidence limits (smaller circles) for each average D. Figure 5 demonstrates that the average Tajima’s D de- creases as hitchhiking reduces E( T ) . The ability of hitchhiking to reduce E(T) depends on a and/or A,, per Figure 3. Thus moving left on the horizontal axis is the result of increasing a and/or A,. Because the average D decreases linearly if either LY or A, is increased (in this range), can be used as a general measure of the strength of the hitchhiking effect. This figure was produced to reflect the su(w“) data set that had n = 50 and S = 17, but it is typical of all parameters we examined (all permutations of n = (10, 30, 50) and S = (10, 30, 50)). For the su(wQ) locus, &- = 0.32 (Table

FIGURE 4.-The average value of Tajima’s D as a function of A,. The parameters are n = 50, S = 17, a = lo’, lo4, lo5, lo6, or lo7, and A,. ranges from zero to AMx, which varies depending on a.

0.0015

2). If one takes this as an estimate of &.), then ac- cording to Figure 5 the expected value of D is -1.52. This result is listed in Table 2, as is the average Tajima’s Ds simulated with values of CY and A, chosen such that the &(T) approximately equaled the J(B of four other recent data sets. In spite of these large, negative ex- pected values of D under the hitchhiking model, four loci exhibiting low 1$. had Ds near zero.

While the above analysis indicates that a large shift in D is expected under hitchhiking strong enough to reduce 0 by &, a more pertinent question concerns the probability of observing a statistically significantly negative D given a particular reduction in &. Figure 6 depicts the outcome of a series of simulations with vari- ous hitchhiking parameters. Each circle was obtained with a unique combination of values of a and AT, for a sample size of 50 with 17 segregating sites. The abscissa is the relative reduction in E( T ) and the ordinate is the proportion of realizations in which the observed D was less than the 0-0.975. In general, the trend is toward more cases of D I 0-0.975 as &(7,1 decreases (ie., as the strength of the hitchhiking effect increases).

According to Figure 6, when &(T) = 0.32 (corre- sponding to the & = 0.32 exhibited by the su( w“) data) (see Table 2), 0.505 of the genealogies had D 5 0-0.975.

This result is listed in Table 2 along with the results of similar analyses for four other data sets (MART~N- CAMPOS et al. 1992; BEGUN and AQUADRO 1993, 1995). Of the four loci in this table with nonsignificant Taji- ma’s W, s u ( s ) had the highest proportion of D I 0-0.975,

Fig. from Braverman et al. (1995) Genetics

“Pseudo-hitchhiking”

• Ignore the trajectory—assume fixation time close to zero. This means assuming very strong selection.

• Hitch-hiking events occur at rate rho, at which time all lineages coalesce (in absence of recombination).

“Pseudo-hitchhiking”E[�x] = 0

V [�x] = ⇢x(1� x)

⇢ = Rate of hitch� hiking events

Ne =N

1+2N⇢y2 ! 1⇢y2 as N ! 1

The extent to which “rho” (and y…) depend on N determines the extent to which variation is constant across species.

916 J. H. Gillespie

In fact, the nonhitchhiking alleles should all be loweredby a different random amount, reflecting the variouseffects of drift and recombination that occur during thehitchhiking event. In some cases, there may even betwo separate hitchhiking alleles. It appears to be quitedifficult to add this particular element of randomness,although further work may uncover a way.Although further refinements of the pseudohitchhik-

ing model will be forthcoming, the remainder of thisarticle is concerned with the properties of the modelas defined above.

THE COALESCENT

As a first step, consider the genealogy of n allelesFigur e 7.—The average values of Tajima’s D for differentsampled from a pseudohitchhiking population with de- sample sizes. Those E{D(n)}curves come from samples drawn

terministic y and N 5 ∞. In this case, the only way that from a direct simulation of the pseudohitchhiking model fora coalescence can occur is if there is a hitchhiking event. a sample of size n. The D(n) curves come from a direct simula-

tion of the coalescent using Equation 18. In both cases, r 5The probability of such an event in a particular genera-0.138, y 5 0.3, and u 5 5 3 1024.tion is r. If there were an event, then a single copyof one

of the alleles in the population increases its frequency toy. The probability that i of the n sampled alleles are The next increment in complexity involves the addi-descended from that fortunate allele is the binomial tion of genetic drift. In any particular generation, aprobability coalescence may be due to the finiteness of the popula-

tion or to hitchhiking. In the former case, the coalescent1

ni 2yi(1 2 y) n2i. can only shrink from n to n 2 1 while in the latter case

the size of the coalescent can shrink from n to n 2 i, i 5A coalescence occurs when i $ 2. Unlike the neutral 1 . . . (n 2 1) . Thus, the probabilities of all possiblecase, a coalescence can involve more than two lineages, transitions arewhich is the root cause of D , 0.We can summarize these observations as follows:

The probability that a coalescence does not occur in an !

5

n w.p. 1 2 r 2n(n 2 1)4N

2 r[ (1 2 y) n 1 ny(1 2 y) n21]

n 2 1 w.p. n(n 2 1)4N

1 r1

n22y2(1 2 y) n22

n2i w.p. r1

ni 1 12

yi11(1 2 y) n2i21, i 5 1 . . . (n 2 1) .

particular generation is

(1 2 r) 1 r[ (1 2 y) n 1 ny(1 2 y) n21] .(18)The probabilitythat a coalescence does occur in a partic-

ular generation is These transition probabilities, plus the usual assump-tion that the times to successive coalescences are expo-

ro

n

i521

ni 2yi(1 2 y) n2i. nentially distributed, allow a complete probabilistic de-

scription of the coalescent for small n. However, forThe probabilitythat the coalescent shrinks from n alleles n . 4 the results are completely unwieldy. On the otherto n 2 i alleles in a particular generation is hand, it is very easy to simulate the coalescent in the

same manner that is done for neutral coalescents (Hud-son 1990) .r

1

ni 1 12

yi11(1 2 y) n2i21.Figure 7 gives examples of the calculation of Tajima’s

D using a direct simulation of the pseudohitchhikingWhen n 5 2, the probability of a coalescence is ry2.model and using a coalescent simulation with the transi-Thus, the mean number of mutations separating thesetion probabilities given above. The two approaches givetwo alleles isidentical answers, as they should. There are two interest-ing aspects to these results. The first is that Tajima’s D2u

ry2. (17)

becomes more negative with increasing population size.The negativity comes from the fact that a coalescenceThis same result can be obtained by taking the limit ofcan involve more than two lineages, the increasing mag-Equation 11 as N ! ∞:nitude comes from a decreasing role of genetic drift andwith it, a decreasing frequency of n ! n 2 1 transitions.lim

N!∞

4Nu1 1 2N ry2

52ury2.

The second interesting aspect of Figure 7 is the in-

Figure from Gillespie (2000)

Soft sweeps: intuition

Beneficial mutation initially on > 1 genetic background, leading to prediction that reduction in diversity at linked,

neutral sites will not be as extreme as for a “classic” sweep (now often called a “hard” sweep).

Soft sweep simulation

• Stochastic trajectories

• Neutral trajectory + selected “stitched” together.

• Vary “ts”, f, etc.

2314 MOLLY PRZEWORSKI ET AL.

FIG. 1. A possible genealogy for six chromosomes at a neutral locus linked to a site where a beneficial allele, A, has reached fixation.In this example, A has just fixed in the population (at time T ! 0), so all lineages carry the favored allele. Going backwards in time, Ais favored from T to ts then neutrally evolving from ts (when it is at frequency f) to tm. The trajectories for the selected and neutralphases are shown in black and gray, respectively. The coalescent genealogy for the six chromosomes is depicted with dashed lines, whilerecombination events between allelic classes are indicated with slanted arrows. Most coalescent events occur when allele A is at lowfrequency. Because A is neutrally evolving from ts to tm, its sojourn time is longer than it would be under a standard sweep, thus providingmore opportunity for recombination. Note that, in this example, the most recent common ancestor has not been reached by 2500 generationsago.

alleles: when ts " t " tm, a and A are selectively equivalent,while for T " t " ts, they are not. During these phases, thelineages ancestral to the sample from the neutral region canbe thought of as evolving in a structured population, wherethe allelic classes (a and A) define subpopulations and re-combination between ancestral lineages of each class acts asmigration (Hudson and Kaplan 1988; Barton 1998; Nordborg2001).

Program

Using this analogy of a structured coalescent, we can sim-ulate the genealogical history of a sample from the neutrallyevolving region by generating the frequency of the selectedallele through time (hereafter ‘‘the trajectory’’), then gen-erating an ancestral recombination graph conditional on thistrajectory (see Fig. 1). This general approach was pioneeredby Kaplan et al. (1989) and since used in other studies (e.g.,Przeworski 2002; Ray et al. 2003; Coop and Griffiths 2004;Innan and Kim 2004).We implement the approach by modifying the coalescent

program described in Przeworski (2002). The only changepertains to the trajectory of allele A. In Przeworski (2002),allele A is favored from introduction to fixation and a deter-ministic approximation is used to model the trajectory. Here,

A is initially neutral, then beneficial, and the trajectory of Ais modeled stochastically (as described below). Thus, whilewe present results for a fixed time of fixation, T, the timestm and ts are random and will therefore vary from run to run.We error-checked the program by writing an independent

code (which uses a birth-death approximation to the diffusionprocess) and comparing the results. The latter is implementedas a version of the program SELSIM (Spencer and Coop2004) and is available at http://pritch.bsd.uchicago.edu/software.html.

Simulating the Trajectory of the A Allele

The frequency of an allele A in the population can be mod-eled by a diffusion process X(t) on (0,1), with generator

21 # #2L ! $ (x) % &(x) , (1)22 #x #xwhere $2(x) ! x(1 ' x) is the infinitesimal variance and &(x)the infinitesimal mean of the diffusion process (cf. Ewens2004). In our model, there are two diffusion processes: aneutral one, XN(t), and a selected one, XS(t). These have in-finitesimal means &N(x) ! 0 and &S(x) ! 2Nsx(1 ' x), re-spectively.We consider these processes conditional on their reaching

Figure from Przeworski et al. (2005) Evolution

2316 MOLLY PRZEWORSKI ET AL.

FIG. 2. Mean diversity levels as a function of distance from the selected site for different values of f, the frequency at which the alleleis first favored. Diversity levels are summarized by the mean ! (dashed), "W (gray) and "H (black). Under the neutral equilibrium model,all three statistics are unbiased estimators of ", the population mutation rate. (A) Plausible parameters for humans. A total of 104simulations were run for 100 chromosomes, with N # 104, s # 0.05, and " # $ # 10%3 per base pair ($ # 4Nr; see Methods for otherparameter definitions). The time since the fixation of the beneficial allele is zero. Under the neutral equilibrium model, E(!) # E("W)# E("H) # 1 per kilobase. (B) Plausible parameters for Drosophila melanogaster. A total of 104 simulations were run for 100 chromosomes,with N # 106, s # 0.01, " # 0.01 per base pair, and $ # 0.1 per base pair. The time since the fixation of the beneficial allele is zero.Under the neutral equilibrium model, E(!) # E("W) # E("H) # 1 per 100 bp.

has more effect on ! than "W in all four examples, this isnot the case when f # 0.05. We illustrate this differencebetween the two selection models in more detail by pre-senting five randomly generated examples of allele frequen-cies in a linked neutral region (Fig. 4). Close to the selectedsite, a standard sweep tends to produce an excess of rare andhigh-frequency alleles relative to the neutral equilibriummodel (e.g., Maynard Smith and Haigh 1974; Kaplan et al.1989; Simonsen et al. 1995; Fay and Wu 2000). As seen inexamples 2 and 5, this also happens under a model where fk 1/(2N). What distinguishes the model of directional se-lection on standing variation from the standard sweep modelis the appreciable number of cases where there is a relativeexcess of intermediate-frequency alleles (examples 1 and 4in Fig. 4). This suggests that for intermediate values of f,directional selection leads to a much larger variance in fre-quency spectra than expected under a standard sweep model.

To quantify this observation, we estimate the variance andcentral 95% probability interval of Tajima’s D (Tajima 1989),a commonly used summary of the folded allele frequencyspectrum based on the (approximately) normalized differencebetween ! and "W (Table 1). Under a neutral equilibriummodel, the mean of this statistic is roughly zero, while anegative (positive) value reflects an excess of rare (inter-mediate-frequency) alleles. We first consider the case wherethe beneficial allele has just reached fixation. As expected,the standard selective sweep leads to sharply reduced valuesof D. In contrast, for f # 0.05, the mean is only slightlyreduced from zero, but both tails of the distribution of D aregreatly increased (Table 1). If the time since fixation is in-stead 2000 generations, or approximately 50,000 years inhumans, then both models lead to more negative D-values.However, the variance in outcomes remains much larger un-der a model of directional selection on standing variation and

f=0.05: reduction in variation not

nearly as pronounced

2316 MOLLY PRZEWORSKI ET AL.

FIG. 2. Mean diversity levels as a function of distance from the selected site for different values of f, the frequency at which the alleleis first favored. Diversity levels are summarized by the mean ! (dashed), "W (gray) and "H (black). Under the neutral equilibrium model,all three statistics are unbiased estimators of ", the population mutation rate. (A) Plausible parameters for humans. A total of 104simulations were run for 100 chromosomes, with N # 104, s # 0.05, and " # $ # 10%3 per base pair ($ # 4Nr; see Methods for otherparameter definitions). The time since the fixation of the beneficial allele is zero. Under the neutral equilibrium model, E(!) # E("W)# E("H) # 1 per kilobase. (B) Plausible parameters for Drosophila melanogaster. A total of 104 simulations were run for 100 chromosomes,with N # 106, s # 0.01, " # 0.01 per base pair, and $ # 0.1 per base pair. The time since the fixation of the beneficial allele is zero.Under the neutral equilibrium model, E(!) # E("W) # E("H) # 1 per 100 bp.

has more effect on ! than "W in all four examples, this isnot the case when f # 0.05. We illustrate this differencebetween the two selection models in more detail by pre-senting five randomly generated examples of allele frequen-cies in a linked neutral region (Fig. 4). Close to the selectedsite, a standard sweep tends to produce an excess of rare andhigh-frequency alleles relative to the neutral equilibriummodel (e.g., Maynard Smith and Haigh 1974; Kaplan et al.1989; Simonsen et al. 1995; Fay and Wu 2000). As seen inexamples 2 and 5, this also happens under a model where fk 1/(2N). What distinguishes the model of directional se-lection on standing variation from the standard sweep modelis the appreciable number of cases where there is a relativeexcess of intermediate-frequency alleles (examples 1 and 4in Fig. 4). This suggests that for intermediate values of f,directional selection leads to a much larger variance in fre-quency spectra than expected under a standard sweep model.

To quantify this observation, we estimate the variance andcentral 95% probability interval of Tajima’s D (Tajima 1989),a commonly used summary of the folded allele frequencyspectrum based on the (approximately) normalized differencebetween ! and "W (Table 1). Under a neutral equilibriummodel, the mean of this statistic is roughly zero, while anegative (positive) value reflects an excess of rare (inter-mediate-frequency) alleles. We first consider the case wherethe beneficial allele has just reached fixation. As expected,the standard selective sweep leads to sharply reduced valuesof D. In contrast, for f # 0.05, the mean is only slightlyreduced from zero, but both tails of the distribution of D aregreatly increased (Table 1). If the time since fixation is in-stead 2000 generations, or approximately 50,000 years inhumans, then both models lead to more negative D-values.However, the variance in outcomes remains much larger un-der a model of directional selection on standing variation and

f=0.2: effect would be very hard to detect

2316 MOLLY PRZEWORSKI ET AL.

FIG. 2. Mean diversity levels as a function of distance from the selected site for different values of f, the frequency at which the alleleis first favored. Diversity levels are summarized by the mean ! (dashed), "W (gray) and "H (black). Under the neutral equilibrium model,all three statistics are unbiased estimators of ", the population mutation rate. (A) Plausible parameters for humans. A total of 104simulations were run for 100 chromosomes, with N # 104, s # 0.05, and " # $ # 10%3 per base pair ($ # 4Nr; see Methods for otherparameter definitions). The time since the fixation of the beneficial allele is zero. Under the neutral equilibrium model, E(!) # E("W)# E("H) # 1 per kilobase. (B) Plausible parameters for Drosophila melanogaster. A total of 104 simulations were run for 100 chromosomes,with N # 106, s # 0.01, " # 0.01 per base pair, and $ # 0.1 per base pair. The time since the fixation of the beneficial allele is zero.Under the neutral equilibrium model, E(!) # E("W) # E("H) # 1 per 100 bp.

has more effect on ! than "W in all four examples, this isnot the case when f # 0.05. We illustrate this differencebetween the two selection models in more detail by pre-senting five randomly generated examples of allele frequen-cies in a linked neutral region (Fig. 4). Close to the selectedsite, a standard sweep tends to produce an excess of rare andhigh-frequency alleles relative to the neutral equilibriummodel (e.g., Maynard Smith and Haigh 1974; Kaplan et al.1989; Simonsen et al. 1995; Fay and Wu 2000). As seen inexamples 2 and 5, this also happens under a model where fk 1/(2N). What distinguishes the model of directional se-lection on standing variation from the standard sweep modelis the appreciable number of cases where there is a relativeexcess of intermediate-frequency alleles (examples 1 and 4in Fig. 4). This suggests that for intermediate values of f,directional selection leads to a much larger variance in fre-quency spectra than expected under a standard sweep model.

To quantify this observation, we estimate the variance andcentral 95% probability interval of Tajima’s D (Tajima 1989),a commonly used summary of the folded allele frequencyspectrum based on the (approximately) normalized differencebetween ! and "W (Table 1). Under a neutral equilibriummodel, the mean of this statistic is roughly zero, while anegative (positive) value reflects an excess of rare (inter-mediate-frequency) alleles. We first consider the case wherethe beneficial allele has just reached fixation. As expected,the standard selective sweep leads to sharply reduced valuesof D. In contrast, for f # 0.05, the mean is only slightlyreduced from zero, but both tails of the distribution of D aregreatly increased (Table 1). If the time since fixation is in-stead 2000 generations, or approximately 50,000 years inhumans, then both models lead to more negative D-values.However, the variance in outcomes remains much larger un-der a model of directional selection on standing variation and

“Hard” sweep: variation strongly reduced near selected site

⇡ = dashed✓W = grey✓H = black

Take-home• To detect sweeps, they should be:

• strong (relative to r)

• recent

• in regions of low recombination (relative to s)

• been selected on when rare

Quantitative traits• Any one beneficial mutation not guaranteed to fix.

• I.e., sweeps can “stall out”—see Chevin and Hopital (2008) Genetics

• B/c genetic background may move mean trait value to optimum before fixation occurs.

• But, patterns of hitch-hiking should depend mostly on whether or not mutation was rare or not at onset of selection.

Considerations

• Demographic null model matters. Methods we read will have to “account for demography”

• How much HH does there need to be before this is impossible? (This is an open question.)