19
GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter 1 1 Department of Human Genetics, University of Chicago, Chicago IL USA ABSTRACT Many questions about human genetic history can be addressed by examining the patterns of shared genetic variation between sets of populations. A useful methodological framework for this purpose are F-statistics, that measure shared genetic drift between sets of two, three and four populations, and can be used to test simple and complex hypotheses about admixture between populations. This paper provides context from phylogenetic and population genetic theory. I review how F-statistics can be interpreted as branch lengths or paths, and derive new interpretations using coalescent theory. I further show that the admixture tests can be interpreted as testing general properties of phylogenies, allowing extension of some ideas applications to arbitrary phylogenetic trees. The new results are used to investigate the behavior of the statistics under different models of population structure, and show how population substructure complicates inference. The results lead to simplified estimators in many cases, and I recommend to replace F 3 with the average number of pairwise differences for estimating population divergence. KEYWORDS admixture; gene flow; phylogenetics; population genetics; phylogenetic network 1 2 F or humans, whole-genome genotype data is now available for 3 individuals from hundreds of populations (Lazaridis et al. 4 2014; Yunusbayev et al. 2015), opening up the possibility to ask 5 more detailed and complex questions about our history (Pickrell 6 and Reich 2014; Schraiber and Akey 2015), and stimulating the 7 development of new tools for the analysis of the joint history of 8 these populations (Reich et al. 2009; Patterson et al. 2012; Pickrell 9 and Pritchard 2012; Lipson et al. 2013; Ralph and Coop 2013; 10 Hellenthal et al. 2014). A simple and intuitive framework for this 11 purpose that has quickly gained in popularity are the F-statistics, 12 introduced by Reich et al. (2009), and summarized in Patterson 13 et al. (2012). In that framework, inference is based on “shared 14 genetic drift” between sets of populations, under the premise 15 that shared drift implies a shared evolutionary history. Tools 16 based on this framework have quickly become widely used in 17 the study of human genetic history, both for ancient and modern 18 DNA (Green et al. 2010; Reich et al. 2012; Lazaridis et al. 2014; 19 Haak et al. 2015; Allentoft et al. 2015). 20 Some care is required with terminology, as the F-statistics 21 sensu Reich et al. (2009) are distinct, but closely related to 22 Wright’s fixation indices (Wright 1931; Reich et al. 2009), which 23 are also often referred to as F-statistics. Furthermore, it is nec- 24 Copyright © 2016 by the Genetics Society of America doi: 10.1534/genetics.XXX.XXXXXX Manuscript compiled: Wednesday 3 rd February, 2016% 1 [email protected] essary to distinguish between statistics (quantities calculated 25 from data) and the underlying parameters (which are part of the 26 model)(Weir and Cockerham 1984). 27 In this paper, I will mostly discuss model parameters, and I 28 will therefore refer to them as drift indices. The term F-statistics 29 will be used when referring to the general framework introduced 30 by Reich et al. (2009), and Wright’s statistics will be referred to 31 as F ST or f . 32 Most applications of the F-statistic-framework can be phrased 33 in terms of the following six questions: 34 1. (Treeness test): Are populations related in a tree-like fash- 35 ion? (Reich et al. 2009) 36 2. (Admixture test): Is a particular population descended from 37 multiple ancestral populations? (Reich et al. 2009) 38 3. (Admixture proportions): What are the contributions from 39 different populations to a focal population (Green et al. 2010; 40 Haak et al. 2015). 41 4. (Number of founders): How many founder populations are 42 there for a certain region? (Reich et al. 2012; Lazaridis et al. 43 2014) 44 5. (Complex demography): How can mixtures and splits of 45 population explain demography? (Patterson et al. 2012; 46 Lipson et al. 2013) 47 Genetics, Vol. XXX, XXXX–XXXX February 2016 1 Genetics: Early Online, published on February 8, 2016 as 10.1534/genetics.115.183913 Copyright 2016.

Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

  • Upload
    others

  • View
    27

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

GENETICS | INVESTIGATION

Admixture, Population Structure and F-statisticsBenjamin M Peter1

1Department of Human Genetics, University of Chicago, Chicago IL USA

ABSTRACT Many questions about human genetic history can be addressed by examining the patterns of shared geneticvariation between sets of populations. A useful methodological framework for this purpose are F-statistics, that measureshared genetic drift between sets of two, three and four populations, and can be used to test simple and complexhypotheses about admixture between populations. This paper provides context from phylogenetic and populationgenetic theory. I review how F-statistics can be interpreted as branch lengths or paths, and derive new interpretationsusing coalescent theory. I further show that the admixture tests can be interpreted as testing general properties ofphylogenies, allowing extension of some ideas applications to arbitrary phylogenetic trees. The new results are usedto investigate the behavior of the statistics under different models of population structure, and show how populationsubstructure complicates inference. The results lead to simplified estimators in many cases, and I recommend to replaceF3 with the average number of pairwise differences for estimating population divergence.

KEYWORDS admixture; gene flow; phylogenetics; population genetics; phylogenetic network

1

2For humans, whole-genome genotype data is now available for 3

individuals from hundreds of populations (Lazaridis et al. 4

2014; Yunusbayev et al. 2015), opening up the possibility to ask 5

more detailed and complex questions about our history (Pickrell 6

and Reich 2014; Schraiber and Akey 2015), and stimulating the 7

development of new tools for the analysis of the joint history of 8

these populations (Reich et al. 2009; Patterson et al. 2012; Pickrell 9

and Pritchard 2012; Lipson et al. 2013; Ralph and Coop 2013; 10

Hellenthal et al. 2014). A simple and intuitive framework for this 11

purpose that has quickly gained in popularity are the F-statistics, 12

introduced by Reich et al. (2009), and summarized in Patterson 13

et al. (2012). In that framework, inference is based on “shared 14

genetic drift” between sets of populations, under the premise 15

that shared drift implies a shared evolutionary history. Tools 16

based on this framework have quickly become widely used in 17

the study of human genetic history, both for ancient and modern 18

DNA (Green et al. 2010; Reich et al. 2012; Lazaridis et al. 2014; 19

Haak et al. 2015; Allentoft et al. 2015). 20

Some care is required with terminology, as the F-statistics 21

sensu Reich et al. (2009) are distinct, but closely related to 22

Wright’s fixation indices (Wright 1931; Reich et al. 2009), which 23

are also often referred to as F-statistics. Furthermore, it is nec- 24

Copyright © 2016 by the Genetics Society of Americadoi: 10.1534/genetics.XXX.XXXXXXManuscript compiled: Wednesday 3rd February, 2016%[email protected]

essary to distinguish between statistics (quantities calculated 25

from data) and the underlying parameters (which are part of the 26

model)(Weir and Cockerham 1984). 27

In this paper, I will mostly discuss model parameters, and I 28

will therefore refer to them as drift indices. The term F-statistics 29

will be used when referring to the general framework introduced 30

by Reich et al. (2009), and Wright’s statistics will be referred to 31

as FST or f . 32

Most applications of the F-statistic-framework can be phrased 33

in terms of the following six questions: 34

1. (Treeness test): Are populations related in a tree-like fash- 35

ion? (Reich et al. 2009) 36

2. (Admixture test): Is a particular population descended from 37

multiple ancestral populations? (Reich et al. 2009) 38

3. (Admixture proportions): What are the contributions from 39

different populations to a focal population (Green et al. 2010; 40

Haak et al. 2015). 41

4. (Number of founders): How many founder populations are 42

there for a certain region? (Reich et al. 2012; Lazaridis et al. 43

2014) 44

5. (Complex demography): How can mixtures and splits of 45

population explain demography? (Patterson et al. 2012; 46

Lipson et al. 2013) 47

Genetics, Vol. XXX, XXXX–XXXX February 2016 1

Genetics: Early Online, published on February 8, 2016 as 10.1534/genetics.115.183913

Copyright 2016.

Page 2: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

A

B

Figure 1 Population phylogeny and admixture graph. A: Apopulation phylogeny with branches corresponding to F2

(green), F3 (yellow) and F(B)4 (blue). B: An admixture graph

extends population phylogenies by allowing gene flow (red,full line) and admixture events (red, dotted).

6. (Closest relative): What is the closest relative to a contem- 48

porary or ancient population (Raghavan et al. 2014) 49

The demographic models under which these questions are 50

addressed, and that motivated the drift indices, are called popula- 51

tion phylogenies and admixture graphs. The population phylogeny 52

(or population tree), is a model where populations are related in 53

a tree-like fashion (Figure 1A), and it frequently serves as the null 54

model for admixture tests. The branch lengths in the population 55

phylogeny correspond to how much genetic drift occurred, so 56

that a branch that is subtended by two different populations can 57

be interpreted as the “shared” genetic drift between these pop- 58

ulations. The alternative model is an admixture graph (Figure 59

1B), which extends the population phylogeny by allowing edges 60

that represent population mergers or a significant exchange of 61

migrants. 62

Under a population phylogeny, the three F-statistics pro- 63

posed by Reich et al. (2009), labelled F2, F3 and F4 have interpre- 64

tations as branch lengths (Fig. 1A) between two, three and four 65

taxa, respectively: Assume populations are labelled as P1, P2, . . . , 66

then 67

• F2(P1, P2) corresponds to the path on the phylogeny from 68

P1 to P2. 69

• F3(PX ; P1, P2) represents the length of the external branch 70

from PX to the (unique) internal vertex connecting all three 71

populations. Thus, the first parameter of F3 has an unique 72

role, whereas the other two can be switched arbitrarily. 73

• F(B)4 (P1, P2; P3, P4) represents the internal branch from the 74

internal vertex connecting P1 and P2 to the vertex connect- 75

ing P3 and P4 (blue). 76

If the arguments are permuted, some F-statistics will have no 77

corresponding internal branch. In particular, it can be shown 78

that in a population phylogeny, one F4-index will be zero, imply- 79

ing that the corresponding internal branch is missing. This is the 80

property that is used in the admixture test. For clarity, I add the 81

superscript F(B)4 if I need to emphasize the interpretation of F4 82

as a branch length, and F(T)4 to emphasize the interpretation as a 83

test-statistic. For details, see the subsection on F4 in the results 84

section. 85

In an admixture graph, there is no longer a single branch 86

length corresponding to each F-statistic, and interpretations are 87

more complex. However, F-statistics can still be thought of 88

as the proportion of genetic drift shared between populations 89

(Reich et al. 2009). The basic idea exploited in addressing all 90

six questions outlined above is that under a tree model, branch 91

lengths, and thus the drift indices, must satisfy some constraints 92

(Buneman 1971; Semple and Steel 2003; Reich et al. 2009). The 93

two most relevant constraints are that i) in a tree, all branches 94

have positive lengths (tested using the F3-admixture test) and ii) 95

in a tree with four leaves, there is at most one internal branch 96

(tested using the F4-admixture test). 97

The goal of this paper is to give a broad overview on the 98

theory, ideas and applications of F-statistics. Our starting point 99

is a brief review on how genetic drift is quantified in general, 100

and how it is measured using F2. I then propose an alterna- 101

tive definition of F2 that allows us to simplify applications, and 102

study them under a wide range of population structure models. 103

I then review some basic properties of distance-based phylo- 104

gentic trees, show how the admixture tests are interpreted in 105

this context, and evaluate their behavior. Many of the results 106

that are highlighted here are implicit in classical (Wahlund 1928; 107

Wright 1931; Cavalli-Sforza and Edwards 1967; Felsenstein 1973; 108

Cavalli-Sforza and Piazza 1975; Felsenstein 1981; Slatkin 1991; 109

Excoffier et al. 1992) and more recent work (Patterson et al. 2012; 110

Pickrell and Pritchard 2012; Lipson et al. 2013), but often not 111

explicitly stated, or given in a different context. 112

Results & Discussion 113

The next sections will discuss the F-statistics, introducing dif- 114

ferent interpretations and giving derivations for some useful 115

expressions. Longer derivations are deferred to the Methods 116

section. A graphical summary of the three interpretations of 117

the statistics is given in Figure 2, and the main formulas are 118

summarized in Table 1. 119

Throughout this paper, populations are labelled as 120

P1, P2, . . . , Pi, . . . . Often, PX will denote a an admixed popu- 121

lation. The allele frequency pi is defined as the proportion of 122

individuals in Pi that carry a particular allele at a biallelic locus, 123

and throughout this paper I assume that all individuals are hap- 124

loid. However, all results hold if instead of haploid individuals, 125

an arbitrary allele of a diploid individual is used. I focus on 126

genetic drift only, and ignore the effects of mutation, selection 127

and other evolutionary forces. 128

2 Benjamin M Peter et al

Page 3: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

F2(P1, P2) F3(PX ; P1, P2) F4(P1, P2, P3, P4)

Definition E[(p1 − p2)2] E(pX − p1)(pX − p2) E(p1 − p2)(p3 − p4)

F2 -12(

F2(P1, PX) + F2(P2, PX) −F2(P1, P2)

) 12(

F2(P1, P4) + F2(P2, P3) −F2(P1, P3)− F2(P2, P4)

)Coalescencetimes 2ET12 −ET11 −ET22 ET1X + ET2X −ET12 −ETXX ET14 + ET23 −ET13 −ET24

Variance Var(p1 − p2)Var(pX) + Cov(p1, p2) −Cov(p1, pX)−Cov(p2, pX)

Cov((p1 − p2), (p3 − p4))

Branch length 2Bc −Bd 2Bc −BdBc − Bd or as admixture test: B′d −Bd

Table 1 Summary of equations. A constant of proportionality is omitted for coalescence times and branch lengths. Derivations forF2 are given in the main text, F3 and F4 are a simple result of combining Equation 5 with 10b and 14b. Bc and Bd correspond to theaverage length of the internal branch in a gene genealogy concordant and discordant with the population assignment respectively(see section Gene tree branch lengths).

Figure 2 Interpretation of F-statistics. F-statistics can be interpreted i) as branch lengths in a population phylogeny (PanelsA,E,I,M), the overlap of paths in an admixture graph (Panels B,F,J,N, see also Figure S1), and in terms of the internal branches ofgene-genealogies (see Figures 4, S2 and S3). For gene trees consistent with the population tree, the internal branch contributes pos-itively (Panels C,G,K), and for discordant branches, internal branches contribute negatively (Panels D,H) or zero (Panel L). F4 hastwo possible interpretations; depending on how the arguments are permuted relative to the tree topology, it may either reflect the

length of the internal branch (third row, F(B)4 ), or a test statistic that is zero under a population phylogeny (last row, F(T)

4 ). For theadmixture test, the two possible gene trees contribute to the statistic with different sign, highlighting the similarity to the D-statisticGreen et al. (2010) and its expectation of zero in a symmetric model.

Admixture, Population Structure and F-statistics 3

Page 4: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

Measuring genetic drift – F2 129

The purpose of F2 is simply to measure how much genetic drift 130

occurred between two populations, i.e. to measure genetic dis- 131

similarity. For populations P1 and P2, F2 is defined as (Reich et al. 132

2009) 133

F2(P1, P2) = F2(p1, p2) = E(p1 − p2)2. (1)

The expectation is with respect to the evolutionary process, 134

but in practice F2 is estimated from hundreds of thousands of 135

loci across the genome (Patterson et al. 2012), which are assumed 136

to be non-independent replicates of the evolutionary history of 137

the populations. 138

Why is F2 a useful measure of genetic drift? As it is infeasible 139

to observe changes in allele frequency directly, the effect of drift 140

is assessed indirectly, through its impact on genetic diversity. 141

Most commonly, genetic drift is quantified in terms of i) the 142

variance in allele frequency, ii) heterozygosity, iii) probability 143

of identity by descent iv) correlation (or covariance) between 144

individuals and v) the probability of coalescence (two lineages 145

having a common ancestor). In the next sections I show how 146

F2 relates to these quantities in the the cases of a single popula- 147

tion changing through time and a pair of populations that are 148

partially isolated. 149

Single population I assume a single population, measured at 150

two time points (t0 and t), and label the two samples P0 and Pt. 151

Then F2(P0, Pt) can be interpreted in terms of the variances of 152

allele frequencies (Figure 3A, see Equation 21) 153

F2(P0, Pt) = Var(pt)−Var(p0) = Var(pt − p0), (2a)

the expected decrease in heterozygosity Ht, between the two 154

sample times (Figure 3B, see Equation 27): 155

F2(P0, Pt) =EH0 −EHt

2, (2b)

and in terms of the inbreeding coefficient f , which can be in- 156

terpreted as the probability of two individuals in Pt descending 157

from the same ancestor in P0, or, equivalently, the probability 158

that two samples from Pt coalesce before t0. (Figure 3C, see 159

Equation 26): 160

F2(P0, Pt) =12

f EH0, (2c)

Equations 2a-2c can be rearranged to make the connectionbetween other measures of genetic drift and F2 more explicit:

EHt = EH0 − 2F2(P0, Pt) (3a)

= EH0 − 2(Var(pt)−Var(p0)) (3b)

= EH0(1− f ) (3c)

Pairs of populations Equations 3b and 3c describing the decay 161

of heterozygosity are – of course – well known by population 162

geneticists, having been established by Wright (1931). In struc- 163

tured populations, very similar relationships exist when the 164

number of heterozygotes expected from the overall allele fre- 165

quency, Hobs is compared with the number of heterozygotes 166

present due to differences in allele frequencies between popula- 167

tions Hexp (Wahlund 1928; Wright 1931). 168

In fact, already Wahlund showed that for a population madeup of two subpopulations with equal proportions, the propor-tion of heterozygotes is reduced by

Hobs = Hexp − 2(p1 − p2)2

A

B

C

Figure 3 Measures of genetic drift in a single population. In-terpretations of F2 in terms of (A) the increase in allele fre-quency variance, (B) the decrease in heterozygosity and (C) f ,which can be interpreted as probability of coalescence of twolineages, or the probability that they are identical by descent.

from which it is straightforward (see section Two populations inMethods) to derive

F2(P1, P2) =EHexp −EHobs

2(4a)

= Var(p1 − p2) (4b)

=12

FSTEHobs. (4c)

This last equation served as the original motivation of F2 (Reich 169

et al. 2009), where it was first defined as a numerator of FST . 170

Justification for F2 The preceding arguments show how the us- 171

age of F2 for both single and structured populations can be justi- 172

fied by the similar effects on heterozygosity and allele frequency 173

variance F2 measures. However, what is the benefit of using F2 174

instead of the established inbreeding coefficient f and fixation 175

index FST? Recall that Wright motivated f and FST as correlation 176

coefficients between alleles (Wright 1921, 1931). Correlation co- 177

efficients have the advantage that they are easy to interpret, as, 178

e.g. FST = 0 implies panmixia and FST = 1 implies complete 179

divergence between subpopulations. In contrast, F2 depends 180

on allele frequencies and is highest for intermediate frequency 181

alleles. However, F2 has an interpretation as a covariance, making 182

it simpler and mathematically more convenient to work with. In 183

particular, variances and covariances are frequently partitioned 184

into components due to different effects using techniques such 185

as analysis of variance and analysis of covariance (e.g. Excoffier 186

et al. 1992). 187

F2 as branch length Reich et al. (2009) and Patterson et al. (2012) 188

proposed to partition “drift” (as previously established, char- 189

acterized by allele frequency variance, or decrease in heterozy- 190

gosity) between different populations into contribution on the 191

different branches of a population phylogeny. This model has 192

4 Benjamin M Peter et al

Page 5: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

been studied by Cavalli-Sforza and Edwards (1967) and Felsen- 193

stein (1973) in the context of a Brownian motion process. In this 194

model, drift on independent branches is assumed to be indepen- 195

dent, meaning that the variances can simply be added. This is 196

what is referred to as the additivity property of F2 (Patterson et al. 197

2012). 198

To illustrate the additivity property, consider two populationsP1 and P2 that split recently from a common ancestral populationP0 (Figure 2A). In this case, p1 and p2 are assumed to be inde-pendent conditional on p0, and therefore Cov(p1, p2) = Var(p0).Then, using 2a and 4b,

F2(P1, P2) = Var(p1 − p2) = Var(p1) + Var(p2)− 2Cov(p1, p2)

= Var(p1) + Var(p2)− 2Var(p0)

= F2(P1, P0) + F2(P2, P0).

Alternative proofs of this statement and more detailed reasoning 199

behind the additivity assumption can be found in Cavalli-Sforza 200

and Edwards (1967); Felsenstein (1973); Reich et al. (2009) or 201

Patterson et al. (2012). 202

Lineages are not independent in an admixture graph, and 203

so this approach cannot be used.Reich et al. (2009) approached 204

this by conditioning on the possible population trees that are 205

consistent with an admixture scenario. In particular, they pro- 206

posed a framework of counting the possible paths through the 207

graph (Reich et al. 2009; Patterson et al. 2012). An example of 208

this representation for F2 in a simple admixture graph is given 209

in Figure S1, with the result summarized in Figure 2B. Detailed 210

motivation behind this visualization approach is given in Ap- 211

pendix 2 of Patterson et al. (2012). In brief, the reasoning is as 212

follows: Recall that F2(P1, P2) = E(p1 − p2)(p1 − p2), and inter- 213

pret the two terms in parentheses as two paths between P1 and 214

P2, and F2 as the overlap of these two paths. In a population 215

phylogeny, there is only one possible path, and the two paths 216

are always the same, therefore F2 is the sum of the length of all 217

the branches connecting the two populations. However, if there 218

is admixture, as in Figure 2B, both paths choose independently 219

which admixture edge they follow. With probability α they will 220

go left, and with probability β = 1 − α they go right. Thus, 221

F2 can be interpreted by enumerating all possible choices for 222

the two paths, resulting in three possible combinations of paths 223

on the trees (Figure S1), and the branches included will differ 224

depending on which path is chosen, so that the final F2 is made 225

up average of the path overlap in the topologies, weighted by 226

the probabilities of the topologies. 227

However, one drawback of this approach is that it scales 228

quadratically with the number of admixture events, making cal- 229

culations cumbersome when the number of admixture events is 230

large. More importantly, this approach is restricted to panmictic 231

subpopulations, and cannot be used when the population model 232

cannot be represented as a weighted average of trees. 233

Gene tree interpretation For this reason, I propose to redefine 234

F2 using coalescent theory (Wakeley 2009). Instead of allele 235

frequencies on a fixed admixture graph, coalescent theory tracks 236

the ancestors of a sample of individuals, tracing their history 237

back to their most recent common ancestor. The resulting tree is 238

called a gene tree (or coalescent tree). Gene trees vary between 239

loci, and will often have a different topology from the population 240

phylogeny, but they are nevertheless highly informative about 241

a population’s history. Moreover, expected coalescence times 242

and expected branch lengths are easily calculated under a wide 243

array of neutral demographic models. 244

In a seminal paper, Slatkin (1991) showed how FST can beinterpreted in terms of the expected coalescence times of genetrees:

FST =ETB −ETW

ETB,

where ETB and ETW are the expected coalescence times of two 245

lineages sampled in two different and the same population, 246

respectively. 247

Unsurprisingly, given the close relationship between F2 and 248

FST , an analogous expression exists for F2(P1, P2): Define T12 as 249

the coalescence time for one lineage each sampled from popula- 250

tions P1 and P2, and T11, T22 as the expected coalescence times 251

for two samples from P1 and P2, respectively. Furthermore, let 252

θ = 4Nµ be the standard mutation parameter. Then F2 can be 253

written as 254

F2(P1, P2) = θ

(ET12 −

ET11 + ET222

). (5)

The derivation of this result is given in the section Derivation of 255

F2 for gene trees. 256

Unlike FST , the mutation parameter θ does not cancel. How- 257

ever, for most applications, the absolute magnitude of F2 is of 258

little interest, since only the sign of the statistics are used for 259

most tests. In other applications F-statistics with presumably 260

the same θ (Reich et al. 2009) are compared. In these cases, θ 261

can be regarded as a constant of proportionality, and will not 262

change the theoretical properties of the F-statistics. It will, how- 263

ever, influence statistical properties, as a larger θ implies more 264

mutations and hence more data. 265

For estimation, the average number of pairwise differences 266

πij is a commonly used estimator for θTij (Tajima 1983). Thus, a 267

natural estimator for F2 is 268

F2(P1, P2) = π12 −π11 + π22

2. (6)

It is straightforward to verify (Section Equivalence of estima- 269

tors) that this estimator of F2 is numerically equivalent to the 270

unbiased estimator proposed by Reich et al. (2009) in terms of 271

the sample allele frequency pi and the sample size ni: 272

F2(P1, P2) = ( p1 − p2)2 − p1(1− p1)

n1 − 1− p2(1− p2)

n2 − 1. (7)

However, it is worth pointing out that the modeling assump- 273

tions are different: The original definition only considered loci 274

that were segregating in an ancestral population; loci not seg- 275

regating there were discarded. Since ancestral populations are 276

usually unsampled, this is often replaced by ascertainment in an 277

outgroup (Patterson et al. 2012; Lipson et al. 2013). In contrast, 278

Equation 6 assumes that all markers are used, which is the more 279

natural interpretation for sequence data. 280

Gene tree branch lengths An important feature of Equation 5 is 281

that it only depends on the expected coalescence times between 282

pairs of lineages. Thus, the behavior of F2 can be fully charac- 283

terized by considering a sample of size four, with two random 284

individuals taken from each population. This is all that is needed 285

to study the joint distribution of T12, T11 and T22, and hence F2. 286

By linearity of expectation, larger samples can be accommodated 287

by summing the expectations over all possible quartets. 288

For a sample of size four with two pairs, there are only two 289

possible unrooted tree topologies. One, where the lineages 290

from the same population are more closely related to each other 291

Admixture, Population Structure and F-statistics 5

Page 6: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

C. Discordant genealogy

B. Concordant genealogy

A. Equation

Figure 4 Schematic explanation how F2 behaves conditionedon gene tree. Blue terms and branches correspond to positivecontributions, whereas red branches and terms are subtracted.Labels represent individuals randomly sampled from thatpopulation. External branches cancel out, so only the internalbranches have non-zero contribution to F2. In the concordantgenealogy (Panel B), the contribution is positive (with weight2), and in the discordant genealogy (Panel C), it is negative(with weight 1). The mutation rate as constant of proportional-ity is omitted.

(called concordant topology, T (2)c ) and one where lineages from 292

different populations coalesce first (which I will refer to as discor- 293

danat topology T (2)d ). The superscripts refers to the topologies 294

being for F2, and I will discard them in cases where no ambiguity 295

arises. 296

Conditioning on the topology yields

F2(P1, P2) =E[F2(P1, P2)|T

]=P(Tc)E[F2(P1, P2)|Tc] + P(Td)E[F2(P1, P2)|Td].

Figure 4 contains graphical representations for 297

E[F2(P1, P2)|Tc] (Panel B) and E[F2(P1, P2)|Td] (Panel C), 298

respectively. 299

In this representation, T12 corresponds to a path from a ran- 300

dom individual from P1 to a random individual from P2, and 301

T11 represents the path between the two samples from P1. 302

For Tc the internal branch is always included in T12, but never 303

in T11 or T22. External branches, on the other hand, are included 304

with 50% probability in T12 on any path through the tree. T11 305

and T22, on the other hand, only consist of external branches, 306

and the lengths of the external branches cancel. 307

On the other hand, for Td, the internal branch is always in- 308

cluded in T11 and T22, but only half the time in T12. Thus, they 309

contribute negatively to F2, but only with half the magnitude of 310

Tc. As for Tc, each T contains exactly two external branches, and 311

so they will cancel out. 312

An interesting way to represent F2 is therefore in terms of the 313

internal branches over all possible gene genealogies. Denote the 314

unconditional average length of the internal branch of Tc as Bc, 315

and the average length of the internal branch in Td as Bd. Then, 316

F2 can be written in terms of these branch lengths as 317

F2(P1, P2) = θ(2Bc − 1Bd

), (8)

resulting in the representation given in Figure 2C-D. 318

As a brief sanity check, consider the case of a population 319

without structure. In this case, the branch length is independent 320

of the topology and Td is twice as likely as Tc and hence Bd = 321

2Bc, from which it follows that F2 will be zero, as expected in a 322

randomly mating population 323

This argument can be transformed from branch lengths to ob- 324

served mutations by recalling that mutations occur on a branch 325

at a rate proportional to its length. F2 is increased by doubletons 326

that support the assignment of populations (i.e. the two lineages 327

from the same population have the same allele), but reduced by 328

doubletons shared by individuals from different populations. 329

All other mutations have a contribution of zero. 330

Testing treeness 331

Many applications consider tens or even hundreds of popula- 332

tions simultaneously (Patterson et al. 2012; Pickrell and Pritchard 333

2012; Haak et al. 2015; Yunusbayev et al. 2015), with the goal to in- 334

fer where and between which populations admixture occurred. 335

Using F-statistics, the approach is to interpret F2(P1, P2) as a 336

measure of dissimilarity between P1 and P2, as a large F2-value 337

implies that populations are highly diverged. Thus, the strat- 338

egy is to calculate all pairwise F2 indices between populations, 339

combine them into a dissimilarity matrix, and ask if that matrix is 340

consistent with a tree. 341

One way to approach this question is by using phylogenetic 342

theory: Many classical algorithms have been proposed that use 343

a measure of dissimilarity to generate a tree (Fitch et al. 1967; 344

Saitou and Nei 1987; Semple and Steel 2003; Felsenstein 2004), 345

and what properties a general dissimilarity matrix needs to have 346

in order to be consistent with a tree (Buneman 1971; Cavalli- 347

Sforza and Piazza 1975), in which case the matrix is also called a 348

tree metric (Semple and Steel 2003). Thus, testing for admixture 349

can be thought of as testing treeness. 350

For a dissimilarity matrix to be consistent with a tree, there 351

are two central properties it needs to satisfy: First, the length of 352

all branches has to be positive. This is strictly not necessary for 353

phylogenetic trees, and some algorithms may return negative 354

branch lengths (e.g. Saitou and Nei 1987); however, since in our 355

case branches have an interpretation of genetic drift, negative 356

genetic drift is biologically nonsensical, and therefore negative 357

branches should be interpreted as a violation of the modeling 358

assumptions, and hence treeness. 359

The second property of a tree metric important in the presentcontext is a bit more involved: A dissimilarity matrix (written interms of F2) is consistent with a tree if for any four populationsPi, Pj, Pk and Pl ,

F2(Pi, Pj) + F2(Pk, Pl) ≤

max(

F2(Pi, Pk) + F2(Pj,Pl), F2(Pi, Pl) + F2(Pj, Pk)

)(9)

that is, if the sums of pairs of distances are compared, two of 360

these sums will be the same, and no smaller than the third. 361

This theorem, due to Buneman (1971, 1974) is called the four- 362

point condition or sometimes, more modestly, the “fundamental 363

theorem of phylogenetics”. A proof can be found in chapter 7 of 364

Semple and Steel (2003). 365

Informally, this statement can be understood by noticing that 366

on a tree, two of the pairs of distances will include the internal 367

branch, whereas the third one will not, and therefore be shorter. 368

Thus, the four-point condition can be colloquially rephrased as 369

“any four taxa tree has at most one internal branch”. 370

Why are these properties useful? It turns out that the ad- 371

mixture tests based on F-statistics can be interpreted as tests 372

of these properties: The F3-test can be interpreted as a test for 373

6 Benjamin M Peter et al

Page 7: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

the positivity of a branch; and the F4 as a test of the four-point 374

condition. Thus, the working of the two test statistics can be 375

interpreted in terms of fundamental properties of phylogenetic 376

trees, with the immediate consequence that they may be applied 377

as treeness-tests for arbitrary dissimilarity matrices. 378

An early test of treeness, based on a likelihood ratio, was 379

proposed by Cavalli-Sforza and Piazza (1975): They compared 380

the likelihood of the observed F2-matrix matrix to that induced 381

by the best fitting tree (assuming Brownian motion), rejecting 382

the null hypothesis if the tree-likelihood is much lower than 383

that of the empirical matrix. In practice, however, finding the 384

best-fitting tree is a challenging problem, especially for large 385

trees Felsenstein (2004) and so the likelihood test proved to be 386

difficult to apply. From that perspective, the F3 and F4-tests 387

provide a convenient alternative: Since treeness implies that all 388

subsets of taxa are also trees, the ingenious idea of Reich et al. 389

(2009) was that rejection of treeness for subtrees of size three 390

(for F3) and four (for F4) is sufficient to reject treeness for the 391

entire tree. Furthermore, tests on these subsets also pinpoint the 392

populations involved in the non-tree-like history. 393

F3 394

In the previous section, I showed how F2 can be interpreted 395

as a branch length, as an overlap of paths, or in terms of gene 396

trees (Figure 2). Furthermore, I gave expressions in terms of 397

coalescence times, allele frequency variances and internal branch 398

lengths of gene trees. In this section, I give analogous results for 399

F3. 400

Reich et al. (2009) defined F3 as: 401

F3(PX ; P1, P2) = F3(pX ; p1, p2) = E(pX − p1)(pX − p2) (10a)

with the goal to test whether PX is admixed. Recalling the path 402

interpretation detailed in Patterson et al. (2012), F3 can be inter- 403

preted as the shared portion of the paths from PX to P1 with the 404

path from PX to P1. In a population phylogeny (Figure 2E) this 405

corresponds to the branch between PX and the internal node. 406

Equivalently, F3 can also be written in terms of F2 (Reich et al. 407

2009): 408

F3(PX ; P1, P2) =12

(F2(PX , P1) + F2(PX , P2)− F2(P1, P2)

).

(10b)If F2 in Equation 10b is generalized to an arbitrary tree metric, 409

Equation 10b is known as the Gromov product (Semple and 410

Steel 2003) in phylogenetics. The Gromov product is a com- 411

monly used operation in classical phylogenetic algorithms to 412

calculate the length of the portion of a branch shared between P1 413

and P2 (Fitch et al. 1967; Felsenstein 1973; Saitou and Nei 1987); 414

consistent with the notion that F3 is the length of an external 415

branch in a population phylogeny. 416

In an admixture graph, there is no longer a single external 417

branch; instead all possible trees have to be considered, and F3 418

is the (weighted) average of paths through the admixture graph 419

(Figure 2F). 420

Combining Equations 5 and 10b gives an expression of F3 in 421

terms of expected coalescence times: 422

F3(PX ; P1, P2) =θ

2(ET1X + ET2X −ET12 −ETXX) . (10c)

Similarly, an expression in terms of variances is obtained by 423

combining Equation 2a with 10b: 424

F3(PX ; P1, P2) =

Var(pX) + Cov(p1, p2)−Cov(p1, pX)−Cov(p2, pX).(10d)

which was also noted by Pickrell and Pritchard (2012). 425

Outgroup-F3 statistics A simple application of the interpreta- 426

tion of F3 as a shared branch length are the “outgroup”-F3- 427

statistics proposed by Raghavan et al. (2014). For an unknown 428

population PU , they wanted to find the most closely related pop- 429

ulation from a panel of k extant populations {Pi, i = 1, 2, . . . k}. 430

They did this by calculating F3(PO; PU , Pi), where PO is an out- 431

group population that was assumed widely diverged from PU 432

and all populations in the panel. This measures the shared drift 433

(or shared branch) of PU with the populations from the panel, 434

and high F3-values imply close relatedness. 435

However, using Equation 10c, the outgroup-F3-statistic can 436

be written as 437

F3(PO; PU , Pi) ∝ ETUO + ETiO −ETUi −ETOO. (11)

Out of these four terms, ETUO and ETOO do not depend on Pi. 438

Furthermore, if PO is truly an outgroup, then all ETiO should 439

be the same, as pairs of individuals from the panel population 440

and the outgroup can only coalesce once they are in the joint an- 441

cestral population. Therefore, only the term ETUi is expected to 442

vary between different panel populations, suggesting that using 443

the number of pairwise differences, πUi, is largely equivalent to 444

using F3(PO; PU , Pi). I confirm this in Figure 5A by calculating 445

outgroup-F3 and πiU for a set of increasingly divergent popu- 446

lations, with each population having its own size, sample size 447

and sequencing error probability. Linear regression confirms 448

the visual picture that πiU has a higher correlation with diver- 449

gence time (R2 = 0.90) than F3 (R2 = 0.73). Hence, the number 450

of pairwise differences may be a better metric for population 451

divergence than F3. 452

F3 admixture test However, F3 is motivated and primarily used 453

as an admixture test (Reich et al. 2009). In this context, the null 454

hypothesis is that F3 is non-negative, i.e. the null hypothesis 455

is that the data is generated from a phylogenetic tree that has 456

positive edge lengths. If this is not the case, the null hypothesis 457

is reject in favor of the more complex admixture graph. From 458

Figure 2F it may be seen that drift on the path on the internal 459

branches (red) contributes negatively to F3. If these branches are 460

long enough compared to the branch after the admixture event 461

(blue), then F3 will be negative. For the simplest scenario where 462

PX is admixed between P1 and P2, Reich et al. (2009) provided 463

a condition when this is the case (Equation 20 in Supplement 464

2 of Reich et al. 2009). However, since this condition involves 465

F-statistics with internal, unobserved populations, it cannot 466

be used in practical applications. A more useful condition is 467

obtained using Equation 10c. 468

In the simplest admixture model, an ancestral population 469

splits into P1 and P2 at time tr. At time t1, the populations mix to 470

form PX , such that with probability α, individuals in PX descend 471

from individuals from P1, and with probability (1− α), they 472

descend from P2 (See Fig. 7 for an illustration). In this case, 473

F3(PX ; P1, P2) is negative if 474

1(1− cx)

t1tr

< 2α(1− α), (12)

Admixture, Population Structure and F-statistics 7

Page 8: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

−0.0

02

0.0

00

0.0

02

0.0

04

0.0

06

0.0

08

Figure 5 Simulation results. A: Outgroup-F3 statistics (yellow) and πiU(white) for a panel of populations with linearly increasingdivergence time. Both statistics are scaled to have the same range, with the first divergence between the most closely related popu-lations set to zero. F3 is inverted, so that it increases with distance. B: Simulated (boxplots) and predicted(blue) F3-statistics under asimple admixture model. C: Comparison of F4-ratio (yellow triangles, Equation 18) and ratio of differences(black circles, (Equation20).

where cx is the probability two individuals sampled in PX have acommon ancestor before t1. For a randomly mating populationwith changing size N(t),

cx = 1− exp(−∫ t1

0

1N(s)

ds)

.

Thus, the power F3 to detect admixture is large 475

1. if the admixture proportion is close to fifty percent 476

2. if the ratio between the times of the original split to the time 477

of secondary contact is large, and 478

3. if the probability of coalescence before the admixture event 479

in PX is small, i.e. the size of PX is large. 480

A more general condition for negativity of F3 is obtained 481

by considering the internal branches of the possible gene tree 482

topologies, analogously to that given for F2 in the section Gene 483

tree branch lengths. Since Equation 10c includes ETXX , only two 484

individuals from PX are needed, and one each from P1 and P2 485

to study the joint distribution of all terms in (10c). The minimal 486

case therefore contains again just four samples (Figure S2). 487

Furthermore, P1 and P2 are exchangeable, and thus there are 488

again just two distinct gene genealogies, a concordant one T (3)c 489

where the two lineages from PX are most closely related, and a 490

discordant genealogy T (3)d where the lineages from PX merge 491

first with the other two lineages. A similar argument as that 492

for F2 shows (presented in Figure S2) that F3 can be written as a 493

function of just the internal branches in the topologies: 494

F3(PX ; P1, P2) = θ(2Bc −Bd

), (13)

where Bc and Bd are the lengths of the internal branches in T (3)c 495

and T (3)d , respectively, and similar to F2, concordant branches 496

have twice the weight of discordant ones. Again, the case of 497

all individuals coming from a single populations serves as a 498

sanity check: In this case Td is twice as likely as Tc, and all 499

branches are expected to have the same length, resulting in F3 500

being zero. However, for F3 to be negative, notice that Bd needs 501

to be more than two times longer than Bc. Since mutations 502

are proportional to Bd and Bc, F3 can be interpreted as a test 503

whether mutations that agree with the population tree are more 504

than twice as common than mutations that disagree with it. 505

I performed a small simulation study to test the accuracy of 506

Equation 12. Parameters were chosen such that F3 has a negative 507

expectation for α > 0.05, and I find that the predicted F3 fit very 508

well with the simulations (Figure 5B). 509

F4 510

The second admixture statistic, F4, is defined as (Reich et al. 511

2009): 512

F4(P1, P2; P3, P4) = F4(p1, p2; p3, p4) = E[(p1 − p2)(p3 − p4)].(14a)

Similarly to F3, F4 can be written as a linear combination of F2: 513

F4(P1, P2; P3, P4) =

12

(F2(P1, P4) + F2(P2, P3)− F2(P1, P3)− F2(P2, P4)

).

(14b)

which leads to 514

F4(P1, P2; P3, P4) =θ

2

(ET14 + ET23 −ET13 − BET24

). (14c)

8 Benjamin M Peter et al

Page 9: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

As four populations are involved, there are 4! = 24 possi-ble ways of arranging the populations in Equation 14a. How-ever, there are four possible permutations of arguments that willlead to identical values, leaving only six unique F4-values forany four populations. Furthermore, these six values come inpairs that have the same absolute value, and a different sign,(i. e. F4(P1, P2; P3, P4) = −F4(P1, P2; P4, P3)), leaving only threeunique absolute values, which correspond to the tree possibletree topologies. Out of these three, one F4 can be written as thesum of the other two, leaving just two independent possibilities:

F4(P1, P2; P3, P4) + F4(P1, P3; P2, P4) = F4(P1, P4; P2, P3).

As for F3, Equation 14b can be generalized by replacing F2 515

with an arbitrary tree metric. In this case, Equation 14b is known 516

as a tree split (Buneman 1971), as it measures the length of the 517

overlap of the branch lengths between the two pairs. As there 518

are two independent F4-indices for a fixed tree, there are two 519

different interpretations for the F4-indices. Consider the tree 520

from Fig. 1A: F4(P1, P2; P3, P4) can be interpreted as the overlap 521

between the paths from P1 to P2 and from P3 to P4. However, 522

these paths do not overlap in Fig. 1A, and therefore F4 = 0. 523

This is how F4 is used as a test statistic. On the other hand, 524

F4(P1, P3; P2, P4) measures the overlap between the paths from 525

P1 to P3 and from P2 to P4), which is the internal branch in Fig. 526

1A, and will be positive. 527

It is cumbersome that the interpretation of F4 depends onthe ordering of its arguments. In order to make our intentionclear, instead of switching the arguments around for the twointerpretations, I introduce the superscripts (T) (for test) and(B) (for branch length):

F(T)4 (P1, P2; P3, P4) = F4(P1, P2; P3, P4) (15a)

F(B)4 (P1, P2; P3, P4) = F4(P1, P3; P2, P4) (15b)

Tree splits, and hence F4, are closely related to the four-point 528

condition (Buneman 1971, 1974), which, informally, states that a 529

(sub-)tree with four populations will have at most one internal 530

branch. Thus, if data is consistent with a tree, F(B)4 will be the 531

length of that branch, and F(T)4 will be zero. In Figure 2, the third 532

row (Panels I-L) correspond to the internal branch, and the last 533

row (Panels M-P) to the “zero”-branch. 534

Thus, in the context of testing for admixture, testing that F4 535

is zero is equivalent to checking whether there is in fact only 536

a single internal branch. If that is not the case, the population 537

phylogeny is rejected. 538

Gene trees Evaluating F4 in terms of gene trees and their inter- 539

nal branches, there are three different gene tree topologies that 540

have to be considered, whose interpretation depends depending 541

on whether the branch length or test-statistic interpretation are 542

considered. 543

For the branch length (F(B)4 ), the gene tree corresponding to 544

the population tree has a positive contribution to F4, and the 545

other two possible trees have a zero and negative contribution, 546

respectively (Figure S3). Since the gene tree corresponding to 547

the population tree is expected to be most frequent, F4 will be 548

positive, and can be written as 549

F(B)4 = θ(Bc −Bd). (16)

This equation is slightly different than those for F2 and F3, where 550

the coefficient for the discordant genealogy was half that for the 551

concordant genealogy. Note, however, that F4 only includes one 552

of the two discordant genealogies. Under a tree, both discordant 553

genealogies are equally likely (Durand et al. 2011), and thus the 554

expectation of F4 will be the same. 555

In contrast, for the admixture test statistic (F(T)4 ), the contribu- 556

tion of the concordant genealogy will be zero, and the discordant 557

genealogies will contribute with coefficients −1 and +1, respec- 558

tively. Under the population phylogeny, these two gene trees 559

will be equally likely (Wakeley 2009), and thus the expectation 560

of F4 as a test statistic 561

F(T)4 = θ(Bd −B′d) (17)

is zero under the null hypothesis. Furthermore, the statistic isclosely related to the ABBA-BABA or D-statistic also used totest for admixture (Green et al. 2010; Durand et al. 2011), whichincludes a normalization term and conditions on alleles beingderived. In our notation the expectation of D is ,

E[D] =B′d −Bd

B′d + Bd

thus, if F(T)4 has an expectation of zero, so will D. 562

F4 as a branch length 563

Rank test Two major applications of F4 use its interpretation 564

as a branch length. First, the rank of a matrix of all F4-statistics 565

is used to obtain a lower bound on the number of admixture 566

events required to explain data (Reich et al. 2012). The principal 567

idea of this approach is that the number of internal branches 568

in a genealogy is bounded to be at most n− 3 in an unrooted 569

tree. Since each F4 is a sum of the length of tree branches, all F4- 570

indices should be sums of n− 3 branches, or n− 3 independent 571

components. This implies that the rank of the matrix (see e.g. 572

Section 4 in McCullagh 2009) is at most n − 3, if the data is 573

consistent with a tree. However, admixture events may increase 574

the rank of the matrix, as they add additional internal branches 575

(Reich et al. 2012). Therefore, if the rank of the matrix is r, the 576

number of admixture events is at least r− n + 3. 577

One issue is that the full F4-matrix has size (n2) × (n

2), and 578

may thus become rather large. Furthermore, in many cases 579

only admixture events in a certain part of the phylogeny are 580

of interest. To estimate the minimum number of admixture 581

events on a particular branch of the phylogeny, Reich et al. (2012) 582

proposed to find two sets of test populations S1 and S2, and 583

two reference populations for each set R1 and R2 that are pre- 584

sumed unadmixed (see Figure 6A). Assuming a phylogeny, all 585

F(B)4 (S1, R1; S2, R2) will measure the length of the same branch, 586

and all F(T)4 (S1, R1; S2, R2) should be zero. Since each admixture 587

event introduces at most one additional branch, the rank of the 588

resulting matrix will increase by at most one, and the rank of 589

either the matrix of all F(T)4 , or the matrix of all F(B)

4 may reveal 590

the number of branches of that form. 591

Admixture proportion The second application is by comparing 592

branches between closely related populations to obtain an esti- 593

mate of mixture proportion, or how much two focal populations 594

correspond to an admixed population. (Green et al. 2010): 595

α =F4(PO, PI ; PX , P1)

F4(PO, PI ; P2, P1)(18)

Here, PX , is the population whose admixture proportion is esti- 596

mated, P1 and P2 are the potential contributors, where I assume 597

Admixture, Population Structure and F-statistics 9

Page 10: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

B CA

Figure 6 Applications of F4: A: Visualization of rank test to estimate the number of admixture events. F4(S1, R1, S2, R2) measuresa branch absent from the phylogeny and should be zero for all populations from S1 and S2. B: Model underlying admixture ratioestimate Green et al. (2010). PX splits up, and the mean coalescence time of PX with PI gives the admixture proportion. C: If themodel is violated, α measures where on the internal branch in the underlying genealogy PX (on average) merges

that they contribute with proportions α and 1− α, respectively. 598

and PO, PI are reference populations with no direct contribution 599

to PX (see Figure 6B). PI has to be more closely related to one of 600

P1 or P2 than the other, and PO is an outgroup. 601

The canonical way (Patterson et al. 2012) to interpret this ratio 602

is as follows : the denominator is the branch length from the 603

common ancestor population from PI and P1 to the common 604

ancestor of PI with P2. (Figure 6C, yellow line), The numerator 605

has a similar interpretation as an internal branch (red dotted 606

line). In an admixture scenarios, (Figure 6B), this is not unique, 607

and is replaced by a linear combination of lineages merging 608

at the common ancestor of PI and P1 (with probability α), and 609

lineages merging at the common ancestor of PI with P2 (with 610

probability 1− α). 611

Thus, a more general interpretation is that α measures how 612

much closer the common ancestor of PX and PI is to the common 613

ancestor of PI and P1 and the common ancestor of PI and P2, 614

indicated by the gray dotted line in Figure 6B. This quantity is de- 615

fined also when the assumptions underlying the admixture test 616

are violated, and if the assumptions are not carefully checked, 617

might lead to misinterpretations of the data. In particular, α is 618

well-defined in cases where no admixture occurred, or in cases 619

where either of P1 and P2 did not experience any admixture. 620

Furthermore, it is evident from Figure 6 that if all populations 621

are sampled at the same time, ETOX = ETO1 = ETO2 = ETOI , 622

and therefore, 623

α =ETI1 −ETIXETI1 −ETI2

. (19)

Thus, 624

α =πI1 − πIXπI1 − πI2

(20)

is another estimator for α that can be used even if no outgroup is 625

available. I compare Equations 18 and 20 for varying admixture 626

proportions in Figure 5C using the mean absolute error in the 627

admixture proportion. Both estimators perform very well, but 628

(20) performs slightly better in cases where the admixture pro- 629

portion is low. However, in most cases this minor improvement 630

possibly does not negate the drawback that Equation 20 is only 631

applicable when populations are sampled at the same time. 632

It is an area of recent development how these estimates can be 633

extended to more populations. A simple approach is to assume 634

a fixed series of admixture events, in which case admixture 635

proportions for each event can be extracted from a series of 636

F4-ratios (Lazaridis et al. 2014, , SI 13). A more sophisticated 637

approach estimates mixture weights using the rank of the F4- 638

matrix, as discussed in the Rank test section (Haak et al. 2015, SI 639

10). Then, it is possible to estimate mixture proportions, using 640

a model similar to that introduced in the program structure 641

(Pritchard et al. 2000), by obtaining a low-rank approximation 642

for the F4-matrix. 643

Population structure models 644

For practical purposes, it is useful to know how the admixture 645

tests perform under demographic models different from popula- 646

tion phylogenies and admixture graphs, and in which cases the 647

assumptions made for the tests are problematic. In other words, 648

under which demographic models is population structure dis- 649

tinguishable from a tree? Equation 5 enables the derivation of 650

expectations for F3 and F4 under a wide variety of models of 651

population structure (Figure 7). The simplest case is that of a 652

single panmictic population. In that case, all F-statistics have 653

an expectation of zero, consistent with the assumption that no 654

structure and therefore no population phylogeny exists. Under 655

island models, F4 is also zero, and F3 is inversely proportional 656

to the migration rate. Results are similar under a hierarchical 657

island model, except that the number of demes has a small ef- 658

fect. This corresponds to a population phylogeny that is star-like 659

and has no internal branches, which is explained by the strong 660

symmetry of the island model. Thus, looking at different F3 and 661

F4-statistics may be a simple heuristic to see if data is broadly 662

consistent with an island model; if F3-values vary a lot between 663

populations, or if F4 is substantially different from zero, an is- 664

land model might be a poor choice. When looking at a finite 665

stepping stone model, F3 and F4 are both non-zero, highlighting 666

that F4 (and the ABBA-BABA-D-statistic) is susceptible to mi- 667

gration between any pair of populations. Thus, for applications, 668

F4 should only be used as an admixture test if there is good 669

evidence that gene flow between some pairs of the populations 670

was severely restricted. A hierarchical stepping stone model, 671

where demes are combined into populations, is the only case I 672

studied (besides the admixture graph) where F3 can be negative. 673

This effect indicates that admixture and population structure 674

models may be the two sides of the same coin: admixture is a 675

10 Benjamin M Peter et al

Page 11: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

(temporary) reduction in gene flow between individuals from 676

the same population. Finally, for a simple serial founder model 677

without migration, I find that F3 measures the time between 678

subsequent founder events. 679

Estimation and Testing 680

In this paper, I focused almost exclusively on the theoretical 681

properties of the F-statistics, glancing over the statistical prob- 682

lems of how they are estimated. Many procedures are imple- 683

mented in the software package ADMIXTOOLS and described in 684

the accompanying paper (Patterson et al. 2012). Alternatively, 685

the software package treemix (Pickrell and Pritchard 2012) con- 686

tains lightwight alternatives for calculating F3 and F4 statistics. 687

Both use a block-jackknife approach to estimate standard errors, 688

taking linkage between markers into account. 689

Conclusions 690

There are three main ways to interpret F-statistics: In the sim- 691

plest case, they represent branches in a population phylogeny. In 692

the case of an admixture graph, the idea of shared drift in terms 693

of paths is most convenient. Finally, the expressions in terms 694

of coalescence times and the lengths of the internal branches of 695

gene genealogies are useful for more complex scenarios. This 696

last interpretation makes the connection to the ABBA-BABA- 697

statistic explicit, and allows the investigation of the behavior of 698

the F-statistics under arbitrary demographic models. 699

If drift indices exist for two, three and four populations, 700

should there be corresponding quantities for five or more popu- 701

lations (e.g. Pease and Hahn 2015)? Two of the interpretations 702

speak against this possibility: First, a population phylogeny can 703

be fully characterized by internal and external branches, and it 704

is not clear how a five-population statistic could be written as a 705

meaningful branch length. Second, all F-statistics can be written 706

in terms of four-individual trees, but this is not possible for five 707

samples. This seems to suggest that there may not exist a five- 708

population statistic as general as the three F-statistics I discussed 709

here, but they will still be useful for questions pertaining to a 710

specific demographic model. 711

A well-known drawback of F3 is that it may have a positive 712

expectation under some admixture scenarios (Patterson et al. 713

2012). Here, I showed that F3 is positive if and only if the branch 714

supporting the population tree is longer than the two branches 715

discordant with the population tree. Note that this is (possibly) 716

distinct from the probabilities of tree topologies, although the 717

average branch length of the internal branch in a topology, and 718

the probability of that topology are frequently strongly corre- 719

lated. Thus, negative F3-values indicate that individuals from 720

the admixed population are likely to coalesce with individuals 721

from the two other populations, before they coalesce with other 722

individuals from their own population! 723

Overall, when F3 is applicable, it is remarkably robust to 724

population structure, requiring rather strong substructure to 725

yield false-positives. Thus, it is a very striking finding that in 726

many applications to humans, negative F3-values are commonly 727

found (Patterson et al. 2012), indicating that for most human 728

populations, the majority of markers support a discordant gene 729

tree, which suggests that population structure and admixture are 730

widespread, and that population phylogenies are poorly suited 731

to describe human evolution. 732

Ancient population structure was proposed as possible con- 733

founder for the D and F4-statistics (Green et al. 2010). Here, I 734

show that non-symmetric population structure such as in step- 735

ping stone models can lead to non-zero F4-values, showing that 736

both ancestral and persisting population structure may result in 737

false-positives when assumptions are violated. 738

Furthermore, I showed that F2 can be seen as a special case 739

of a tree-metric, and that using F-statistics is equivalent to using 740

phylogenetic theory to test hypotheses about simple phyloge- 741

netic networks (Huson et al. 2010). From this perspective, it 742

is worth re-raising the issue pointed out by Felsenstein (1973), 743

how and when allele-frequency data should be transformed 744

for within-species phylogenetic inference.While F2 has become 745

a de facto standard, different transformations of allele frequen- 746

cies might be useful in some cases, as both F3 and F4 can be 747

interpreted as tests for treeness for arbitrary tree metrics. 748

This relationship provides ample opportunities for interac- 749

tion between these currently diverged fields: Theory (Huson 750

and Bryant 2006; Huson et al. 2010) and algorithms for finding 751

phylogenetic networks such as Neighbor-Net (Bryant and Moul- 752

ton 2004) may provide a useful alternative to tools specifically 753

developed for allele frequencies and F-statistics (Patterson et al. 754

2012; Pickrell and Pritchard 2012; Lipson et al. 2013), particularly 755

in complex cases. On the other hand, the tests and different 756

interpretations described here may be useful to test for treeness 757

in other phylogenetic applications, and the complex history of 758

humans may provide motivation to further develop the theory 759

of phylogenetic networks, and stress its usefulness for within- 760

species demographic analyses. 761

Acknowledgements 762

I would like to thank Heejung Shim, Rasmus Nielsen, John 763

Novembre and all members of the Novembre lab for helpful 764

comments and discussions. I am further grateful for comments 765

from Nick Patterson and an anonymous reviewer. BMP is sup- 766

ported by a Swiss NSF early postdoc mobility fellowship. Ad- 767

ditional funding for this work was provided by NIH grant R01 768

HG007089 to John Novembre. 769

Methods 770

Equivalence of drift interpretations 771

First, I show that F2 can be interpreted as the difference in vari- 772

ance of allele frequencies (Figure 3A): 773

As in the Results section, let Pi denote a population with 774

allele frequency pi and sample size ni, and let us assume P0 775

represents the population at time t0, and Pt the population at 776

time t, with t0 < t. 777

F2(Pt, P0) = E[(pt − p0)2]

= Var(pt − p0) + E[(pt − p0)]2

= Var(pt) + Var(p0)− 2Cov(p0, pt)

= Var(pt) + Var(p0)− 2Cov(p0, p0 + (pt − p0))

= Var(pt)−Var(p0) (21)

Here, I used E[pt − p0] = Cov(p0, pt − p0) = 0 to obtain lines 778

three and five. It is worth noting that this result holds for any 779

model of genetic drift where the expected allele frequency is the 780

current allele frequency and increments are independent. For 781

example, this interpretation of F2 holds also if genetic drift is 782

modeled as a Brownian motion (Cavalli-Sforza and Edwards 783

1967). 784

Admixture, Population Structure and F-statistics 11

Page 12: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

Figure 7 Expectations for F3 and F4 under select models. The constant factor θ/2 is omitted. Derivations can be found in the sectionDerivation of F under select models

12 Benjamin M Peter et al

Page 13: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

A heterozygosity model The interpretation of F2 in terms of the 785

decay in heterozygosity and identity by descent can be derived 786

elegantly using the duality between the diffusion process and 787

the coalescent: Let again t0 < t. 788

Then, 789

Ept

[pnt

t |p0, nt]= En0

[pn0

0 |p0, nt]

. (22)

This equation is due to Tavaré (1984), who also provided the 790

following intuition: Given nt individuals are sampled at time 791

t. Let E denote the event that all individuals carry allele x, 792

conditional on allele x having frequency p0 at time t0. There 793

are two components to this: First, the frequency will change 794

between t0 and t, and then all nt sampled individuals need to 795

carry x. 796

In a diffusion framework, 797

P(E) =∫ 1

0ynt P(pt = y|p0, nt)dy = E[pnt

t |p0, nt]. (23)

On the other hand, one may argue using the coalescent: For 798

E to occur, all nt samples need to carry the x allele. At time t0, 799

they had n0 ancestral lineages, who all carry x with probability 800

p0. Therefore, 801

P(E) =n0

∑i=1

pi0P (n0 = i|p0, nt) = E

[pn0

0 |p0, nt]

. (24)

Equating (23) and (24) yields Equation 22. 802

In the present case, the only relevant cases are nt = 1, 2, since:

E[

p1t |p0; nt = 1

]= p0 (25a)

E[

p2t |p0; nt = 2

]= p0 f + p2

0(1− f ), (25b)

where f is the probability that two lineage sampled at time t 803

coalesce before time t0. 804

This yields an expression for F2 by conditioning on the allelefrequency p0,

E[(p0 − pt)

2|p0

]= E

[p2

0|p0

]−E [2pt p0|p0] + E

[p2

t |p0

]= p2

0 − 2p20 + p0 f + p2

0(1− f )= f p0(1− p0)

=12

f H0.

Where H0 = 2p0(1− p0) is the heterozygosity. Integrating over 805

P(p0) yields: 806

F2(P0, Pt) =12

f EH0 (26)

and it can be seen that F2 increases as a function of f (Figure 3C). 807

This equation can also be interpreted in terms of probabilities of 808

identity by descent: f is the probability that two individuals are 809

identical by descent in Pt given their ancestors were not identical 810

by descent in P0 (Wright 1931), and EH0 is the probability two 811

individuals are not identical in P0. 812

Furthermore, EHt = (1− f )EH0 (Equation 3.4 in Wakeley 813

(2009)) and therefore 814

EH0 −EHt = EH0(1− (1− f )

)= 2F2(Pt, P0). (27)

which shows that F2 measures the decay of heterozygosity (Fig- 815

ure 3A). A similar argument was used by Lipson et al. (2013) to 816

estimate ancestral heterozygosities and to linearize F2. 817

Two populations F2 in terms of the difference in expected and 818

observed heterozygosity follows directly from the result from 819

Wahlund (1928), which was obtained by considering the geno- 820

types of all possible matings in the two subpopulations (Ta- 821

ble 3 in Wahlund 1928), and the variance case follows di- 822

rectly because Var(p1− p2) = E(p1− p2)2−

[E(p1− p2)

]2, but 823

E(p1 − p2) = 0. Lastly, F2 relates to FST via the definition of F2 824

as a variance in the definition of FST : FST =2(p1−p2)2

EHexp . 825

Covariance interpretation To see how F2 can be interpreted as acovariance between two individuals from the same population,define Xi and Xj as indicator variables that two individuals fromthe same population sample have the A allele, which has fre-quency p1 in one, and p2 in the other population. If individualsare equally likely to come from either population,

EXi = EXj =12

p1 +12

p2

EXiXj =12

p21 +

12

p22

Cov(Xi, Xj) =EXiXj −EXiEXj

=14(p1 − p2)

2 =14

F2(P1, P2) (28)

Derivation of F2 for gene trees 826

To derive equation (5), I start by considering F2 for two sam- 827

ples of size one, express F2 for arbitrary sample sizes in terms 828

of individual-level F2, and obtain a sample-size independent 829

expression by letting the sample size n go to infinity. 830

In this framework, I assume that mutation is rare such that 831

there is at most one mutation at any locus. In a sample of size 832

two, let Ii be an indicator random variable that individual i has 833

a particular allele. For two individuals, F2(I1, I2) = 1 implies 834

I1 = I2, whereas F2(I1, I2) = 0 implies I1 6= I2. Thus, F2(I1, I2) 835

is another indicator random variable with parameter equal to 836

the probability that a mutation happened on the tree branch 837

between I1 and I2. 838

Now, instead of a single individual I1, consider a sampleof n1 individuals: P1 = {I1,1 I1,2, . . . , I1,n1}. The sample allelefrequency is p1 = n−1

1 ∑i I1,i. And the sample-F2 is

F2 ( p1, I2) =F2

(1

n1

n1

∑i=1

I1,i, I2

)= E

(1

n1

n1

∑i=1

I1,i − I2

)2

=E

[1

n21

∑ I21,i +

2n2

1∑ I1,i I1,j −

2n1

∑ I1,i I2 + I22

]

=E

[1

n1∑ I2

1,i −2

n1∑ I1,i I2 +

n1n1

I22

]+ E

[2

n21

∑ I1,i I1,j −n1 − 1

n21

∑ I21,i

]The first three terms can be grouped into n1 terms of the form 839

F2(I1,i, I2), and the last two terms can be grouped into (n12 ) terms 840

of the form F2(I1,i, I1,j), one for each possible pair of samples in 841

P1. 842

Therefore, 843

F2 ( p1, I2) =1

n1∑

iF2(I1,i, I2)−

1n2

1∑i<j

F2(I1,i, I1,j) (29)

where the second sum is over all pairs in P1. This equation is 844

equivalent to Equation 22 in Felsenstein (1973). 845

Admixture, Population Structure and F-statistics 13

Page 14: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

As F2( p1, p2) = F2( p2, p1), I can switch the labels, and obtain 846

the same expression for a second population P2 = {I2,i, i = 847

0, . . . , n2} Taking the average over all I2,j yields 848

F2 ( p1, p2) =1

n1∑

iF2(I1,i, I2,j)

− 1n2

1∑i<j

F2(I1,i, I1,j)−1

n22

∑i<j

F2(I2,i, I2,j).(30)

Thus, I can write F2 between the two populations as the average 849

number of differences between individuals from different pop- 850

ulations, minus some terms including differences within each 851

sample. 852

Equation 30 is quite general, making no assumptions on 853

where samples are placed on a tree. In a coalescence framework, 854

it is useful to make the assumptions that all individuals from the 855

same population have the same branch length distribution, i.e. 856

F2(Ix1,i, Iy1,j) = F2(Ix2,i, Iy2,j) = for all pairs of samples (x1, x2) 857

and (y1, y2) from populations Pi and Pj. Secondly, I assume that 858

all samples correspond to the leaves of the tree, so that I can esti- 859

mate branch lengths in terms of the time to a common ancestor 860

Tij. Finally, I assume that mutations occur at a constant rate of 861

θ/2 on each branch. Taken together, these assumptions imply 862

that F2(Ii,k, Ij,l) = θETij for all individuals from populations 863

Pi, Pj. this simplifies (30) to 864

F2( p1, p2) = θ×(

ET12−12

(1− 1

n1

)ET11−

12

(1− 1

n2

)ET22

)(31)

which, for the cases of n = 1, 2 was also derived by Petkova 865

et al. (2014). In some applications, F2 might be calculated only 866

for segregating site in a large sample. As the expected number 867

of segregating sites is θ2 Ttot (with Ttot denoting the total tree 868

length), taking the limit where θ → 0 is meaningful (Slatkin 869

1991; Petkova et al. 2014): 870

F2( p1, p2) =2

Ttot

×(

ET12 −12

(1− 1

n1

)ET11 −

12

(1− 1

n2

)ET22

).

(32)

In either of these equations, 2Ttot

or θ act as a constant of pro- 871

portionality that is the same for all statistics calculated from the 872

same data. Since interest is focused on the relative magnitude 873

of F2, or whether a sum of F2-values is different from zero, this 874

constant has no impact on inference. 875

Furthermore, a population-level quantity is obtained by tak- 876

ing the limit when the number of individuals n1 and n2 go to 877

infinity: 878

F2(P1, P2) = limn1,n2→∞

F2( p1, p2). (33)

This yields Equation 5. Using θ as the constant of proportionality 879

yields 880

E[2π12 − π11 − π22] = 2F2(P1, P2), (34)

leading to the estimator given in Equation 6. 881

Equivalence of estimators The estimator in Eq. 34 is equivalent 882

to that given by Reich et al. (2009): 883

F2(P1, P2) =π12 − π11/2− π22/2

=[p1(1− p2) + p2(1− p1)

]− p1(1− p1)

n1n1 − 1

− p2(1− p2)n2

n2 − 1

= p1

(1− 1− 1

n1

)+ p2

(1− 1− 1

n2

)− 2p1 p2

+ p21

(1− 1

n1 − 1

)− p2

2

(1− 1

n2 − 1

)=( p1 − p2)

2 − p1(1− p1)

n1 − 1− p2(1− q2)

n2 − 1

which is Equation 10 in the Appendix of Reich et al. (2009). 884

Four-point-condition and F4 885

Using F2 as a tree-metric, the four-point condition (Buneman 886

1971) can be written as 887

F2(P1, P2) + F2(P3, P4) ≤

min[

F2(P1, P3) + F2(P2, P4), F2(P1, P4) + F2(P2, P3)

] (35)

for any permutations of the samples. This implies that two of 888

the sums need to be the same, and larger than the third. The 889

claim is that if the four-point condition holds, at least one of 890

the F4-values will be zero, and the others will have the same 891

absolute value. 892

Without loss of generality, assume that

F2(P1, P2) + F2(P3, P4) ≤ F2(P1, P3) + F2(P2, P4)

F2(P1, P3) + F2(P2, P4) = F2(P1, P4) + F2(P2, P3)

Simply plugging this into the three possible F4-equations yields

F4(P1, P2; P3, P4) = 0

F4(P1, P3; P2, P4) = kF4(P1, P4; P2, P3) = −k

where k = F2(P1, P3) + F2(P2, P4)− F2(P1, P2) + F2(P3, P4). 893

It is worth noting that the converse is false. If

F2(P1, P2) + F2(P3, P4) > F2(P1, P3) + F2(P2, P4)

F2(P1, P3) + F2(P2, P4) = F2(P1, P4) + F2(P2, P3)

the four-point condition is violated, but F4(P1, P2; P3, P4) is still 894

zero, and the other two F4-values have the same magnitude. 895

Derivation of F under select models 896

Here, I use Equation 5 together with Equations 10b and 14b to 897

derive expectations for F3 and F4 under some simple models. 898

Panmixia In a randomly mating population (with arbitrary 899

population size changes), P1 and P2 are taken from the same 900

pool of individuals and therefore T12 = T11 = T22, EF2 = EF3 = 901

EF4 = 0. 902

14 Benjamin M Peter et al

Page 15: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

Island models A (finite) island model has D subpopulations of 903

size 1 each. Migration occurs at rate M between subpopulations. 904

It can be shown (Strobeck 1987) that ET11 = ET22 = D. ET12 905

satisfies the recursion 906

ET12 =1

(D− 1)M+

D− 2D− 1

ET12 +1

D− 1ET11. (36)

with solution ET12 = 1 + M−1. This results in the equation in 907

Figure 7. The derivation of coalescence times for the hierarchical 908

island models is marginally more complicated, but similar. It is 909

given in Slatkin and Voelm (1991). 910

Admixture models These are the model for which the F-statistics were originally developed. Many details, applications,and the origin of the path representation are found in Pattersonet al. (2012). For simplicity, I look at the simplest possible treewith four populations, where PX is admixed from P1 and P2 withcontributions α and β = (1− α), respectively. I assume that allpopulations have the same size, and that this size is one. Then,

F3(PX ; P1, P2) ∝ET1X + ET2X −ET12 −ETXX

=(αt1 + βtr + 1) + (αtr + βt1 + 1)− tr − 1

− α21 − (1− α)21− 2α(1− α)[(1− cx)tr + 1]

=t1 − 2α(1− α)(1− cx)tr (37)

Here, cx is the probability that the two lineages from PX coalesce 911

before the admixture event. 912

Thus, F3 is negative if 913

t1(1− cx)tr

< 2α(1− α), (38)

which is more likely if α is large, the admixture is recent and the 914

overall coalescent is far in the past. 915

For F4, omitting the within-population coalescence time of 1:

F4(P1PX ; P2, P3) =ET12 + ET3X −ET13 −ET2X

=tr + αtr + βt23 − tr − αtr − βt2X

=β(t2 − t1)

Stepping-stone models For the stepping stone models, I haveto solve the recursions of the Markov chains describing thelocation of all lineages in a sample of size 2. For the standardstepping stone model, I assumed there were four demes, all ofwhich exchange migrants at rate M. This results in a MarkovChain with the following five states: i) lineages in same deme ii)lineages in demes 1 and 2, iii) lineages in demes 1 and 3, lineagesin demes 1 and 4 and v) lineages in demes 2 and 3. Note thatthe symmetry of this system allows collapsing some states. Thetransition matrix for this system is

1 0 0 0 0

2M 1− 3M M 0 0

0 M 1− 3M M M

0 0 2M 1− 2M 0

2M 0 2M 0 1− 4M

(39)

Once lineages are in the same deme, the system terminates 916

as the time to coalescence time is independent of the deme in 917

isotropic migration models (Strobeck 1987), and cancels from 918

the F-statistics. 919

Therefore, the vector v of the expected time until two lineages 920

are in the same deme is found using standard Markov Chain 921

theory by solving v = (I − T)−1)1, where T is the transition 922

matrix involving only the transitive states in the Markov chain 923

(all but the first state), and 1 is a vector of ones. 924

Finding the expected coalescence time involves solving a 925

system of 5 equations. The terms involved in calculating the 926

F-statistics (Table 1) are the entries in v corresponding to these 927

states. 928

The hierarchical case is similar, except there are 6 demes and10 equations. Representing states as lineages being in demes(same), (1,2), (1,3), (1,4), (1,5), (1,6), (2,3), (2,4), (2,5), (3,4).

1 0 0 0 0 0 0 0 0 0

2M 1− 3M M 0 0 0 0 0 0 0

0 M 1− 3M M 0 0 M 0 0 0

0 0 M 1− 3M M 0 0 M 0 0

0 0 0 M 1− 3M M 0 0 M 0

0 0 0 0 2M 1− 2M 0 0 0 0

2M 0 M 0 0 0 1− 4M M 0 0

0 0 0 M 0 0 M 1− 4M M M

0 0 0 0 2M 0 0 2M 1− 4M 0

2M 0 0 0 0 0 0 2M 0 1− 4M

As in the non-hierarchical case, solving this system yields all 929

pairwise coalescence times. Then, all I have to do is average the 930

coalescence times over all possibilities. E.g. 931

ET1X =v2 + v3 + v6 + v7

4. (40)

For F4, I assume that demes 1 and 2 are in P1, demes 3 and 4 in 932

PX and demes 5 and 6 correspond to P2 and P3, respectively. 933

Range expansion model I use a serial founder model with nomigration (Peter and Slatkin 2015), where I assume that theexpansion is recent enough such that the effect of migrationafter the expansion finished can be ignored. Under that model, Iassume that samples P1 and P2 are taken from demes D1 and D2,with D1 closer to the origin of the expansion, and populationswith high ids even further away from the expansion origin. ThenET12 = t1 + ET11, where Et1 is the time required for a lineagesampled further away in the expansion to end up in D1. (Notethat t1 only depends on the deme that is closer to the origin).Thus, for three demes,

F3(P2; P1, P3) ∝ ET12 −ET13 + ET23 −ET22

∝ ET11 + t1 −ET11 − t1 + ET22 + t2 −ET22

∝ t2.

and

F(T)4 (P1, P2; P3, P4) ∝ ET13 −ET14 + ET24 −ET23

∝ ET11 + t1 −ET11 − t1 + ET22 + t2 −ET22 − t2

= 0.

More interesting is

F(B)4 (P1, P2; P3, P4) ∝ ET12 −ET14 + ET34 −ET23

∝ ET11 + t1 −ET11 − t1 + ET33 + t3 −ET22 − t2

∝ ET33 + t3 −ET22 − t2.

Simulations 934

Simulations were performed using ms (Hudson 2002). Specific 935

commands used are 936

ms 466 100 -t 100 -r 10 100000 -I 12 22 6 61 49 937

57 33 43 34 40 84 13 24 -en 0 2 7.2 -en 0 3 .2 -en 0 938

Admixture, Population Structure and F-statistics 15

Page 16: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

4 .4 -en 0 5 .2 -en 0 6 4.4 -en 0 7 3.2 -en 0 8 4.8 939

-en 0 9 0.2 -en 0 10 3.2 -en 0 11 0.2 -en 0 12 0.7 940

-ej 0.01 2 1 -ej 0.02 3 1 -ej 0.04 4 1 -ej 0.06 5 1 941

-ej 0.08 6 1 -ej 0.10 7 1 -ej 0.12 8 1 -ej 0.14 9 1 942

-ej 0.16 10 1 -ej 0.18 11 1 -ej 0.3 12 1 943

for the outgroup-F3-statistic (Figure 5A). Sample sizes and 944

population sizes were picked randomly, but kept the same over 945

all one hundred replicates. Additionally, I randomly assigned 946

each population an error rate uniformly between zero and 0.05. 947

Errors were introduced by adding additional singletons and 948

flipping alleles at that rate. 949

For Figure 5B, the command was 950

ms 301 100 -t 10 -I 4 100 100 100 1 -es 0.001 2 951

$ALPHA -ej 0.03 2 1 -ej 0.03 5 3 -ej 0.3 3 1 -ej 952

0.31 4 1 953

with the admixture proportion $ALPHA set to increments of 954

0.025 from 0 to 0.5, with 200 data sets generated per $ALPHA. 955

Lastly, data for Figure 5C was simulated using 956

ms 501 100 -t 50 -r 50 10000 -I 6 100 100 100 100 957

100 1 -es 0.001 3 $ALPHA -ej 0.03 3 2 -ej 0.03 7 4 958

-ej 0.1 2 1 -ej 0.2 4 1 -ej 0.3 5 1 -ej 0.31 6 1 959

Here, the admixture proportion $ALPHA was varied in incre- 960

ments of 0.1 from 0 to 1, again with 200 data sets generated per 961

$ALPHA. 962

F3 and F4-statistics were calculated using the implementation 963

from Pickrell and Pritchard (2012). 964

Literature Cited 965

Allentoft, M. E., M. Sikora, K.-G. Sjögren, S. Rasmussen, 966

M. Rasmussen, J. Stenderup, P. B. Damgaard, H. Schroeder, 967

T. Ahlström, L. Vinner, A.-S. Malaspinas, A. Margaryan, 968

T. Higham, D. Chivall, N. Lynnerup, L. Harvig, J. Baron, P. D. 969

Casa, P. Dabrowski, P. R. Duffy, A. V. Ebel, A. Epimakhov, 970

K. Frei, M. Furmanek, T. Gralak, A. Gromov, S. Gronkiewicz, 971

G. Grupe, T. Hajdu, R. Jarysz, V. Khartanovich, A. Khokhlov, 972

V. Kiss, J. Kolár, A. Kriiska, I. Lasak, C. Longhi, G. McG- 973

lynn, A. Merkevicius, I. Merkyte, M. Metspalu, R. Mkrtchyan, 974

V. Moiseyev, L. Paja, G. Pálfi, D. Pokutta, Ł. Pospieszny, T. D. 975

Price, L. Saag, M. Sablin, N. Shishlina, V. Smrcka, V. I. Soenov, 976

V. Szeverényi, G. Tóth, S. V. Trifanova, L. Varul, M. Vicze, 977

L. Yepiskoposyan, V. Zhitenev, L. Orlando, T. Sicheritz-Pontén, 978

S. Brunak, R. Nielsen, K. Kristiansen, and E. Willerslev, 2015 979

Population genomics of Bronze Age Eurasia. Nature 522: 167– 980

172. 981

Bryant, D. and V. Moulton, 2004 Neighbor-Net: An Agglomera- 982

tive Method for the Construction of Phylogenetic Networks. 983

Molecular Biology and Evolution 21: 255–265. 984

Buneman, P., 1971 The recovery of trees from measures of dis- 985

similarity. Mathematics in the Archaeological and Historical 986

Sciences . 987

Buneman, P., 1974 A note on the metric properties of trees. Jour- 988

nal of Combinatorial Theory, Series B 17: 48–50. 989

Cavalli-Sforza, L. L. and A. W. F. Edwards, 1967 Phylogenetic 990

Analysis: Models and Estimation Procedures. Evolution 21: 991

550–570, ArticleType: research-article / Full publication date: 992

Sep., 1967 / Copyright © 1967 Society for the Study of Evolu- 993

tion. 994

Cavalli-Sforza, L. L. and A. Piazza, 1975 Analysis of evolution: 995

Evolutionary rates, independence and treeness. Theoretical 996

Population Biology 8: 127–165. 997

Durand, E., N. Patterson, D. Reich, and M. Slatkin, 2011 Testing 998

for ancient admixture between closely related populations. 999

Molecular Biology and Evolution 28: 2239–2252. 1000

Excoffier, L., P. E. Smouse, and J. M. Quattro, 1992 Analysis of 1001

Molecular Variance Inferred From Metric Distances Among 1002

DNA Haplotypes: Application to Human Mitochondrial DNA 1003

Restriction Data. Genetics 131: 479–491. 1004

Felsenstein, J., 1973 Maximum-likelihood estimation of evolu- 1005

tionary trees from continuous characters. American Journal of 1006

Human Genetics 25: 471–492. 1007

Felsenstein, J., 1981 Evolutionary Trees From Gene Frequencies 1008

and Quantitative Characters: Finding Maximum Likelihood 1009

Estimates. Evolution 35: 1229–1242. 1010

Felsenstein, J., 2004 Inferring phytogenies. Sunderland, Mas- 1011

sachusetts: Sinauer Associates . 1012

Fitch, W. M., E. Margoliash, and others, 1967 Construction of 1013

phylogenetic trees. Science 155: 279–284. 1014

Green, R., J. Krause, A. Briggs, T. Maricic, U. Stenzel, M. Kircher, 1015

N. Patterson, H. Li, W. Zhai, M. Fritz, et al., 2010 A draft 1016

sequence of the Neandertal genome. science 328: 710. 1017

Haak, W., I. Lazaridis, N. Patterson, N. Rohland, S. Mallick, 1018

B. Llamas, G. Brandt, S. Nordenfelt, E. Harney, K. Steward- 1019

son, Q. Fu, A. Mittnik, E. Bánffy, C. Economou, M. Francken, 1020

S. Friederich, R. G. Pena, F. Hallgren, V. Khartanovich, 1021

A. Khokhlov, M. Kunst, P. Kuznetsov, H. Meller, O. Mochalov, 1022

V. Moiseyev, N. Nicklisch, S. L. Pichler, R. Risch, M. A. 1023

Rojo Guerra, C. Roth, A. Szécsényi-Nagy, J. Wahl, M. Meyer, 1024

J. Krause, D. Brown, D. Anthony, A. Cooper, K. W. Alt, and 1025

D. Reich, 2015 Massive migration from the steppe was a source 1026

for Indo-European languages in Europe. Nature 522: 207–211. 1027

Hellenthal, G., G. B. J. Busby, G. Band, J. F. Wilson, C. Capelli, 1028

D. Falush, and S. Myers, 2014 A Genetic Atlas of Human 1029

Admixture History. Science 343: 747–751. 1030

Hudson, R. R., 2002 Generating samples under a Wright-Fisher 1031

neutral model of genetic variation. Bioinformatics (Oxford, 1032

England) 18: 337–338. 1033

Huson, D. H. and D. Bryant, 2006 Application of phylogenetic 1034

networks in evolutionary studies. Molecular biology and evo- 1035

lution 23: 254–267. 1036

Huson, D. H., R. Rupp, and C. Scornavacca, 2010 Phylogenetic 1037

networks: concepts, algorithms and applications. Cambridge Uni- 1038

versity Press. 1039

Lazaridis, I., N. Patterson, A. Mittnik, G. Renaud, S. Mallick, 1040

K. Kirsanow, P. H. Sudmant, J. G. Schraiber, S. Castellano, 1041

M. Lipson, and others, 2014 Ancient human genomes suggest 1042

three ancestral populations for present-day Europeans. Nature 1043

513: 409–413. 1044

Lipson, M., P.-R. Loh, A. Levin, D. Reich, N. Patterson, and 1045

B. Berger, 2013 Efficient Moment-Based Inference of Admix- 1046

ture Parameters and Sources of Gene Flow. Molecular Biology 1047

and Evolution 30: 1788–1802. 1048

McCullagh, P., 2009 Marginal likelihood for distance matrices. 1049

Statistica Sinica 19: 631. 1050

Patterson, N. J., P. Moorjani, Y. Luo, S. Mallick, N. Rohland, 1051

Y. Zhan, T. Genschoreck, T. Webster, and D. Reich, 2012 1052

Ancient Admixture in Human History. Genetics p. genet- 1053

ics.112.145037. 1054

Pease, J. B. and M. W. Hahn, 2015 Detection and Polarization of 1055

Introgression in a Five-Taxon Phylogeny. Systematic Biology 1056

64: 651–662. 1057

Peter, B. M. and M. Slatkin, 2015 The effective founder effect in 1058

a spatially expanding population. Evolution 69: 721–734. 1059

16 Benjamin M Peter et al

Page 17: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

Petkova, D., J. Novembre, and M. Stephens, 2014 Visualizing spa- 1060

tial population structure with estimated effective migration 1061

surfaces. bioRxiv p. 011809. 1062

Pickrell, J. K. and J. K. Pritchard, 2012 Inference of Population 1063

Splits and Mixtures from Genome-Wide Allele Frequency 1064

Data. PLoS Genet 8: e1002967. 1065

Pickrell, J. K. and D. Reich, 2014 Toward a new history and 1066

geography of human genes informed by ancient DNA. Trends 1067

in Genetics 30: 377–389. 1068

Pritchard, J. K., M. Stephens, and P. Donnelly, 2000 Inference of 1069

population structure using multilocus genotype data. Genetics 1070

155: 945–959. 1071

Raghavan, M., P. Skoglund, K. E. Graf, M. Metspalu, A. Albrecht- 1072

sen, I. Moltke, S. Rasmussen, T. W. Stafford Jr, L. Orlando, 1073

E. Metspalu, and others, 2014 Upper Palaeolithic Siberian 1074

genome reveals dual ancestry of Native Americans. Nature 1075

505: 87–91. 1076

Ralph, P. and G. Coop, 2013 The Geography of Recent Genetic 1077

Ancestry across Europe. PLoS Biol 11: e1001555. 1078

Reich, D., N. Patterson, D. Campbell, A. Tandon, S. Mazieres, 1079

N. Ray, M. V. Parra, W. Rojas, C. Duque, N. Mesa, L. F. 1080

García, O. Triana, S. Blair, A. Maestre, J. C. Dib, C. M. 1081

Bravi, G. Bailliet, D. Corach, T. Hünemeier, M. C. Bor- 1082

tolini, F. M. Salzano, M. L. Petzl-Erler, V. Acuña-Alonzo, 1083

C. Aguilar-Salinas, S. Canizales-Quinteros, T. Tusié-Luna, 1084

L. Riba, M. Rodríguez-Cruz, M. Lopez-Alarcón, R. Coral- 1085

Vazquez, T. Canto-Cetina, I. Silva-Zolezzi, J. C. Fernandez- 1086

Lopez, A. V. Contreras, G. Jimenez-Sanchez, M. J. Gómez- 1087

Vázquez, J. Molina, Á. Carracedo, A. Salas, C. Gallo, G. Poletti, 1088

D. B. Witonsky, G. Alkorta-Aranburu, R. I. Sukernik, L. Os- 1089

ipova, S. A. Fedorova, R. Vasquez, M. Villena, C. Moreau, 1090

R. Barrantes, D. Pauls, L. Excoffier, G. Bedoya, F. Rothham- 1091

mer, J.-M. Dugoujon, G. Larrouy, W. Klitz, D. Labuda, J. Kidd, 1092

K. Kidd, A. Di Rienzo, N. B. Freimer, A. L. Price, and A. Ruiz- 1093

Linares, 2012 Reconstructing Native American population 1094

history. Nature 488: 370–374. 1095

Reich, D., K. Thangaraj, N. Patterson, A. L. Price, and L. Singh, 1096

2009 Reconstructing Indian population history. Nature 461: 1097

489–494. 1098

Saitou, N. and M. Nei, 1987 The neighbor-joining method: a 1099

new method for reconstructing phylogenetic trees. Molecular 1100

Biology and Evolution 4: 406–425. 1101

Schraiber, J. G. and J. M. Akey, 2015 Methods and models for 1102

unravelling human evolutionary history. Nature Reviews Ge- 1103

netics . 1104

Semple, C. and M. A. Steel, 2003 Phylogenetics. Oxford University 1105

Press. 1106

Slatkin, M., 1991 Inbreeding coefficients and coalescence times. 1107

Genetic Research 58: 167–175. 1108

Slatkin, M. and L. Voelm, 1991 FST in a Hierarchical Island 1109

Model. Genetics 127: 627–629. 1110

Strobeck, C., 1987 Average number of nucleotide differences in 1111

a sample from a single subpopulation: a test for population 1112

subdivision. Genetics 117: 149–153. 1113

Tajima, F., 1983 Evolutionary Relationship of Dna Sequences in 1114

Finite Populations. Genetics 105: 437–460. 1115

Tavaré, S., 1984 Line-of-descent and genealogical processes, and 1116

their applications in population genetics models. Theoretical 1117

Population Biology 26: 119–164. 1118

Wahlund, S., 1928 Zusammensetzung Von Populationen 1119

Und Korrelationserscheinungen Vom Standpunkt Der 1120

Vererbungslehre Aus Betrachtet. Hereditas 11: 65–106. 1121

Wakeley, J., 2009 Coalescent theory: an introduction. Roberts & Co. 1122

Publishers. 1123

Weir, B. S. and C. C. Cockerham, 1984 Estimating F-Statistics for 1124

the Analysis of Population Structure. Evolution 38: 1358–1370, 1125

ArticleType: research-article / Full publication date: Nov., 1126

1984 / Copyright © 1984 Society for the Study of Evolution. 1127

Wright, S., 1921 Systems of mating. Genetics 6: 111–178. 1128

Wright, S., 1931 Evolution in Mendelian populations. Genetics 1129

16: 97–159. 1130

Yunusbayev, B., M. Metspalu, E. Metspalu, A. Valeev, S. Litvi- 1131

nov, R. Valiev, V. Akhmetova, E. Balanovska, O. Balanovsky, 1132

S. Turdikulova, D. Dalimova, P. Nymadawa, A. Bahmanimehr, 1133

H. Sahakyan, K. Tambets, S. Fedorova, N. Barashkov, I. Khidiy- 1134

atova, E. Mihailov, R. Khusainova, L. Damba, M. Derenko, 1135

B. Malyarchuk, L. Osipova, M. Voevoda, L. Yepiskoposyan, 1136

T. Kivisild, E. Khusnutdinova, and R. Villems, 2015 The Ge- 1137

netic Legacy of the Expansion of Turkic-Speaking Nomads 1138

across Eurasia. PLoS Genet 11: e1005068. 1139

Admixture, Population Structure and F-statistics 17

Page 18: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

Figure S1 Path interpretation of F2: F2 is interpreted as the covariance of two possible paths from P1 to P2, which I color green andblue, respectively. Only branches that are taken by both paths contribute to the covariance. With probability α, a path takes the leftadmixture edge, and with probability β = 1− α, the right one. I then condition on which admixture edge the paths follow: In thefirst tree on the right-hand side, both paths take the right admixture edge (with probability α2, in the second and third tree they takedifferent or the right path, respectively. The result is summarized as the weighted sum of branches in the left-hand side tree. For amore detailed explanation, see Patterson et al. (2012).

C. Discordant genealogy

B. Concordant genealogy

A. Equation

Figure S2 Schematic explanation how F3 behaves conditioned on gene tree. Blue terms and branches correspond to positive con-tributions, whereas red branches and terms are subtracted. Labels represent individuals randomly sampled from that population.The external branches cancel out, so only the internal branches have non-zero contribution to F3. In the concordant genealogy(Panel B), the contribution is positive (with weight 2), and in the discordant genealogy (Panel C), it is negative (with weight 1). Themutation rate as constant of proportionality is omitted.

18 Benjamin M Peter et al

Page 19: Admixture, Population Structure and F-Statistics - Genetics · GENETICS | INVESTIGATION Admixture, Population Structure and F-statistics Benjamin M Peter1 1Department of Human Genetics,

C. Discordant genealogy (BABA)

B. Concordant genealogy

A. Equation

D. Discordant genealogy (ABBA)

Figure S3 Schematic explanation how F4 behaves conditioned on gene tree. Blue terms and branches correspond to positive con-tributions, whereas red branches and terms are subtracted. Labels represent individuals randomly sampled from that population.All branches cancel out in the concordant genealogy (Panel B), and that the two discordant genealogies contribute with weight +2and -2, respectively The mutation rate as constant of proportionality is omitted.

Admixture, Population Structure and F-statistics 19