43
Maximum Likelihood Pedigree Reconstruction using Integer Linear Programming James Cussens 1 , Mark Bartlett 1 , Elinor M. Jones 2 , and Nuala A. Sheehan 2 1 Department of Computer Science, University of York, York, North Yorkshire, YO10 5GH, United Kingdom 2 Department of Health Sciences, University of Leicester, Leicester, Leicestershire, LE1 6TP, United Kingdom Running Title: Pedigree Reconstruction using ILP Corresponding Author: Dr Nuala Sheehan, Department of Health Sciences, University of Leicester, Adrian Building, University Road, Leicester, LE1 7RH Tel: 00 44 116 229 7271 Email: [email protected] 1

Maximum Likelihood Pedigree Reconstruction using Integer Linear

Embed Size (px)

Citation preview

Maximum Likelihood Pedigree Reconstruction using

Integer Linear Programming

James Cussens1, Mark Bartlett1, Elinor M. Jones2, and Nuala A. Sheehan2

1Department of Computer Science, University of York, York, North

Yorkshire, YO10 5GH, United Kingdom

2Department of Health Sciences, University of Leicester, Leicester,

Leicestershire, LE1 6TP, United Kingdom

Running Title: Pedigree Reconstruction using ILP

Corresponding Author: Dr Nuala Sheehan,

Department of Health Sciences,

University of Leicester,

Adrian Building, University Road,

Leicester, LE1 7RH

Tel: 00 44 116 229 7271

Email: [email protected]

1

Abstract

Large population biobanks of unrelated individuals have been highly successful in

detecting common genetic variants affecting diseases of public health concern. However,

they lack the statistical power to detect more modest gene-gene and gene-environment

interaction effects or the effects of rare variants for which related individuals are ideally

required. In reality, most large population studies will undoubtedly contain sets of

undeclared relatives, or pedigrees. Although a crude measure of relatedness might

sometimes suffice, having a good estimate of the true pedigree would be much more

informative if this could be obtained efficiently. Relatives are more likely to share

longer haplotypes around disease susceptibility loci and are hence biologically more

informative for rare variants than unrelated cases and controls. Distant relatives are

arguably more useful for detecting variants with small effects since they are less likely to

share masking environmental effects. Moreover, the identification of relatives enables

appropriate adjustments of statistical analyses that typically assume unrelatedness.

We propose to exploit an integer linear programming (ILP) optimisation approach

to pedigree learning, which is adapted to find valid pedigrees by imposing appropriate

constraints. Our method is not restricted to small pedigrees and is guaranteed to return

a maximum likelihood pedigree. With additional constraints, we can also search for

multiple high probability pedigrees and thus account for the inherent uncertainty in

any particular pedigree reconstruction. The true pedigree is found very quickly by

comparison with other methods when all individuals are observed. Extensions to more

complex problems seem feasible.

Keywords: Constraint based optimisation, Genetic marker data, Bayesian networks,

Model uncertainty

2

1 Introduction

Identification of relatives amongst a group of individuals from genetic marker data is relevant

to areas as diverse as evolution and conservation research, epidemiological and genealogical

research, and forensic science identification problems [Blouin, 2003; Jones and Ardren, 2003;

Meagher and Thompson, 1987; Olaisen et al., 1997]. In medical research, large population

biobank studies of unrelated individuals have been set up worldwide to investigate the ge-

netic risk factors underlying the common complex diseases of major public health concern. It

is now well understood that they do not have sufficient statistical power to detect the effects

of rare genes and the more modest effects of gene-gene and gene-environment interactions.

Sets of relatives are more likely to share longer haplotypes around susceptibility loci and are

hence biologically more informative for rare variants than unrelated ‘cases’ and ‘controls’. In

fact, distant, rather than close, relatives are more likely to be useful for detecting genes with

small effects as the genetic effects are less likely to be masked by the shared environment

effects typically arising in near relatives. In the absence of exact information, general pair-

wise relatedness, rather than specific relationships, can be estimated from population sample

data using approaches based on identity by descent (IBD) methods [McPeek and Sun, 2000;

Sieberts et al., 2002; Leutenegger et al., 2003; Stankovich et al., 2005; Browning and Brown-

ing, 2010; Powell et al., 2010; Thompson, 2008b]. Such methods are particularly efficient at

detecting misspecified relationships: assessing the true relationship is more difficult. How-

ever, large population studies undoubtedly contain pedigrees (i.e. family trees) [Thompson,

2008a] so if it were possible to reconstruct (extended) pedigrees easily from population sam-

ple genetic data, we would have far richer information and sets of distant relatives could be

readily proposed.

Pedigree information is also relevant to standard statistical association testing where fail-

ure to take account of existing relationships can yield overly high false positive results [New-

man et al., 2001; Stankovich et al., 2005; Choi et al., 2009; Thornton and McPeek, 2010].

Likewise, some ‘true’ associations could be missed: a susceptibility gene for which different

variants are segregating in different groups of relatives would never be detected in an asso-

ciation analysis on unrelated individuals. Moreover, even in linkage analyses where pedigree

3

data are required, undeclared relatedness amongst the founders of a particular family and

between founders of different families can heavily bias the results if appropriate adjustments

are not made [Genin and Clerget-Darpoux, 1996; Sheehan and Egeland, 2008; Thompson,

2008a].

It is sometimes convenient to consider the pedigree identification problem as one of de-

termining the most likely pedigree from a set of possible alternatives. In theory, estimating

the most likely pedigree for a given set of individuals from genetic marker data simply

requires consideration of all possible relationships amongst them and computing the like-

lihood for each [Cannings and Thompson, 1981; Thompson, 1975]. However, due to the

potentially enormous number of possible pedigrees, simple enumeration rapidly becomes im-

practical [Egeland et al., 2000; Sheehan and Egeland, 2007]. When general features of the

pedigree are of more interest than precise estimation of the entire set of relationships, ‘se-

quential’ (i.e. greedy) algorithms can be used which produce a single high likelihood (but

not necessarily maximum likelihood) reconstruction by starting from a position where all

individuals are assumed to be unrelated and gradually accepting sibships on the basis of

the increase attained in log-likelihood [Thompson, 1976a]. In order to give good results,

the above approach requires information on age (or at least an age ordering), sex, average

sibship size, typical generation gap and age disparity among parents. Other approaches

to pedigree reconstruction which produce a high likelihood pedigree include simulated an-

nealing [Almudevar, 2003; Riester et al., 2009] and Markov chain Monte Carlo (MCMC)

methods [Almudevar, 2007].

In contrast, the method presented in this paper will yield a guaranteed maximum like-

lihood pedigree, rather than just a high-probability pedigree. This has also been achieved

through Bayesian network learning using dynamic programming by Cowell [2009]. The ap-

proach we propose in this paper uses constraint based methods which have been used for

Mendelian consistency checking [Sanchez et al., 2008] but not to our knowledge for pedigree

reconstruction. More specifically, we use integer linear programming (ILP) to search for

maximum likelihood pedigrees. For clarity and direct comparison with Cowell [2009], we

have chosen to cast the problem as one of Bayesian network learning noting that while this

is not necessary for the situation presented here, such a representation will undoubtedly be

4

useful for more complex applications.

There is no guarantee that the most probable pedigree is the true pedigree [Thompson,

1976b]. This is due to the probabilistic nature of genetic inheritance whereby marker data

on unrelated individuals may indicate that they are, in fact, related and where the observed

data could assign higher likelihood to an incorrect relationship than the true one. Failure

to acknowledge such inevitable uncertainty could lead to spuriously confident inferences and

so, in addition to finding the most likely pedigree, we will also show how our approach can

be used to search for many high likelihood pedigrees.

The structure of the paper is as follows. In Section 2, we outline how a pedigree can be

represented as a Bayesian network and present the likelihood function to be optimised. We

then explain how the pedigree reconstruction problem can be encoded as an ILP optimisation

following the argument presented in Cussens [2010]. Results for this encoding are given for

a straightforward situation in Section 3 and compared with competing methods. The paper

concludes in Section 4 with a discussion of the issues and challenges for future work, including

possible extensions to more complex problems.

2 Material and Methods

2.1 Pedigrees and Bayesian networks

A pedigree can easily be represented as a directed acyclic graph (DAG) [Lange and Elston,

1975], where each node corresponds to a pedigree member and a directed connection (i.e.

arrow, or directed edge) between two individuals represents a parent-child relationship. We

note that this defines a special class of DAG with well defined structural constraints: each

node in the graph is known to have exactly two parent nodes (although either may be latent)

and these parents must be of opposite sexes. A node with no parents is a pedigree founder.

Figure 1(a) shows a simple pedigree in conventional ‘family tree’ form, with males represented

as squares and females as circles. The corresponding DAG is shown in Figure 1(b) and the

marriage node graph [Thomas, 1985; Cannings et al., 1978] in Figure 1(c).

[Figure 1 about here.]

5

A Bayesian network (BN) is a directed acyclic graph whose nodes, v ∈ V , represent

random variables. There are several ways of expressing a pedigree as a BN [Lauritzen

and Sheehan, 2003] but one of the simplest is to consider the graph in Figure 1(b) where

the nodes no longer represent the pedigree members themselves but random variables that

assign genotypes to these individuals. These random variables have a joint distribution that

depends on the pedigree structure and the mode of inheritance for the relevant genetic locus

which can be factorised into a product of conditional distributions for each node given its

parent nodes.

We will introduce our reconstruction approach with a focus, as other authors have done,

on the most straightforward case [Almudevar, 2003; Cowell, 2009; Thompson, 1976a]. Specif-

ically, we assume that the sample of individuals is complete by which we mean that each

observed individual has fully observed genetic data and each unobserved parent is assumed

to be unrelated to any other individual in the sample and is hence an unobserved pedigree

founder. We will also adopt the common assumption of random union of gametes indicating

that founder genes are independent of each other with the consequent implication of Hardy-

Weinberg proportions for founder genotypes. Finally, we will assume that genetic markers

at the different loci under consideration are segregating independently. While perfectly ad-

equate for the independent marker data that we consider here, we note that the above BN

representation of a pedigree will not suffice for more complex genetic models involving linked

loci, for example, where more general graphical model representations are more appropriate.

We note that it also implicitly assumes simple Mendelian inheritance of genes in offspring

from parents [Lauritzen and Sheehan, 2003].

For a given pedigree graph, G, let V be the set of individuals and identify v ∈ V with a

random variable Gv assigning a genotype at a single locus to v. Let fv and mv be the labels

of the nodes representing the father and mother of v, respectively. Then

τ(gv | gfv, gmv

) ≡ P (Gv = gv |Gfv= gfv

, Gmv= gmv

)

denotes the probability that v has genotype gv given the parental genotype combination

of gfvand gmv

. In the event where only one parent, e.g. mv, is observed, we assume that

6

the unobserved parent is a founder and sum over all possible genotypes for that parent.

Specifically,

τ(gv | gmv) =

gfv

τ(gv | gfv, gmv

)p(gfv)

where p(gfv) — the marginal probability that a founder individual fv has genotype gfv

is derived from the population allele frequencies and the assumption of Hardy-Weinberg

proportions. This is equivalent to assuming that the gene inherited from the unobserved

parent is randomly drawn from the population allele frequency distribution. In general, we

can express these probabilities as

τ(v, Pa(v,G))

where Pa(v,G) denotes the parent set of v in G which will comprise 0, 1 or 2 nodes and

where τ(v, ∅) ≡ p(gv) represents the marginal probability that v has genotype gv. Note

that these probabilities are just the usual Mendelian transmission probabilities of 0, 1

4, 1

2

or 1 for non-founding pedigree members with two observed parents. Since a pedigree is

completely specified by parent-child relationships, the probability of any single configuration

of genotypes on a pedigree, and hence the likelihood under the assumption of a complete

sample and Mendelian inheritance, decomposes into a product of conditional probabilities

and can be written as

L(G) =∏

v∈V

τ(v, Pa(v,G)). (1)

Note that this is precisely the form of the standard decomposition of a pedigree likelihood

into triplet products under the same assumptions. (See Thompson [2000], for example.)

The aim of a maximum likelihood pedigree reconstruction is to find G such that L(G) is

maximised. Given that the factorisation of the joint distribution in (1) defines a Bayesian

network [Lauritzen, 1996], this is equivalent to searching for an optimal Bayesian network

where the search is constrained to find networks that are valid pedigrees. It is often more

convenient to work with the log-likelihood so our optimisation problem in that case is to find

7

G to maximise

l(G) = log L(G) =∑

v∈V

log τ(v, Pa(v,G)). (2)

Since our multilocus genotype is based on independently segregating markers, by as-

sumption, the joint likelihood of interest is simply a product of likelihoods (Equation (1))

calculated for each locus separately and the log-likelihood is a sum of the relevant expres-

sions (Equation (2)). For convenience, we will thus focus on a single locus to describe our

approach.

2.2 An ILP formulation for likelihood maximisation

An integer linear program (ILP) is defined by:

1. a set of variables X, representing unknown quantities, some of which will be restricted

to take only integer values,

2. a linear objective function of the form∑

x∈X cxx where the coefficients cx are fixed

constants, and

3. linear equations and inequalities putting joint constraints of the form∑

x∈X axx ≤ b

on the values the variables can take where the values ax and b are fixed constants.

The ILP optimisation problem is to find an assignment of values to the variables which

maximises the objective function while respecting all constraints. This is an NP-hard prob-

lem in the general case [Karp, 1972] which means there is no known polynomial time algo-

rithm to solve it in the general case. It is however still possible to solve the problem exactly,

thus guaranteeing that the optimal pedigree is found. One simple algorithm that does so

is to enumerate all potential solutions and examine them in turn. This would of course be

impractically slow for any reasonably sized problem.

Decades of research (both academic and commercial) have gone into creating highly opti-

mised ILP solvers based on the simplex algorithm which achieve good temporal performance

in practice [Wolsey, 1998]. Given increasingly large problems, these solvers will eventually

8

run out of computer memory or take too long to be of practical use. However, provided that

the solvers are given sufficient time and resources, it can be guaranteed that they will always

find the optimal solution. The typically good performance of these solvers means that they

present an attractive option for maximum likelihood pedigree reconstruction provided the

latter can be efficiently coded as an ILP problem.

Such an encoding is indeed possible but is far from obvious and so it will be described in

some detail here for the pedigree likelihood shown in Equation (1). For illustrative purposes

we will describe how our encoding applies to a very simple pedigree reconstruction problem

involving n = 3 individuals, v1, v2 and v3, under the assumption that marker data are

available for all three but there is no additional information concerning their sex, age or

relatedness.

The first step is to define variables which allow a numeric encoding of any possible

pedigree. From the simple observation that a pedigree specifies the parents of an individual,

indicator functions of the form I(W → v)(G) are created for each possible parentage for each

individual. Specifically,

I(W → v)(G) =

1 if v has parents W in G

0 otherwise(3)

where W ⊆ V \ {v} and |W | ≤ 2. In our tiny example there are 12 such variables:

I({} → v1) I({v2} → v1) I({v3} → v1) I({v2, v3} → v1)

I({} → v2) I({v1} → v2) I({v3} → v2) I({v1, v3} → v2)

I({} → v3) I({v1} → v3) I({v2} → v3) I({v1, v2} → v3)

where, for example, I({v2} → v1) = 1 indicates that v2 is the only parent of v1 in V .

[Figure 2 about here.]

Since each pedigree member has only one set of parents, any possible pedigree G deter-

mines a joint instantiation of the binary variables, I(W → v) in Equation (3) by setting

exactly |V | = n of them to 1 and all the others to 0. For example, if we consider the

pedigree for our toy example in Figure 2(a), we have I({} → v1) = 1 since v1 has no par-

9

ents (i.e. is a founder), I({v1} → v2) = 1 since v2 has one parent v1 in the pedigree and

I({v2} → v3) = 1 to represent that v3 has v2 as its only parent. Analogously, I({} → v1) = 1,

I({v1, v3} → v2) = 1 and I({} → v3) = 1 for Figure 2(b) and I({} → v1) = 1, I({} → v2) = 1

and I({v2} → v3) = 1 for Figure 2(c). This approach allows any pedigree to be mapped to

a set of variables of length n + n(n− 1) + 1

2n(n− 1)(n− 2) each with a value of zero or one.

However, the mapping is not one-to-one as most such sets of variables do not correspond to a

valid pedigree; there is no pedigree that corresponds to all variables being ones, for example.

Noting that I(W → v)(G) only takes the value 1 when W = Pa(v,G), the pedigree

log-likelihood shown in Equation (2) can now be rewritten in terms of these binary variables

as

l(G) = log L(G) =∑

v,W

log τ(v,W ) [I(W → v)(G)] . (4)

The maximum likelihood pedigree reconstruction problem is thus reformulated as a prob-

lem of finding an instantiation of I(W → v) to maximise the quantity shown in Equation (4)

subject to the constraint that this instantiation represents a valid pedigree, G. Since all

the variables are integer-valued and the objective function is linear in these variables, this

satisfies the definition of an integer linear programming problem as long as the constraints

on the I(W → v) can also be expressed as linear equations and inequalities. We show below

that such an expression is indeed possible.

Constraints are required to rule out sets of variable assignments that do not correspond

to a valid pedigree. There are three properties that must be met for a set of assignments

to encode a valid pedigree. First, each individual must have exactly one set of parents.

Second, no individual can ever be his own ancestor. Third, it must be possible to assign sex

consistently to each member of the pedigree. We will now show how each of these properties

can be expressed in the form of linear constraints.

Ensuring one set of parents The most basic constraint, as noted above, is that each in-

dividual, v, has exactly one parent set. This can be expressed concisely by n linear equations

of the form

10

∀v :∑

W

I(W → v) = 1 (5)

For our simple example, these three constraints are as follows:

I({} → v1) + I({v2} → v1) + I({v3} → v1) + I({v2, v3} → v1) = 1

I({} → v2) + I({v1} → v2) + I({v3} → v2) + I({v1, v3} → v2) = 1

I({} → v3) + I({v1} → v3) + I({v2} → v3) + I({v1, v2} → v3) = 1.

To understand why these constraints work, note that for a valid pedigree, exactly one of

the variables in each equation will be equal to 1 (the variable encoding the correct parent

set) and the remaining variables will each be equal to 0. If a variable were to have no parent

set, then the equation for that variable would have its left hand side equal to 0 and hence the

constraint would be violated. Similarly, if a variable were assigned two or more parent sets,

the left hand side would sum to more than 1 and again the constraint would be violated.

Ruling out cycles While an instantiation of the I(W → v) satisfying Equation (5) will

represent a graph where no node can have more than one parent set, these constraints are

not enough to rule out graphs with (directed) cycles i.e. where an individual node can be its

own ancestor or descendant. In our example, setting I({v3} → v1) = 1, I({v1} → v2) = 1

and I({v2} → v3) = 1 would clearly satisfy the constraints in Equation (5) but does not

result in a valid pedigree.

One way to rule out directed cycles, and the approach we use in this paper, is to assign

generation numbers to the individuals. A founder member has generation number zero and

every other individual must have a generation number that is greater than that of each of

its parents. In the absence of any prior knowledge on the maximum generation number, m,

it is possible to use m = n − 1.

For each individual v in the problem, let us introduce a new variable, gen(v), to represent

the generation number of that individual. If gen(v) denotes the generation number assigned

to individual v and u is a parent of v, then we must find assignments to these variables such

11

that gen(v) ≥ gen(u) + 1. Consider the following n(n − 1) constraints:

∀u, v : gen(v) − gen(u) ≥ −m + (m + 1)∑

W :u∈W

I(W → v) (6)

for two distinct individuals labelled u and v. Note that if u is not a parent of v, the

summation∑

W :u∈W I(W → v) on the RHS is 0 and the constraint is vacuously satisfied.

If u is a parent of v, the summation — and hence the entire RHS — is 1 effecting the

desired constraint, gen(v)−gen(u) ≥ 1, on a parent-child relationship. Furthermore, for any

ancestor w of v, it follows that gen(w) < gen(v). A directed cycle necessarily implies that

at least two individuals are their own ancestors leading to an obvious inconsistency. The

constraints in Equation (6) are hence sufficient to rule out such cycles.

The form given in Equation (6) does not correspond to that required for an ILP problem,

but after rearrangement we can write:

∀u, v : (m + 1)∑

W :u∈W

I(W → v) − gen(v) + gen(u) ≤ m (7)

which is in the desired form.

Returning to the example with n = 3 individuals, the 6 new constraints obtained are:

3 [I({v2} → v1) + I({v2, v3} → v1)] − gen(v1) + gen(v2) ≤ 2

3 [I({v1} → v2) + I({v1, v3} → v2)] − gen(v2) + gen(v1) ≤ 2

3 [I({v3} → v1) + I({v2, v3} → v1)] − gen(v1) + gen(v3) ≤ 2

3 [I({v1} → v3) + I({v1, v2} → v3)] − gen(v3) + gen(v1) ≤ 2

3 [I({v3} → v2) + I({v1, v3} → v2)] − gen(v2) + gen(v3) ≤ 2

3 [I({v2} → v3) + I({v1, v2} → v3)] − gen(v3) + gen(v2) ≤ 2

Ensuring sex consistency

[Figure 3 about here.]

The constraints given by (5) together with those in (7) are sufficient to ensure that only

variable assignments encoding directed acyclic graphs are allowed. However, not all such

12

graphs are valid pedigrees. The simple example in Figure 3 (taken from Cowell [2009]) shows

a DAG that is not a valid pedigree. The problem arises because of the fact that each of the

parent pairs (1, 2), (1, 3) and (2, 3) must be of opposing sexes and these clearly cannot be

consistently assigned.

In order to ensure that a sex can be consistently assigned to all individuals in the pedigree,

a further binary variable If (u) is created for each individual u such that If (u) = 1 if and

only if u is a female. Constraint (8) states that if an individual v has two parents, then

at most one is female and constraint (9) states that in this situation at least one parent is

female. Note that in both cases, if I({u,w} → v) = 0 (i.e. v does not have {u,w} as parents)

then the constraints are vacuously satisfied.

∀u, v, w : I({u,w} → v) + If (u) + If (w) ≤ 2 (8)

∀u, v, w : I({u,w} → v) − If (u) − If (w) ≤ 0. (9)

With these constraints in place, the maximum (log) likelihood pedigree reconstruction

problem can now be reformulated as follows from Equation (4):

Maximise∑

v,W log τ(v,W )I(W → v)

subject to

∀v :∑

W I(W → v) = 1

∀u, v : (m + 1)∑

W :u∈W I(W → v) − gen(v) + gen(u) ≤ m

∀u, v, w : I({u,w} → v) + If (u) + If (w) ≤ 2

∀u, v, w : I({u,w} → v) − If (u) − If (w) ≤ 0

where

|W | ≤ 2

∀v,W : I(W → v) ∈ {0, 1} ∀v : If (v) ∈ {0, 1} ∀v : gen(v) ∈ [0,m].

(10)

Any assignment to the variables that meets these constraints corresponds to a valid

13

pedigree. A particular assignment to the variables that also maximises the objective function

is guaranteed to be a maximum likelihood pedigree.

Obviously, for the tiny running example of n = 3 individuals just considered, one could

easily construct and score each of the 25 possible pedigrees and avoid the procedure de-

scribed above. However, that would not scale to larger numbers of individuals. For the ILP

formulation with n individuals, there are only 1

2n3 − 1

2n2 + 3n variables and n3 − 2n2 + 2n

constraints so, in principle, the method should extend to reasonably large problems.

Note that, for any given situation, the number of variables and constraints to be consid-

ered may in fact be lower. We consider a genetic model in which there is no mutation; this

means that τ(v,W ) = 0 for many combinations of parent set W and child u. A valid pedigree

will hence not contain such relationships and the value of the corresponding I(W → v) has

to be 0. Once these variables are fixed at 0, some of the constraints featuring these variables

may be provably satisfied by any valid assignment of values to the remaining variables. Such

redundant constraints can then be removed.

2.3 Additions and improvements to the ILP formulation

Having formulated the problem of finding the most likely pedigree as an ILP problem, we

introduce some slight modifications that improve the solving performance and allow for the

k most likely pedigrees to be found.

At least one founder Preliminary experiments revealed that explicitly adding the implied

constraint that there has to be at least one pedigree founder greatly increases the performance

of the algorithm. This is given as

v

I({} → v) ≥ 1. (11)

This is always trivially true since pedigrees are acyclic and so a maximum likelihood

pedigree will still be returned.

Finding the kth most likely pedigree As already noted, finding a single maximum

likelihood reconstruction does not take account of model uncertainty, and for this reason,

14

we wish to find multiple high likelihood pedigrees. Specifically, we want to find the k most

likely pedigrees, for some arbitrary value k. This can be done by repeatedly solving the ILP

problem given in (10), adding one additional constraint each time to prevent a previously

found solution from being returned again. In this way, repeated solutions of the problem

yield valid pedigrees in order of descending likelihood.

The additional constraint added each time is as follows:

∀ W, v s.t. I(W → v)(G) = 1 :∑

W,v

I(W → v) < n (12)

where G is the previously found pedigree. The constraint simply states that at least one

parent set in the new pedigree has to differ from the previous one. For the kth most likely

pedigree, there will be k−1 such constraints, one for each of the previously found pedigrees.

2.4 Marker data and pedigree structures

In order to assess how well the method performs, the results reported in Section 3 are based

on simulated datasets about which there is no uncertainty. To generate each dataset, genetic

profiles were created for each individual of a given pedigree based on a given distribution of

population allele frequencies under the assumptions stated in Section 2.1. We simulated 100

datasets for each combination of pedigree structure and set of allele frequencies considered.

Allele Frequencies We simulated data from two sets of allele frequencies, both based on a

typical forensic set of microsatellite markers. We used the original (not rounded) Caucasian

allele frequencies for the 15 autosomal short tandem repeat loci (the 13 CODIS core loci plus

D19S433 and D2S1338) corresponding to Table 1 of Butler et al. [2003] as an example of

a realistic distribution. These allele frequencies were previously used for the same purpose

in Cowell [2009]. For our second set of frequencies, we considered the same 15 marker set

but assigned equal probabilities to the listed alleles for each marker. This is what we will

refer to as our ‘uniform frequency distribution’. In both cases, we thus have 15 markers with

allele numbers ranging between 7 and 15.

15

Pedigrees

[Figure 4 about here.]

[Figure 5 about here.]

[Figure 6 about here.]

Several pedigree structures are considered. For direct comparison with the results re-

ported in Cowell [2009], we used the two small pedigrees of 20 individuals considered in

that paper and reproduced in Figure 4. As an example of a larger structure, we took the

59-member pedigree of Almudevar [2003], reproduced here in Figure 5, as a reasonable repre-

sentation of the complexity to be expected in a typical human pedigree. In order to consider

a pedigree complexity effect, we also considered the cyclic half-sib, cyclic first cousin and

quadruple second cousin regular mating structures of Wright [1921], each with 64 individuals

i.e. 7 generations descended from 8 founders. The half-sib structure is the most inbred of

the three. However, although there are no matings between close relatives in the quadru-

ple second cousin pedigree shown in Figure 6, it requires only four generations before all

individuals are related to all eight founders.

3 Results

Initial reported results using integer linear programming for maximum likelihood pedigree

reconstruction are encouraging [Cussens, 2010]. Here, we will perform a more detailed ex-

ploration where we aim to assess how well our algorithm performs on a range of pedigrees.

We will consider the time needed by this approach to find the most likely pedigree and also

the time needed to find the kth most likely pedigree.

Further to this, we examine the suitability of any approach which searches for maximum

likelihood pedigrees by examining the similarity between the real pedigree and the maximum

likelihood one. This is developed into considering the distribution of likelihood over the

k most likely pedigrees. Finally, we compare our approach to those presented by Cowell

[2009], Riester et al. [2009] and Almudevar [2003].

16

All reported results were produced on a normal desktop computer (2.8GHz quad-core

running Windows 7) with the Gurobi ILP solver, for which a free academic licence is available.

3.1 Solution Times

For the four larger pedigrees (the 59 member one shown in Figure 5 and the three regular

mating structures studied), we first considered the time taken to find a most likely pedigree

for the 100 data sets simulated from both sets of allele frequencies. Side by side boxplots of

these time distributions are shown in Figure 7.

[Figure 7 about here.]

The results show that a maximum likelihood solution was usually found very quickly with

the most extreme case taking about 1.3 seconds when uniform allele frequencies were used.

Note that, despite the fact that we have differing numbers of alleles at each locus, uniform

allele frequencies clearly make the reconstruction problem easier. Typical solving times were

always slower and much more variable with realistic frequencies on all four pedigrees. The

cyclic half-sib pedigree proved to be the most challenging structure for the algorithm, in

terms of solving time, with an extreme value of 13 seconds when marker data were simulated

from the realistic allele frequency distribution. The pedigree in Figure 5 and the quadruple

second cousin structure in Figure 6 were the easiest with the maximum likelihood solution

always found within a second.

[Figure 8 about here.]

As noted earlier, a single maximum likelihood reconstruction is not always particularly

useful so we now consider the 100 most likely pedigrees. Figure 8 shows the median times

taken between finding the (k − 1)th and the kth most likely pedigree for all four structures

and both sets of allele frequencies. The overall pattern is that the maximum likelihood

pedigree (i.e. the first of the 100 pedigrees) is found most quickly and solving times tend to

increase for each additional pedigree sought. Solving times are slower for the realistic allele

frequencies than for the uniform ones and are more variable for the pedigree of Figure 5 and

the half-sib structure than for the other two regular structures. The cyclic half-sib structure

is markedly more difficult to solve with increasing number of pedigrees found.

17

3.2 Accuracy of the reconstructed pedigree

Assessing the value of a particular reconstruction depends to some extent on what it is

required for. Often, as would be the case for demographic research or ecological studies that

aim to design a breeding program, it is more important that the estimated pedigree reflect

general characteristics of the true structure, such as correct distribution of sibship sizes,

correct levels of multiple marriages and overall connectedness of the sampled individuals,

rather than be accurate in every detail [Thompson, 1986]. There are many such criteria

that may be important and these will differ between applications. We therefore focus on

a general property for assessing the quality of the reconstruction, while noting that this

may not be the best criterion in any given situation. Comparing the number of erroneous

parent-offspring links (or incorrect graph edges) is an obvious statistic and is commonly

used in BN learning. However, it should be noted that although a maximum likelihood

pedigree is often a reasonable approximation to the true pedigree, whether it is or not has

no bearing on the speed and exactness of a method for finding it. Our goal is to find high

likelihood pedigrees quickly so it could be argued that error rates such as the mean number

of parental misspecifications are not the appropriate statistics for evaluating the success of

a likelihood approach. Nevertheless, they can be useful — particularly when several high

likelihood pedigrees are considered.

[Table 1 about here.]

Table 1 shows the sample distribution of incorrect parent-offspring links among the 100

maximum likelihood pedigrees for each of the 8 combinations of allele frequency and pedigree

structure. Specifically, we define a link to be incorrect if it is either missing or incorrectly

attributed. The maximum number of such errors is twice the number of individuals. As can

be seen, the maximum likelihood pedigrees had considerably fewer incorrect links when the

data were simulated under uniform allele frequencies than for the case with realistic frequen-

cies. The reconstruction of the pedigree in Figure 5, which is a more realistic representation

of a human pedigree structure, was most often correct or close to correct. Quadruple second

cousins were the most accurate of the regular mating systems we considered by this mea-

sure. The cyclic half-sib maximum likelihood reconstruction was the least likely to be the

18

true structure with as many as 8 misspecified parent-offspring links even for the uniform

allele frequency scenario. Accuracy of reconstruction appears to deteriorate with increasing

levels of inter-relatedness. From the form of the optimisation function in (4), the algorithm

is choosing the ‘best’ parentage for each individual. The more inter-related the individuals,

the more genetically consistent choices there are for parent pairs or triplets. This is consis-

tent with the observation in Thompson [1986] that high exclusion probabilities for incorrect

parents are required in order to obtain an accurate reconstruction.

[Table 2 about here.]

It is reasonable to expect that while the true pedigree may not always have maximum

likelihood, the inferred relationships should include many of the true parent-child links and

so the true pedigree should often feature among the most likely ones. Table 2 shows where

the true pedigree was found among the 100 most likely pedigrees for all 100 simulated data

sets for each scenario considered. For the uniform allele frequency distribution, the maximum

likelihood pedigree was actually the true pedigree 95% of the time for the pedigree of Figure 5.

The story was quite different for the realistic frequencies where the true pedigree did not

appear in the top 100 reconstructions for 2 of the runs. There was a similar trend for the

regular mating structures with the cyclic half-sib pedigree being the hardest to get right: the

true structure was the first one found in 2 runs but 76 had failed to find it among the first

100 for the realistic allele frequency distribution. These results highlight the importance

of being able to generate multiple high probability pedigrees rather than rely on a single

reconstruction.

3.3 Distribution of likelihoods

[Figure 9 about here.]

In order to get an idea of what the search space of pedigree likelihoods looks like, we

investigated the relative distribution of the likelihoods of the top 100 pedigrees. Because

likelihoods vary considerably with different combinations of simulated data and pedigree

structure, we divided the likelihood of the kth most likely pedigree in each of the 100 runs

19

by the value of the likelihood for the first pedigree found in that run. Examples of how

variable these are for some particular data sets are given in Figure 9 for the first cousin

64-member pedigree. As can be seen, the likelihoods tend to drop sharply soon after the top

pedigree is found, plateau out and then drop again several times to a tail of low likelihood

pedigrees. Median values across the 100 simulated datasets with interquartile range error

bars at selected intervals are displayed in Figure 10 for all four pedigree structures using the

uniform (a) and realistic (b) allele frequencies. The y-axis has been plotted on a log scale to

stretch out the lower values because of the sharp drop in likelihood at the outset.

[Figure 10 about here.]

Likelihoods drop more quickly for the uniform allele frequencies, again reflecting the fact

that maximum likelihood pedigree reconstruction is easier with uniform frequencies than

with more realistic ones. For our set of realistic allele frequencies, the likelihoods of the

reconstructed cyclic half-sib pedigrees are not surprisingly the slowest to drop and are by far

the least variable indicating the difficulty of the optimisation problem.

3.4 Comparison with other approaches

There are three other notable approaches to maximum likelihood pedigree reconstruction [Al-

mudevar, 2003; Cowell, 2009; Riester et al., 2009] against which we will now compare our

approach.

The method described by Almudevar [2003] relies on simulated annealing to search for a

maximum likelihood pedigree. However, any method such as this that uses a heuristic search

cannot be guaranteed to find the actual maximum likelihood pedigree. The results shown in

Figure 10 illustrate the particular problem with this in the context of pedigree reconstruction;

the most likely pedigree can be orders of magnitude more likely than the next few most likely.

Methods that do not necessarily find the most likely pedigree may instead find pedigrees

significantly less likely than the best. Calculation times reported in Almudevar [2003] are

of the order of minutes, compared to sub-second times for reconstructing the same pedigree

using our method. It should also be noted that the evaluation presented in Almudevar [2003]

20

is conducted under the assumption of uniform allele frequency. Our results show that this

makes the pedigree reconstruction problem easier than it would be using realistic frequencies.

As for our method, the dynamic programming approach of Cowell [2009] guarantees

a maximum likelihood pedigree. However, that approach is restricted to small pedigrees

due to storage requirements (i.e. up to about 30 individuals). For direct comparison, the

dynamic programming approach to maximum likelihood pedigree reconstruction of Cowell

[2009] and the ILP algorithm of Section 2.2 were both run on 100 simulated data sets for

both sets of allele frequencies on the two pedigrees in Figure 4. Both algorithms were quick

(less that 1 second) to find a single maximum likelihood pedigree. Solving times for the

dynamic programming method were all consistently between 0.45 to 0.5 seconds while the

ILP approach was noticeably faster with 90% of the runs succeeding in less than 0.25 seconds.

The dynamic programming approach occasionally found a sex-inconsistent pedigree but this

never happened with the ILP algorithm as the search was constrained to rule out such

structures. Solving time was a little longer for the more inbred pedigree of Figure 4(b) and

this became more apparent when many high probability pedigrees were sought with the ILP

algorithm. In terms of the number of correctly inferred parent-offspring relationships, the

maximum likelihood pedigree was always very close to the true pedigree structure with ≤ 3

different links 95% of the time. The true pedigree almost always featured among the 10 most

likely pedigrees when these could be found and, not unexpectedly, the likelihood distribution

dropped off dramatically after the top 10 pedigrees, or so. For cyclic half-sib structures of

100 individuals, we found that the algorithm took considerably longer to find a maximum

likelihood pedigree. For realistic allele frequencies, 75% of the runs found a solution in less

than 100 seconds but some took as long as 4 hours. This is not too surprising, given that this

was by far the hardest structure we considered, and we would expect much better results for

less complex structures.

Riester et al. [2009] present a software tool for pedigree reconstruction. At the core of the

tool, a solution is found using either the method of Almudevar [2003] or that of Cowell [2009].

Hence, the tool suffers from the disadvantages discussed above: either the pedigree cannot

be guaranteed to be the most likely or the number of individuals that can be considered is

seriously limited. However, it does incorporate several features that we have not addressed.

21

The likelihoods of parentage triples are modified in a principled way to admit the possibility

of genotyping errors and to compensate for a finite number of mating individuals in the

population. These likelihoods are also modified during the search process based on the

current beliefs about population allele frequencies, rather than assuming such frequencies

are known accurately a priori. The tool also allows for potential missing data for some

individuals by repeatedly estimating values for such data during the search process using

Gibbs sampling.

To the best of our knowledge, our ILP algorithm is the only approach to maximum

likelihood pedigree reconstruction that can deliver the k most likely pedigrees and can thus

address the issue of model uncertainty.

4 Discussion

4.1 Extensions to more complex problems

The method we have presented here performs well under the assumptions of a complete sam-

ple, independently segregating genetic markers and Hardy-Weinberg proportions for founder

genotypes. In particular, when complete marker data are observed for all individuals, an

offspring genotype is independent of all non-descendant genotypes conditional on the geno-

types of its parents. This is generally known as the (directed local) Markov property for

Bayesian networks and results in the decomposition of the (log) likelihood function in Equa-

tions (1) and (4) which is the core of our ILP formulation. As noted in Lauritzen and Sheehan

[2003], Mendelian inheritance of offspring genotypes from parental types is also implicit in

this decomposition. Contrary to the statement in Cowell [2009], the desired decomposition

breaks down when markers are linked and siblings are no longer conditionally independent

given their parents. This can be easily seen for the case of two linked marker loci with a

recombination fraction r and alleles A1, A2, A3 and B1, B2, B3, respectively. Consider two

offspring from a doubly heterozygous (A1A2, B1B2) and doubly homozygous (A3A3, B3B3)

mating where parental phase is unknown. We are interested in the probability of the second

offspring, I2, having genotype (A1A3, B1B3). If the first offspring, I1 is also (A1A3, B1B3),

22

P (I2 = (A1A3, B1B3)|I1 = (A1A3, B1B3)) =1

4(1 − r)2 +

1

4r2

i.e. both offspring inherited both alleles from the same chromosome (no recombination) or

they both inherited them from different chromosomes. If I1 is (A1A3, B2B3), we have that

P (I2 = (A1A3, B1B3)|I1 = (A1A3, B2B3)) =1

2r(1 − r)

i.e. there was a recombination for one offspring and not for the other. In order for the Markov

property to hold, both these probabilities should be equal. This is clearly true only when

r = 1

2i.e. when the loci are unlinked.

Complete data will generally not be available when constructing pedigrees for linkage

analysis or for identifying distant relatives for homozygosity mapping from human population

studies. The same is true for wildlife management and forensic science applications. In these

cases, unobserved individuals that form ‘missing links’ between the individuals of interest

have to be accounted for and the likelihood in Equation (1) has to be modified by summing

over all possible combinations of genotypes on the unobserved individuals [Thompson, 2000].

The more unobserved individuals, the larger the sum and the more difficult the likelihood

computation. In order to extend our approach to include individuals with wholly or partially

missing marker data, the missing information could be represented using additional variables

so that the log-likelihood remains linear. The number of additional variables required should

be modest if there is no linkage and the approach should be feasible given the observed

efficiency of the algorithm. An alternative method is suggested by Riester et al. [2009]

whereby missing data are estimated during the reconstruction using Gibbs sampling.

Existing likelihood approaches to pedigree construction are all based on independently

segregating genetic markers. Likelihood calculations are hence simplified, as was the case

in this paper, since only one marker has to be considered at any one time. There are some

relationships, however, that have the same likelihood and are hence indistinguishable for

any number of unlinked markers, but are distinguishable with linked markers [Thompson,

1975; Egeland and Sheehan, 2008]. Given the ever increasing availability of dense sets of

SNPs from genetic association studies, reconstruction algorithms should aim to make use of

23

linked markers. Correlated markers naturally lead to a reduction in discriminatory power

when compared with (the same number of) unlinked markers so it is to be expected that

many more markers will be required. Preliminary work for distinguishing between particular

alternatives for pairwise relationships has shown that huge numbers of markers, as are now

readily available for genetic association studies, can be used to distinguish between quite

distant relationships [Skare et al., 2009]. In order to adapt our method for handling linked

markers, we could adopt a similar strategy to that described above for missing data by repre-

senting the unknown phase information with additional variables. This restores the desired

decomposition of the likelihood function and the optimisation procedure now returns a most

probable combination of pedigree structure and haplotypes. Note that the observed (un-

ordered) marker data put tight constraints on the possible haplotypes and these constraints

are easy to express in the ILP framework. It is no surprise that ILP has been successfully

used for inferring haplotypes in the case where the true pedigree is known [Brown and Har-

rower, 2006]. Likelihood calculations will be much more intensive for linked markers and we

may realistically need to consider approximations.

Finally, in addition to genetic marker data, prior population demographic information on

breeding patterns, average numbers of offspring and cultural prejudices, for example, as well

as specific knowledge about particular relationships may often be available in practice. It is

important to include such information when reconstructing pedigrees since structures that

fit the marker data the best may be unreasonable in other respects, such as by representing

improbably high levels of interrelatedness, for the population in question [Egeland et al.,

2000]. More importantly, a formal Bayesian approach permits a principled way of quantifying

model uncertainty using Bayesian model averaging [Kass and Raftery, 1995; Hoeting et al.,

1999].

Many existing approaches [Hadfield et al., 2006; Neff et al., 2001; Riester et al., 2009;

Thompson, 1976a; Thompson and Meagher, 1987] do incorporate some prior information,

but often in an informal way and at an interim stage of the reconstruction, for example

when the likelihood approach is favouring what is clearly an undesirable structure [Sheehan

and Egeland, 2007]. If sensible priors could be assigned to pedigrees [Egeland et al., 2000;

Sheehan and Egeland, 2007], the likelihood of any candidate pedigree L(G), based on the

24

observed marker data, can be combined with a prior probability π(G) to derive the posterior

probability p(G | data) ∝ π(G)L(G) using Bayes’ theorem. The pedigree reconstruction prob-

lem then reduces to a search for a pedigree with maximal posterior probability, often called

MAP estimation. This is equivalent to likelihood estimation in the case of a uniform prior

distribution over all possible pedigrees.

Extending our approach to deal with priors should be quite simple. If the prior encodes

‘hard’ information (e.g. an individual is female, or a particular individual is the parent of

another), this can be added to the ILP formulation as additional constraints. If the prior

encodes ‘soft’ information (e.g. a probability distribution over sibship sizes), it can be shown

that this can be accurately incorporated into the problem through the addition of extra log

prior terms to the objective function, provided that the prior can be encoded as a linear

function.

4.2 Conclusions

Family relationships, and hence pedigrees, are central to all gene mapping studies [Day-

Williams et al., 2011]. Establishing pedigree information through interviews and historical

records is time consuming and so it is appealing to be able to reconstruct pedigrees from

genetic marker data using available large population studies, especially since such studies

have a reasonable chance of representing extended pedigree information which is otherwise

difficult to obtain [Browning and Browning, 2010]. The ILP approach to pedigree reconstruc-

tion that we have presented in this paper has been shown to out-perform other approaches

in the standard setting where all individuals are observed, segregation is Mendelian, markers

are independent and founder genotypes are in Hardy-Weinberg equilibrium. It also extends

to quite large pedigrees, guarantees a maximum likelihood solution and can deal with un-

certainty in the reconstruction by providing the k most likely pedigrees. Most importantly,

we have a framework that seems more amenable to adaptation to harder, and hence more

interesting, reconstruction problems. Our algorithm usually finds a maximum likelihood

pedigree efficiently but is sensitive to the allele frequency distribution and the complexity of

the pedigree structure. We would warn that the use of uniform allele frequencies, as com-

monly arises in simulation studies, can be misleading as they make the problem deceptively

25

easier.

It can be argued that likelihood approaches are not necessarily the most useful for pedi-

gree estimation, especially when general features or specific sections of the pedigree are of

more interest that the overall structure. Comparison of estimates is much less straightfor-

ward in these situations, although realising an appropriate one is generally simpler. For small

populations, multidimensional scaling of pairwise kinship coefficients has been suggested as

one way of representing general population structure [Thompson, 1986]. Where the overall

structure is of interest, however, we have verified that maximising the likelihood is a good

approach. Unless the structure is extremly complex, the maximum likelihood pedigree is

typically quite similar to the true structure with regard to the number of parent-offspring

links that are correctly assigned and usually features quite early in the list of most likely

pedigrees. We agree with the comment in Thompson [1986] that a single maximum likeli-

hood estimate may not be particularly helpful but useful inferences should be possible from

several high probability pedigrees. For example, reasonable probability estimates of particu-

lar pedigree features can be acquired from the proportion of high probability structures that

contain them.

Pedigree reconstruction has also been proposed using a two-step approach that involves

estimation of pairwise kinship coefficients followed by clustering of individuals into pedigrees

using graph learning methods [Cowell and Mostad, 2003; Day-Williams et al., 2011]. This ap-

proach is also sensitive to allele frequencies and the choice of cut-off value used to determine

the graph edges. For quantitative trait locus (QTL) linkage analysis, pedigree information is

only incorporated via local and global pairwise kinship coefficients so estimating these quan-

tities from marker data using dynamic programming and a method of moments approach,

respectively, can obviate the need for standard pedigrees in such analyses [Day-Williams

et al., 2011]. In fact, it has also been noted by Choi et al. [2009] that estimated kinship

coefficients can often be more accurate than pedigree-based quantities because they exploit

the observed sharing of genes between two individuals rather than the expected sharing im-

plied by the pedigree structure. The ILP approach presented here also uses the observed

sharing but has the advantage of considering all related individuals jointly in the likelihood

calculation. We would thus expect it to have an advantage over approaches based on pair-

26

wise estimates and, indeed, in the simple standard scenario we considered here, we do not

think that one could do much better. However, we need to extend the method to deal with

unobserved individuals and linked markers before a sensible comparison can be made. Also,

when extended pedigrees are of interest, genotyping error and/or mutation rates should also

be included. These are much harder reconstruction problems but ILP would seem to provide

the right framework in which to address them.

Acknowledgements

The authors would like to thank Robert Cowell for giving access to his code.

The authors acknowledge support from the Medical Research Council (Project Grant G1002312),

the Leverhulme Trust (Research Fellowship RF/9/RFG/2009/0062) and the BioSHaRE-EU

project (HEALTH-F4-2010-261433) funded by the European Commission under the Seventh

Framework Program (FP7/2007-2013).

References

Almudevar, A. (2003). A simulated annealing algorithm for maximum likelihood pedigree

reconstruction. Theoretical Population Biology, 63:63–75.

Almudevar, A. (2007). A graphical approach to relatedness inference. Theoretical Population

Biology, 71:213–229.

Blouin, M. S. (2003). DNA-based methods for pedigree reconstruction and kinship analysis

in natural populations. Trends in Ecology and Evolution, 18:503–511.

Brown, D. and Harrower, I. (2006). Integer programming approaches to haplotype inference

by pure parsimony. IEEE/ACM Transactions in Computational Biology and Bioinformat-

ics, 3(2):141–154.

Browning, S. R. and Browning, B. L. (2010). High-resolution detection of identity by descent

in unrelated individuals. American Journal of Human Genetics, 86:526–539.

27

Butler, J. M., Schoske, R., Vallone, P. M., Redman, J. W., and Kline, M. C. (2003). Allele

frequencies for 15 autosomal STR loci on U.S. Caucasian, African American, and Hispanic

populations. Journal of Forensic Sciences, 48(4).

Cannings, C. and Thompson, E. A. (1981). Genealogical and Genetic Structure. Cambridge

University Press.

Cannings, C., Thompson, E. A., and Skolnick, M. H. (1978). Probability functions on

complex pedigrees. Advances in Applied Probability, 10:26–61.

Choi, Y., Wijsman, E. M., and Weir, B. S. (2009). Case-control testing in the presence of

unknown relaionships. Genetic Epidemiology, 33:668–678.

Cowell, R. G. (2009). Efficient maximum likelihood pedigree reconstruction. Theoretical

Population Biology, 76(4):285–291.

Cowell, R. G. and Mostad, P. (2003). A clustering algorithm using DNA marker information

for sub-pedigee reconstruction. Journal of Forensic Sciences, 48:1239–1248.

Cussens, J. (2010). Maximum likelihood pedigree reconstruction using integer programming.

In Workshop on Constraint Based Methods for Bioinformatics (WCB-10), Edinburgh.

Day-Williams, A. G., Blangero, J., Dyer, T. D., Lange, K., and Sobel, E. M. (2011). Linkage

analysis without defined pedigrees. Genetic Epidemiology, 35(5):360–370.

Egeland, T., Mostad, P. F., Mevag, B., and Stenersen, M. (2000). Beyond traditional

paternity and identification cases: selecting the most probable pedigree. Forensic Science

International, 110:47–59.

Egeland, T. and Sheehan, N. (2008). On identification problems requiring linked markers.

Forensic Science International: Genetics, 2:219–225.

Genin, E. and Clerget-Darpoux, F. (1996). Consanguinity and the sib-pair method: an

approach using identity by descent between and within individuals. American Journal of

Human Genetics, 59:1149–1162.

28

Hadfield, J. D., Richardson, D. S., and Burke, T. (2006). Towards unbiased parentage

assignment: combining, genetic, behavioural and spatial data in a Bayesian framework.

Molecular Ecology, 15:3715–3730.

Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model

averaging: a tutorial. Statistical Science, 14(4):382–401.

Jones, A. G. and Ardren, W. R. (2003). Methods of parentage analysis in natural populations.

Molecular Ecology, 12:2511–2522.

Karp, R. (1972). Reducibility among combinatorial problems. In Miller, R. and Thatcher,

J., editors, Complexity of Computer Computations, pages 85–103. Plenum Press.

Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical

Association, 90(430):773–795.

Lange, K. and Elston, R. C. (1975). Extensions to pedigree analysis. I. Likelihood calculations

for simple and complex pedigrees. Human Heredity, 25:95–105.

Lauritzen, S. L. (1996). Graphical Models. Clarendon Press, Oxford, United Kingdom.

Lauritzen, S. L. and Sheehan, N. (2003). Graphical models for genetic analyses. Statistical

Science, 18:489–514.

Leutenegger, A. L., Prum, B., Genin, E., Verny, C., Lemainque, A., Clerget-Darpoux, F.,

and Thompson, E. A. (2003). Estimation of the inbreeding coefficient through use of

genomic data. American Journal of Human Genetics, 73:516–523.

McPeek, M. S. and Sun, L. (2000). Statistical test for detection of misspecified relationships

by use of genome-screen data. American Journal of Human Genetics, 66:1076–1094.

Meagher, T. R. and Thompson, E. A. (1987). Analysis of parentage for naturally established

seedlings of Chamaelirium luteum (liliaceae). Ecology, 68:803–812.

Neff, B. D., Repka, J., and Gross, M. R. (2001). A Bayesian framework for parentage

analysis: the value of genetic and other biological data. Theoretical Population Biology,

59:315–331.

29

Newman, D. L., Abney, M., McPeek, M. S., Ober, C., and Cox, N. J. (2001). The importance

of genealogy in determining genetic associations with complex traits. American Journal

of Human Genetics, 69:1146–1148.

Olaisen, B., Stenersen, M., and Mevag, B. (1997). Identification by DNA analysis of the

victims of the August 1996 Spitsbergen civil aircraft disaster. Nature Genetics, 15:402–405.

Powell, J. E., Visscher, P. M., and Goddard, M. E. (2010). Reconciling the analysis of IBD

and IBS in complex trait studies. Nature Reviews:Genetics, 11:800–805.

Riester, M., Stadler, P. F., and Klemm, K. (2009). FRANz: reconstruction of wild multi-

generation pedigrees. Bioinformatics, 25:2134–2139.

Sanchez, M., Givry, S. d., and Schiex, T. (2008). Mendelian error detection in complex

pedigrees using weighted constraint satisfaction techniques. Constraints, 13:130–154.

Sheehan, N. A. and Egeland, T. (2007). Structured incorporation of prior information in

relationship identification problems. Annals of Human Genetics, 71:501–518.

Sheehan, N. A. and Egeland, T. (2008). Adjusting for founder relatedness in a linkage

analysis using prior information. Human Heredity, 65(4):221–231.

Sieberts, S. K., Wijsman, E. M., and Thompson, E. A. (2002). Relationship inference from

trios of individuals in the presence of typing error. American Journal of Human Genetics,

70:170–180.

Skare, O., Sheehan, N., and Egeland, T. (2009). Identification of distant family relationships.

Bioinformatics, 25:2376–2382.

Stankovich, J., Bahlo, M., Rubio, J. P., Wilkinson, C. R., Thomson, R., Banks, A., Ring, M.,

Foote, S. J., and Speed, T. P. (2005). Identifying nineteenth century links from genotypes.

Human Genetics, 117:188–199.

Thomas, A. (1985). Data structures, methods of approximation and optimal computation for

pedigree analysis. PhD thesis, Cambridge University.

30

Thompson, E. A. (1975). The estimation of pairwise relationships. Annals of Human Ge-

netics, 39:173–188.

Thompson, E. A. (1976a). Inference of genealogical structure. Social Science Information,

15:477–526.

Thompson, E. A. (1976b). A paradox of genealogical inference. Advances in Applied Proba-

bility, 8:648–650.

Thompson, E. A. (1986). Pedigree Analysis in Human Genetics. The Johns Hopkins Uni-

versity Press, Baltimore.

Thompson, E. A. (2000). Statistical Inference from Genetic Data on Pedigrees, volume 6 of

NSF-CBMS Regional Conference Series in Probability and Statistics. Institute of Mathe-

matical Statistics and the American Statistical Association, Beachwood, Ohio, USA.

Thompson, E. A. (2008a). Analysis of data on related individuals through inference of iden-

tity by descent. Technical Report 539, Department of Statistics, University of Washington.

Thompson, E. A. (2008b). The IBD process along four chromosomes. Theoretical Population

Biology, 73:369–373.

Thompson, E. A. and Meagher, T. R. (1987). Parent and sib likelihoods in genealogy

reconstruction. Biometrics, 43:585–600.

Thornton, T. and McPeek, M. S. (2010). ROADTRIPS: Case-control association testing with

partially or completely unknown population and pedigree structure. American Journal of

Human Genetics, 86:172–184.

Wolsey, L. A. (1998). Integer Programming. Wiley-Interscience Series in Discrete Mathe-

matics and Optimization. Wiley.

Wright, S. (1921). Systems of mating. Genetics, 6:111–178.

31

(a) (b) (c)

Figure 1: A simple pedigree in standard form (a); with corresponding representations as a

directed acyclic graph (b); and marriage node graph (c).

32

(a) (b) (c)

v1v1v1

v2

v2 v2

v3

v3

v3

Figure 2: Three of the twenty-five possible pedigrees for 3 individuals, where individuals

other than v1, v2 and v3 are not represented

33

1 2 3

4 5 6

Figure 3: A simple DAG where no node has more than two parent nodes but which is not

a pedigree as sex cannot be consistently assigned. Since 2 and 3 have a child together, they

must be of opposite sex. However as they each have a child with 1, they must have the same

sex. It is clearly impossible to assign sexes that meet these contradictory constraints.

34

1 2

2 4 5 6

16

7 8 9 10 1211

201918171513 14

(a)

1 2 3 4 5

6 7 8 9 19

11 12 13 14 15

16 17 18 19 20

(b)

Figure 4: The two pedigrees of 20 individuals corresponding to Figures 2 and 3 in Cowell

[2009] relabelled and drawn as marriage node graphs. Note that the pedigree in (a) has one

first cousin marriage and two sets of half-siblings (offspring of 7 and 12). The pedigree in

(b) is more inbred.

35

1 6 8 9

12 13 14 15 16 17 18 21 22 23 24

25 26 30

35 36 37 38 40 42 43 44 47

48 50 51 53 54 55 56 57 59

2 73 4 5

10 11 19 20

27 28 29 31 32 33 34

39 41 45 46

49 52 58

Figure 5: A 59-member pedigree featuring some first cousin marriages and half-siblings taken

from Almudevar [2003].

36

1 2 3 4 5 6 7 8

71 72 73 74 75 76 77 78

61 62 63 64 65 66 67 68

51 52 53 54 55 56 57 58

4847464544434241

31 32 33 34 35 36 37 38

2827262524232221

11 12 13 14 15 16 17 18

Figure 6: A 64-member pedigree constructed as a quadruple second cousin regular mating

structure as described in Wright [1921].

37

0

0.5

1

1.5

2

2.5

3

HS U HS R FC U FC R SC U SC R A U A R

Tim

e (s

ec)

Figure 7: Boxplots (showing median, interquartile range, minimum and maximum) of the

time taken to find a most likely pedigree in 100 simulated data sets for the half-sibs (HS),

first cousins (FC), quadruple second cousins (SC) and Almudevar (A) pedigrees for both

uniform (U) and realistic (R) allele frequencies. Note that the maximum value for the half-

sib structure with realistic frequencies is outside the displayed range at 13 seconds.

38

0

0.5

1

1.5

2

2.5

3

3.5

0 20 40 60 80 100

Tim

e (s

ec)

kth most likely pedigree

Half SibsFirst Cousins

Second CousinsAlmudevar

(a)

0

0.5

1

1.5

2

2.5

3

3.5

0 20 40 60 80 100

Tim

e (s

ec)

kth most likely pedigree

Half SibsFirst Cousins

Second CousinsAlmudevar

(b)

Figure 8: Time taken to find the kth most likely pedigree after finding the (k − 1)th. The

lines show median solving times (out of 100 simulated data sets) for each of the four larger

pedigree structures for both uniform, (a), and realistic, (b), allele frequencies. Interquartile

range error bars are shown at 4 selected intervals to give an indication of the variability of

these solving times.

39

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 10 20 30 40 50 60 70 80 90 100

Nor

mal

ised

like

lihoo

d

kth most likely pedigree

Figure 9: Likelihoods for each of the top 100 pedigrees, scaled by the highest observed

likelihood, for 5 datasets simulated on the cyclic first cousin regular mating structure from

the uniform allele frequency distribution. Note the logarithmic scale on the y axis.

40

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 10 20 30 40 50 60 70 80 90 100

Nor

mal

ised

like

lihoo

d

kth most likely pedigree

Half SibsFirst CousinsSecond CousinsAlmudevar

(a)

1e-008

1e-007

1e-006

1e-005

0.0001

0.001

0.01

0.1

1

0 10 20 30 40 50 60 70 80 90 100

Nor

mal

ised

like

lihoo

d

kth most likely pedigree

Half SibsFirst CousinsSecond CousinsAlmudevar

(b)

Figure 10: Median values (log scale) of the scaled likelihoods of Figure 9 across all 100

simulated datasets for uniform (a) and realistic (b) allele frequencies on all four pedigrees.

Interquartile range error bars are shown at selected intervals.

41

Data set Number of edges incorrectPedigree Alleles 0 1 2 3 4 5 6 7 8 >8Half Sibs Uniform 21 27 19 13 10 4 2 3 1 0Half Sibs Realistic 2 3 13 8 17 12 11 10 4 20First Cousins Uniform 59 32 4 3 2 0 0 0 0 0First Cousins Realistic 19 26 18 12 10 4 8 0 2 1Second Cousins Uniform 77 17 5 0 1 0 0 0 0 0Second Cousins Realistic 32 22 26 9 3 4 1 0 1 2Almudevar Uniform 95 4 1 0 0 0 0 0 0 0Almudevar Realistic 46 31 18 3 1 0 1 0 0 0

Table 1: Number of edges that differ between the true pedigree and the most likely pedi-

gree found for each of the 100 simulated data sets in each scenario. Each incorrect edge

corresponds to a mistake in the assignment of a single parent-child relationship.

42

Data set Position of real pedigreePedigree Alleles 1 ≤5 ≤10 ≤50 ≤100 >100Half Sibs Uniform 21 48 63 83 87 13Half Sibs Realistic 2 6 8 18 24 76First Cousins Uniform 59 91 97 99 99 1First Cousins Realistic 19 47 54 67 72 28Second Cousins Uniform 77 100 100 100 100 0Second Cousins Realistic 32 60 66 85 89 11Almudevar Uniform 95 100 100 100 100 0Almudevar Realistic 46 84 92 98 98 2

Table 2: Cumulative scores indicating where the true pedigree was found in the top 100

(most likely) pedigrees for the 100 simulated data sets for each scenario considered. The last

column indicates the number of simulations where the true pedigree did not feature among

the top 100.

43