BMI 731- Winter 2004 Haplotype reconstruction

BMI 731- Winter 2004

Haplotype reconstruction

Catalin Barbacioru

Department of Biomedical Informatics

Ohio State University

The problem

We start with a collection of genotypes in the form of allelic determinations at tightly linked single nucleotide polymorphisms (SNPs) for each of a set of n individuals. For example, we might describe 3 SNPs as follows:

Name SNP alleles (major, minor)

SNP1 T, A SNP2 A, G SNP3 C, G

An individual might have genotype AT at SNP1, AA at SNP2, and CG at SNP3, which we will denote by AT//AA//CG. Possible haplotype pairs for this person are AAC/TAG and AAG/TAC, and without further information, we can’t distinguish between these two pairs.

The problem, cont.

What can be done? With information on the individual’s parents, we can usually infer the haplotypes, the only problem being that the parents may not be fully informative.

For example, if the maternal and paternal genotypes were TA//AA//CC and TT//AA//CG respectively, at SNPs 1, 2 and 3, and the individual is AT//AA//CG, then it would be clear that the haplotypes were AAC/TAG (why?). On the other hand, if the parents both had genotypes AT//AA//CG, then we wouldn’t be able to determine unique haplotypes for the individual.

Even in the first case, we might have to make an assumption about the frequency of recombination: what is it?

Our problem here is to determine haplotypes, or make good guesses at them without parental genotypes.

Origin of the problem

Why do we want to determine haplotypes for individuals at tightly linked SNP loci?

a)Haplotypes are more powerful discriminators between cases and controls in disease association studies. Why?

b) With haplotypes we can conduct evolutionary studies.

c) Use of haplotypes in disease association studies reduces the number of tests to be carried out, and hence the penalty for multiple testing. Is this the same point as a)?

Two aspects of the problem

With a random sample of multilocus genotypes at a set of SNPs, we can attempt a) to estimate the frequencies of all possible haplotypes, and b) to infer the haplotypes of all individuals.

The first step on this problem was taken by A Clark in 1990. He gave what we might call a parsimony solution to b) above.

It goes like this. With a reasonable sample size, we might expect to have some individuals homozygous at every locus, e.g. TT//AA//CC, or heterozygous at just one locus, e.g. TT//AA//CG. With the individuals of former type, we have unambiguously identified one (TAC), and of the latter type two (TAC and TAG) haplotypes present in the population. The algorithm begins by finding all homozygotes and single SNP heterozygotes and tallying the resulting known haplotypes.

Now proceed as follows. For each known haplotype, look at all remaining unresolved cases, and ask whether the known haplotype can be made from some combination of ambiguous sites from an unresolved case.

For example, if we have identified TAC as a known haplotype from a TT//AA//CC homozygote, and we have an individual AT//AA//CG still unresolved, then we infer that s/he is TAC/AAG, and we have have “resolved” this person’s haplotype and added a putative haplotype to our list. Similarly, a TT//AA//CG individual gives us both TAC and TAG as known haplotypes, and both of these go into the initial list.

This chain of inferences is continued until either all haplotypes have been recovered, or until no more new haplotypes can be found in this way.

A Clark’s algorithm, cont.

Clark’s algorithm with SNPs in practice

This method should work in principle, but there are three

problems that might arise in practice: a) there may be no

homozygotes or single SNP heterozygotes in the sample, and

so the chain might never get started; b) there may be many

unresolved haplotypes left at the end; and c) haplotypes

might be erroneously inferred if a crossover product of two

actual haplotypes is identical to another true haplotype.

The frequency of these problems will depend on average

heterozygosity of the SNPs, number of loci, their recombination

rates and the sample size. Clark (1990) did some calculations

and simulations which led him to believe the algorithm would

perform well, even with relatively small sample sizes. And it did.

The EM algorithm solution to problem a)(SNPem)

We now describe an EM algorithm to infer haplotype frequencies in a population on the basis of a random sample and the assumption of random mating for haplotypes. Escoffier and Slatkin (1995) call the phase unknown multilocus genotypes, e.g. TT//AA//CG, phenotypes, and keep the term genotype for the corresponding haplotype pair TAC/TAG. Others use the term diplotype for a pair of haplotypes, but neither of these has caught on.

The observed data in a random sample of n individuals will be multilocus genotype frequencies, and the natural model is multinomial. The number c of haplotype pairs leading to a given phenotype will depend on the number s of heterozygous SNPs, and will be 2s-1. E.g if our genotype is TT//AA//CG, then we can recover the haplotypes unambiguously (c=1), but for our original case, AT//AA//CG, there were c = 2 possible haplotype pairs.

Under the assumption of random mating, the probability of a given genotype is just the sum of 2s-1 squares or products of haplotype probabilities, e.g.

P=pr(AT//AA//CG)

= pr(AAC/TAG) + pr(AAG/TAC)

= 2pr(AAC)pr(TAG) + 2pr(AAG)pr(TAC).

The EM algorithm, cont.

The probability of a sample of n individuals conditioned by the phenotype frequencies P1, …, Pm (i.e. the likelihood of the data given the parameters) is given by the multinomial probability,

mnm

n

mm PP

nn

nPPsampleP ...

!!...

!),...,|( 1

11

1

where m different phenotypes are observed with counts n1,…,nm. If we h possible haplotypes with (unknown) frequencies p1,…,ph, then the likelihood of the haplotype frequencies given the phenotypic counts is

m

j

nc

iilikh

jj

hhPappL1 1

1 )(),...,(

where a is a constant incorporating the multinomial coefficient, hp,r are the possible haplotypes and cj is the number of genotypes leading to the jth phenotype. In principle, we want to estimate the haplotype frequencies that maximizethe likelihood function, by solving a set of h-1 equations involving first partialderivatives of the logarithm of the likelihood, ∂logL/∂pi.


The EM algorithm is an iterative method to compute successive sets of haplotype frequencies p1,…,ph, starting with initial values p1

(0),…,ph(0). These

initial values are used as if they are the true frequencies to estimate genotype frequencies P(hkhl)(0) (the expectation step). These expected genotype frequencies are used in turn to estimate haplotype frequencies at the next generation p1

(1),…,ph(1) (the maximization step) and so on, until

convergence is reached.

Possible initial haplotype frequencies can be evaluated assuming that all haplotypes are equally frequent in the sample, or that haplotype frequencies are equal to the product of single-locus allele frequencies, or that initial haplotypes frequencies are chosen random, or that for each phenotype, the possible genotypes are equally likely, so

P(genotype hkhl in phenotype j)(0) = Pj(hkhl)(0)

= 1/cj


The expectation step at the gth iteration consists of using the haplotype frequencies in the previous iteration to calculate the probability of resolving each genotype given the phenotype:

)(

)(21

)(

)()(21 )( g

lk

g

hhP

hhPgj hhP

where the sum is taken over all genotypes (hk,hl) leading to the jth phenotype, and P(h1,h2)(g) = 2p1

(g)p2(g) if h1 and h2 are different and (p1

(g))2 otherwise.Haplotype frequencies are then computed in the maximization step:

)(

1 1

)1( )(2

1 glk

m

j

c

ijit

gt hhPp

j

where δit is equal to the number of times haplotype t is present in genotypei = (hk,hl) (0,1, or 2).

The EM algorithm: performance with SNPs

The EM algorithm: performance with SNPs

Previous slide:

Influence of sample size on haplotype frequency estimates. A and B, Haplotype frequencies at the three steps of the simulation procedure. Generating frequencies (Gk [line]), sample frequencies (Sk [triangles]), and resulting haplotype frequency estimates from the EM algorithm (Ek [unblackened circles]) for a five-locus system with equally frequent population haplotype frequencies, with sample size set to N = 50 (A) and N = 500 (B) are shown. C, Average MSE and 95% CI for batches of 500 data sets of each sample size for five-locus haplotypes generated under the N(1/k,2) model. Unbroken line denotes comparisons of EM estimates to sample values (SE); dotted line, EM estimates to generating parameters (GE).

A Gibbs sampler approach to haplotype reconstruction, Stephens et al (2001) (PHASE)

Our notation is as follows: G = (G1,…, Gn) denotes the observed multilocus

genotype frequencies, H=(H1,…,Hn) will be the corresponding unknown

haplotype pairs, while F=(F1, …, FM) will denote the M unknown

population haplotype frequencies. In these terms, the EM algorithm sought

that F which maximized pr(G|F).

The Gibbs sampler uses the following three steps, starting from some

initial haplotype reconstruction H(0):

i) Choose an individual i, uniformly and at random from all ambiguous individuals;

ii) Sample Hi(t+1) from pr(Hi | g, H-i

(t)), where H-i is the set of haplotypes excluding individual i;

iii) Set Hj(t+1) = Hj

(t) for j=1,…,i-1,i+1, …,n.

General theory tells us that this produces a Markov chain with the desired stationary distribution. The question is: what is pr(Hi | g, H-i

(t))?

The Stephens et al Gibbs sampler, some details

For any haplotype pair Hi = (hi1,hi2) consistent with genotype Gi,

write

pr(Hi | G, H-i) pr(Hi | H-i) pr(hi1|H-I)pr(hi2 | H-I, hi1) (*)

where pr(.|H) is the conditional distribution of a future sampled haplotype, given a set H of previously sampled haplotypes. There are a couple of choices here, one assuming that the type of a mutant offspring is h with probability h, independent of the type of the parent. In that case, for a constant-size random mating population, pr(h|H) = (rh + h)/(r + ), where rh is the number of haplotypes of type h in H, r is the total number of haplotypes in H, and is the scaled mutation rate.

The Stephens et al Gibbs sampler, more details

In principle we can substitute this formula into step ii) of the generic Gibbs algorithm, and off we go. In practice, the number of possible values of Hi is too large: 2s-1, where s is the number of loci at which individual i heterozygous. However, if we take h =1/M, where M is the total number of different possible haplotypes that could be observed in the population, we can exploit the fact

Comparison

Comparison

Previous Figure Comparison of accuracy of PHASE method (solid line) versus EM (dotted line) and Clark's method (dashed line) for short sequence data (530 segregating sites). Top row, mean error rate (defined in text) for haplotype reconstruction. Bottom row, mean discrepancy (defined in text) for estimation of haplotype frequencies. We simulated data sets of 2n haplotypes, randomly paired to form n genotypes, under an infinite-sites model, with = 4 and different assumptions about the local recombination rate R (R and are defined in the note to table 1), using a coalescent-based program kindly provided by R. R. Hudson.For each combination of parameters considered, we generated 100 independent data sets and discarded those data sets for which the total number of possible haplotypes was >105 (the limit of our implementation of the EM algorithm), which typically left >90 data sets on which to compare the methods. Each point thus represents an average over 90100 simulated data sets. Horizontal lines above and below each point show approximate 95% confidence intervals for this average (±2 standard errors). The results for Clark's algorithm for R = 40 are omitted, as we had difficulty getting the algorithm to consistently provide a unique haplotype reconstruction for these data.

The Stephens et al Gibbs sampler, yet more details

Stephens et al (2001) offer a second algorithm, based on slightly different population genetic modelling. It takes the form

pr(h | H) = ∑{E} ∑ {s ≥0} [r/r][/(r+ )]s[r/(r+)](Ps)h ,

where r is the number of haplotypes of type in the set H, r is the total number of haplotypes in H, and is the scaled mutation rate. Here E is the set of types for a general mutation model, and P a reversible mutation matrix. Informally, this corresponds to the next sampled haplotype h being obtained by applying a random number of mutations s to a randomly chosen existing haplotype , where s is sampled from a geometric distribution. This related to sampling from what is known as the coalescent, see Stephens et al (2001) for details and references.

The algorithm uses the above expression in (*) a few slides back. There are several issues that need to be dealt with here: estimation of , dealing with the fact that the dimension of P is M, the number of possible haplotypes, etc.

An alternative Gibbs sampler, Nui et al (2002) These authors are from the Bayesian school, who assume Dirichlet priors with parameters = (1, …, M), for the haplotype

frequencies F = (f1, …, fM). As with the EM approach, our model

is multinomial, being a product over individuals in the sample, of phase unknown multilocus genotype probabilities, which in turn are sums of products of pairs of haplotype probabilities. The problem, as always, is in the large number of haplotypes which are compatible with a given genotype. More fully, thinking of the haplotypes H as “missing”, we can write

pr(G,H | F) ∏{i=1,..n} fhi1fhi2 ∏j fjj - 1.

The steps are familiar: conditional on F, sample hi1 and hi2 for individual i according to (**) below; then sample F given Hupdated .

(**) pr(hi1=g, hi2 = h|F, Gi) = fgfh/∑{g’h’=Gi} fg’fh’ .

Here g’h’=Gi means that g’ and h’ combine to give Gi .

Alternative Gibbs sampler, cont.The advantage of this approach is that it is independent of complex and potentially dangerous population genetic modelling assumptions. Furthermore, tricks we have met before and some new ones can be used to improve the efficiency of the algorithm.

Predictive updating. Here, as elsewhere, we integrate out the parameters F to improve the Gibbs sampler. In this case

pr(G,H) [|+N(H)|]/ [+N(H)],

where we are using the abbreviated gamma function notation once more, and where N(H) is the vector of haplotype counts. Now we can use a different, easier sampler: pick an individual i at random or in a certain order, and update his/her haplotype hi by sampling from

pr(hi = (g,h)|H-i,G) (ng + g)(nh + h),

where ng and nh are counts of g and h in H-i, respectively.

Alternative sampler, cont: Ligation.Handling a large number of loci and hence haplotypes still presents a challenge. It is natural to adopt a divide and conquer approach: solve the problem for small blocks of loci, and then piece together the solutions. Niu et al suggest partitioning L loci into blocks of about 8 loci. They offer two strategies: progressive ligation and hierarchical ligation. In both cases, the first step is to carry out block level haplotype reconstruction using the sampler just described. They then record the B most probable (partial) haplotypes for each block, between 40 and 50 in their examples, and proceed to join them.

Progressive ligation begins at one end and pieces consecutive pairs together, using the Gibbs sampler restricted to longer haplotypes which are among the B2 combinations from the two most probable sets of B partial haplotypes. This process is then continued until all the blocks are joined.

Hierarchical ligation is analogous, but working across the whole length, see Figure on next slide. It is worth pointing out that these strategies not only deal with the large state space, they also help the Markov Chain to converge more rapidly.

Schematic depicting ligation. Here there are L loci, initially insegments of K, while is the highest level of the pyramidal hierarchy.

Alternative sampler completed

Prior annealing. A nice trick used in Niu et al to enable the Gibbs sampler to more move freely in haplotype space is to use high pseudo-counts at the beginning of the iterations, and progressively reduce them at a fixed rate as the sampler continues. Their formula for T iterations is

(t) = (0) + t((T) - (0) )/T.

Missing marker data. The absence of both alleles of an SNP marker is common owing to PCR dropouts. Another concern is the “one-allele” problem, in which only one allele is unscored. The Gibbs sampler adapts nicely to having multiple categories of errors and dealing with these in the sampling.

ReferencesA G Clark. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 7: 111-122. 1990

L Escoffier and M Slatkin. Maximum likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12: 921-927, 1995.

D Fallin and NJ Schork. Accuracy of haplotype frequency estimation for biallelic loci, via the expectation maximization algorithm, for unphased diploid genotype data. Am J Hum, Genet. 67:947-959, 2000.

M Stephens, NJ Smith and P Donnelly. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68: 978-989, 2001.

T Niu, ZS Qin, X. Xu and J Liu. Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet. 70:157-169, 2002.

T Speed, Statistics 246 (notes), Berkeley 2002

Documents

BMI 731- Winter 2004 Haplotype reconstruction