Combinatorial Algorithms for Haplotype Inference

Combinatorial Algorithms for Haplotype Inference

Pure ParsimonyDan Gusfield

SNP Data

• A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).

• SNP maps have been compiled with a density of about 1 site per 1000.

• SNP data is what is mostly collected in populations - it is much cheaper to collect than full sequence data, and focuses on variation in the population, which is what is of interest.

Genotypes and HaplotypesEach individual has two “copies” of each

chromosome. At each site, each chromosome has one of two

alleles (states) denoted by 0 and 1 (motivated by SNPs)

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individualMerge the haplotypes

Haplotype Map Project: HAPMAP

• NIH lead project ($100M) to find common haplotypes in the Human population.

• Used to try to associate genetic-influenced diseases with specific haplotypes, to either find causal haplotypes, or to find the region near causal mutations.

• Haplotyping individuals is expensive.

Haplotyping Problem

• Biological Problem: For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect. Genotype data is easy to collect.

• Computational Problem: Given a set of n genotypes, determine the original set of n haplotype pairs that generated the n genotypes. This is hopeless without a genetic model.

The Pure Parsimony Objective

For a set of genotypes, find a Smallest set H of haplotypes, such that each genotype can be explained by a pair of haplotypes in H.

For each genotype G in the input, assign a pair of haplotypes in H to explain G.

The Pure Parsimony Objective reflects simplegenetic models of how haplotypes evolve in apopulation.

Example of Parsimony

02120 0010001110

22110 01110 10110

20120 0010010110

3 distinct haplotypesset S has size 3

Pure Parsimony is NP-hard

Earl Hubbel (Affymetrix) showed that Pure Parsimonyis NP-hard.

However, for a range of parameters of current interest(50 sites and 50 genotypes) a True Parsimony solution can be computed efficiently, using IntegerLinear Programming, and two speed-up tricks.

For larger parameters (100 sites and 50 genotypes)A near-parsimony solution can be found efficiently.

Why I did this work

I wanted to answer two questions:

First, can a pure parsimony solution be computed efficientlyfor a range of problem sizes of current interest in biology?

Second, how accurate is the pure parsimony solution, compared to the correct solution (in simulations and in theavailable real data), and compared to solutions given by otherexisting computational methods such as PHASE.

Accuracy is measured by the number of genotypes whoseoriginating pair of haplotypes are returned in the solution.

The Conceptual Integer Programming Formulation

For each genotype (individual) j, create one integerprogramming variable Yij for each pair of haplotypeswhose merge creates genotype j. If j has k 2’s, thenThis creates 2^(k-1) Y variables.

Create one integer programming variable Xq forEach distinct haplotype q that appears in one of thepairs for a Y variable.

Conceptual IP

For each genotype, create an equality that says thatexactly one of its Y variables must be set to 1.

For each variable Yij, whose two haplotypes aregiven variables Xq and Xq’, include an inequalitythat says that if variable Yij is set to 1, then bothvariables Xq and Xq’ must be set to 1.

Then the objective function is to Minimize thesum of the X variables.

Example02120 Creates a Y variable Y1 for pair 00100 X1 01110 X2

and a Y variable Y2 for pair 01100 X3 00110 X4

Y1 + Y2 = 1Y1 - X1 <= 0Y1 - X2 <= 0Y2 - X3 <= 0Y2 - X4 <= 0

Include the following (in)equalities into the IP

The objective function willinclude the subexpressionX1 + X2 + X3 + X4But any X variable is includedexactly once no matter how manyY variables it is associated with.

Efficiency Tricks

Ignore any Y variable and its two X variables if those Xvariables are associated with no other Y variable. TheResulting IP is much smaller, and can be used to findthe optimal to the conceptual IP.

Also, we need not enumerate all X pairs for a given genotype, but can efficiently recognize the pairs weneed.

Avoiding Enumeration of unneeded haplotypes

For each pair of genotypes, G1, G2 it is easy to find all the haplotypes that appear in an explanation for G1 and inan explanation for G2.

Example: 0 2 1 1 0 2 0 2 0 1 1 1 2 2 0 2

0 1 1 1 0 V 0 2 V and then generate all combinations

of 0,1’s over the V sites.

So the time is O(m x # haps in both explanation sets)

The APOE Data: A case where the haplotypes were molecularly

determined

There are 17 distinct haplotypes in the real data.

The IP finds a True Parsimony Solution with 15 distinct haplotypes.

PHASE and HAPLOTYPER each use 15 haplotypes also.

Over 10,000 executions of Clarks method, the fewest haplotypes itused in any solutions was 20.

This data has 9 sites, and 47 genotypes, each with at least twoambiguous sites.

Recombination

Recombination is a process whereby a prefix of one sequenceis concatenated to a suffix of another sequence to create a thirdsequence.

Ex. ABCDEFG and TUVWXYZ could recombine to createABCWXYZ

DNA sequences evolve by mutations of different types, but alsoby recombinations.

Recombination Helps Efficiency

As the level of recombination increases, the efficiencyof the IP increases, because the variable eliminationtrick becomes more effective, reducing the size of theIP. The reason is that recombination makes the underlyinghaplotypes in the population more varied, and also increases the number of haplotypes in the population. Hence, eachhaplotype is less likely to be part of a potential explanation ofany given genotype.

Recombination Hurts Accuracy

For almost the same reason as recombination helps efficiency,it hurts accuracy. As recombination increases, the number ofhaplotypes that can be part of the explanation of more thanone genotype in the data decreases. That helps efficiency,but it reduces the level of structure and dependency among thepotential explanations, and hence the parsimony criteria is lesseffective.

How Fast? How Good?

Depends on the level of recombination in the underlyingdata. Pure Parsimony can be computed in seconds tominutes for most cases with 50 genotypes and up to 60sites, faster as the level of recombination increases.

As the level of recombination increases, the accuracyof the Pure Parsimony Solution falls, but remains within5% of the quality of PHASE (for comparison).

Accuracy

For 10 sites and moderate recombination, the PureParsimony solutions have the same accuracy asPHASE and HAPLOTYPER solutions. As thenumber of sites and the level of recombination increases,PHASE and HAPLOTYPER tend to be more accuratethan the Pure Parsimony solution, but the gap is moderate.

A Hybrid Approach for Large Data Sets

We are interested in handling 100 genotypes and 150 sites.

This is too large for the IP approach, but we can use ahybrid approach based on Clarks Method and an IP versionof it.

Generic Clark MethodGiven a known haplotype H (original homozygote or single-siteheterozygote, or previously inferred), and an unresolvedgenotype G, if G can be explained by H and another vector H’,then call H’ a known haplotype, available for additional inferrals.

example: H 0 1 0 0 1 G 2 1 0 2 2 G is “resolved” by H and H’ ------------------ H’ 1 1 0 1 0

Clark (1990) Randomize choices, and do the computations manytimes to find an execution (run) that explains the most genotypes.

In a single run, repeat the basic step until stuck - resolve as many genotypes as possible in the data.

Basic Step:

Many variations of Clark

• Variations based on which parts are randomized.• We closely examine eight variations on a real

data set. Variation 1 randomizes every decision - probably more than Clark originally intended.

• Truth in advertising - we implemented our own Clark versions - did not actually use Clark’s software.

Clark/Parsimony Hybrid

Find an execution of Clark’s method that a) maximizes the number of genotypes resolvedb) minimizes the number of distinct haplotypes used

We can do this by mixing the Digraph View of Clark’s method (Gusfield 2001) with the parsimony criteria, and truly findan execution of Clark’s method that minimizes the number ofdistinct haplotypes used.

On datasets where we can compute True Parsimony, thishybrid does only a bit worse than True Parsimony.

For low recombination, large (>60) sites

Other uses of IP

On datasets where we know the solution, find the bestthat a Clark method can ever do. IP can find the bestpossible execution.

On the APOE data, Clark’s method can get all get 47 correct! In fact in a huge number of ways. (But the best we foundby actually running Clark’s method was 42 correct). This kind of test is not possible for Statistical methods.

Documents

Combinatorial Algorithms for Haplotype Inference