26
BIBE 05 1 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos Kalpakis

BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

Embed Size (px)

Citation preview

Page 1: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 1

Haplotype Phasing using Semidefinite Programming

Parag NamjoshiCSEE Department

University of Maryland Baltimore County

Joint work with Konstantinos Kalpakis

Page 2: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 2

Outline Biology Review Motivation Previous work Our contribution Experimental results Conclusions

Page 3: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 3

Biology Review living systems are composed of cells

the code for the creation of the cells is packed in a molecule called DNA.

DNA consists of four nucleic acids Adenine, Cytosine, Guanine, and Thymine arranged as complementary strands of a double helix.

DNA strand = string of A,C,G, & T’s.

Page 4: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 4

Chromosomes the genome is arranged as set of

distinct chromosomes.

mammals are diploids humans have 22 + x and y chromosomes. chromosomes occur in homologous pairs one homologous chromosome is inherited from each parent

homologous chromosomes contain the same genes in the same order (up to mutations)

Page 5: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 5

Single Nucleotide Polymorphisms.

Single Nucleotide Polymorphism (SNP) = mutation of a single base.

evidence suggests that in humans 90% of variation is due to SNPs DNA has long conserved regions punctuated by SNPs

there is one SNP in approximately 1000 bases most SNPS are bi-allelic

at any given locus, only two of the four possible nucleotides are present in 95% of the population

the restriction (projection) of a DNA strand to SNP sites is a haplotype

Page 6: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 6

What are Genotypes? the genotype of diploid organisms is the conflation of the

inherited haplotypes

T C A G A C

T G A C T C

TT {C,G} AA {C,G} {A,T} CC

T {C,G} A {C,G} {A,T} C

Mother

Father

Child

Haplotypes

Genotype

Homozygous Heterozygous

Page 7: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 7

Genotype & Haplotype Std. Representation

genotypes and haplotypes can be represented as a 0,1,2 vectors independently for each site

identify each one of the two letters that appear in it with 0 or 1

replace each homozygous site with 0/1 using the mapping above

replace heterozygous sites with 2

T/1 C/0 A/1 G/0 A/1 C/0

T/1 G/1 A/1 C/1 T/0 C/0

Mother

Father

Child

Haplotypes

Genotype1 2 1 2 2 0

Page 8: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 8

Haplotypes vs. Genotypes large scale polymorphism studies such as Linkage

Disequilibrium need haplotype information however, experimentally

it is expensive to segregate the haplotypes of the individuals

it is easier to observe the genotypes of those individuals can we find haplotypes from the genotypes

computationally? a genotype with h heterozygous sites can be

explained (phased) by 2h-1 different haplotype pairs

how do you choose among them?

Page 9: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 9

Haplotype Phasing with Parsimony in Population haplotyping, given genotypes from different

individuals we want to find a set of haplotypes which resolve all the genotypes

Recall that there can be many such solutions Experimental evidence suggests that the number of such haplotypes is

small

HPP: Haplotype Phasing Problem with Pure Parsimony Given a set of genotypes, find a minimum size set of haplotypes which

conflate to produce the given genotypes

other criteria for choosing among possible sets of haplotypes are perfect phylogeny, minimum total pairwise distance, minimum

diameter, etc

we focus on HPP problem Lancia, Pinotti, and Rizzi proved that the HPP is NP–complete as well as

APX–hard

Page 10: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 10

Clark’s Rule Clark (1990) describes a greedy inference rule to find a

small set of haplotypes resolving a set of genotypes Starting with a set of haplotypes H that resolves all the

homozygous genotypes, do the following for each unresolved genotype g

if there is a pair (h, h’) that resolves g with h in H, then add h’ to H, else stop

the solution obtained is sensitive to the order in which genotypes are resolved

Clark’s rule may terminate with some genotypes unresolved (orphans) The rule can be modified to include a pair of haplotypes that

resolve an orphan genotype, and continue as before

h H

Page 11: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 11

Gusfield’s TIP Gusfield (1999) introduces the TIP approach

enumerate all distinct haplotypes that can be used to resolve any single heterozygous genotype

solve an Integer linear Program (IP) to select a minimum size set haplotypes from the enumerated haplotypes that explains the genotypes

TIP uses O(2L n) variables and constraints, where L is the maximum number of heterozygous loci of any genotype

Gusfield describes a number of important improvements to the basic approach above that improve performance

Page 12: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 12

Harrower-Brown IP Harrower and Brown give an alternate 0-1 IP for

the HPP problem (HB-IP) explain the n genotypes with 2n haplotypes (not

necessarily distinct) the number of distinct haplotypes used are minimized the number of variables and constraints is polynomial in

n, m

Page 13: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 13

The QIP approach - Outline arithmetic representation of genotypes semidefinite programming (SDP) Quadratic Integer Program (QIP) for HPP

a semidefinite programming based heuristic to solve QIP experimental results concluding remarks

Page 14: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 14

Arithmetic Representation of Genotypes represent each genotype g as a vector δ with

each homozygous locus takes value 0 or 2 iff it was 0 or 1 in g

each heterozygous locus takes value 1 conflation can now be replaced by addition

if haplotypes h1 and h2 explain genotype δ, then δ = h1 + h2

we call δ an arithmetic genotype

g = 0 1 2

h1= 0 1 0

h2= 0 1 1

δ = 0 2 1

h1= 0 1 0

h2= 0 1 1

g δ

Page 15: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 15

Arithmetic Genotypes let Δ be n x m matrix with the arithmetic

genotypes as rows let H be k x m matrix with haplotypes as rows if haplotypes in H resolve Δ, then

Δ = S H where S is a n x k 0-1-2 matrix

the row of S for a homozygous genotype has a single 2 all other rows have exactly two 1s

we call S a selector matrix ith row of S “selects” two haplotypes (rows of H) to explain

ith genotype

Page 16: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 16

The k-HPP Problem the k-HPP problem

Given nxm matrix Δ representing a set of n distinct genotypes each with m loci

Find an nxk 0-1-2 selector matrix S and a kxm 0-1 haplotype matrix H such that

Δ = S H S has as few non-zero columns as possible all row-sums of S are 2

HPP is equivalent to k-HPP with k=2n

lower Bounds for HPP is a well known lower bound Lemma: rank(Δ) is a lower bound for HPP

Consider an optimal solution S, H Since Δ = S H, we know that rank(Δ) = min(rank(S), rank(H)), and thus H

must have at least rank(Δ) distinct rows (haplotypes)

n

n

Page 17: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 17

Finding H given Δ and S given Δ and H to find an S is easy

given Δ and S find an H by solving a 2-SAT problem If genotype i is resolved by haplotypes t and l, then for each

locus j, add following clauses If δi,j = 0, add two clauses (¬ht,j) ^ (¬hl,j) If δi,j = 2, add two clauses (ht,j) ^ (hl,j) If δi,j = 1, add clauses (ht,j V hl,j ) ^ (¬ht,j V ¬hl,j)

Only one of the ht,j ,hl,j must both be 1

2-SAT problem has km variables and 2nm clauses can be solved in (almost) linear time any satisfying assignment gives a resolution of the genotypes

Page 18: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 18

Quadratic, Vector, and Semi-definite Programs Quadratic Integer Program

Optimize a quadratic objective function subject to quadratic constraints on integer variables

Strict, when each term has total degree 0 or 2 Vector program

optimize a linear objective function of inner products of vector variables subject to linear constraints on inner products of those variables

Strict quadratic programs lead to vector programs (products of variables are mapped to inner products of corresponding vectors)

SDP program optimize a linear objective function of the elements of a matrix X subject to

linear constraints on the elements of X X being a positive semi-definite matrix

Vector programs lead to SDP (X is the matrix of all vector inner products) SDP programs can be solved in polynomial-time with small numerical

errors, thus solving vector programs, thus solving relaxations of strict Quadratic Integer programs

construct an approximate solution to a quadratic integer program from a solution of its relaxation, obtained via SDP

Page 19: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 19

Quadratic Integer Program for the k-HPP

Subject to:

Page 20: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 20

QIP Heuristic: SDP+Rounding+Backtracking recursively solve k-HPP

using SDP compute vectors for the variables of QIP for each selector variable Si,j, compute

P[Si,j]=probability that a random hyperplane separates the vectors of Si,j and z variables (ala MAX-CUT)

round to 1 the Si,j* with the highest P[Si,j] residual k-HPP=k-HPP problem with the rounded Si,j’s

fixed to their rounded value if the residual k-HPP is infeasible

round Si,j* to 0 instead if the new residual k-HPP is still infeasible

backtrack by returning infeasible recursively solve the residual k-HPP

Page 21: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 21

Experiments we experiment with three approaches for the HPP

problem Clark’s rule LP relaxation of Gusfield’s TIP scheme with simple

rounding the QIP heuristic for k–HPP with k = 2n

The MATLAB package SDPT 3.02 is used to solve the SDP relaxation of the problem

all experiments are done on a single CPU MATLAB on a Dual Xeon 2.4 Ghz desktop with 1GB memory

Page 22: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 22

Experimental Datasets we use synthetic datasets A and B

each with 20 instances for each triplet (n, m, k) = (5, 5, 5), (8, 8, 8), (10, 10, 10), and (15, 15, 15) (and for B, recombination levels ρ = 0, 16 and 40)

generate instances of the HPP problem as follows randomly mate k haplotypes with m loci to produce n

genotypes generation of haplotypes for dataset A

each locus of k haplotypes takes value 0/1 with probability ½ independent of other loci and other genotypes

generation of haplotypes for dataset B Use Hudson’s program to generate haplotypes with these

parameters diploid population of size 106

mutation rate = 1.5 × 10-6

recombination levels ρ = 0, 16 and 40 corresponding to crossover probabilities 0, 4 × 10-6, and 10-5

Page 23: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 23

Experimental Results

Page 24: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 24

QIP Extensions QIP can be extended to handle many variants of

basic k-HPP problem, such as partial Genotypes

Some loci in some genotypes are unknown shared haplotypes

Prior knowledge of shared haplotypes allowing for erroneous genotypes and loci editing allowing for outlier genotypes

Page 25: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 25

Concluding Remarks developed arithmetic formulation for the HPP problem

provides new lower bound yields simple quadratic IP (QIP) QIP can be extended to handle many variants, incorporate prior

information etc SDP relaxation of QIP that can be solved in polynomial time

SDP+rounding+backtracking gives QIP heuristic experimentally

Demonstrate competitiveness of QIP heuristic vs Clark’s rule and Gusfield’s TIP relaxation

Show that rank of the genotypes is a tighter lower bound than future work

Analysis of worst-case performance ratio of the QIP heuristic Devise algorithms that scale better

n

Page 26: BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos

BIBE 05 26

Thank You !

Questions ?