Upload
ross-booth
View
224
Download
0
Embed Size (px)
Citation preview
BIBE 05 1
Haplotype Phasing using Semidefinite Programming
Parag NamjoshiCSEE Department
University of Maryland Baltimore County
Joint work with Konstantinos Kalpakis
BIBE 05 2
Outline Biology Review Motivation Previous work Our contribution Experimental results Conclusions
BIBE 05 3
Biology Review living systems are composed of cells
the code for the creation of the cells is packed in a molecule called DNA.
DNA consists of four nucleic acids Adenine, Cytosine, Guanine, and Thymine arranged as complementary strands of a double helix.
DNA strand = string of A,C,G, & T’s.
BIBE 05 4
Chromosomes the genome is arranged as set of
distinct chromosomes.
mammals are diploids humans have 22 + x and y chromosomes. chromosomes occur in homologous pairs one homologous chromosome is inherited from each parent
homologous chromosomes contain the same genes in the same order (up to mutations)
BIBE 05 5
Single Nucleotide Polymorphisms.
Single Nucleotide Polymorphism (SNP) = mutation of a single base.
evidence suggests that in humans 90% of variation is due to SNPs DNA has long conserved regions punctuated by SNPs
there is one SNP in approximately 1000 bases most SNPS are bi-allelic
at any given locus, only two of the four possible nucleotides are present in 95% of the population
the restriction (projection) of a DNA strand to SNP sites is a haplotype
BIBE 05 6
What are Genotypes? the genotype of diploid organisms is the conflation of the
inherited haplotypes
T C A G A C
T G A C T C
TT {C,G} AA {C,G} {A,T} CC
T {C,G} A {C,G} {A,T} C
Mother
Father
Child
Haplotypes
Genotype
Homozygous Heterozygous
BIBE 05 7
Genotype & Haplotype Std. Representation
genotypes and haplotypes can be represented as a 0,1,2 vectors independently for each site
identify each one of the two letters that appear in it with 0 or 1
replace each homozygous site with 0/1 using the mapping above
replace heterozygous sites with 2
T/1 C/0 A/1 G/0 A/1 C/0
T/1 G/1 A/1 C/1 T/0 C/0
Mother
Father
Child
Haplotypes
Genotype1 2 1 2 2 0
BIBE 05 8
Haplotypes vs. Genotypes large scale polymorphism studies such as Linkage
Disequilibrium need haplotype information however, experimentally
it is expensive to segregate the haplotypes of the individuals
it is easier to observe the genotypes of those individuals can we find haplotypes from the genotypes
computationally? a genotype with h heterozygous sites can be
explained (phased) by 2h-1 different haplotype pairs
how do you choose among them?
BIBE 05 9
Haplotype Phasing with Parsimony in Population haplotyping, given genotypes from different
individuals we want to find a set of haplotypes which resolve all the genotypes
Recall that there can be many such solutions Experimental evidence suggests that the number of such haplotypes is
small
HPP: Haplotype Phasing Problem with Pure Parsimony Given a set of genotypes, find a minimum size set of haplotypes which
conflate to produce the given genotypes
other criteria for choosing among possible sets of haplotypes are perfect phylogeny, minimum total pairwise distance, minimum
diameter, etc
we focus on HPP problem Lancia, Pinotti, and Rizzi proved that the HPP is NP–complete as well as
APX–hard
BIBE 05 10
Clark’s Rule Clark (1990) describes a greedy inference rule to find a
small set of haplotypes resolving a set of genotypes Starting with a set of haplotypes H that resolves all the
homozygous genotypes, do the following for each unresolved genotype g
if there is a pair (h, h’) that resolves g with h in H, then add h’ to H, else stop
the solution obtained is sensitive to the order in which genotypes are resolved
Clark’s rule may terminate with some genotypes unresolved (orphans) The rule can be modified to include a pair of haplotypes that
resolve an orphan genotype, and continue as before
h H
BIBE 05 11
Gusfield’s TIP Gusfield (1999) introduces the TIP approach
enumerate all distinct haplotypes that can be used to resolve any single heterozygous genotype
solve an Integer linear Program (IP) to select a minimum size set haplotypes from the enumerated haplotypes that explains the genotypes
TIP uses O(2L n) variables and constraints, where L is the maximum number of heterozygous loci of any genotype
Gusfield describes a number of important improvements to the basic approach above that improve performance
BIBE 05 12
Harrower-Brown IP Harrower and Brown give an alternate 0-1 IP for
the HPP problem (HB-IP) explain the n genotypes with 2n haplotypes (not
necessarily distinct) the number of distinct haplotypes used are minimized the number of variables and constraints is polynomial in
n, m
BIBE 05 13
The QIP approach - Outline arithmetic representation of genotypes semidefinite programming (SDP) Quadratic Integer Program (QIP) for HPP
a semidefinite programming based heuristic to solve QIP experimental results concluding remarks
BIBE 05 14
Arithmetic Representation of Genotypes represent each genotype g as a vector δ with
each homozygous locus takes value 0 or 2 iff it was 0 or 1 in g
each heterozygous locus takes value 1 conflation can now be replaced by addition
if haplotypes h1 and h2 explain genotype δ, then δ = h1 + h2
we call δ an arithmetic genotype
g = 0 1 2
h1= 0 1 0
h2= 0 1 1
δ = 0 2 1
h1= 0 1 0
h2= 0 1 1
g δ
BIBE 05 15
Arithmetic Genotypes let Δ be n x m matrix with the arithmetic
genotypes as rows let H be k x m matrix with haplotypes as rows if haplotypes in H resolve Δ, then
Δ = S H where S is a n x k 0-1-2 matrix
the row of S for a homozygous genotype has a single 2 all other rows have exactly two 1s
we call S a selector matrix ith row of S “selects” two haplotypes (rows of H) to explain
ith genotype
BIBE 05 16
The k-HPP Problem the k-HPP problem
Given nxm matrix Δ representing a set of n distinct genotypes each with m loci
Find an nxk 0-1-2 selector matrix S and a kxm 0-1 haplotype matrix H such that
Δ = S H S has as few non-zero columns as possible all row-sums of S are 2
HPP is equivalent to k-HPP with k=2n
lower Bounds for HPP is a well known lower bound Lemma: rank(Δ) is a lower bound for HPP
Consider an optimal solution S, H Since Δ = S H, we know that rank(Δ) = min(rank(S), rank(H)), and thus H
must have at least rank(Δ) distinct rows (haplotypes)
n
n
BIBE 05 17
Finding H given Δ and S given Δ and H to find an S is easy
given Δ and S find an H by solving a 2-SAT problem If genotype i is resolved by haplotypes t and l, then for each
locus j, add following clauses If δi,j = 0, add two clauses (¬ht,j) ^ (¬hl,j) If δi,j = 2, add two clauses (ht,j) ^ (hl,j) If δi,j = 1, add clauses (ht,j V hl,j ) ^ (¬ht,j V ¬hl,j)
Only one of the ht,j ,hl,j must both be 1
2-SAT problem has km variables and 2nm clauses can be solved in (almost) linear time any satisfying assignment gives a resolution of the genotypes
BIBE 05 18
Quadratic, Vector, and Semi-definite Programs Quadratic Integer Program
Optimize a quadratic objective function subject to quadratic constraints on integer variables
Strict, when each term has total degree 0 or 2 Vector program
optimize a linear objective function of inner products of vector variables subject to linear constraints on inner products of those variables
Strict quadratic programs lead to vector programs (products of variables are mapped to inner products of corresponding vectors)
SDP program optimize a linear objective function of the elements of a matrix X subject to
linear constraints on the elements of X X being a positive semi-definite matrix
Vector programs lead to SDP (X is the matrix of all vector inner products) SDP programs can be solved in polynomial-time with small numerical
errors, thus solving vector programs, thus solving relaxations of strict Quadratic Integer programs
construct an approximate solution to a quadratic integer program from a solution of its relaxation, obtained via SDP
BIBE 05 19
Quadratic Integer Program for the k-HPP
Subject to:
BIBE 05 20
QIP Heuristic: SDP+Rounding+Backtracking recursively solve k-HPP
using SDP compute vectors for the variables of QIP for each selector variable Si,j, compute
P[Si,j]=probability that a random hyperplane separates the vectors of Si,j and z variables (ala MAX-CUT)
round to 1 the Si,j* with the highest P[Si,j] residual k-HPP=k-HPP problem with the rounded Si,j’s
fixed to their rounded value if the residual k-HPP is infeasible
round Si,j* to 0 instead if the new residual k-HPP is still infeasible
backtrack by returning infeasible recursively solve the residual k-HPP
BIBE 05 21
Experiments we experiment with three approaches for the HPP
problem Clark’s rule LP relaxation of Gusfield’s TIP scheme with simple
rounding the QIP heuristic for k–HPP with k = 2n
The MATLAB package SDPT 3.02 is used to solve the SDP relaxation of the problem
all experiments are done on a single CPU MATLAB on a Dual Xeon 2.4 Ghz desktop with 1GB memory
BIBE 05 22
Experimental Datasets we use synthetic datasets A and B
each with 20 instances for each triplet (n, m, k) = (5, 5, 5), (8, 8, 8), (10, 10, 10), and (15, 15, 15) (and for B, recombination levels ρ = 0, 16 and 40)
generate instances of the HPP problem as follows randomly mate k haplotypes with m loci to produce n
genotypes generation of haplotypes for dataset A
each locus of k haplotypes takes value 0/1 with probability ½ independent of other loci and other genotypes
generation of haplotypes for dataset B Use Hudson’s program to generate haplotypes with these
parameters diploid population of size 106
mutation rate = 1.5 × 10-6
recombination levels ρ = 0, 16 and 40 corresponding to crossover probabilities 0, 4 × 10-6, and 10-5
BIBE 05 23
Experimental Results
BIBE 05 24
QIP Extensions QIP can be extended to handle many variants of
basic k-HPP problem, such as partial Genotypes
Some loci in some genotypes are unknown shared haplotypes
Prior knowledge of shared haplotypes allowing for erroneous genotypes and loci editing allowing for outlier genotypes
BIBE 05 25
Concluding Remarks developed arithmetic formulation for the HPP problem
provides new lower bound yields simple quadratic IP (QIP) QIP can be extended to handle many variants, incorporate prior
information etc SDP relaxation of QIP that can be solved in polynomial time
SDP+rounding+backtracking gives QIP heuristic experimentally
Demonstrate competitiveness of QIP heuristic vs Clark’s rule and Gusfield’s TIP relaxation
Show that rank of the genotypes is a tighter lower bound than future work
Analysis of worst-case performance ratio of the QIP heuristic Devise algorithms that scale better
n
BIBE 05 26
Thank You !
Questions ?