45
Constraint Programming and Biology: Haplotype Inference Agostino Dovier Dept. Math and Computer Science, Univ. of Udine, Italy ACP Summer School in Constraint Programming Wroclaw, September 2012 Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wroclaw, September 2012 1 / 20

Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Constraint Programming and Biology:Haplotype Inference

Agostino Dovier

Dept. Math and Computer Science, Univ. of Udine, Italy

ACP Summer School in Constraint ProgrammingWrocław, September 2012

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 1 / 20

Page 2: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

DNA and Genome in a nutshell

DNA (DeoxyriboNucleic Acid) ischaracterized by a string of nucleotides: A,C, G, and T (Adenine, Cytosine, Guanine,Thymine)Given a sequence s ∈ {A,C,G,T}∗ thecomplementary sequence s̄ isdeterministically obtained by substitutingA↔ T and C ↔ Gs and s̄ fold together forming the famousdouble helix

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 2 / 20

Page 3: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

DNA and Genome in a nutshell

DNA strings are huge (106–1010 nucleotides).Differences between the DNAs of two members of the samespecie are limited (e.g., 1 on 1000 for humans)Some fragments of the DNA encode proteins (we’ll be back onthat later). Let’s say for now that they are very important parts andcalled genes.In the Human DNA it is estimated that there are 23000 (maybefew) protein-coding genes.Differences of some nucleotides in the same gene characterize aproperty of an individual w.r.t. another.The set of all genes of an individual is called genome

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 3 / 20

Page 4: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

Genes are packaged in bundles called chromosomes.(Chromosomes are therefore regions of DNA)In diploid organisms (like humans) we have 23 homologouschromosome pairs, one coming from the DNA of the father,another coming from the DNA of the mother.A haplotype is a DNA sequence that has been inherited from oneparent.

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 4 / 20

Page 5: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

Each person inherits two haplotypes (from the mother and from thefather) for most regions of the genome.

· · · G A T C T G T A C T G A G T · · ·· · · G A T C T G T A C T G A A T · · ·

⇑ ⇑ ⇑

In some (typical) points, the bases can be different.If this is the case, we say that there is a Single NucleotidePolymorphism (SNP).

Changes are always C ↔ T and A↔ G

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 20

Page 6: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

Each person inherits two haplotypes (from the mother and from thefather) for most regions of the genome.

· · · G A T C T G T A C T G A G T · · ·· · · G A T C T G T A C T G A A T · · ·

⇑ ⇑ ⇑

In some (typical) points, the bases can be different.

If this is the case, we say that there is a Single NucleotidePolymorphism (SNP).

Changes are always C ↔ T and A↔ G

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 20

Page 7: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

Each person inherits two haplotypes (from the mother and from thefather) for most regions of the genome.

· · · G A T C T G T A C T G A G T · · ·· · · G A T C T G T A C T G A A T · · ·

⇑ ⇑ ⇑

In some (typical) points, the bases can be different.If this is the case, we say that there is a Single NucleotidePolymorphism (SNP).

Changes are always C ↔ T and A↔ G

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 20

Page 8: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

Each person inherits two haplotypes (from the mother and from thefather) for most regions of the genome.

· · · G A T C T G T A C T G A G T · · ·· · · G A T C T G T A C T G A A T · · ·

⇑ ⇑ ⇑

In some (typical) points, the bases can be different.If this is the case, we say that there is a Single NucleotidePolymorphism (SNP).

Changes are always C ↔ T and A↔ G

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 5 / 20

Page 9: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

A good introduction in http://csiflabs.cs.ucdavis.edu/~gusfield/gusfieldorzack.pdf

The Haplotype Inference problem(s) is(are) introduced toinvestigate genetic variations in a population.Some particular points of the DNA where typically mutations areconcentrated are selected (SNPs).These lists of points are analyzed.

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 6 / 20

Page 10: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype InferenceSingle Nucleotide Polymorphism (SNP)

Each person has two haplotypes (from the mother and from the father)for most regions of the genome:

G A A T C T T C G T A C T G A G TG A A T C T T C G T A C T G A A TLet us focus on the SNPs:

A C T GA C T A

0 0 1 2We know that at a location (site/locus) there is a SNP.We know whether the SNP is C ↔ T and A↔ G.We assign a SNP a 0-2 value in the following way:C,C 7→ 0 T ,T 7→ 1 C,T 7→ 2 T ,C 7→ 2A,A 7→ 0 G,G 7→ 1 A,G 7→ 2 G,A 7→ 2

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 7 / 20

Page 11: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype InferenceSingle Nucleotide Polymorphism (SNP)

Each person has two haplotypes (from the mother and from the father)for most regions of the genome:

G A A T C T T C G T A C T G A G TG A A T C T T C G T A C T G A A TLet us focus on the SNPs:

A C T GA C T A0 0 1 2

We know that at a location (site/locus) there is a SNP.

We know whether the SNP is C ↔ T and A↔ G.We assign a SNP a 0-2 value in the following way:C,C 7→ 0 T ,T 7→ 1 C,T 7→ 2 T ,C 7→ 2A,A 7→ 0 G,G 7→ 1 A,G 7→ 2 G,A 7→ 2

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 7 / 20

Page 12: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype InferenceSingle Nucleotide Polymorphism (SNP)

Each person has two haplotypes (from the mother and from the father)for most regions of the genome:

G A A T C T T C G T A C T G A G TG A A T C T T C G T A C T G A A TLet us focus on the SNPs:

A C T GA C T A0 0 1 2

We know that at a location (site/locus) there is a SNP.We know whether the SNP is C ↔ T and A↔ G.

We assign a SNP a 0-2 value in the following way:C,C 7→ 0 T ,T 7→ 1 C,T 7→ 2 T ,C 7→ 2A,A 7→ 0 G,G 7→ 1 A,G 7→ 2 G,A 7→ 2

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 7 / 20

Page 13: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype InferenceSingle Nucleotide Polymorphism (SNP)

Each person has two haplotypes (from the mother and from the father)for most regions of the genome:

G A A T C T T C G T A C T G A G TG A A T C T T C G T A C T G A A TLet us focus on the SNPs:

A C T GA C T A0 0 1 2

We know that at a location (site/locus) there is a SNP.We know whether the SNP is C ↔ T and A↔ G.We assign a SNP a 0-2 value in the following way:C,C 7→ 0 T ,T 7→ 1 C,T 7→ 2 T ,C 7→ 2A,A 7→ 0 G,G 7→ 1 A,G 7→ 2 G,A 7→ 2

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 7 / 20

Page 14: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

A string of {0,1,2}∗ is called a genotypeA string of {0,1}∗ is called a haplotypeTwo equal length haplotypes generate a unique genotype

E.g., 0010,0101⇒ 0222If we have a genotype, we can only conjecture haplotypes thatgenerated it(observe that, e.g., 0110,0001⇒ 0222)Biological experiments allow us to know genotypes!Investigating sets of genotypes for a population we canunderstand the relationships between SNPs and physical featuresas well as medical informationSince genotypes are introduced in evolution, is is reasonable tofind minimal sets of haplotypes explaining the known genotypes.

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 8 / 20

Page 15: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

A string of {0,1,2}∗ is called a genotypeA string of {0,1}∗ is called a haplotypeTwo equal length haplotypes generate a unique genotypeE.g., 0010,0101⇒ 0222

If we have a genotype, we can only conjecture haplotypes thatgenerated it(observe that, e.g., 0110,0001⇒ 0222)Biological experiments allow us to know genotypes!Investigating sets of genotypes for a population we canunderstand the relationships between SNPs and physical featuresas well as medical informationSince genotypes are introduced in evolution, is is reasonable tofind minimal sets of haplotypes explaining the known genotypes.

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 8 / 20

Page 16: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

A string of {0,1,2}∗ is called a genotypeA string of {0,1}∗ is called a haplotypeTwo equal length haplotypes generate a unique genotypeE.g., 0010,0101⇒ 0222If we have a genotype, we can only conjecture haplotypes thatgenerated it

(observe that, e.g., 0110,0001⇒ 0222)Biological experiments allow us to know genotypes!Investigating sets of genotypes for a population we canunderstand the relationships between SNPs and physical featuresas well as medical informationSince genotypes are introduced in evolution, is is reasonable tofind minimal sets of haplotypes explaining the known genotypes.

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 8 / 20

Page 17: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

A string of {0,1,2}∗ is called a genotypeA string of {0,1}∗ is called a haplotypeTwo equal length haplotypes generate a unique genotypeE.g., 0010,0101⇒ 0222If we have a genotype, we can only conjecture haplotypes thatgenerated it(observe that, e.g., 0110,0001⇒ 0222)

Biological experiments allow us to know genotypes!Investigating sets of genotypes for a population we canunderstand the relationships between SNPs and physical featuresas well as medical informationSince genotypes are introduced in evolution, is is reasonable tofind minimal sets of haplotypes explaining the known genotypes.

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 8 / 20

Page 18: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

A string of {0,1,2}∗ is called a genotypeA string of {0,1}∗ is called a haplotypeTwo equal length haplotypes generate a unique genotypeE.g., 0010,0101⇒ 0222If we have a genotype, we can only conjecture haplotypes thatgenerated it(observe that, e.g., 0110,0001⇒ 0222)Biological experiments allow us to know genotypes!Investigating sets of genotypes for a population we canunderstand the relationships between SNPs and physical featuresas well as medical informationSince genotypes are introduced in evolution, is is reasonable tofind minimal sets of haplotypes explaining the known genotypes.

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 8 / 20

Page 19: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 20

Page 20: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 20

Page 21: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Introduction

Haplotype Inference

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 9 / 20

Page 22: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Definition

Haplotype Inference

Let H = {0,1}∗ be the set of haplotypes andG = {0,1,2}∗ be the set of genotypes.Given h1,h2 ∈ H and g ∈ G, {h1,h2} explains g if and only if|h1| = |h2| = |g| (let’s say n = |g|) and ∀i ∈ [1..n]:

g[i] ≤ 1 −→ h1[i] = h2[i] = g[i]g[i] = 2 −→ h1[i] 6= h2[i]

A set of haplotypes H ⊆ H explains a set of genotypes G ⊆ G if forall g ∈ G there are h1,h2 ∈ H such that {h1,h2} explains g.Given a set of genotypes G ⊆ G and an integer k , the haplotypeinference problem (HIP) by pure parsimony is the problem offinding a set H ⊆ H that explains G and such that |H| = k(decision version).

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 10 / 20

Page 23: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Definition

Haplotype InferenceThe ILP modeling

0, 1, 2 are arbitrary valuesLet us swap the roles of 1 and 2 in genotypes, namely 1 is used ofmismatch (and 2 stands for 1)Given h1,h2 ∈ H and g ∈ G, {h1,h2} explains g if and only if|h1| = |h2| = |g| = n and ∀i ∈ [1..n]:

g[i] = 0 −→ h1[i] = h2[i] = 0g[i] = 2 −→ h1[i] = h2[i] = 1g[i] = 1 −→ {h1[i],h2[i]} = {0,1}

Therefore g[i] = h1[i] + h2[i]This simplifies an ILP encoding. Experimentally it does notimprove CP speed-up. Just forget it in this school.

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 11 / 20

Page 24: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Definition

Haplotype InferenceThe ILP modeling

0, 1, 2 are arbitrary valuesLet us swap the roles of 1 and 2 in genotypes, namely 1 is used ofmismatch (and 2 stands for 1)Given h1,h2 ∈ H and g ∈ G, {h1,h2} explains g if and only if|h1| = |h2| = |g| = n and ∀i ∈ [1..n]:

g[i] = 0 −→ h1[i] = h2[i] = 0g[i] = 2 −→ h1[i] = h2[i] = 1g[i] = 1 −→ {h1[i],h2[i]} = {0,1}

Therefore g[i] = h1[i] + h2[i]This simplifies an ILP encoding. Experimentally it does notimprove CP speed-up. Just forget it in this school.

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 11 / 20

Page 25: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Definition

Haplotype Inference by Pure Parsimony

Use of such a parsimony criterion is consistent with the factthat the number of distinct haplotypes observed in mostnatural populations is vastly smaller than the number ofpossible haplotypes; this is expected given the plausibleassumptions that the mutation rate at each site is small andrecombinations rate are low.

[Gusfield and Orzack, 2006]

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 12 / 20

Page 26: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Complexity

Haplotype Inference by Pure ParsimonyNP-completeness (sketch — see [LPR04])

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 13 / 20

Page 27: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Complexity

Haplotype Inference by Pure ParsimonyNP-completeness (sketch — see [LPR04])

Vertex cover of cardinality 2

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 14 / 20

Page 28: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Complexity

Haplotype Inference by Pure ParsimonyNP-completeness (sketch — see [LPR04])

1 2 3 4 51 0 1 1 1 1 0

2 1 0 1 1 1 03 1 1 0 1 1 04 1 1 1 0 1 05 1 1 1 1 0 0

(1,2) 2 2 1 1 1 2(1,3) 2 1 2 1 1 2(1,4) 2 1 1 2 1 2(1,5) 2 1 1 1 2 2(2,5) 1 2 1 1 2 2(3,5) 1 1 2 1 2 2(4,5) 1 1 1 2 2 2

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 15 / 20

Page 29: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Complexity

Haplotype Inference by Pure ParsimonyNP-completeness (sketch — see [LPR04])

1 2 3 4 51 0 1 1 1 1 02 1 0 1 1 1 03 1 1 0 1 1 04 1 1 1 0 1 05 1 1 1 1 0 0

(1,2) 2 2 1 1 1 2(1,3) 2 1 2 1 1 2(1,4) 2 1 1 2 1 2(1,5) 2 1 1 1 2 2(2,5) 1 2 1 1 2 2(3,5) 1 1 2 1 2 2(4,5) 1 1 1 2 2 2

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 15 / 20

Page 30: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Complexity

Haplotype Inference by Pure ParsimonyNP-completeness (sketch — see [LPR04])

1 2 3 4 51 0 1 1 1 1 02 1 0 1 1 1 03 1 1 0 1 1 04 1 1 1 0 1 05 1 1 1 1 0 0

(1,2) 2 2 1 1 1 2

(1,3) 2 1 2 1 1 2(1,4) 2 1 1 2 1 2(1,5) 2 1 1 1 2 2(2,5) 1 2 1 1 2 2(3,5) 1 1 2 1 2 2(4,5) 1 1 1 2 2 2

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 15 / 20

Page 31: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Complexity

Haplotype Inference by Pure ParsimonyNP-completeness (sketch — see [LPR04])

1 2 3 4 51 0 1 1 1 1 02 1 0 1 1 1 03 1 1 0 1 1 04 1 1 1 0 1 05 1 1 1 1 0 0

(1,2) 2 2 1 1 1 2(1,3) 2 1 2 1 1 2

(1,4) 2 1 1 2 1 2(1,5) 2 1 1 1 2 2(2,5) 1 2 1 1 2 2(3,5) 1 1 2 1 2 2(4,5) 1 1 1 2 2 2

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 15 / 20

Page 32: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Complexity

Haplotype Inference by Pure ParsimonyNP-completeness (sketch — see [LPR04])

1 2 3 4 51 0 1 1 1 1 02 1 0 1 1 1 03 1 1 0 1 1 04 1 1 1 0 1 05 1 1 1 1 0 0

(1,2) 2 2 1 1 1 2(1,3) 2 1 2 1 1 2(1,4) 2 1 1 2 1 2(1,5) 2 1 1 1 2 2(2,5) 1 2 1 1 2 2(3,5) 1 1 2 1 2 2(4,5) 1 1 1 2 2 2

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 15 / 20

Page 33: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Complexity

Haplotype Inference by Pure ParsimonyNP-completeness (sketch — see [LPR04])

Vertex cover for G = 〈N,E〉 of cardinality k ⇒ |H| = |N|+ k .

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 16 / 20

Page 34: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Complexity

Haplotype Inference by Pure ParsimonyNP-completeness (sketch — see [LPR04])

Vertex cover for G = 〈N,E〉 of cardinality k ⇒ |H| = |N|+ k .

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 16 / 20

Page 35: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Complexity

Haplotype Inference by Pure ParsimonyNP-completeness (sketch — see [LPR04])

Vertex cover for G = 〈N,E〉 of cardinality k ⇒ |H| = |N|+ k .

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 16 / 20

Page 36: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Complexity

Haplotype Inference by Pure ParsimonyNP-completeness (sketch — see [LPR04])

Vertex cover for G = 〈N,E〉 of cardinality k ⇒ |H| = |N|+ k .

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 16 / 20

Page 37: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Complexity

Haplotype Inference by Pure ParsimonyNP-completeness (sketch — see [LPR04])

Vertex cover for G = 〈N,E〉 of cardinality k ⇒ |H| = |N|+ k .

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 16 / 20

Page 38: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Encoding

Haplotype Inference1st CP encoding

Let us focus on the decisional version: Is there an explanation forG with k haplotypes?Generate k vectors of 0-1 FD variables H1, . . . ,Hk of length nAdd a lexicographical constraint on H1, . . . ,Hk .Build a constraint of the form:

∀Gi ∈ G ∃Hi1∃Hi2 s.t. 〈Hi1 ,Hi2〉 explain Gi

Basically, for each i , i1, i2 we have a flag F ii1,i2

true iff

n∧j=1

(Gi [j] ≤ 1→ (Hi1 [j] = Hi2 [j] = Gi [j])∧Gi [j] = 2→ (Hi1 [j] 6= Hi2 [j])

)

Then forall i ∈ [1..|G|]:∑

i1,i2 F ii1,i2≥ 1

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 17 / 20

Page 39: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Encoding

Haplotype Inference2nd CP encoding

Let us focus on the decisional version: Is there an explanation forG with k haplotypes?Generate m = 2|G| vectors of 0-1 FD variables H1, . . . ,Hm oflength nAdd a lexicographical constraint on pairs(H1,H2), (H3,H4), . . . , (Hm−1,Hm) (we can have repetitions now!)Build a constraint of the form:

(∀Gi ∈ G) (〈H2i−1,H2i〉 explain G)

Namely, again,n∧

j=1

(Gi [j] ≤ 1→ (H2i1 [j] = Hi2 [j] = G2i [j])∧Gi [j] = 2→ (H2i1 [j] 6= H2i [j])

)We need to state (using constraints!) that |{H1, . . . ,Hm}| = k .

This is a good constraint exercise.

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 18 / 20

Page 40: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Encoding

Haplotype Inference2nd CP encoding

Let us focus on the decisional version: Is there an explanation forG with k haplotypes?Generate m = 2|G| vectors of 0-1 FD variables H1, . . . ,Hm oflength nAdd a lexicographical constraint on pairs(H1,H2), (H3,H4), . . . , (Hm−1,Hm) (we can have repetitions now!)Build a constraint of the form:

(∀Gi ∈ G) (〈H2i−1,H2i〉 explain G)

Namely, again,n∧

j=1

(Gi [j] ≤ 1→ (H2i1 [j] = Hi2 [j] = G2i [j])∧Gi [j] = 2→ (H2i1 [j] 6= H2i [j])

)We need to state (using constraints!) that |{H1, . . . ,Hm}| = k .This is a good constraint exercise.

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 18 / 20

Page 41: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Encoding

Haplotype Inference2nd CP encoding

For a,b ∈ [1..m] we set Fa,b ↔∧n

i=1 Ha[i] = Hb[i].Namely Fa,b is a Boolean variable that is true iff Ha and Hb will beequal in the solution

Then define Ma ↔∨m

b=a+1 Fa,b

Ma is again a Boolean variable that is true if and only if there isanother vector in Ha+1,Ha+2, . . . ,Hm equal to Ha

The size of H can be therefore expressed as∑n

a=1(1−Ma)(viewing Boolean truth values as 0/1)What is the best encoding? Try codes clp_direct.pl andclp_second.pl (0, 1, and 2 have the original meaning here).Search space: O(2nk ), with k < 2|G| (1), O(22n|G|)) (2).Call goals like :- input(1,Gs),haplo_decision(Gs,5). or:- haplo_decision([[0,1,2],[0,2,1]],3).

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 19 / 20

Page 42: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Encoding

Haplotype Inference2nd CP encoding

For a,b ∈ [1..m] we set Fa,b ↔∧n

i=1 Ha[i] = Hb[i].Namely Fa,b is a Boolean variable that is true iff Ha and Hb will beequal in the solutionThen define Ma ↔

∨mb=a+1 Fa,b

Ma is again a Boolean variable that is true if and only if there isanother vector in Ha+1,Ha+2, . . . ,Hm equal to Ha

The size of H can be therefore expressed as∑n

a=1(1−Ma)(viewing Boolean truth values as 0/1)What is the best encoding? Try codes clp_direct.pl andclp_second.pl (0, 1, and 2 have the original meaning here).Search space: O(2nk ), with k < 2|G| (1), O(22n|G|)) (2).Call goals like :- input(1,Gs),haplo_decision(Gs,5). or:- haplo_decision([[0,1,2],[0,2,1]],3).

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 19 / 20

Page 43: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Encoding

Haplotype Inference2nd CP encoding

For a,b ∈ [1..m] we set Fa,b ↔∧n

i=1 Ha[i] = Hb[i].Namely Fa,b is a Boolean variable that is true iff Ha and Hb will beequal in the solutionThen define Ma ↔

∨mb=a+1 Fa,b

Ma is again a Boolean variable that is true if and only if there isanother vector in Ha+1,Ha+2, . . . ,Hm equal to Ha

The size of H can be therefore expressed as∑n

a=1(1−Ma)(viewing Boolean truth values as 0/1)

What is the best encoding? Try codes clp_direct.pl andclp_second.pl (0, 1, and 2 have the original meaning here).Search space: O(2nk ), with k < 2|G| (1), O(22n|G|)) (2).Call goals like :- input(1,Gs),haplo_decision(Gs,5). or:- haplo_decision([[0,1,2],[0,2,1]],3).

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 19 / 20

Page 44: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference Encoding

Haplotype Inference2nd CP encoding

For a,b ∈ [1..m] we set Fa,b ↔∧n

i=1 Ha[i] = Hb[i].Namely Fa,b is a Boolean variable that is true iff Ha and Hb will beequal in the solutionThen define Ma ↔

∨mb=a+1 Fa,b

Ma is again a Boolean variable that is true if and only if there isanother vector in Ha+1,Ha+2, . . . ,Hm equal to Ha

The size of H can be therefore expressed as∑n

a=1(1−Ma)(viewing Boolean truth values as 0/1)What is the best encoding? Try codes clp_direct.pl andclp_second.pl (0, 1, and 2 have the original meaning here).Search space: O(2nk ), with k < 2|G| (1), O(22n|G|)) (2).Call goals like :- input(1,Gs),haplo_decision(Gs,5). or:- haplo_decision([[0,1,2],[0,2,1]],3).

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 19 / 20

Page 45: Constraint Programming and Biology: Haplotype Inferenceagostino.dovier/WROCLAW/BIOCP12_1.pdf · In the Human DNA it is estimated that there are 23000 (maybe few) protein-coding genes

Haplotype Inference References

Haplotype InferenceSome References

Gusfield and Orzack. Haplotype Inference (Survey, and ILPformulations) In CRC Handbook on Bioinformatics, 2006Lancia, Pinotti, Rizzi. [LPR04] Haplotyping Populations by PureParsimony: Complexity of Exact and Approximation Algorithms.INFORMS Journal on Computing 16(4):348–359, 2004.Graça, Marques-Silva, Lynce, Oliveira. Several works onSAT-based and specialized 0-1 ILP for Haplotype Inference. (e.g.WCB 08, WCB 09)Di Gaspero, Roli. Stochastic local search for large-scale instancesof the haplotype inference problem by pure parsimony. J.Algorithms 63(1-3): 55-69 (2008) (also in WCB08).Erdem, Erdem, Türe. HAPLO-ASP: Haplotype Inference UsingAnswer Set Programming. LPNMR 2009: 573–578James Cussens Maximum likelihood pedigree reconstructionusing integer programming. WCB 10.

Agostino Dovier (DIMI, UDINE Univ.) CP and Biology Wrocław, September 2012 20 / 20