74
Optimization Problems for Polymorphisms of Single Nucleotides

Optimization Problems for Polymorphisms of Single Nucleotides

  • Upload
    juana

  • View
    26

  • Download
    1

Embed Size (px)

DESCRIPTION

Optimization Problems for Polymorphisms of Single Nucleotides. Polymorphisms. A polymorphism is a feature. Polymorphisms. A polymorphism is a feature - common to everybody. Polymorphisms. A polymorphism is a feature - common to everybody - not identical in everybody. - PowerPoint PPT Presentation

Citation preview

Page 1: Optimization Problems for  Polymorphisms of Single Nucleotides

Optimization Problems for

Polymorphisms of Single Nucleotides

Page 2: Optimization Problems for  Polymorphisms of Single Nucleotides

Polymorphisms

A polymorphism is a feature

Page 3: Optimization Problems for  Polymorphisms of Single Nucleotides

Polymorphisms

A polymorphism is a feature - common to everybody

Page 4: Optimization Problems for  Polymorphisms of Single Nucleotides

Polymorphisms

A polymorphism is a feature - common to everybody - not identical in everybody

Page 5: Optimization Problems for  Polymorphisms of Single Nucleotides

Polymorphisms

A polymorphism is a feature - common to everybody - not identical in everybody- the possible variants (alleles) are just a few

Page 6: Optimization Problems for  Polymorphisms of Single Nucleotides

Polymorphisms

E.g. think of eye-color

A polymorphism is a feature - common to everybody - not identical in everybody- the possible variants (alleles) are just a few

Page 7: Optimization Problems for  Polymorphisms of Single Nucleotides

Polymorphisms

A polymorphism is a feature - common to everybody - not identical in everybody- the possible variants (alleles) are just a few

E.g. think of eye-color

Or blood-type for a feature not visible from outside

Page 8: Optimization Problems for  Polymorphisms of Single Nucleotides

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

Page 9: Optimization Problems for  Polymorphisms of Single Nucleotides

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, henceSingle Nucleotide Polymorphism (SNP)

Page 10: Optimization Problems for  Polymorphisms of Single Nucleotides

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, henceSingle Nucleotide Polymorphism (SNP)

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

Page 11: Optimization Problems for  Polymorphisms of Single Nucleotides

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, henceSingle Nucleotide Polymorphism (SNP)

atcggattagttagggcacaggacggacatcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtacatcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtacatcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

Page 12: Optimization Problems for  Polymorphisms of Single Nucleotides

- SNPs are predominant form of human variations

atcggattagttagggcacaggacggacatcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtacatcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtacatcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

- Used for drug design, study disease, forensic, evolutionary...

- On average one every 1,000 bases

Page 13: Optimization Problems for  Polymorphisms of Single Nucleotides

- Multimillion dollar SNP consortium project

atcggattagttagggcacaggacggacatcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtacatcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtacatcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

- Goal: associate SNPs (or group of SNPs) to genetic diseases

- 1st step: build maps of several thousand SNPs

Page 14: Optimization Problems for  Polymorphisms of Single Nucleotides

atcggattagttagggcacaggacggacatcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtacatcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtacatcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes

Page 15: Optimization Problems for  Polymorphisms of Single Nucleotides

atcggattagttagggcacaggacggacatcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtacatcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtacatcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes

Page 16: Optimization Problems for  Polymorphisms of Single Nucleotides

atcggattagttagggcacaggacggacatcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtacatcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtacatcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUS: different alleles

Page 17: Optimization Problems for  Polymorphisms of Single Nucleotides

atcggattagttagggcacaggacggacatcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtacatcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtacatcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUS: different alleles

Page 18: Optimization Problems for  Polymorphisms of Single Nucleotides

atcggattagttagggcacaggacggacatcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtacatcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtacatcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUS: different alleles

HAPLOTYPE: chromosome content at SNP sites

Page 19: Optimization Problems for  Polymorphisms of Single Nucleotides

atcggcttagttagggcacaggacgtacatcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtacatcggattagttagggcacaggacgt

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggacatcggattagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUS: different alleles

HAPLOTYPE: chromosome content at SNP sites

atcggattagttagggcacaggacggacatcggattagttagggcacaggacgtac

Page 20: Optimization Problems for  Polymorphisms of Single Nucleotides

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

HOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUS: different alleles

HAPLOTYPE: chromosome content at SNP sites

Page 21: Optimization Problems for  Polymorphisms of Single Nucleotides

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

HOMOZYGOUS: same allele on both chromosomes

HETEROZYGOUS: different alleles

HAPLOTYPE: chromosome content at SNP sites

GENOTYPE: “union” of 2 haplotypes

OcE

EE

OaOg

OaE OaOt

EOg

OgE

Page 22: Optimization Problems for  Polymorphisms of Single Nucleotides

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

OcE

EE

OaOg

OaE OaOt

EOg

OgE

CHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio). Call them 1 and O. Also, call * the fact that a site is heterozygous

HAPLOTYPE: string over 1,OGENOTYPE: string over 1,O,*

Page 23: Optimization Problems for  Polymorphisms of Single Nucleotides

1o 11

o1 1o

o1 oo

11 11

1o oo

1o oo

1o 1o

o*

**

*o

1* 11

*o

*o

CHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio). Call them 1 and O. Also, call * the fact that a site is heterozygous

HAPLOTYPE: string over 1,OGENOTYPE: string over 1,O,*

Page 24: Optimization Problems for  Polymorphisms of Single Nucleotides

THE HAPLOTYPING PROBLEM

Single Individual: Given genomic data of one individual, determine 2 haplotypes (one per chromosome)

Population : Given genomic data of k individuals, determine (at most) 2k haplotypes (one per chromosome/indiv.)

For the individual problem, input is erroneous haplotype data, from sequencing

For the population problem, data is ambiguous genotype data, from screening

OBJ is lead by Occam’s razor: find minimum explanation of observed data under given hypothesis (a.k.a. parsimony principle)

Page 25: Optimization Problems for  Polymorphisms of Single Nucleotides

Theory and Results

- Polynomial Algorithms for gapless haplotyping (L, Bafna, Istrail, Lippert, Schwartz 01 & Bafna, L, Istrail, Rizzi 02)

- Polynomial Algorithms for bounded-length gapped haplotyping (BLIR 02)

Single individual

- NP-hardness for general gapped haplotyping (LBILS 01)

- APX-hardness (Gusfield 00)

- Reduction to Graph-Theoretic model and I.P. approach (Gusfield 01)

Population

-New formulations and Disease Detection (L, Ravi, Rizzi, 02)

- Exact algorithms for min-size solution (L,Serafini 2011)

- Heuristics (Tininini, L, Bertolazzi 2010)

Page 26: Optimization Problems for  Polymorphisms of Single Nucleotides

The Single-IndividualHaplotyping problem

Page 27: Optimization Problems for  Polymorphisms of Single Nucleotides

TGAGCCTAG GATTT GCCTAG CTATCTTATAGATA GAGATTTCTAGAAATC ACTGATAGAGATTTC TCCTAAAGAT CGCATAGATA

fragmentation

sequencing

assembly

Shotgun Assembly of a Chromosome [ Webber and Myers, 1997]

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTTACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTTACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

ACTGCAGCCTAGAGATTCTCAGATATTTCTAGGCGTATCTATCTT

Page 28: Optimization Problems for  Polymorphisms of Single Nucleotides

-Sequencing errors:

ACTGCCTGGCCAATGGAACGGACAAG CTGGCCAAT CATTGGAAC AATGGAACGGA

-Contaminants

MAIN ERROR SOURCES

Page 29: Optimization Problems for  Polymorphisms of Single Nucleotides

Given errors, the data may be inconsistent with exactly 2 haplotypes

PROBLEM: Find and remove the errors so that the data becomes consistent with exactly 2 haplotypes

Hence, assembler is unable to build 2 chromosomes

Page 30: Optimization Problems for  Polymorphisms of Single Nucleotides

ACTGAAAGCGA ACTAGAGACAGCATGACTGATAGC GTAGAGTCAACTG TCGACTAGA CATGACTGA CGATCCATCG TCAGCACTGAAA ATCGATC AGCATGACTGAAAGCGA ACTAGAGACAGCATGACTGATAGC GTAGAGTCAACTG TCGACTAGA CATGACTGA CGATCCATCG TCAGCACTGAAA ATCGATC AGCATG 1 1 O O O 1 1 1 1 1 O

The data: a SNP matrix

Page 31: Optimization Problems for  Polymorphisms of Single Nucleotides

Snips 1,..,n

Fragments 1,..,m

1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 13 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - - - - - 1 O6 - - - - O O O 1 -

Page 32: Optimization Problems for  Polymorphisms of Single Nucleotides

Snips 1,..,n

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype

1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 13 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - - - - - 1 O6 - - - - O O O 1 -

Page 33: Optimization Problems for  Polymorphisms of Single Nucleotides

Snips 1,..,n

Fragments 1,..,m

Fragment conflict: can’t be on same haplotype

1

62

3

45

Fragment Conflict Graph GF(M)

We have 2 haplotypes iff GF is BIPARTITE

1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 13 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - - - - - 1 O6 - - - - O O O 1 -

Page 34: Optimization Problems for  Polymorphisms of Single Nucleotides

Snips 1,..,n

Fragments 1,..,m

1

62

3

45

PROBLEM (Fragment Removal): make GF Bipartite

1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 13 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - - - - - 1 O6 - - - - O O O 1 -

Page 35: Optimization Problems for  Polymorphisms of Single Nucleotides

Snips 1,..,n

1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 13 1 1 O 1 1 - - - - 4 O O 1 - - - - O - 5 - - - - - - - 1 O6 - - - - O O O 1 -

Fragments 1,..,m

PROBLEM (Fragment Removal): make GF Bipartite

1

62

3

45

1 2 3 4 5 6 7 8 9 1 - - - O 1 1 O O - 2 - O - O 1 - - - 14 O O 1 - - - - O -

3 1 1 O 1 1 - - - -5 - - - - - - - 1 O

O O 1 O 1 1 O O 1

1 1 O 1 1 - - 1 O

Page 36: Optimization Problems for  Polymorphisms of Single Nucleotides

Removing fewest fragments is equivalent to maximum induced bipartite subgraph

NP-complete [Yannakakis, 1978a, 1978b; Lewis, 1978] O(|V|(log log |V|/log |V|)2)-approximable [Halldórsson, 1999] not O(|V|)-approximable for some [Lund and Yannakakis, 1993]

Are there cases of M for which GF(M) is easier?

YES: the gapless M

---O11OO---O1OO1--- gap

---O11OO1O1O1OO1--- gapless

---O11--1O----O1--- 2 gaps

Page 37: Optimization Problems for  Polymorphisms of Single Nucleotides

Why gaps?

Sequencing errors (don’t call with low confidence)

---OO11?11--- ===> ---OO11-11---

Page 38: Optimization Problems for  Polymorphisms of Single Nucleotides

Why gaps?

Sequencing errors (don’t call with low confidence)

---OO11?11--- ===> ---OO11-11---

Celera’s mate pairs

attcgttgtagtggtagcctaaatgtcggtagaccttga

attcgttgtagtggtagcctaaatgtcggtagaccttga

Page 39: Optimization Problems for  Polymorphisms of Single Nucleotides

THEOREM

For a gapless M, the Min Fragment RemovalProblem is Polynomial

NOTE: Does not need to be gapless. Enough if it can be sorted to become such (Consecutive Ones Property, Booth and Lueker, 1976)

Page 40: Optimization Problems for  Polymorphisms of Single Nucleotides

An O(nm + n ) D.P. algo3

1 - O O 1 1 O O - -2 - - 1 O 1 1 O - -3 - - - 1 1 O - - - 4 - - - - O O 1 O - 5 - - - - - 1 O 1 O

Page 41: Optimization Problems for  Polymorphisms of Single Nucleotides

An O(nm + n ) D.P. algo3

LFT(i) RGT(i)

sort according to LFT

1 - O O 1 1 O O - -2 - - 1 O 1 1 O - -3 - - - 1 1 O - - - 4 - - - - O O 1 O - 5 - - - - - 1 O 1 O

Page 42: Optimization Problems for  Polymorphisms of Single Nucleotides

An O(nm + n ) D.P. algo3

1 - O O 1 1 O O - -2 - - 1 O 1 1 O - -3 - - - 1 1 O - - - 4 - - - - O O 1 O - 5 - - - - - 1 O 1 O

LFT(i) RGT(i)

D(i;h,k) := min cost to solve up to row i, with k, h not removed and put in different haplotypes, and maximizing RGT(k), RGT(h)

sort according to LFT

D(i; h,k) =

D(i-1; h,k) if i, k compatible and RGT(i) <= RGT(k) or i, h compatible and RGT(i) <= RGT(h)

1 + D(i-1; h, k) otherwise{

OPT is min h,k D( n; h, k ) and can be found in time O(nm + n^3)

Page 43: Optimization Problems for  Polymorphisms of Single Nucleotides

Th: NP-Hard if 2 gaps per fragment

proof: (simple) use fact that for every G there is M s.t. G = GF(M) and reduce from Max Bip. Induced Subgraph on 3-regular graphs (in each row, max 3 non-bit, hence max 2 gaps)

WITH GAPS…..

Page 44: Optimization Problems for  Polymorphisms of Single Nucleotides

Th : NP-Hard if even 1 gap per fragment proof: technical. reduction from MAX2SAT

WITH GAPS…..

Th: NP-Hard if 2 gaps per fragment

proof: (simple) use fact that for every G there is M s.t. G = GF(M) and reduce from Max Bip. Induced Subgraph on 3-regular graphs (in each row, max 3 non-bit, hence max 2 gaps)

Page 45: Optimization Problems for  Polymorphisms of Single Nucleotides

Th : NP-Hard if even 1 gap per fragment proof: technical. reduction from MAX2SAT

WITH GAPS…..

But, gaps must be long for problem to be difficult.

We have O( 2 mn + 2 n ) D.P.

for MFR on matrix with total gaps length L

2L 3L 3

Th: NP-Hard if 2 gaps per fragment

proof: (simple) use fact that for every G there is M s.t. G = GF(M) and reduce from Max Bip. Induced Subgraph on 3-regular graphs (in each row, max 3 non-bit, hence max 2 gaps)

Page 46: Optimization Problems for  Polymorphisms of Single Nucleotides

What for MFR with gaps? Why not ILP...

min∑𝑓𝑥 𝑓

∑𝑓 ∈𝐶

𝑥 𝑓 ≥1 for all odd cycles𝐶𝑥∈ {0,1 }𝑛

Page 47: Optimization Problems for  Polymorphisms of Single Nucleotides

What for MFR with gaps? Why not ILP...

1

5 2

34

1/2

1/3

1/41/2

0

min∑𝑓𝑥 𝑓

∑𝑓 ∈𝐶

𝑥 𝑓 ≥1 for all odd cycles𝐶𝑥∈ {0,1 }𝑛

Page 48: Optimization Problems for  Polymorphisms of Single Nucleotides

What for MFR with gaps? Why not ILP...

1

5 2

34

1/2

1/3

1/41/2

01

5 2

34

1

5 2

34

min∑𝑓𝑥 𝑓

∑𝑓 ∈𝐶

𝑥 𝑓 ≥1 for all odd cycles𝐶𝑥∈ {0,1 }𝑛

Page 49: Optimization Problems for  Polymorphisms of Single Nucleotides

What for MFR with gaps? Why not ILP...

1

5 2

34

1/2

1/3

1/41/2

01

5 2

34

1

5 2

34

5/12 5/12

min∑𝑓𝑥 𝑓

∑𝑓 ∈𝐶

𝑥 𝑓 ≥1 for all odd cycles𝐶𝑥∈ {0,1 }𝑛

Page 50: Optimization Problems for  Polymorphisms of Single Nucleotides

What for MFR with gaps? Why not ILP...

1

5 2

34

1/2

1/3

1/41/2

01

5 2

34

1

5 2

34

5/12 5/12

min∑𝑓𝑥 𝑓

∑𝑓 ∈𝐶

𝑥 𝑓 ≥1 for all odd cycles𝐶𝑥∈ {0,1 }𝑛

Page 51: Optimization Problems for  Polymorphisms of Single Nucleotides

What for MFR with gaps? Why not ILP...

1

5 2

34

1/2

1/3

1/41/2

01

5 2

34

1

5 2

34

5/12 5/12

min∑𝑓𝑥 𝑓

∑𝑓 ∈𝐶

𝑥 𝑓 ≥1 for all odd cycles𝐶𝑥∈ {0,1 }𝑛

Page 52: Optimization Problems for  Polymorphisms of Single Nucleotides

What for MFR with gaps? Why not ILP...

1

5 2

34

1/2

1/3

1/41/2

01

5 2

34

1

5 2

34

5/12 5/12

Randomized rounding heuristic: round and repeat. Worked well at Celera

min∑𝑓𝑥 𝑓

∑𝑓 ∈𝐶

𝑥 𝑓 ≥1 for all odd cycles𝐶𝑥∈ {0,1 }𝑛

Page 53: Optimization Problems for  Polymorphisms of Single Nucleotides

The fragment removal is good to get rid of contaminants.

However, we may want to keep all fragments andcorrect errors otherwise

A dual point of view is to disregard some SNPs and keepthe largest subset sufficient to reconstruct the haplotypes

All fragments get assigned to one of the two haplotypes.We describe the min SNP removal problem: remove the fewest number of columns from M so that the fragmentgraph becomes bipartite.

Page 54: Optimization Problems for  Polymorphisms of Single Nucleotides

SNP conflicts

- - - O 1 1 O O - - O 1 O 1 - - - 11 1 O 1 1 - - - - O O 1 - - - O O - - - - - - - 1 1 O- - - - O O O 1 -

Page 55: Optimization Problems for  Polymorphisms of Single Nucleotides

SNP conflicts

OK

- - - O 1 1 O O - - O 1 O 1 - - - 11 1 O 1 1 - - - - O O 1 - - - O O - - - - - - - 1 1 O- - - - O O O 1 -

Page 56: Optimization Problems for  Polymorphisms of Single Nucleotides

SNP conflicts

OK

- - - O 1 1 O O - - O 1 O 1 - - - 11 1 O 1 1 - - - - O O 1 - - - O O - - - - - - - 1 1 O- - - - O O O 1 -

Page 57: Optimization Problems for  Polymorphisms of Single Nucleotides

SNP conflicts

OK

- - - O 1 1 O O - - O 1 O 1 - - - 11 1 O 1 1 - - - - O O 1 - - - O O - - - - - - - 1 1 O- - - - O O O 1 -

Page 58: Optimization Problems for  Polymorphisms of Single Nucleotides

SNP conflicts

CONFLICT !

- - - O 1 1 O O - - O 1 O 1 - - - 11 1 O 1 1 - - - - O O 1 - - - O O - - - - - - - 1 1 O- - - - O O O 1 -

Page 59: Optimization Problems for  Polymorphisms of Single Nucleotides

SNP conflicts

CONFLICT !

- - - O 1 1 O O - - O 1 O 1 - - - 11 1 O 1 1 - - - - O O 1 - - - O O - - - - - - - 1 1 O- - - - O O O 1 -

Page 60: Optimization Problems for  Polymorphisms of Single Nucleotides

- - - O 1 1 O O - - O 1 O 1 - - - 11 1 O 1 1 - - - - O O 1 - - - O O - - - - - - - 1 1 O- - - - O O O 1 -

SNP conflicts

SNP conflict graph GS(M)1 node for each SNP (column)edge between conflicting SNPs

Page 61: Optimization Problems for  Polymorphisms of Single Nucleotides

SNP conflicts

1 2 3 4 5 6 7 8 9 - - - O 1 1 O O - - O 1 O 1 - - - 11 1 O 1 1 - - - - O O 1 - - - O O - - - - - - - 1 1 O- - - - O O O 1 -

Page 62: Optimization Problems for  Polymorphisms of Single Nucleotides

SNP conflicts

1

6

2

3

4

5

8

9

7

1 2 3 4 5 6 7 8 9 - - - O 1 1 O O - - O 1 O 1 - - - 11 1 O 1 1 - - - - O O 1 - - - O O - - - - - - - 1 1 O- - - - O O O 1 -

Page 63: Optimization Problems for  Polymorphisms of Single Nucleotides

1 2 3 4 5 6 7 8 9 - - - O 1 1 O O - - O 1 O 1 - - - 11 1 O 1 1 - - - - O O 1 - - - O O - - - - - - - 1 1 O- - - - O O O 1 -

SNP conflicts

1

6

2

3

4

5

8

9

7

Page 64: Optimization Problems for  Polymorphisms of Single Nucleotides

THEOREM 1

For a gapless M, GF(M) is bipartiteif and only if GS(M) is an independent set

THEOREM 2

For a gapless M, GS(M) is a perfect graph

COROLLARY

For a gapless M, the min SNP removalproblem is polynomial

Page 65: Optimization Problems for  Polymorphisms of Single Nucleotides

THEOREM 1For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--OO11OO-------------OO1OO1O11O-----------11O1O111-----11OO1O11O-----------1OOO1-----------11111O-------11O11O1OO------

Assume M gapless, GS(M) an independent set, but GF(M)not bipartite.

Take an odd cycle in GF

Page 66: Optimization Problems for  Polymorphisms of Single Nucleotides

THEOREM 1For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?1???-------------O????????O-----------??O??1??-----??????1??-----------???O?-----------????1?-------1???????O------

There is a generic structure of hor-vert cycle

Page 67: Optimization Problems for  Polymorphisms of Single Nucleotides

THEOREM 1For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?1???-------------O????????O-----------??O??1??-----??????1??-----------???O?-----------????1?-------1???????O------

“vertical lines”

There cannot be only one vertical line in odd cycle

We merge rightmost and next to reduce them by 1

Hence, there cannot be a minimal (in n. of vertical lines) counterexample

Page 68: Optimization Problems for  Polymorphisms of Single Nucleotides

THEOREM 1For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?1???-------------O????????O-----------??O??1??-----??????1??-----------???O?-----------????1?-------1???????O------

“vertical lines”

Must be 1

Page 69: Optimization Problems for  Polymorphisms of Single Nucleotides

THEOREM 1For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?1???-------------O?????1??O-----------??O??1??-----??????1??-----------???O?-----------????1?-------1???????O------

“vertical lines”

Must be 1

Merge the rightmost lines

Page 70: Optimization Problems for  Polymorphisms of Single Nucleotides

THEOREM 1For a gapless M, GF(M) is bipartite if and only if GS(M) is an independent set

PROOF (sketch): by minimal counterexample

--O?1???-------------O?????1--------------??O----------??????1-------------???O------------????1--------1???????O------

“vertical lines”

Still a counterexample!

Merge the rightmost lines

Page 71: Optimization Problems for  Polymorphisms of Single Nucleotides

1 2 31 O - O 2 - O 1 3 1 1 -

Note: Theorem not true if there are gaps

1

2 3

1

2 3

GF(M) GS(M)

M

Page 72: Optimization Problems for  Polymorphisms of Single Nucleotides

THEOREM 2For a gapless M, GS(M) is a perfect graph

PROOF: GS(M) is the complement of a comparability graph A

Comparability graphs are perfect

Comparability Graphs: unoriented that can be oriented to become a partial order

Page 73: Optimization Problems for  Polymorphisms of Single Nucleotides

LEMMA: If i<j<k and (i,k) is a SNP conflict then either (i,k) or (j,k) is also a SNP conflict

i j k - 1 O O ? 1 O 1 - - O 1 O ? 1 1 1 -

Equal:conflicts with i

OO

Different:conflicts with k

O1

i kj

I.e. if (i,j) is not a conflict and (j,k) is not a conflict, also (i,k) is not a conflict

So (u,v) with u < v and u not a conflict with v is a comparability graph Aand GS is A complement

NOTE: ind set on perfect graph is in P (Lovasz, Schrijvers, Groetschel, 84)

Page 74: Optimization Problems for  Polymorphisms of Single Nucleotides

THEOREM: The min SNP removal is NP-hard if there can be gaps (Reduction from MAXCUT)

Again, gaps must be long for problem to be difficult.

We have O(mn + n ) D.P.

for MSR on matrix with total gaps length L

2L + 1 2L + 2

Hence gapless MSR is polynomial (max stable set on perfect graph).

There are better, D.P., algorithms, O(mn + m^2)

What if gaps ?