23
Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Embed Size (px)

Citation preview

Page 1: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Sorting by Cuts, Joins and Whole Chromosome Duplications

Ron Zeira and Ron ShamirCombinatorial Pattern Matching 2015

30.6.15

Page 2: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Genome rearrangements

Page 3: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Motivation I: evolution

Human genome project

Page 4: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Motivation II: cancer

NCI, 2001

Normal karyotypeMCF-7 breast cancer cell-line

Page 5: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Definitions: gene

• A gene – oriented segment:

• A gene has two extremities: head and tail.

• Positive: tailhead; Negative: headtail.

Page 6: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Definitions: chromosome

• Chromosome is a series of consecutive genes.

• 2 consecutive extremities form an adjacency.• A telomere is an extremity that is not part of

an adjacency.• Circular chrom. has no telomeres. Linear

chrom. has 2 telomeres.

Page 7: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Definitions: genome

• A genome is a set of chromosomes.

• Equivalently, a genome is a set of adjacencies.

• Ordinary genome has one copy of each gene. Otherwise duplicated.

{ , },{ , },{ , },{ , }h h t h h h t ta b b c d f f eΠ

Page 8: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

GR distance problem

• Distance dop(Π,Σ) – minimal number of operations between genomes Π and Σ.

• Operations:– Reversals– Translocations– Transpositions– Others…

Page 9: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

The SCJ model

• SCJ – Single Cut or Join (Feijão,Meidanis 11):– Cut an adjacency to 2 telomeres.– Join 2 telomeres to an adjacency.

• Simple and practical model.• Reflects evolutionary distance (Biller et al. 13)

cut

join

Page 10: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Models with multiple gene copies

• Most models with multiple gene copies are NP-hard.

• Not many models allow duplications or deletions.

• Many normal and cancer genomes have multiple gene copies.

Page 11: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

The SCJD model

• A duplication takes a linear chromosome and produces an additional copy of it.

• An SCJD operation is either a cut, or a join or a duplication.

,abc abc abc

Page 12: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

The SCJD distance

• The minimal number of SCJD operations that transform an ordinary genome into a duplicated genome.

Page 13: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Results outline

• Characterize optimal solution structure.

• Give a distance optimization function.

• Solve the optimization problem.

• Study the number of duplications in optimal scenario.

Page 14: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

SCJD optimal scenario structure

• Theorem: There exists an optimal SCJD sorting scenario, consisting, in this order, of– SCJ operations on single-copy genes.– Duplications.– SCJ operations acting on duplicated genes.

' 2 'SCJs duplications SCJs

Page 15: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Proof outline

• An SCJ operation acts on extremities on 2 duplicated genes or 2 unduplicated genes.

• Preempting SCJ on unduplicated genes keeps a valid sorting scenario.

• Preempt duplications while scenario is valid.

Page 16: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Corollary: SCJD distance

• Write the distance as a function of Γ’.• Find Γ’ that minimizes the distance.

η – higher score for adj. in Γ and Δ

Page 17: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Distance optimization solution

• The following genome maximizes H:

• If Γ not linear, remove an adjacency with η=1 from each circular chromosome in Γ’ to obtain Γ’’.

• Theorem: SCJD distance is computable in linear time.

' { | ( ) 0}

Page 18: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Controlling the number of duplications

• Duplications are more “radical” events than cut or join.

• Lemma: Our algorithm gives an optimal sorting scenario with a maximum number of duplications.

Page 19: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Optimal solutions can have different numbers of duplications

Page 20: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Minimizing duplications is hard

• Theorem: Finding an optimal SCJD sorting scenario with a minimum number of duplications is NP-hard.

• Reduction from Hamiltonian path problem on a directed graph with in/out degree 2.

Page 21: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Proof outline

• For a 2-digraph G and two vertices x, y, there is an Eulerian path P:xy.

• Create a duplicated genome Σ from P and an empty genome Π.

• Add auxiliary genes and k copies of Σ, Π.• There is a Hamiltonian path xy in G iff there

is an optimal sorting scenario with k duplications.

Page 22: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Summary

• Genome rearrangements are important.• Problems with multiple gene copies are hard.• SCJD – allows SCJ and duplications:– Linear algorithm for the SCJD distance.– Study the number of duplications in optimal

solution.• We hope to generalize the model and apply it

on cancer data.

Page 23: Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial Pattern Matching 2015 30.6.15

Thank You!