Sorting by Cuts, Joins and Whole Chromosome Duplications Ron Zeira and Ron Shamir Combinatorial...

Preview:

Citation preview

Sorting by Cuts, Joins and Whole Chromosome Duplications

Ron Zeira and Ron ShamirCombinatorial Pattern Matching 2015

30.6.15

Genome rearrangements

Motivation I: evolution

Human genome project

Motivation II: cancer

NCI, 2001

Normal karyotypeMCF-7 breast cancer cell-line

Definitions: gene

• A gene – oriented segment:

• A gene has two extremities: head and tail.

• Positive: tailhead; Negative: headtail.

Definitions: chromosome

• Chromosome is a series of consecutive genes.

• 2 consecutive extremities form an adjacency.• A telomere is an extremity that is not part of

an adjacency.• Circular chrom. has no telomeres. Linear

chrom. has 2 telomeres.

Definitions: genome

• A genome is a set of chromosomes.

• Equivalently, a genome is a set of adjacencies.

• Ordinary genome has one copy of each gene. Otherwise duplicated.

{ , },{ , },{ , },{ , }h h t h h h t ta b b c d f f eΠ

GR distance problem

• Distance dop(Π,Σ) – minimal number of operations between genomes Π and Σ.

• Operations:– Reversals– Translocations– Transpositions– Others…

The SCJ model

• SCJ – Single Cut or Join (Feijão,Meidanis 11):– Cut an adjacency to 2 telomeres.– Join 2 telomeres to an adjacency.

• Simple and practical model.• Reflects evolutionary distance (Biller et al. 13)

cut

join

Models with multiple gene copies

• Most models with multiple gene copies are NP-hard.

• Not many models allow duplications or deletions.

• Many normal and cancer genomes have multiple gene copies.

The SCJD model

• A duplication takes a linear chromosome and produces an additional copy of it.

• An SCJD operation is either a cut, or a join or a duplication.

,abc abc abc

The SCJD distance

• The minimal number of SCJD operations that transform an ordinary genome into a duplicated genome.

Results outline

• Characterize optimal solution structure.

• Give a distance optimization function.

• Solve the optimization problem.

• Study the number of duplications in optimal scenario.

SCJD optimal scenario structure

• Theorem: There exists an optimal SCJD sorting scenario, consisting, in this order, of– SCJ operations on single-copy genes.– Duplications.– SCJ operations acting on duplicated genes.

' 2 'SCJs duplications SCJs

Proof outline

• An SCJ operation acts on extremities on 2 duplicated genes or 2 unduplicated genes.

• Preempting SCJ on unduplicated genes keeps a valid sorting scenario.

• Preempt duplications while scenario is valid.

Corollary: SCJD distance

• Write the distance as a function of Γ’.• Find Γ’ that minimizes the distance.

η – higher score for adj. in Γ and Δ

Distance optimization solution

• The following genome maximizes H:

• If Γ not linear, remove an adjacency with η=1 from each circular chromosome in Γ’ to obtain Γ’’.

• Theorem: SCJD distance is computable in linear time.

' { | ( ) 0}

Controlling the number of duplications

• Duplications are more “radical” events than cut or join.

• Lemma: Our algorithm gives an optimal sorting scenario with a maximum number of duplications.

Optimal solutions can have different numbers of duplications

Minimizing duplications is hard

• Theorem: Finding an optimal SCJD sorting scenario with a minimum number of duplications is NP-hard.

• Reduction from Hamiltonian path problem on a directed graph with in/out degree 2.

Proof outline

• For a 2-digraph G and two vertices x, y, there is an Eulerian path P:xy.

• Create a duplicated genome Σ from P and an empty genome Π.

• Add auxiliary genes and k copies of Σ, Π.• There is a Hamiltonian path xy in G iff there

is an optimal sorting scenario with k duplications.

Summary

• Genome rearrangements are important.• Problems with multiple gene copies are hard.• SCJD – allows SCJ and duplications:– Linear algorithm for the SCJD distance.– Study the number of duplications in optimal

solution.• We hope to generalize the model and apply it

on cancer data.

Thank You!

Recommended