Upload
clifton-dawson
View
217
Download
1
Embed Size (px)
Citation preview
Sorting by Cuts, Joins and Whole Chromosome Duplications
Ron Zeira and Ron ShamirCombinatorial Pattern Matching 2015
30.6.15
Genome rearrangements
Motivation I: evolution
Human genome project
Motivation II: cancer
NCI, 2001
Normal karyotypeMCF-7 breast cancer cell-line
Definitions: gene
• A gene – oriented segment:
• A gene has two extremities: head and tail.
• Positive: tailhead; Negative: headtail.
Definitions: chromosome
• Chromosome is a series of consecutive genes.
• 2 consecutive extremities form an adjacency.• A telomere is an extremity that is not part of
an adjacency.• Circular chrom. has no telomeres. Linear
chrom. has 2 telomeres.
Definitions: genome
• A genome is a set of chromosomes.
• Equivalently, a genome is a set of adjacencies.
• Ordinary genome has one copy of each gene. Otherwise duplicated.
{ , },{ , },{ , },{ , }h h t h h h t ta b b c d f f eΠ
GR distance problem
• Distance dop(Π,Σ) – minimal number of operations between genomes Π and Σ.
• Operations:– Reversals– Translocations– Transpositions– Others…
The SCJ model
• SCJ – Single Cut or Join (Feijão,Meidanis 11):– Cut an adjacency to 2 telomeres.– Join 2 telomeres to an adjacency.
• Simple and practical model.• Reflects evolutionary distance (Biller et al. 13)
cut
join
Models with multiple gene copies
• Most models with multiple gene copies are NP-hard.
• Not many models allow duplications or deletions.
• Many normal and cancer genomes have multiple gene copies.
The SCJD model
• A duplication takes a linear chromosome and produces an additional copy of it.
• An SCJD operation is either a cut, or a join or a duplication.
,abc abc abc
The SCJD distance
• The minimal number of SCJD operations that transform an ordinary genome into a duplicated genome.
Results outline
• Characterize optimal solution structure.
• Give a distance optimization function.
• Solve the optimization problem.
• Study the number of duplications in optimal scenario.
SCJD optimal scenario structure
• Theorem: There exists an optimal SCJD sorting scenario, consisting, in this order, of– SCJ operations on single-copy genes.– Duplications.– SCJ operations acting on duplicated genes.
' 2 'SCJs duplications SCJs
Proof outline
• An SCJ operation acts on extremities on 2 duplicated genes or 2 unduplicated genes.
• Preempting SCJ on unduplicated genes keeps a valid sorting scenario.
• Preempt duplications while scenario is valid.
Corollary: SCJD distance
• Write the distance as a function of Γ’.• Find Γ’ that minimizes the distance.
η – higher score for adj. in Γ and Δ
Distance optimization solution
• The following genome maximizes H:
• If Γ not linear, remove an adjacency with η=1 from each circular chromosome in Γ’ to obtain Γ’’.
• Theorem: SCJD distance is computable in linear time.
' { | ( ) 0}
Controlling the number of duplications
• Duplications are more “radical” events than cut or join.
• Lemma: Our algorithm gives an optimal sorting scenario with a maximum number of duplications.
Optimal solutions can have different numbers of duplications
Minimizing duplications is hard
• Theorem: Finding an optimal SCJD sorting scenario with a minimum number of duplications is NP-hard.
• Reduction from Hamiltonian path problem on a directed graph with in/out degree 2.
Proof outline
• For a 2-digraph G and two vertices x, y, there is an Eulerian path P:xy.
• Create a duplicated genome Σ from P and an empty genome Π.
• Add auxiliary genes and k copies of Σ, Π.• There is a Hamiltonian path xy in G iff there
is an optimal sorting scenario with k duplications.
Summary
• Genome rearrangements are important.• Problems with multiple gene copies are hard.• SCJD – allows SCJ and duplications:– Linear algorithm for the SCJD distance.– Study the number of duplications in optimal
solution.• We hope to generalize the model and apply it
on cancer data.
Thank You!