47
Genome Genome Rearrangement Rearrangement SORTING BY REVERSALS SORTING BY REVERSALS Ankur Jain Ankur Jain Hoda Mokhtar Hoda Mokhtar CS290I – SPRING CS290I – SPRING 2003 2003

Genome Rearrangement SORTING BY REVERSALS

  • Upload
    cece

  • View
    74

  • Download
    1

Embed Size (px)

DESCRIPTION

Genome Rearrangement SORTING BY REVERSALS. Ankur Jain Hoda Mokhtar CS290I – SPRING 2003. Comparative Genomics. The practice of analyzing and comparing the genetic material of different species for the purpose of studying evolution, the function of genes and inherited diseases. - PowerPoint PPT Presentation

Citation preview

Page 1: Genome Rearrangement SORTING BY REVERSALS

Genome RearrangementGenome RearrangementSORTING BY REVERSALSSORTING BY REVERSALS

Ankur JainAnkur Jain

Hoda MokhtarHoda Mokhtar

CS290I – SPRING 2003CS290I – SPRING 2003

Page 2: Genome Rearrangement SORTING BY REVERSALS

Comparative GenomicsComparative GenomicsThe practice of analyzing and comparing the genetic material of different species for the purpose of studying evolution, the function of genes and inherited diseases.

Chromosome breakage and mistakes in repair, along with a number of other processes, give rise to changes in gene order. 

These have important consequences for the evolution of species. 

Page 3: Genome Rearrangement SORTING BY REVERSALS

Problem DefinitionProblem Definition

During biological evolution, inter- and intra-chromosomal  exchanges of chromosomal fragments disrupt the order of genes on a  chromosome.

The genome rearrangements approach, is the use of  combinatorial optimization techniques, to infer a sequence of rearrangement events to account for the differences among the  genomes. 

Page 4: Genome Rearrangement SORTING BY REVERSALS

OutlineOutline• Problem definition

• Genome Comparison

• Possible chromosomal changes• Sorting by reversals : - Previous work

- Definitions

- Duality Theorem

• Our technique : - Bit Vector Method

- Experimental results : - Synthetic datasets

- Real datasets

- Breakpoints Technique

• Conclusions and Future work

Page 5: Genome Rearrangement SORTING BY REVERSALS

Genome ComparisonGenome Comparison• In the late 1980 was discovered remarkable and novel

pattern of evolutionary change in plant organelles.

• Jeffrey Palmer and his collegues compared the mitochondrial genomes of cabbage and turnip, which are very closely related. Molecules which are almost identical in gene sequences, differ dramatically in gene order. {Sridhar, Pevzner 1995}

• This discovery and many other studies proved that genome rearrangements represent a common mode of molecular evolution.

Page 6: Genome Rearrangement SORTING BY REVERSALS

Cabbage and TurnipCabbage and Turnip

Gene orientation

Page 7: Genome Rearrangement SORTING BY REVERSALS

Single Chromosome OperationsSingle Chromosome Operations• Reversal: A section of a chromosome is excised, reversed in

orientation, and re-inserted.

(abc1c2c3c4de -> ab-c4-c3-c2-c1de)

• Transposition: A section of a chromosome is excised and inserted at new position in the chromosome, without changing orientation. (abcd -> cdab)

• Inverted transposition: Exactly like transposition, except that the transposed segment changes orientation. (abcd -> -c-dab)

• Gene duplication: A section of a chromosome is duplicated, so that multiple copies exist of every gene in that section.

(abc -> abcb, abc -> abbc)

• Gene loss: A section of a chromosome is excised and lost.

(abc->ac )

Page 8: Genome Rearrangement SORTING BY REVERSALS

Operations on 2 Chromosomes Operations on 2 Chromosomes

• Translocation: The end of one chromosome is broken and attached to the end of another chromosome.

• Fusion: two chromosomes merge.

• Fission: one chromosome splits up into two chromosomes.

Page 9: Genome Rearrangement SORTING BY REVERSALS

Genomic Sorting ProblemGenomic Sorting Problem

• Given genomes the genomic sorting problem is to find a series of reversals where and t is minimal.

• We call t the genomic distance between

and

t1

and

t ,...,1

Page 10: Genome Rearrangement SORTING BY REVERSALS

Sorting by ReversalsSorting by Reversals

• Genome rearrangements can be modelled by a combinatorical problem of sorting by reversals.

Break and Invert

A T G C C T G T A C T A

A T G A T G T C C C T A

Reversal

Page 11: Genome Rearrangement SORTING BY REVERSALS

Sorting by Reversals (Cont.)Sorting by Reversals (Cont.)

Minimum Sorting by Reversals: Given a permutation , what

is the shortest sequence (12….t ) of reversals that sorts ?

Complexity remains open. (NP-Hard) {Caprara ‘97}

Minimum Signed Sorting by Reversals: Given a signed

permutation , what is the shortest sequence (12….t ) of reversals that sorts ?

Solvable in polynomial time.

Page 12: Genome Rearrangement SORTING BY REVERSALS

Sorting of Signed PermutationsSorting of Signed Permutations• Transforming cabbage into turnip. {Hannenhalli, S., and Pevzner, P. ‘95} - Polynomial algorithm for sorting signed permutations by reversals•A Very Elementray Presentation of the Hannenhalli-Pevzner Theory, {A. Bergeron’95} – Polynomial algorithm for sorting signed permutations, efficiently implemented using bit vectors.

• Experiments in Computing Sequences of Reversals, {A. Bergeron and F. Strasbourg’95} – Polynomial algorithm for sorting signed permutations.• Fast Sorting by Reversal, {Berman, P., Hannenhalli, S. ‘96. }- exploit a few combinatorial properties of the cycle graph of a permutation and provided a polynomial algorithm.

•A Faster and Simpler Algorithm for Sorting Signed Permutations by Reversals, {Kaplan, H., Shamir, R., and Tarjan, R. ‘99.} – O(n2) using hurdles, cycles and fortress.

•A Linear-Time Algorithm for Computing Inversion Distance between Signed Permutations with an Experimental Study, {Moret, and Yan’ 00} - Computes reversal distance (without actually sorting) in O(n) time. Computes the connected components using stack rather than Union-Find. {Hannenhalli-Pevzner ’96} (GRAPPA program)

•A Very Elementray Presentation of the Hannenhalli-Pevzner Theory, {A. Bergeron’95} – Polynomial algorithm for sorting signed permutations, efficiently implemented using bit vectors.

Page 13: Genome Rearrangement SORTING BY REVERSALS

OutlineOutline• Problem definition• Genome Comparison• Possible chromosomal changes• Sorting by reversals : - Previous work

- Definitions

• Our technique : - Bit Vector Method

Experimental results : - Synthetic datasets

- Real datasets

- Breakpoints Technique

• Conclusions and Future work

Page 14: Genome Rearrangement SORTING BY REVERSALS

What is a Permutation?What is a Permutation?Permutation () : an ordered arrangement of the set { 1,2,…,n}

Signed Permutation (): a permutation where the elements are oriented a reversal switches element orientation{+3 -4 +7 -6 +1 -5 +2 } (7,-5) = {+3 -4 +5 -1 +6 -7 +2}

),...,1

,,...,1

,,1

,...,1

(),(njijji

ji

Page 15: Genome Rearrangement SORTING BY REVERSALS

Let i ~ j if | i – j | = 1. Extend permutation by adding = 0 and = n + 1.

We call pair of elements , 0 ≤ i ≤ n,

of an adjacency if ~

and a breakpoint if is not ( ~ )

0 1n

1, ii i 1i

1ii

BreakPointBreakPoint

~

=0 =n+1

~

Page 16: Genome Rearrangement SORTING BY REVERSALS

G

110 ,...,, nn

1i 1,,...,1,0 nn

i

ij~

The breakpoint graph of a permutation is a

edge-colored graph with 2n+2 vertices

We join vertices and by a black edge

We join vertices ijand by a gray edge if

What is breakpoint graph?What is breakpoint graph?

Page 17: Genome Rearrangement SORTING BY REVERSALS

Breakpoint graph – signed caseBreakpoint graph – signed case

Straight edges – every other pair of consecutive elements

Curved edges - every other pair of consecutive integers

Every connected component of the graph is a cycle

Page 18: Genome Rearrangement SORTING BY REVERSALS

Correlation between the Correlation between the breakpoints and reversal distancebreakpoints and reversal distance

• Correlations exists between the reversal distance and the number of breakpoints

• Sorting by reversals corresponds to eliminating breakpoints

• Every resersal can eliminate at most 2 breakpoints

)(d 2/b {Shamir, 95}

Page 19: Genome Rearrangement SORTING BY REVERSALS

OutlineOutline• Problem definition• Genome Comparison• Possible chromosomal changes• Sorting by reversals : - Previous work

- Definitions

- Duality Theorem (Hurdles !!)

• Our technique : - Vector-Method

Experimental results : - Synthetic

datasets

- Real datasets

-Breakpoints Technique

• Conclusions and Future work

Page 20: Genome Rearrangement SORTING BY REVERSALS

HurdleHurdle

Hurdle - an unoriented component whose elements are consecutive

Simple hurdle - a hurdle whose deletion decreases the number of hurdles

Super hurdles - hurdles that are not simple

Page 21: Genome Rearrangement SORTING BY REVERSALS

Duality Theorem for Sorting Duality Theorem for Sorting Signed PermutationsSigned Permutations

d 11 hcn

hcn 1

Hannenhalli and Pevzner, 1995.

For every signed permutation

=if is a fortress

otherwise

Page 22: Genome Rearrangement SORTING BY REVERSALS

Safe reversalSafe reversal

1 hc

1 hc

For an arbitary reversal

Reversal is safe if

C=3, h=1

C = 5, h= 2

Page 23: Genome Rearrangement SORTING BY REVERSALS

OutlineOutline• Problem definition• Genome Comparison• Possible chromosomal changes• Sorting by reversals : - Previous work

- Definitions

- Duality Theorem (Hurdles !!)

• Our technique : - Bit Vector Method

Experimental results : - Synthetic datasets

- Real datasets

- Breakpoints Technique

• Conclusions and Future work

Page 24: Genome Rearrangement SORTING BY REVERSALS

Our ApproachOur Approach

• Finding hurdles and fortresses in a graph are difficult and expensive {Kaplan, H., Shamir, R., and Tarjan, R. ‘99.}

• Use oriented sort to remove the oriented components in a graph and then apply the breakpoint approach to perform the remaining reversals

• We used the bit-vector approach to perform the oriented sort

Page 25: Genome Rearrangement SORTING BY REVERSALS

Oriented Sort Oriented Sort

Choose among the several candidates, asafe reversal, that is a reversal that decreases the reversal distance.

Theorem : The reversal that maximizes the number of oriented vertices is safe {A. Bergeron’95}

Page 26: Genome Rearrangement SORTING BY REVERSALS

Basic Sorting – oriented pairBasic Sorting – oriented pair

• An oriented pair is a pair of consecutive integers, that is

with opposite signs

Example:

(0 3 1 6 5 -2 -4 7)

Oriented pairs are: (1,-2) , (3, -4)

i j1

i j

Page 27: Genome Rearrangement SORTING BY REVERSALS

ReversalReversal score score

The number of oriented pairs in the resulting permutation as a result of a reversal

Example: ( 0 3 1 6 5 -2 4 7 )

(3, -2) (1, -2)

( 0 -5 -6 -1 -3 -2 4 7 ) ( 0 3 1 2 -5 -6 4 7 )Score 4 Score 2

( 0 3 1 6 5 -2 4 7 )( 0 3 1 6 5 -2 4 7 )

Page 28: Genome Rearrangement SORTING BY REVERSALS

AlgorithmAlgorithm

• As long as has an oriented pair choose the oriented reversal that has maximal score

(0 3 1 6 5 –2 4 7)

( 0 -5 -6 -1 -3 -2 4 7 ) (-3, 4)

( 0 -5 -6 -1 2 3 4 7 ) (-1,2)

( 0 -5 -6 1 2 3 4 7 ) (-6,7)

( 0 -5 -4 -3 -2 -1 6 7 ) (-5,6)

( 0 1 2 3 4 5 6 7 )

Page 29: Genome Rearrangement SORTING BY REVERSALS

OrientedOriented edgeedge

ji ,

lj , ik ,

Let be a gray edge incident to black edges

and Then ji ,

is oriented if and only if i – k = j - lEdge 20-21 is oriented (contains 3 [odd] number of vertices).I= 20, j=21, k=22, l=23I-k = -2 = j-l = -2

Bergeron

Pevzner

.

Page 30: Genome Rearrangement SORTING BY REVERSALS

Oriented reversalsOriented reversals

)1,( ji , if 1i j , and

1i j ),1( ji , if

Reversals that create consecutive integers are always induced by oriented pairs. Such reversals are called oriented reversal.

•Reversals induces by an oriented pair will be:

Example: The pair (1, -2) induces the reversal:(0 3 1 6 5 –2 4 7)

(0 3 1 2 –5 –6 4 7)

Page 31: Genome Rearrangement SORTING BY REVERSALS

Interleaving GraphInterleaving Graph

Every 2 components are adjacent if there

is an overlap between them but neither of

them contains the other.

C

Page 32: Genome Rearrangement SORTING BY REVERSALS

Constructing the Bit MatrixConstructing the Bit MatrixConsider the sequence P = 3 1 6 5 –2 4 7Consider the sequence P = 3 1 6 5 –2 4 7Represent PRepresent Pi i by by

2i-1, 2i if P2i-1, 2i if Pii is +ve and is +ve and

2i, 2i-1 otherwise P2i, 2i-1 otherwise Pii is -ve is -ve

3 1 6 5 -2 4 7 3 1 6 5 -2 4 7 0 5 6 1 2 11 12 9 10 4 3 7 8 13 14 150 5 6 1 2 11 12 9 10 4 3 7 8 13 14 15

Bit Matrix

Parity Scores

Page 33: Genome Rearrangement SORTING BY REVERSALS

The AlgorithmThe Algorithm

Step 1. Select the vertex vi with the maximum score and perform the these operations until we reach a situation when parity of all the vertices is zero

Step 2. If the sequence is not sorted completely apply the breakpoint technique to complete the sorting

Page 34: Genome Rearrangement SORTING BY REVERSALS

OutlineOutline• Problem definition• Genome Comparison• Possible chromosomal changes• Sorting by reversals : - Previous work

- Definitions

- Duality Theorem (Hurdles !!)

• Our technique : - Bit Vector Method

Experimental results : - Synthetic datasets

- Real datasets

- Breakpoints Technique

• Conclusions and Future work

Page 35: Genome Rearrangement SORTING BY REVERSALS

Experimental SettingsExperimental Settings

1- Synthetic Datasets:

generated random signed permutation of different lengths and evolution rate using GRIMM permutation generation module

2- Real Datasets: Used GRAPPA test sets for different species of

“Campanulaceae” (flower plant)

MGR (multiple genome rearrangement) human-mouse gene order data

Genome.org Herpes Virus that affects human

Page 36: Genome Rearrangement SORTING BY REVERSALS

Experiment 1 - SyntheticExperiment 1 - Synthetic

Performance

0

10

20

30

40

50

Sequence Length

No.

Sor

ted k=20

k=30

k=40

k=20 39 36 37 32 32 28

k=30 42 37 40 31 32 30

k=40 42 33 37 38 34 34

50 100 200 400 800 1600

1- Generated files of random permutations of different lengths (50, 100, 200, 400, 800, 1600) each file with 50 permutations.2- We computed the number of correctly sorted permutations.3- Evolution rate varies : 20,30,40

Page 37: Genome Rearrangement SORTING BY REVERSALS

Experiment 2 - SyntheticExperiment 2 - Synthetic

1- Generated files of random permutations of different lengths (50, 100, 200, 400, 800, 1600) each file with 50 permutations.2- We computed the time needed to obtain the correctly sorted permutations.3- Evolution rate varies : 20,30,40

Time

0

50

100

150

200

250

300

50 100 200 400 800 1600

Sequence LengthTi

me

(sec

) k=20

k=30

k=40

Page 38: Genome Rearrangement SORTING BY REVERSALS

Experiment 3 - SyntheticExperiment 3 - Synthetic

1- Generated files of random permutations of length 1000

2- We computed the time needed to obtain the correctly sorted permutations.

3- Evolution rate varies in increments of 100.

Observation:Saturation state is reached as evolution rate approaches 1000

Evolution Rate vs Time

0

0.5

1

1.52

2.5

3

3.5

4

k

time

(sec

)

Length=1000

Page 39: Genome Rearrangement SORTING BY REVERSALS

Experiment 1 - RealExperiment 1 - RealConsidered Herpes simplex virus (HSV), Epstein-Barr virus (EBV), and Cytomegalovirus (CMV) gene orders (Hannenhalli et al. 1995) as well as the identity gene order (A)

Observations:Our reversal results matched those obtained in optimal evolutionary scenario recovered by MGR-MEDIAN.

Page 40: Genome Rearrangement SORTING BY REVERSALS

Experiment 2 - RealExperiment 2 - Real

1- Considered Campanulaceae species

2- Obtained reversals for Cyanathus (11 reversals), Triodanus (13 reversals), and Symphanra (12 reversals) versus Tobacco but failed to sort Platyncodon, Legousia and Codonopsis

Observation:The ones we sorted were sorted with same number of reversals as GRIMM

Page 41: Genome Rearrangement SORTING BY REVERSALS

Experiment 3 - RealExperiment 3 - Real1- Considered Human-Mouse gene order from MGR

12 13 14 15 -9 -8 -7 -6 47 48 -46 -45 -44 -11 -10 -58 -57 -56 92 93 -95 -94 -21 -20 -5 -4 -3 -2 -1 34 35 41 42 43 36 37 38 -64 -63 61 62 65 66 67 68 90 91 -55 -54 51 52 53 39 40 -60 -59 -77 -76 -19 -18 16 17 -97 -96 -75 -74 -73 24 25 78 79 -83 -82 -81 -80 84 85 86 87 -28 -27 -26 22 23 98 99 69 70 -72 -71 -33 -32 -31 -30 -29 88 89 -50 -49 -105 -104 106 107 108 114 115 -117 -116 -103 -102 109 110 111 112 113 -101 -100 118 119 120 121 122 123 (mouse genome and human is identity)

40 reversals

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 6768 69 70 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124

3 reversalsIdentity

GRIMM sorts the permutation in 41 reversals

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 6768 69 70 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 6768 69 70 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124

Page 42: Genome Rearrangement SORTING BY REVERSALS

ConclusionsConclusions We implemented a technique that integrates the bit-matrix oriented sorting technique together with the greedy breakpoint reversal technique.

The technique proposed was tested on both real and synthetic data and was able to sort signed permutations in a fair number of the test data

We think that such integration can yield good results beside being a simple and relatively fast technique

However, the oriented sort algorithm fails to sort permutations that have hurdles, in those cases we have to apply the breakpoint approach

Page 43: Genome Rearrangement SORTING BY REVERSALS

Future WorkFuture Work We really think that the technique we implemented can provide good results, we think that further experiments can strengthen our claim

We started implementing the algorithm proposed in Kaplan, H., Shamir, R., and Tarjan R. ’99} but didn’t succeed to complete the implementation. We think that having this technique implemented under that same conditions as ours can provide a good source of comparative results, and can give a better confidence about what we propose.

Applying the technique in different datasets including exon order rather than gene order

Considering different species and trying to compute reversal distance and use it to confirm phylogenetic trees

Page 44: Genome Rearrangement SORTING BY REVERSALS

Oriented PairsOriented Pairs

• An oriented pair ( , ) is a pair of An oriented pair ( , ) is a pair of consecutive integers, that is consecutive integers, that is

with opposite signswith opposite signs

Example:Example:

(0 3 1 6 5 –2 4 7)(0 3 1 6 5 –2 4 7)

Oriented pairs areOriented pairs are

1 ni j

1i j

= (0 … ) 2

(1,-2) (3, 2)

Page 45: Genome Rearrangement SORTING BY REVERSALS

Reversal Distance EstimationReversal Distance Estimation

This reversal distance is very in-accurate.

Bafna and Pevzner, 1996 showed that another hidden parameter ”hurdles” estimated reversal distance with much greater accuracy.

)(d 2/b

hcbd

Page 46: Genome Rearrangement SORTING BY REVERSALS

Proper reversalProper reversal

cccc

1, cc

c

1 ccFor every permutation and reversal

Given an arbitary reversal denote

(increase in the size of cycle decomposition)Then for every permutation and reversal

We call reversal proper if = 1

Page 47: Genome Rearrangement SORTING BY REVERSALS

OrientedOriented pairspairs

• Oriented pairs are useful because they indicate reversals that create consecutive elements of the permutation.

Example:

The pair (1, -2) induces the reversal:

(0 3 1 6 5 –2 4 7)

(0 3 1 2 –5 –6 4 7)