6
On the Distribution of the Number of Cycles in the Breakpoint Graph of a Random Signed Permutation Simona Grusea Abstract—We use the finite Markov chain embedding technique to obtain the distribution of the number of cycles in the breakpoint graph of a random uniform signed permutation. This further gives a very good approximation of the distribution of the reversal distance between two random genomes. Index Terms—Markov processes, probabilistic algorithms, distribution functions, biology and genetics. Ç 1 INTRODUCTION 1.1 The Biological Context COMPARING genomic organization between different species may help to decipher the evolutionary history of species and also to better understand the biology of contemporary species. Orthologous genes are two genes, in two different species, that descend from the same gene at the ancestor of the two species, as a result of a speciation event. They tend, in general, to have similar functions. Therefore, finding a group of orthologous genes in close proximity in the genomes of two different species may represent a sign for evolutionary or functional relationships between these genes. For this to be the case, the observed orthologous gene clusters have to be significant, i.e., very improbable to have appeared by chance. During the evolutionary time, the gene order in one genome can be affected by various genome rearrangement events, such as inversions, translocations, transpositions, and chromosomic fis- sions and fusions. Hence, in the absence of certain constraints due to functional selective pressures, the gene order is rapidly randomized. This is one reason why, in general, the null hypothesis taken in the significance tests for gene clusters is the hypothesis of random gene order. In the “genomic comparison” literature, various definitions for gene clusters exist, and also different statistical tests for detecting gene clusters which are significant from the point of view of the proximity of the orthologs (see [3], [4], [6], [7], [13], [15], [19], [20], [25], [34]). On the other hand, one might want to take into account also the order of the orthologs in these gene clusters, considering that the clusters in which the gene order is exceptionally conserved are even more biologically significant. Sankoff and Haque [26] propose three adjacency disruption measures for comparing the order of the orthologs which are in common between two clusters in two genomes. They investigate in more detail the “maximum adjacency disruption” criterion, giving analytic formulas for some values of its distribution under random gene order and also simulation results. Grusea [17] proposes three measures based on the mathematical transposition distance between permutations, for assessing the exceptionality of the gene order in conserved genomic regions found by the reference region approach, and obtains analytic expressions for the distribution of these distances in the case of a uniform random permutation. In [12], Corteel et al. analyze the distribution of the number of common intervals in the case of a uniform random permutation and also study some generalized common intervals, in which gaps of a certain size are permitted. In the “genome rearrangements” literature, several more biologically relevant distances have been studied, which take into account one or a combination of different types of genomic events: reversals, translocations, chromosomic fissions and fusions, biolo- gical transpositions, and block interchanges—see [22] for a review. The problem with using these distances as test statistics comes from the fact that their distributions for a random permutation are very difficult to obtain and there are very few results on this subject. Recently, Doignon and Labarre [14] and Bona and Flynn [8] have found, in two different ways, the distribution of the number of (edge-disjoint) alternating cycles in the bicolored breakpoint graph of a random unsigned permutation, which can be used to deduce the exact distribution of the “block interchange” distance of Christie [11]. However, for signed permutations, corresponding to the case when gene orientation is also known, the exact distribution of the number of cycles in the breakpoint graph is still unknown. Sankoff and Haque [27] use a constructive approach to obtain asymptotic estimates for the distribution of the number of cycles in the breakpoint graph of two random signed permutations. Xu et al. [32] and Xu [31] use a similar approach to study the case of multichromosomal genomes. The comparison of two genomes induces a decomposition of the genomes into synteny blocks (or conserved segments), chromo- somic segments containing orthologous genes in the same or reverse order in the two genomes (see [10], [23]). The genomes could then be seen as permutations (unsigned or signed) of the set of synteny blocks. Some authors extend the notion of synteny block, allowing for some microrearrangements inside the synteny blocks (see [24]). In the present work, we are interested in finding the exact distribution of the number of alternating cycles in the breakpoint graph of a random signed permutation. The knowledge of this distribution provides a very good approximation for the distribu- tion of the reversal distance for a random signed permutation. This further allows us to use the reversal distance as a test statistic for assessing the exceptionality of the gene order in conserved genomic regions or of the order of the synteny blocks between two genomes. We use the finite Markov chain embedding technique of Fu and Koutras [16] to obtain the distribution of the number of cycles in the breakpoint graph of a random signed permutation via a product of transition probability matrices of a certain finite Markov chain. 1.2 The Breakpoint Graph and the Reversal Distance We let S n denote the permutation group of order n. For a permutation % 2 S n , we will use the notation % ¼½%ð1Þ; ... ;%ðnÞ.A signed permutation of n elements is a permutation % ¼½%ð1Þ; ... ; %ðnÞ in which the elements %ðiÞ;i ¼ 1; ... ;n have a sign, either þ or . In other words, %ðiÞ 2 f1; ... ; ng, for i ¼ 1; ... ;n and fj%ð1Þj; ... ; j%ðnÞjg ¼ f1; ... ;ng. We denote by B n the set of all the signed permutations of n elements. The reversal of the interval ði; jÞ in the signed permutation % reverses the subsequence %ðiÞ; ... ;%ðjÞ while changing their signs, and, hence, produces the signed permutation % 0 ¼½%ð1Þ; ... ;%ði 1Þ; %ðjÞ; %ðj 1Þ; ... ; %ði þ 1Þ; %ðiÞ; %ðj þ 1Þ; ... ;%ðnÞ: For % 2 B n , we let d rev ð%;IdÞ denote its reversal distance, i.e., the minimum number of reversals needed to transform % into the identity permutation Id ¼ ½þ1; ... ; þn. Bafna and Pevzner [2] introduced the concept of breakpoint graph of a permutation and noticed important links between the cycle decomposition of this graph and the reversal distance. The breakpoint graph of a signed permutation is defined as follows: IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011 1411 . The author is with LATP - UMR CNRS 6632, Equipe EBM, Universite´de Provence, and the De´partement GMM, Institut de Mathe´matiques de Toulouse, INSA de Toulouse, Universite´ de Toulouse, 135 avenue de Rangueil, Toulouse 31077, cedex 4, France. E-mail: [email protected]. Manuscript received 24 Mar. 2010; revised 8 July 2010; accepted 26 Aug. 2010; published online 23 Nov. 2010. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBB-2010-03-0080. Digital Object Identifier no. 10.1109/TCBB.2010.123. 1545-5963/11/$26.00 ß 2011 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

On the Distribution of the Number of Cycles in the Breakpoint Graph of a Random Signed Permutation

  • Upload
    s

  • View
    217

  • Download
    3

Embed Size (px)

Citation preview

Page 1: On the Distribution of the Number of Cycles in the Breakpoint Graph of a Random Signed Permutation

On the Distribution of the Number of Cyclesin the Breakpoint Graph of a Random

Signed Permutation

Simona Grusea

Abstract—We use the finite Markov chain embedding technique to obtain the

distribution of the number of cycles in the breakpoint graph of a random uniform

signed permutation. This further gives a very good approximation of the

distribution of the reversal distance between two random genomes.

Index Terms—Markov processes, probabilistic algorithms, distribution functions,

biology and genetics.

Ç

1 INTRODUCTION

1.1 The Biological Context

COMPARING genomic organization between different species mayhelp to decipher the evolutionary history of species and also tobetter understand the biology of contemporary species.

Orthologous genes are two genes, in two different species, thatdescend from the same gene at the ancestor of the two species, as aresult of a speciation event. They tend, in general, to have similarfunctions. Therefore, finding a group of orthologous genes in closeproximity in the genomes of two different species may represent asign for evolutionary or functional relationships between thesegenes. For this to be the case, the observed orthologous geneclusters have to be significant, i.e., very improbable to haveappeared by chance.

During the evolutionary time, the gene order in one genomecan be affected by various genome rearrangement events, such asinversions, translocations, transpositions, and chromosomic fis-sions and fusions. Hence, in the absence of certain constraints dueto functional selective pressures, the gene order is rapidlyrandomized. This is one reason why, in general, the nullhypothesis taken in the significance tests for gene clusters is thehypothesis of random gene order.

In the “genomic comparison” literature, various definitions forgene clusters exist, and also different statistical tests for detectinggene clusters which are significant from the point of view of theproximity of the orthologs (see [3], [4], [6], [7], [13], [15], [19], [20],[25], [34]). On the other hand, one might want to take into accountalso the order of the orthologs in these gene clusters, consideringthat the clusters in which the gene order is exceptionally conservedare even more biologically significant.

Sankoff and Haque [26] propose three adjacency disruptionmeasures for comparing the order of the orthologs which are incommon between two clusters in two genomes. They investigatein more detail the “maximum adjacency disruption” criterion,giving analytic formulas for some values of its distribution underrandom gene order and also simulation results. Grusea [17]proposes three measures based on the mathematical transpositiondistance between permutations, for assessing the exceptionality ofthe gene order in conserved genomic regions found by thereference region approach, and obtains analytic expressions for the

distribution of these distances in the case of a uniform randompermutation. In [12], Corteel et al. analyze the distribution of thenumber of common intervals in the case of a uniform randompermutation and also study some generalized common intervals,in which gaps of a certain size are permitted.

In the “genome rearrangements” literature, several morebiologically relevant distances have been studied, which take intoaccount one or a combination of different types of genomic events:reversals, translocations, chromosomic fissions and fusions, biolo-gical transpositions, and block interchanges—see [22] for a review.The problem with using these distances as test statistics comes fromthe fact that their distributions for a random permutation are verydifficult to obtain and there are very few results on this subject.

Recently, Doignon and Labarre [14] and Bona and Flynn [8]have found, in two different ways, the distribution of the number of(edge-disjoint) alternating cycles in the bicolored breakpoint graphof a random unsigned permutation, which can be used to deduce theexact distribution of the “block interchange” distance of Christie[11]. However, for signed permutations, corresponding to the casewhen gene orientation is also known, the exact distribution of thenumber of cycles in the breakpoint graph is still unknown.

Sankoff and Haque [27] use a constructive approach to obtainasymptotic estimates for the distribution of the number of cycles inthe breakpoint graph of two random signed permutations. Xu et al.[32] and Xu [31] use a similar approach to study the case ofmultichromosomal genomes.

The comparison of two genomes induces a decomposition ofthe genomes into synteny blocks (or conserved segments), chromo-somic segments containing orthologous genes in the same orreverse order in the two genomes (see [10], [23]). The genomescould then be seen as permutations (unsigned or signed) of the setof synteny blocks. Some authors extend the notion of syntenyblock, allowing for some microrearrangements inside the syntenyblocks (see [24]).

In the present work, we are interested in finding the exactdistribution of the number of alternating cycles in the breakpointgraph of a random signed permutation. The knowledge of thisdistribution provides a very good approximation for the distribu-tion of the reversal distance for a random signed permutation. Thisfurther allows us to use the reversal distance as a test statistic forassessing the exceptionality of the gene order in conserved genomicregions or of the order of the synteny blocks between two genomes.

We use the finite Markov chain embedding technique of Fu andKoutras [16] to obtain the distribution of the number of cycles in thebreakpoint graph of a random signed permutation via a product oftransition probability matrices of a certain finite Markov chain.

1.2 The Breakpoint Graph and the Reversal Distance

We let Sn denote the permutation group of order n. For apermutation � 2 Sn, we will use the notation � ¼ ½�ð1Þ; . . . ; �ðnÞ�. Asigned permutation of n elements is a permutation � ¼ ½�ð1Þ; . . . ;�ðnÞ� in which the elements �ðiÞ; i ¼ 1; . . . ; n have a sign, either þor �. In other words, �ðiÞ 2 f�1; . . . ;�ng, for i ¼ 1; . . . ; n andfj�ð1Þj; . . . ; j�ðnÞjg ¼ f1; . . . ; ng. We denote by Bn the set of all thesigned permutations of n elements.

The reversal of the interval ði; jÞ in the signed permutation �

reverses the subsequence �ðiÞ; . . . ; �ðjÞ while changing their signs,and, hence, produces the signed permutation

�0 ¼ ½�ð1Þ; . . . ; �ði� 1Þ;��ðjÞ;��ðj� 1Þ; . . . ;��ðiþ 1Þ;��ðiÞ;�ðjþ 1Þ; . . . ; �ðnÞ�:

For � 2 Bn, we let drevð�; IdÞ denote its reversal distance, i.e., theminimum number of reversals needed to transform � into theidentity permutation Id ¼ ½þ1; . . . ;þn�.

Bafna and Pevzner [2] introduced the concept of breakpointgraph of a permutation and noticed important links between thecycle decomposition of this graph and the reversal distance. Thebreakpoint graph of a signed permutation is defined as follows:

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011 1411

. The author is with LATP - UMR CNRS 6632, Equipe EBM, Universite deProvence, and the Departement GMM, Institut de Mathematiquesde Toulouse, INSA de Toulouse, Universite de Toulouse, 135 avenue deRangueil, Toulouse 31077, cedex 4, France.E-mail: [email protected].

Manuscript received 24 Mar. 2010; revised 8 July 2010; accepted 26 Aug.2010; published online 23 Nov. 2010.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-2010-03-0080.Digital Object Identifier no. 10.1109/TCBB.2010.123.

1545-5963/11/$26.00 � 2011 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Page 2: On the Distribution of the Number of Cycles in the Breakpoint Graph of a Random Signed Permutation

Given a signed permutation � in Bn, we first transform it into anunsigned permutation �0 2 S2n by replacing the positive elementsþi by the pair ð2i� 1; 2iÞ and the negative elements �i by the pairð2i; 2i� 1Þ. For instance, the signed permutation � ¼ ½þ3;�4;�2;þ1;þ5� is transformed into �0 ¼ ½5; 6; 8; 7; 4; 3; 1; 2; 9; 10�. Wethen extend �0 by adding two more elements: one at the beginning,which we will denote S (for Start), and the other at the end, whichwe will denote T (for Terminus).

The breakpoint graph of the signed permutation � 2 Bn is thegraph Gð�Þ ¼ ðV ;B [ CÞ, having the set of vertices V ¼fS; 1; 2; . . . ; 2n; Tg and the edge set partitioned into two subsets:the set B of solid edges, corresponding to the adjacencies in thepermutation �, and the set C of dashed edges, corresponding to theadjacencies in the identity permutation Id.

More precisely, if for every element a of the permutation � wedenote, respectively, by aL and aR the left and right elements in thepair associated with a in �0, then we will have a solid edge betweenaR and bL if a and b are consecutive in �. We have also solid edgesbetween S and ð�1ÞL and between ð�nÞR and T . We have dashededges between the vertices 2i� 1 and 2i, for every i ¼ 1; . . . ; n,between S and 1, and between 2n and T .

Note that each vertex in Gð�Þ is of degree 2, having exactly onesolid edge and one dashed edge incident to it. Consequently, thebreakpoint graph decomposes uniquely into disjoint alternatingcycles, i.e., cycles in which the solid edges and the dashed edgesalternate. For a given cycle, we call its length the number of solidedges, or equivalently, the number of dashed edges it contains. Inthe example from Fig. 1, the breakpoint graph of the permutation� ¼ ½þ3;�4;�2;þ1;þ5� decomposes into two alternating cycles:one of length 1 and the other of length 5.

For a signed permutation � 2 Bn, we will denote cð�Þ thenumber of alternating cycles in the breakpoint graph Gð�Þ. Thereason for introducing cð�Þ in [2] was that it is easily seen to givethe following lower bound, for every � 2 Bn:

drevð�; IdÞ � nþ 1� cð�Þ: ð1Þ

Hannenhalli and Pevzner [18] proved that for every signedpermutation � 2 Bn, we have the exact formula

drevð�; IdÞ ¼ nþ 1� cð�Þ þ hð�Þ þ fð�Þ; ð2Þ

where hð�Þ is the number of hurdles in Gð�Þ and fð�Þ is 1 if � isa fortress and 0 otherwise (see [18] for the definitions of hurdleand fortress).

The problem of computing the reversal distance for signedpermutations (without giving the optimal sequence of reversals)can be solved in linear time (see Bader et al. [1]). The problembecomes more complicated if one wants also the optimal sequenceor reversals. The most efficient sorting algorithms at presentare the one by Tannier et al. [30], that run in Oðn3=2

ffiffiffiffiffiffiffiffiffiffilognp

Þ, and thealgorithm by Swenson et al. [29] that runs in Oðn lognÞ time foralmost all permutations.

Caprara [9] showed that genomes containing hurdles are veryrare. For example, less than 1 percent of the genomes with eightgenes contain hurdles and only one in 105 genomes with 100 genes.Swenson et al. [28] proved that the probability that a random

signed permutation on n elements contains a hurdle is Oðn�2Þ andthe probability that it contains a fortress is Oðn�15Þ. It was alsoshown that the bound (1) approximates the reversal distanceextremely well for both simulated (see Kececioglu and Sankoff[21]) and biological data (see Bafna and Pevzner [2]). Kececiogluand Sankoff [21] observed that the average difference between thisbound and the exact distance is less than one for a randompermutation.

One can, therefore, use the bound ð1Þ as a very goodapproximation for the reversal distance.

Moreover, in the case of unichromosomal genomes, the boundð1Þ agrees exactly with the double-cut-and-join (DCJ) distanceintroduced by Yancopoulos et al. [33]. For more details on thisdistance, see also [5].

The goal of the present work is to find the distribution of cð�Þfor a random (uniform) signed permutation �.

2 THE DISTRIBUTION OF cð�Þ2.1 The Finite Markov Chain Embedding Technique

For obtaining the distribution of cð�Þ for a random signedpermutation �, we use the finite Markov chain embeddingtechnique introduced by Fu and Koutras [16].

Let Xn (n a nonnegative integer) be a nonnegative integerrandom variable. As in [16, Definition 2.1], we call Xn finite Markov

chain embeddable if

1. there exists a (possibly inhomogeneous) finite Markovchain fYt : 1 � t � ng with values in a finite state spaceE ¼ fa1; . . . ; amg,

2. there exists a finite partition fCx; x ¼ 0; 1; . . . ; ‘g of E, and3. for every x ¼ 0; 1; . . . ; ‘, we have

IPðXn ¼ xÞ ¼ IPðYn 2 CxÞ:

The distribution of Xn can in this case be obtained via aproduct of transition matrices of the Markov chain ðYtÞ1�t�n.Indeed, if we define, for every 2 � t � n, the transition matrixPt :¼ ðPtðy; zÞÞy;z2E by

Ptðy; zÞ ¼ IPðYt ¼ zjYt�1 ¼ yÞ; 8y; z 2 E;

then, by [16, Theorem 2.1], we have

IPðXn ¼ xÞ ¼ �1P2 � � �PnXi:ai2Cx

ei; ð3Þ

where

�1 ¼ ðIPðY1 ¼ a1Þ; . . . ; IPðY1 ¼ amÞÞ

is the row vector of the initial probability of the Markov chain and,for each i ¼ 1; . . . ;m, ei is the column vector having 1 at theith coordinate and 0 elsewhere.

2.2 The Construction of the Markov Chain

Let n be a fixed positive integer. We start with �1 being a randomuniform signed permutation with one element; hence, �1 ¼ ½þ1�with probability 1=2 and �1 ¼ ½�1� with probability 1=2. For everyt ¼ 2; . . . ; n, we let �t represent the random signed permutation oft elements which is obtained from �t�1 by inserting at random theelement t uniformly into one of the t possible positions, with the“þ” sign with probability 1=2 and the “�” sign with probability1=2, the sign being independent of the position.

Note that ð�tÞ1�t�n is an inhomogeneous Markov chain withinitial distribution

IPð�1 ¼ ½1�Þ ¼ IPð�1 ¼ ½�1�Þ ¼ 1=2

1412 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011

Fig. 1. The breakpoint graph of the permutation � ¼ ½þ3;�4;�2;þ1;þ5�.

Page 3: On the Distribution of the Number of Cycles in the Breakpoint Graph of a Random Signed Permutation

and the following transition probability matrices: for every2 � t � n,

Mtð�; �þ;iÞ ¼ IPð�t ¼ �þ;ij�t�1 ¼ �Þ ¼1

2t;

Mtð�; ��;iÞ ¼ IPð�t ¼ ��;ij�t�1 ¼ �Þ ¼1

2t;

where

�þ;i :¼ ½�1; . . . ; �i�1; t; �i; . . . ; �t�1�;��;i :¼ ½�1; . . . ; �i�1;�t; �i; . . . ; �t�1�;

and Mtð�; �0Þ ¼ 0 for every other �0 2 Bt.It is easy to see that for every t ¼ 1; . . . ; n, �t is a random signed

permutation of t elements, uniformly chosen among the2tt! elements of Bt. In our case, the random variable of interest isXn :¼ cð�nÞ, which we will show to be finite Markov chainembeddable. We construct a finite Markov chain ðYtÞ1�t�nverifying conditions ðiÞ, ðiiÞ, and ðiiiÞ as follows.

For every t ¼ 1; . . . ; n, we denote by Kj;t; j ¼ 1; . . . ; nþ 1 therandom variables representing the number of cycles of length j inthe breakpoint graph of the permutation �t. We also denote by Ltthe length of the cycle in Gð�tÞ which contains the terminal pointT . For every t ¼ 1; . . . ; n, we obviously have Kj;t ¼ 0 for j ¼tþ 2; . . . ; nþ 1 and

Xtþ1

j¼1

jKj;t ¼ tþ 1;Xtþ1

j¼1

Kj;t ¼ cð�tÞ:

We let

Yt :¼ ðLt;K1;t; . . . ;Knþ1;tÞ; t ¼ 1; . . . ; n:

We call Yt the type of the permutation �t. For example, thepermutation � ¼ ½þ3;�4;�2;þ1;þ5� from Fig. 1 is of typeð1; 1; 0; 0; 0; 1; 0Þ. Note that for every 1 � t � n, Yt takes values inthe finite set

Et ¼(ð‘; k1; . . . ; ktþ1; 0; . . . ; 0|fflfflfflffl{zfflfflfflffl}

n�t

Þ : ‘ 2 f1; . . . ; tþ 1g;

Xtþ1

j¼1

jkj ¼ tþ 1; k‘ � 1

):

Let us denote k!

:¼ ðk1; . . . ; knþ1Þ. We have

IPðcð�nÞ ¼ xÞ ¼ IPðYn 2 CxÞ;

where, for every x ¼ 1; 2; . . . ; nþ 1,

Cx ¼ ð‘; k!Þ :Xnþ1

j¼1

kj ¼ x;Xnþ1

j¼1

jkj ¼ nþ 1; k‘ � 1

( ):

We will show that ðYtÞ1�t�n is an inhomogeneous Markovchain. The initial distribution of Y1 is

IPðY1 ¼ ð1; 2; 0; 0; . . . ; 0ÞÞ ¼ IPðY1 ¼ ð2; 0; 1; 0; . . . ; 0ÞÞ ¼ 1=2;

the case Y1 ¼ ð1; 2; 0; 0; . . . ; 0Þ corresponding to �1 ¼ ½þ1� andthe case Y1 ¼ ð2; 0; 1; 0; . . . ; 0Þ to �1 ¼ ½�1�.

For 2 � t < n, write Yt�1 ¼ ð‘; k!Þ. Note that necessarily k‘ � 1.

We have the following result:

Proposition 1. ðYtÞ1�t�n is an inhomogeneous Markov chain of initial

distribution

IPðY1 ¼ ð1; 2; 0; 0; . . . ; 0ÞÞ ¼ IPðY1 ¼ ð2; 0; 1; 0; . . . ; 0ÞÞ ¼ 1=2

and the following transition probabilities.

If Yt�1 ¼ ð‘; k!Þ, with k‘ � 1, then the possible transitions are to

Yt ¼ ð‘0; k0!Þ, where

1. ‘0 ¼ ‘þ 1 and k0!¼ k!� e0‘ þ e0‘þ1, with probability ‘=ð2tÞ;

2. ‘0 ¼ j, with 1 � j � ‘, and k0!¼ k!� e0‘ þ e0j þ e0‘þ1�j, with

probability 1=ð2tÞ;3. ‘0 ¼ ‘þ xþ 1, with 1 � x � t� ‘; x 6¼ ‘ and k0

!¼ k!�

e0‘ � e0x þ e0‘þxþ1, with probability xkx=t;

4. ‘0 ¼ 2‘þ 1 and k0!¼ k!� 2e0‘ þ e02‘þ1, with probability

‘ðk‘ � 1Þ=t,where for each i, e0i is the row vector having 1 at the ith coordinate and

0 elsewhere.

Proof. For a permutation � of type ð‘; k!Þ, we will show that IPðYt ¼ð‘0; k0!Þj�t�1 ¼ �Þ depends only on ‘0; k0

!; ‘; k!

. This easily implies

that ðYtÞ1�t�n is a Markov chain.Suppose now that �t�1 ¼ �, with � being of type ð‘; k!Þ. In

Fig. 2, we have the disjoint cycle decomposition of thebreakpoint graph of the permutation �5 ¼ � ¼ ½þ4;�2;�1;þ5;þ3�. In this case, we have ‘ ¼ 5 and k

!¼ ð1; 0; 0; 0; 1; 0Þ.We will investigate the changes produced in the breakpoint

graph when inserting the new element �t, at random, into oneof the t possible positions of the permutation �, with the “þ”sign with probability 1=2 and the “�” sign with probability 1=2.

The modifications concerning the dashed edges are simple.Disregarding the sign of �t, the dashed edge between 2ðt� 1Þand T is deleted and replaced by a dashed edge between 2ðt�1Þ and 2t� 1, and then another dashed edge is added between2t and T (see, for example, Fig. 3).

Concerning the solid edges: we choose at random a solidedge among the t solid edges in the breakpoint graph of �, wedelete it and then add two other solid edges to connect the twoextremities of the deleted edge to 2t� 1 and 2t, respectively, inone of the two possible ways. The choice of the solid edge to bedeleted corresponds to the choice of the position in thepermutation � where �t is inserted. The way in which weconnect the two extremities of the deleted edge to 2t� 1 and 2t,respectively, corresponds to the sign of the element t.

More precisely, if we choose to insert the element �t in theposition i, where 2 � i � t� 1, then we will delete the solid edgebetween ð�ði� 1ÞÞR and ð�ðiÞÞL. If we insertþt, then we will addtwo solid edges between ð�ði� 1ÞÞR and 2t� 1 and between 2tand ð�ðiÞÞL. If we insert �t, then we will add two solid edgesbetween ð�ði� 1ÞÞR and 2t and between 2t� 1 and ð�ðiÞÞL.

If we choose to insert the element �t in the position 1, i.e., atthe beginning of the permutation �, then we will delete the

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011 1413

Fig. 2. The disjoint cycle decomposition of Gð�Þ, for � ¼ ½þ4;�2;�1;þ5;þ3�.

Page 4: On the Distribution of the Number of Cycles in the Breakpoint Graph of a Random Signed Permutation

solid edge between S and ð�ð1ÞÞL. If we insert þt, we add twosolid edges between S and 2t� 1 and between 2t and ð�ð1ÞÞL,and if we insert �t, we add two solid edges between S and 2t,and between 2t� 1 and ð�ð1ÞÞL.

If we choose to insert �t in the position t, i.e., at the end ofthe permutation �, then we will delete the solid edge betweenð�ðtÞÞR and T . If we insert þt, we add two solid edges betweenð�ðt� 1ÞÞR and 2t� 1 and between 2t and T , and if we insert�t, we add two solid edges between ð�ðt� 1ÞÞR and 2t andbetween 2t� 1 and T .

Cases 1 and 2 in the statement correspond to the deletion ofa solid edge from the cycle containing T , and cases 3 and 4correspond to the deletion of a solid edge belonging to a cyclenot containing T .

If we delete a solid edge belonging to the cycle of size ‘which

contains T , then we have two possible situations, depending on

the deleted solid edge and on the permutation �. One situation is

that, when we insert þt, the cycle containing T grows to the

length ‘þ 1 (see Fig. 3), and when we insert �t, it splits into two

smaller cycles, of sizes which sum to ‘þ 1 (see Fig. 4). The other

possible situation is the converse, i.e., when we insert �t, the

cycle containing T becomes of size ‘þ 1 (see Fig. 5), and when

we insert þt, it splits into two smaller cycles, of sizes which sum

to ‘þ 1 (see Fig. 6).

The event that the cycle containing T becomes of size ‘þ 1occurs with probability ‘=ð2tÞ, because we have ‘ possible solidedges to choose in the cycle containing T . In the case when the

1414 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011

Fig. 5. For �6 ¼ ½�6;þ4;�2;�1;þ5;þ3�, the cycle containing T grows to the

length ‘0 ¼ ‘þ 1 ¼ 6. The element �6 is inserted into �5 ¼ ½þ4;�2;�1;þ5;þ3� inposition i ¼ 1, corresponding to the deletion of the solid edges S-7.

Fig. 3. For �6 ¼ ½þ4;�2;�1;þ6;þ5;þ3�, the cycle containing T grows to thelength ‘0 ¼ ‘þ 1 ¼ 6. The element þ6 is inserted into �5 ¼ ½þ4;�2;�1;þ5;þ3� inposition i ¼ 4, corresponding to the deletion of the solid edges 1-9.

Fig. 4. For �6 ¼ ½þ4;�2;�1;�6;þ5;þ3�, the cycle containing T splits into two and‘0 ¼ 3. The element �6 is inserted into �5 ¼ ½þ4;�2;�1;þ5;þ3� in position i ¼ 4,corresponding to the deletion of the solid edges 1-9.

Fig. 6. For �6 ¼ ½þ6;þ4;�2;�1;þ5;þ3�, the cycle containing T splits into two and‘0 ¼ 2. The element þ6 is inserted into �5 ¼ ½þ4;�2;�1;þ5;þ3� in position i ¼ 1,corresponding to the deletion of the solid edges S-7.

Page 5: On the Distribution of the Number of Cycles in the Breakpoint Graph of a Random Signed Permutation

cycle containing T splits into two cycles, the new size j of thecycle which will contain T is chosen at random, uniformlybetween 1 and ‘. The size of the second cycle is then simply‘þ 1� j. Each size j corresponds to a specific choice for thedeleted solid edge; hence, the event that the cycle containingT splits into two cycles and the size of the new cycle which willcontain T becomes j occurs with probability 1=ð2tÞ.

If we delete a solid edge from a cycle not containing T , thendisregarding the sign of t, this cycle will merge with the onecontaining T . If the cycle from which we have deleted a solidedge was of size x, then in the breakpoint graph of �t the cyclecontaining T will be of size ‘þ xþ 1.

In case 3, x represents the length of the cycle not containingT from which we choose a solid edge to be deleted. If x 6¼ ‘, theprobability that this event occurs equals xkx=t, because we havekx cycles of length x that we can choose, and each of themcontains x solid edges.

Case 4 corresponds to the case x ¼ ‘, when we have onlyk‘ � 1 possibilities to choose a cycle of size ‘ not containing T .tu

Proposition 1 describes the entries of the transition prob-

ability matrix Pt of the inhomogeneous Markov chain ðYtÞt. As

described in (3), we can, therefore, obtain the distribution of

cð�nÞ via the product of n transition matrices of this Markov

chain.

2.3 Numerical Results

We have implemented an iterative procedure which, for a given

n, computes numerically the distribution of Yn and then the

distribution of cð�nÞ. At each step t ¼ 1; . . . ; n� 1, we compute

the distribution of Ytþ1 from the distribution of Yt, using the

transition probabilities described in Proposition 1. The complexity

of our algorithm is Oðn2 � pðnþ 1ÞÞ, where p is the partition

function, i.e., for every positive integer m, pðmÞ is the number of

integer partitions of m. An asymptotic expression for pðmÞ is

given by

pðmÞ expð�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið2mÞ=3

4mffiffiffi3p ; as m!1:

In Tables 1 and 2, we give the distribution of cð�Þ for a

random uniform signed permutation � of 20 and 30 elements,

respectively. In the two tables, pk denotes the probability that

cð�Þ takes the value k. For the values of k not appearing in the

tables, the corresponding probabilities are negligible. On a

Pentium 4 processor, 3.1 Mhz, 512 Mb, the computation time

was 13 s for n ¼ 20, 300 s for n ¼ 30, 4� 103 s for n ¼ 40, and

4� 104 s for n ¼ 50.

3 CONCLUDING REMARKS

In this paper, we have obtained the distribution of the number ofalternating cycles in the breakpoint graph of a random signedpermutation, in the form of a product of transition probabilitymatrices of a certain finite Markov chain, using the finite Markovchain embedding technique. A drawback of our method is the factthat our Markov chain is inhomogeneous and of large dimension,which induces a high computational complexity.

A plan for a future work is to find a closed analytic formula forthe exact distribution of the number of cycles in the breakpointgraph of a random signed permutation.

ACKNOWLEDGMENTS

The author would like to thank Etienne Pardoux, the thesisadvisor, for all his support during this work, and Pierre Pontarotti,the second thesis advisor, for helpful biological discussions. Theauthor would also like to thank Anthony Labarre for explaining hisresults on the number of cycles in the breakpoint graph of arandom unsigned permutation. This work was partially supportedby the ANR MAEV under contract ANR-06-BLAN-0113.

REFERENCES

[1] D.A. Bader, B.M.E. Moret, and M. Yan, “A Linear-Time Algorithm forComputing Inversion Distance between Signed Permutations with anExperimental Study,” J. Computational Biology, vol. 8, no. 5, pp. 483-491, Oct.2001.

[2] V. Bafna and P. Pevzner, “Genome Rearrangements and Sorting byReversals,” SIAM J. Computing, vol. 25, no. 2, pp. 272-289, Feb. 1996,doi:10.1137/S0097539793250627.

[3] A. Bergeron, C. Chauve, and Y. Gingras, “Formal Models of GeneClusters,” Bioinformatics Algorithms: Techniques and Applications, I. Mandoiuand A. Zelikovsky, eds., Wiley Series of Bioinformatics, 2008.

[4] A. Bergeron, S. Corteel, and M. Raffinot, “The Algorithmic of Gene Teams,”Proc. Int’l Workshop Algorithms in Bioinformatics, pp. 464-476, Jan. 2002,doi:10.1007/3-540-45784-4_36.

[5] A. Bergeron, J. Mixtacki, and J. Stoye, “A Unifying View of GenomeRearrangements,” Proc. Int’l Workshop Algorithms in Bioinformatics, vol. 4175,pp. 163-173, Sept. 2006, doi:10.1007/11851561_16.

[6] G. Blin and J. Stoye, “Finding Nested Common Intervals Efficiently,” Proc.Int’l Workshop Comparative Genomics, pp. 59-69, Sept. 2009, doi:10.1007/978-3-642-04744-2_6.

[7] S. Bocker, K. Jahn, J. Mixtacki, and J. Stoye, “Computation of Median GeneClusters,” J. Computational Biology, vol. 16, no. 8, pp. 1085-1099, Aug. 2009,doi:10.1089/cmb.2009.0098.

[8] M. Bona and R. Flynn, “The Average Number of Block InterchangesNeeded to Sort a Permutation and a Recent Result of Stanley,” InformationProcessing Letters, vol. 109, no. 16, pp. 927-931, July 2009, doi:10.1016/j.ipl.2009.04.019.

[9] A. Caprara, “On the Tightness of the Alternating-Cycle Lower Bound forSorting by Reversals,” J. Combinatorial Optimization, vol. 3, nos. 2/3, pp. 149-182, July 1999, doi:10.1023/A:1009838309166.

[10] V. Choi, C. Zheng, Q. Zhu, and D. Sankoff, “Algorithms for the Extractionof Synteny Blocks from Comparative Maps,” Proc. Int’l Workshop Algorithmsin Bioinformatics, pp. 277-288, Aug. 2007, doi:10.1007/978-3-540-74126-8_26.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011 1415

TABLE 1The Distribution of cð�Þ for a Random Signed Permutation � 2 B20

TABLE 2The Distribution of cð�Þ for a Random Signed Permutation � 2 B30

Page 6: On the Distribution of the Number of Cycles in the Breakpoint Graph of a Random Signed Permutation

[11] D.A. Christie, “Sorting Permutations by Block-Interchanges,” InformationProcessing Letters, vol. 60, no. 4, pp. 165-169, Nov. 1996, doi:10.1016/S0020-0190(96)00155-X.

[12] S. Corteel, G. Louchard, and R. Pemantle, “Common Intervals inPermutations,” Discrete Math. and Theoretical Computer Science, vol. 8,no. 1, pp. 189-216, 2006.

[13] E. Danchin and P. Pontarotti, “Statistical Evidence for a More than 800-Million-Year-Old Evolutionarily Conserved Genomic Region in OurGenome,” J. Molecular Evolution, vol. 59, no. 5, pp. 587-597, Nov. 2004.

[14] J-P. Doignon and A. Labarre, “On Hultman Numbers,” J. Integer Sequences,vol. 10, article no. 07.6.2, 2007.

[15] D. Durand and D. Sankoff, “Tests for Gene Clustering,” J. ComputationalBiology, vol. 10, nos. 3/4, pp. 453-482, June 2003, doi:10.1089/10665270360688129.

[16] J.C. Fu and M.V. Koutras, “Distribution Theory of Runs: A Markov ChainApproach,” J. Am. Statistical Assoc., vol. 89, no. 427, pp. 1050-1058, Sept.1994.

[17] S. Grusea, “Measures for the Exceptionality of Gene Order in ConservedGenomic Regions,” Advances in Applied Math., vol. 45, no. 3, pp. 359-372,Sept. 2010, doi:10.1016/j.aam.2010.02.002.

[18] S. Hannenhalli and P. Pevzner, “Transforming Cabbage into Turnip:Polynomial Algorithm for Sorting Signed Permutations by Reversals,”J. ACM, vol. 46, no. 1, pp. 1-27, Jan. 1999.

[19] R. Hoberman and D. Durand, “The Incompatible Desiderata of GeneCluster Properties,” Proc. Int’l Workshop Comparative Genomics, pp. 73-87,Sept. 2005, doi:10.1007/11554714.

[20] R. Hoberman, D. Sankoff, and D. Durand, “The Statistical Analysis ofSpatially Clustered Genes under the Maximum Gap Criterion,”J. Computational Biology, vol. 12, no. 8, pp. 1083-1102, Oct. 2005,doi:10.1089/cmb.2005.12.1083.

[21] J. Kececioglu and D. Sankoff, “Exact and Approximation Algorithms forSorting by Reversals, with Application to Genome Rearrangement,”Algorithmica, vol. 13, nos. 1/2, pp. 180-210, Feb. 1995, doi:10.1007/BF01188586.

[22] Z. Li, L. Wang, and K. Zhang, “Algorithmic Approaches for GenomeRearrangement: A Review,” IEEE Trans. Systems, Man and Cybernetics, PartC, vol. 36, no. 5, pp. 636-648, Sept. 2006.

[23] J. Nadeau and B. Taylor, “Lengths of Chromosomal Segments Conservedsince Divergence of Man and Mouse,” Proc. Nat’l Academy of Sciences USA,vol. 81, pp. 814-818, 1984.

[24] P. Pevzner and G. Tesler, “Genome Rearrangements in MammalianEvolution: Lessons from Human and Mouse Genomes,” Genome Research,vol. 13, no. 1, pp. 37-45, Jan. 2003.

[25] N. Raghupathy and D. Durand, “Gene Cluster Statistics with GeneFamilies,” Molecular Biology and Evolution, vol. 26, no. 5, pp. 957-968, Jan.2009, doi:10.1093/molbev/msp002.

[26] D. Sankoff and L. Haque, “Power Boosts for Cluster Tests,” Proc. Int’lWorkshop Comparative Genomics, pp. 121-130, Dec. 2005, doi:10.1007/11554714_11.

[27] D. Sankoff and L. Haque, “The Distribution of Genomic Distance betweenRandom Genomes,” J. Computational Biology, vol. 13, no. 5, pp. 1005-1012,June 2006, doi:10.1089/cmb.2006.13.1005.

[28] K.M. Swenson, Y. Lin, V. Rajan, and B.M.E. Moret, “Hurdles Hardly Haveto Be Heeded,” Proc. Int’l Workshop Comparative Genomics (RECOMB-CG’08), pp. 239-249, 2008, doi:10.1007/978-3-540-87989-3_18.

[29] K.M. Swenson, V. Rajan, Y. Lin, and B.M.E. Moret, “Sorting SignedPermutations by Inversions in Oðn lognÞ Time,” Proc. Ann. Int’l Conf.Research in Computational Molecular Biology, pp. 386-399, May 2009,doi:10.1007/978-3-642-02008-7_28.

[30] E. Tannier, A. Bergeron, and M-F. Sagot, “Advances on Sorting byReversals,” Discrete Applied Math., vol. 155, nos. 6/7, pp. 881-888, Apr.2007, doi:10.1016/j.dam.2005.02.033.

[31] W. Xu, “The Distance between Randomly Constructed Genomes,” Proc.Fifth Asia-Pacific Bioinformatics Conf., pp. 227-236, Oct. 2006, doi:10.1142/9781860947995_0025.

[32] W. Xu, C. Zheng, and D. Sankoff, “Paths and Cycles in Breakpoint Graph ofRandom Multichromosomal Genomes,” J. Computational Biology, vol. 14,no. 4, pp. 423-435, May 2007, doi:10.1089/cmb.2007.A004.

[33] S. Yancopoulos, O. Attie, and R. Friedberg, “Efficient Sorting of GenomicPermutations by Translocation, Inversion and Block Interchange,” Bioinfor-matics, vol. 21, no. 16, pp. 3340-3346, Aug. 2005, doi:10.1093/bioinfor-matics/bti535.

[34] Q. Zhu, Z. Adam, V. Choi, and D. Sankoff, “Generalized GeneAdjacencies, Graph Bandwidth, and Clusters in Yeast Evolution,”IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 6, no. 2,pp. 213-220, Apr.-June 2009.

. For more information on this or any other computing topic, please visit ourDigital Library at www.computer.org/publications/dlib.

1416 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 5, SEPTEMBER/OCTOBER 2011