1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

1

Multiple Genome Alignment:

Chaining Algorithms Revisited

Graduiertenkolleg Bioinformatik

Universität Bielefeld

Mohamed I. AbouelhodaUniversity of Bielefeld

Joint work with

Enno OhlebuschUniversity of Ulm

Germany2003

2

Comparative GenomicsGraduiertenkolleg Bioinformatik


The practice of analysing the genomic material of a species by comparing it with

the genomic material of other species.

Why is this important?

The next logical step after the high throughput sequencing projects.

Deducing the mechanism and history of genome evolution.

Discovering genes and regulatory elements.

Identifying exons in eukaryotic genes.

Revealing the role of non-coding conserved sequences.

3

Conserved regions are more significant in a multiple genome alignment.

The StrategyGraduiertenkolleg Bioinformatik


Closely-related organisms No (or few) genome rearrangements.

Finding conserved regions (genes and regulatory elements).

Detecting mutations.

Detecting unique genes (e.g., pathogenic genes in bacterial genomes).

Strategy: Global Sequence Alignment

Mutation

Pathogenic Gene

4

Multiple Genome Alignment is DifficultGraduiertenkolleg Bioinformatik


Standard dynamic programming takes O(N k) inpractical

even for k=2

N is very large

Mega bases

Given k genomes, each of average length N

Heuristic algorithms are therefore employed

Program Year Authors

Two GenomesPipMaker 2000 Schwartz et al.

DIALIGN 2000 Morgenstern

MUMmer 2002 Delcher et al.

CHAOS 2002 Brudno and Morgenstern

OWEN 2002 Roytberg et al.

AVID 2003 Bray et al.

Multiple GenomesMGA 2002 Höhl et al.

These tools use anchor-based alignment method

5

MGAGraduiertenkolleg Bioinformatik


MGA uses a strategy composed of three steps:

First Genome G1

Second Genome G2

Third Genome G3

1. Computation of fragments (maximal multiple exact matches).

2. Computation of an optimal chain of colinear non-overlapping fragments.

3. Detailed alignment of the regions between the fragments of the optimal chain.

6




First Genome G1

Second Genome G2

Third Genome G3


2. Computation of an optimal chain of colinear non-overlapping fragments.


7





2. Computation of an optimal chain of colinear non-overlapping fragments (anchors).


First Genome G1

Second Genome G2

Third Genome G3

anchors

8

The Chaining ProblemGraduiertenkolleg Bioinformatik


Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments

such that its total score is maximum over all other chains.

score(C)= ∑i [ fi+1 .weight - g(fi+1, fi)]

where g(fi+1, fi) is the gap cost of connecting fi+1 to fi

First Genome G1

Second Genome G2

Third Genome G3

9

The Chaining ProblemGraduiertenkolleg Bioinformatik


Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments

such that its total score is maximum over all other chains.

First Genome G1

Second Genome G2

Third Genome G3

score(C)= ∑i [ fi+1 .weight - g(fi+1, fi)]

fi+1fi

where g(fi+1, fi) is the gap cost of connecting fi+1 to fi

10

Previous WorkGraduiertenkolleg Bioinformatik


Graph based solution takes O(n2) time.

Geometric based algorithm is subquadratic (sparse dynamic programming):

1. Zhang et al. (1994) used space division with a kd-tree (no complexity analysis was given).

2. Myers and Miller (1995) used orthogonal range search with a range tree yielding a complexity of

O(n log k n) time and O(n log k-1 n) space.

But the result is a time bound higher by a logarithmic factor than

what one would expect.David Eppstein

Soble-Martinez, 1986Wilbur-Lipman, 1983

Eppstein-Giancarlo, 1992

11

Previous WorkGraduiertenkolleg Bioinformatik


For two genomes the complexities are also higher than those of known 2-dim. chaining algorithms

O(n log 2 n) time and O(n log n) space.

We thought hard to reduce this discrepancy but have been unable to do so and the reasons appear

to be fundamental !To improve upon our result appears to be a difficult

open problem !

Here, it is improved by almost two log factors in time and one log factor in

space

O(n log n) time and O(n) space. For k=2

For k>2 O(n log k-2 n log log n) time and O(n log k-2 n) space.

Myers & Miller

12

The Problem RevisitedGraduiertenkolleg Bioinformatik


fi<< fi+1: end( fi).xr < start(fi+1).xr for all r, 0 < r <= k

• Any kind of fragment can be used (fragments can contain also mismatches,

insertions/deletions).

• A fragment fi is represented as a hyper-rectangle in a k-dimensional space.

• A fragment fi is identified with its start and end points: start(fi) and end( fi).

• We add two imaginary fragments O and t with weight zero.

• Any two fragments fi and fi+1 in the chain must be colinear and non-overlapping

13

The SolutionGraduiertenkolleg Bioinformatik


fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}

where

fi<< fj : end( fi).xr < start(fj).xr for all r, 0 < r <= kg( fi , fj ) is the gap cost of connecting fi to fj

score(C)= ∑i [ fi .weight - g(fi, fi-1)]

An optimal chain is a chain of maximum score

The score of a chain C is

The maximum score can be computed by the recurrence

1 3

1 4

1 2 3

1 2 2

fj

Sparse Dynamic Programming

A graph based solution takes O(n2) time.

14

Geometric-based SolutionGraduiertenkolleg Bioinformatik


The recurrence


can be written as

RMQ (Range Maximum Query)

Retrieves the fragment fi whose end point Iies in the hyper-rectangle bounded by start(fj) and O such that fi.score-g( fi , fj ) is maximum.

fj.score=fj.weight+RMQ{O, start(fj)}

RMQ is applied using the start and end points

fj

15

Overview of the AlgorithmGraduiertenkolleg Bioinformatik


The algorithm uses techniques from computational geometry

1. Line-sweep algorithm.

2. The algorithm works on the start and end points of the fragments.

3. RMQ using a semi-dynamic data structure: the range tree.

4. Proper inclusion of the gap costs into the fragment weight.


The recurrence is

If the gap cost is zero, a RMQ returns the end point of the

fragment fi such that is

maximum.

ir

rri weightfscoref

0

..

16

The Algorithm without Gap CostsGraduiertenkolleg Bioinformatik



The recurrence is

Line-sweep algorithm

1. Sort the start and end points of the fragments w.r.t. x1

2. If a start point of a fragment, say fj, is scanned

apply the RMQ(O, (start(fj).x2, …, start(fj).xk)) to the set of active end points

and update the score of the end point of fragment fj.

3. Otherwise, add the end point to the set of active end points (already scanned end points).

The first step reduces the dimension of the RMQ to k-1.

If the gap cost is zero, a RMQ returns the end point of the

fragment fi such that is

maximum.

ir

rri weightfscoref

0

..

17

The Complexity of the AlgorithmGraduiertenkolleg Bioinformatik


The complexity of the algorithm depends on the complexity of the RMQ in d= k-1 dimensions The required data structure D to manipulate the set of points

• is a semi-dynamic data structure over all end points in d= k-1 dimensions • supports the operations:

1. Activate an end point.2. Perform a RMQ.

D is implemented as a range tree

O(n log k-2 n log log n) time and O(n log k-2 n) space

For n fragments and dimension d, the RMQ and activation takes:

Since d= k-1>1, the complexity of the algorithm is

O(n log d-1 n log log n) time and O(n log d-1 n) space

1. supported by fractional cascading.

2. enhanced with priority queues.

Willard, 1985

van Emde Boas, 1977Johnson, 1982

O(n log n) time and O(n) space For k=2, the total complexity is

18

Including Gap Costs Graduiertenkolleg Bioinformatik


The gap cost should be included in the RMQ, otherwise the algorithm would be quadratic.


fj.score=fj.weight+RMQ{O, start(fj)}A

C C

XX

X A

C C

ACCYYACC

f

f

Recall the recurrence

How to define the gap costs ?

How to include the gap costs without affecting the complexity?

19

Types of Gap CostsGraduiertenkolleg Bioinformatik


ACC YYY _ _ ACC ACC YYY ACCACC _ _ _ XX ACC ACC _ XX ACC

L1 L∞

A C

C Y

YY

A C

C

ACCXXACC

f The gap costs g can be described geometrically:

k

iii xfendxfstartfendfstartdffg

111 ).().())(),((),(

iiixfendxfstartfendfstartdffg ).().(max))(),((),(

5),(1 ffg 3),( ffgf

ACC XX _ _ _ _ _ ACCACC _ _ YYY _ _ ACC

ACC _XX ACCACC YYY ACC

ACC _ _ _ _ _ ZZ ACC

7),(1 ffg

ACC _ ZZ ACC

3),( ffg

The L∞ and the sum-of-pairs gap cost follow the same idea as the L1

x

y

20

Including Gap Costs Graduiertenkolleg Bioinformatik


The gap cost should be included in the RMQ, otherwise the algorithm would be quadratic.



A C

C X

XX

A C

C

ACCYYACC

f

f

Recall the recurrence

21

Including Gap Costs in L1Graduiertenkolleg

Bioinformatik


gc( f) = d1(t, end(f))

We define the geometric cost of a fragment f as follows:

where d1(t, end(f) is the distance in the L1 metric

between t and end(f).

f 1.score - g( f 1 , f ) > f 2.score - g( f 2

, f )

iff

f 1.score - gc( f 1) > f 2.score - gc( f 2)

f 1

f 2

gc( f) is a constant that can be precomputed and attached to the fragment’s weight

O(n log k-2 n log log n) time and O(n log k-2 n) space

For k>2, the complexity is

O(n log n) time and O(n) space For k=2, the complexity is

f

22

Gap Costs in L∞Graduiertenkolleg

Bioinformatik


gc( f) = d∞(t, end(f))

The geometric cost of a fragment f is then:

f 1.score - g( f 1 , f ) > f 3.score - g( f 3

, f )

iff

f 1.score - gc( f 1) > f 3.score - gc( f 3)

f 1

f 2

gc( f) is a constant that can be precomputed and attached to the fragment’s weight

iiixfendxfstartfendfstartdffg ).().(max))(),((),(

),(),,(max),( ffffffg yx

In the octant O1, ),(),( ffffg x

In the octant O2, ),(),( ffffg y

f 3

O1

O2

RMQ must be performed in every octant and the point of maximum score is chosen.

23

RMQ on OctantsGraduiertenkolleg Bioinformatik


O1

O2

s

pq

pq

s

Because the RMQ requires an orthogonal range, we use the octant-to-quadrant transformation:

),(),(: 221211 xxxxxT

),(),(: 121212 xxxxxT

For the octant O1

For the octant O2

The total complexity of the algorithm depends on the space divison.

O(k! n log k-2 n log log n) time and O(k! n log k-2 n) space

For k>2, the complexity is

O(2 n log n) time and O(2 n) space For k=2, the complexity is

24

Example: 4 Strains Staphylococcus Graduiertenkolleg Bioinformatik


1. Staphlyococcus aureus N315 NC_002745 (2853924)2. Staphlyococcus aureus Mu50 NC_002758 (2919236)3. Staphlyococcus aureus MW2 NC_003923 (2860842)4. Staphlyococcus epidermidis NC_004461 (2535068)

Fragments min. len. 15 of 1-2200,000 Fragments

25

Example: 4 Strains Staphylococcus Graduiertenkolleg Bioinformatik


1-2 1-3 1-4

1. Staphlyococcus aureus N315 NC_002745 (2853924)2. Staphlyococcus aureus Mu50 NC_002758 (2919236)3. Staphlyococcus aureus MW2 NC_003923 (2860842)4. Staphlyococcus epidermidis NC_004461 (2535068)

3-4 2-4 2-3

26

Example: 3 Strains E. coli Graduiertenkolleg Bioinformatik


1-2

1-3 2-3 1: E.coli O157:H7 EDL 993(5608027bp)2: Ecoli O157:H7 (5577057bp)3: E.coli k12 (4705567)

Fragments min. len. 30 of 1-260,000 Fragments

27

ConclusionsGraduiertenkolleg Bioinformatik


Our algorithm solves an open problem w.r.t. the chaining of n fragments of k genomes.

Other data structures than the range tree can be used, e.g., the kd-tree. It takes for k>2

The sum-of-pairs gap cost is addressed in the paper.

We reduced the time complexity by almost two log factors and the space complexity by one log factor.

time and O(n) space ))1(( 112

knkO

Other research topic: Comparison of distantly-related organisms.

Many genome rearrangements.

Local sequence alignment Detecting rearranged segments.

Revealing the mechanisms of rearrangements.

Better identification of the exons of eukaryotic genes.Abouelhoda-Ohlebusch, WABI 2003

28

Example Local chainsGraduiertenkolleg Bioinformatik


Fragments min. len. 12

M. Genetalium

M. P

neimoniaM

. Pne

imon

ia

M. GenetaliumM. Genetalium

Abouelhoda-Ohlebusch, WABI 2003

29

Example Local chainsGraduiertenkolleg Bioinformatik


Fragments min. len. 15

E. coli

V. cholera

30

Graduiertenkolleg Bioinformatik


Acknowledgement

Enno Ohlebusch, Ulm University

Stefan Kurtz, , Hamburg University

Robert Giegerich, Bielefeld University

Janina Scholz, Bielefeld University

Thanks for your attention

This work was funded by the Graduiertenkolleg Bioinformatik,

Universität Bielefeld, Germany.

Documents

1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…