30
1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität Bielefeld Mohamed I. Abouelhoda University of Bielefeld Joint work with Enno Ohlebusch University of Ulm Germany 2003

1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

Embed Size (px)

DESCRIPTION

3 Conserved regions are more significant in a multiple genome alignment. The Strategy Graduiertenkolleg Bioinformatik Universität Bielefeld Closely-related organisms No (or few) genome rearrangements. Finding conserved regions (genes and regulatory elements). Detecting mutations. Detecting unique genes (e.g., pathogenic genes in bacterial genomes). Strategy: Global Sequence Alignment Mutation Pathogenic Gene

Citation preview

Page 1: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

1

Multiple Genome Alignment:

Chaining Algorithms Revisited

Graduiertenkolleg Bioinformatik

Universität Bielefeld

Mohamed I. AbouelhodaUniversity of Bielefeld

Joint work with

Enno OhlebuschUniversity of Ulm

Germany2003

Page 2: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

2

Comparative GenomicsGraduiertenkolleg Bioinformatik

Universität Bielefeld

The practice of analysing the genomic material of a species by comparing it with

the genomic material of other species.

Why is this important?

The next logical step after the high throughput sequencing projects.

Deducing the mechanism and history of genome evolution.

Discovering genes and regulatory elements.

Identifying exons in eukaryotic genes.

Revealing the role of non-coding conserved sequences.

Page 3: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

3

Conserved regions are more significant in a multiple genome alignment.

The StrategyGraduiertenkolleg Bioinformatik

Universität Bielefeld

Closely-related organisms No (or few) genome rearrangements.

Finding conserved regions (genes and regulatory elements).

Detecting mutations.

Detecting unique genes (e.g., pathogenic genes in bacterial genomes).

Strategy: Global Sequence Alignment

Mutation

Pathogenic Gene

Page 4: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

4

Multiple Genome Alignment is DifficultGraduiertenkolleg Bioinformatik

Universität Bielefeld

Standard dynamic programming takes O(N k) inpractical

even for k=2

N is very large

Mega bases

Given k genomes, each of average length N

Heuristic algorithms are therefore employed

Program Year Authors

Two GenomesPipMaker 2000 Schwartz et al.

DIALIGN 2000 Morgenstern

MUMmer 2002 Delcher et al.

CHAOS 2002 Brudno and Morgenstern

OWEN 2002 Roytberg et al.

AVID 2003 Bray et al.

Multiple GenomesMGA 2002 Höhl et al.

These tools use anchor-based alignment method

Page 5: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

5

MGAGraduiertenkolleg Bioinformatik

Universität Bielefeld

MGA uses a strategy composed of three steps:

First Genome G1

Second Genome G2

Third Genome G3

1. Computation of fragments (maximal multiple exact matches).

2. Computation of an optimal chain of colinear non-overlapping fragments.

3. Detailed alignment of the regions between the fragments of the optimal chain.

Page 6: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

6

MGAGraduiertenkolleg Bioinformatik

Universität Bielefeld

MGA uses a strategy composed of three steps:

First Genome G1

Second Genome G2

Third Genome G3

1. Computation of fragments (maximal multiple exact matches).

2. Computation of an optimal chain of colinear non-overlapping fragments.

3. Detailed alignment of the regions between the fragments of the optimal chain.

Page 7: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

7

MGAGraduiertenkolleg Bioinformatik

Universität Bielefeld

MGA uses a strategy composed of three steps:

1. Computation of fragments (maximal multiple exact matches).

2. Computation of an optimal chain of colinear non-overlapping fragments (anchors).

3. Detailed alignment of the regions between the fragments of the optimal chain.

First Genome G1

Second Genome G2

Third Genome G3

anchors

Page 8: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

8

The Chaining ProblemGraduiertenkolleg Bioinformatik

Universität Bielefeld

Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments

such that its total score is maximum over all other chains.

score(C)= ∑i [ fi+1 .weight - g(fi+1, fi)]

where g(fi+1, fi) is the gap cost of connecting fi+1 to fi

First Genome G1

Second Genome G2

Third Genome G3

Page 9: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

9

The Chaining ProblemGraduiertenkolleg Bioinformatik

Universität Bielefeld

Given n weighted fragments from k genomes, find the chain C of colinear non-overlapping fragments

such that its total score is maximum over all other chains.

First Genome G1

Second Genome G2

Third Genome G3

score(C)= ∑i [ fi+1 .weight - g(fi+1, fi)]

fi+1fi

where g(fi+1, fi) is the gap cost of connecting fi+1 to fi

Page 10: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

10

Previous WorkGraduiertenkolleg Bioinformatik

Universität Bielefeld

Graph based solution takes O(n2) time.

Geometric based algorithm is subquadratic (sparse dynamic programming):

1. Zhang et al. (1994) used space division with a kd-tree (no complexity analysis was given).

2. Myers and Miller (1995) used orthogonal range search with a range tree yielding a complexity of

O(n log k n) time and O(n log k-1 n) space.

But the result is a time bound higher by a logarithmic factor than

what one would expect.David Eppstein

Soble-Martinez, 1986Wilbur-Lipman, 1983

Eppstein-Giancarlo, 1992

Page 11: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

11

Previous WorkGraduiertenkolleg Bioinformatik

Universität Bielefeld

For two genomes the complexities are also higher than those of known 2-dim. chaining algorithms

O(n log 2 n) time and O(n log n) space.

We thought hard to reduce this discrepancy but have been unable to do so and the reasons appear

to be fundamental !To improve upon our result appears to be a difficult

open problem !

Here, it is improved by almost two log factors in time and one log factor in

space

O(n log n) time and O(n) space. For k=2

For k>2 O(n log k-2 n log log n) time and O(n log k-2 n) space.

Myers & Miller

Page 12: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

12

The Problem RevisitedGraduiertenkolleg Bioinformatik

Universität Bielefeld

fi<< fi+1: end( fi).xr < start(fi+1).xr for all r, 0 < r <= k

• Any kind of fragment can be used (fragments can contain also mismatches,

insertions/deletions).

• A fragment fi is represented as a hyper-rectangle in a k-dimensional space.

• A fragment fi is identified with its start and end points: start(fi) and end( fi).

• We add two imaginary fragments O and t with weight zero.

• Any two fragments fi and fi+1 in the chain must be colinear and non-overlapping

Page 13: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

13

The SolutionGraduiertenkolleg Bioinformatik

Universität Bielefeld

fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}

where

fi<< fj : end( fi).xr < start(fj).xr for all r, 0 < r <= kg( fi , fj ) is the gap cost of connecting fi to fj

score(C)= ∑i [ fi .weight - g(fi, fi-1)]

An optimal chain is a chain of maximum score

The score of a chain C is

The maximum score can be computed by the recurrence

1 3

1 4

1 2 3

1 2 2

fj

Sparse Dynamic Programming

A graph based solution takes O(n2) time.

Page 14: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

14

Geometric-based SolutionGraduiertenkolleg Bioinformatik

Universität Bielefeld

The recurrence

fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}

can be written as

RMQ (Range Maximum Query)

Retrieves the fragment fi whose end point Iies in the hyper-rectangle bounded by start(fj) and O such that fi.score-g( fi , fj ) is maximum.

fj.score=fj.weight+RMQ{O, start(fj)}

RMQ is applied using the start and end points

fj

Page 15: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

15

Overview of the AlgorithmGraduiertenkolleg Bioinformatik

Universität Bielefeld

The algorithm uses techniques from computational geometry

1. Line-sweep algorithm.

2. The algorithm works on the start and end points of the fragments.

3. RMQ using a semi-dynamic data structure: the range tree.

4. Proper inclusion of the gap costs into the fragment weight.

fj.score=fj.weight+RMQ{O, start(fj)}

The recurrence is

If the gap cost is zero, a RMQ returns the end point of the

fragment fi such that is

maximum.

ir

rri weightfscoref

0

..

Page 16: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

16

The Algorithm without Gap CostsGraduiertenkolleg Bioinformatik

Universität Bielefeld

fj.score=fj.weight+RMQ{O, start(fj)}

The recurrence is

Line-sweep algorithm

1. Sort the start and end points of the fragments w.r.t. x1

2. If a start point of a fragment, say fj, is scanned

apply the RMQ(O, (start(fj).x2, …, start(fj).xk)) to the set of active end points

and update the score of the end point of fragment fj.

3. Otherwise, add the end point to the set of active end points (already scanned end points).

The first step reduces the dimension of the RMQ to k-1.

If the gap cost is zero, a RMQ returns the end point of the

fragment fi such that is

maximum.

ir

rri weightfscoref

0

..

Page 17: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

17

The Complexity of the AlgorithmGraduiertenkolleg Bioinformatik

Universität Bielefeld

The complexity of the algorithm depends on the complexity of the RMQ in d= k-1 dimensions The required data structure D to manipulate the set of points

• is a semi-dynamic data structure over all end points in d= k-1 dimensions • supports the operations:

1. Activate an end point.2. Perform a RMQ.

D is implemented as a range tree

O(n log k-2 n log log n) time and O(n log k-2 n) space

For n fragments and dimension d, the RMQ and activation takes:

Since d= k-1>1, the complexity of the algorithm is

O(n log d-1 n log log n) time and O(n log d-1 n) space

1. supported by fractional cascading.

2. enhanced with priority queues.

Willard, 1985

van Emde Boas, 1977Johnson, 1982

O(n log n) time and O(n) space For k=2, the total complexity is

Page 18: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

18

Including Gap Costs Graduiertenkolleg Bioinformatik

Universität Bielefeld

The gap cost should be included in the RMQ, otherwise the algorithm would be quadratic.

fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}

fj.score=fj.weight+RMQ{O, start(fj)}A

C C

XX

X A

C C

ACCYYACC

f

f

Recall the recurrence

How to define the gap costs ?

How to include the gap costs without affecting the complexity?

Page 19: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

19

Types of Gap CostsGraduiertenkolleg Bioinformatik

Universität Bielefeld

ACC YYY _ _ ACC ACC YYY ACCACC _ _ _ XX ACC ACC _ XX ACC

L1 L∞

A C

C Y

YY

A C

C

ACCXXACC

f The gap costs g can be described geometrically:

k

iii xfendxfstartfendfstartdffg

111 ).().())(),((),(

iiixfendxfstartfendfstartdffg ).().(max))(),((),(

5),(1 ffg 3),( ffgf

ACC XX _ _ _ _ _ ACCACC _ _ YYY _ _ ACC

ACC _XX ACCACC YYY ACC

ACC _ _ _ _ _ ZZ ACC

7),(1 ffg

ACC _ ZZ ACC

3),( ffg

The L∞ and the sum-of-pairs gap cost follow the same idea as the L1

x

y

Page 20: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

20

Including Gap Costs Graduiertenkolleg Bioinformatik

Universität Bielefeld

The gap cost should be included in the RMQ, otherwise the algorithm would be quadratic.

fj.score=fj.weight+max{fi.score-g( fi , fj ): fi<<fj}

fj.score=fj.weight+RMQ{O, start(fj)}

A C

C X

XX

A C

C

ACCYYACC

f

f

Recall the recurrence

Page 21: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

21

Including Gap Costs in L1Graduiertenkolleg

Bioinformatik

Universität Bielefeld

gc( f) = d1(t, end(f))

We define the geometric cost of a fragment f as follows:

where d1(t, end(f) is the distance in the L1 metric

between t and end(f).

f 1.score - g( f 1 , f ) > f 2.score - g( f 2

, f )

iff

f 1.score - gc( f 1) > f 2.score - gc( f 2)

f 1

f 2

gc( f) is a constant that can be precomputed and attached to the fragment’s weight

O(n log k-2 n log log n) time and O(n log k-2 n) space

For k>2, the complexity is

O(n log n) time and O(n) space For k=2, the complexity is

f

Page 22: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

22

Gap Costs in L∞Graduiertenkolleg

Bioinformatik

Universität Bielefeld

gc( f) = d∞(t, end(f))

The geometric cost of a fragment f is then:

f 1.score - g( f 1 , f ) > f 3.score - g( f 3

, f )

iff

f 1.score - gc( f 1) > f 3.score - gc( f 3)

f 1

f 2

gc( f) is a constant that can be precomputed and attached to the fragment’s weight

iiixfendxfstartfendfstartdffg ).().(max))(),((),(

),(),,(max),( ffffffg yx

In the octant O1, ),(),( ffffg x

In the octant O2, ),(),( ffffg y

f 3

O1

O2

RMQ must be performed in every octant and the point of maximum score is chosen.

Page 23: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

23

RMQ on OctantsGraduiertenkolleg Bioinformatik

Universität Bielefeld

O1

O2

s

pq

pq

s

Because the RMQ requires an orthogonal range, we use the octant-to-quadrant transformation:

),(),(: 221211 xxxxxT

),(),(: 121212 xxxxxT

For the octant O1

For the octant O2

The total complexity of the algorithm depends on the space divison.

O(k! n log k-2 n log log n) time and O(k! n log k-2 n) space

For k>2, the complexity is

O(2 n log n) time and O(2 n) space For k=2, the complexity is

Page 24: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

24

Example: 4 Strains Staphylococcus Graduiertenkolleg Bioinformatik

Universität Bielefeld

1. Staphlyococcus aureus N315 NC_002745 (2853924)2. Staphlyococcus aureus Mu50 NC_002758 (2919236)3. Staphlyococcus aureus MW2 NC_003923 (2860842)4. Staphlyococcus epidermidis NC_004461 (2535068)

Fragments min. len. 15 of 1-2200,000 Fragments

Page 25: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

25

Example: 4 Strains Staphylococcus Graduiertenkolleg Bioinformatik

Universität Bielefeld

1-2 1-3 1-4

1. Staphlyococcus aureus N315 NC_002745 (2853924)2. Staphlyococcus aureus Mu50 NC_002758 (2919236)3. Staphlyococcus aureus MW2 NC_003923 (2860842)4. Staphlyococcus epidermidis NC_004461 (2535068)

3-4 2-4 2-3

Page 26: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

26

Example: 3 Strains E. coli Graduiertenkolleg Bioinformatik

Universität Bielefeld

1-2

1-3 2-3 1: E.coli O157:H7 EDL 993(5608027bp)2: Ecoli O157:H7 (5577057bp)3: E.coli k12 (4705567)

Fragments min. len. 30 of 1-260,000 Fragments

Page 27: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

27

ConclusionsGraduiertenkolleg Bioinformatik

Universität Bielefeld

Our algorithm solves an open problem w.r.t. the chaining of n fragments of k genomes.

Other data structures than the range tree can be used, e.g., the kd-tree. It takes for k>2

The sum-of-pairs gap cost is addressed in the paper.

We reduced the time complexity by almost two log factors and the space complexity by one log factor.

time and O(n) space ))1(( 112

knkO

Other research topic: Comparison of distantly-related organisms.

Many genome rearrangements.

Local sequence alignment Detecting rearranged segments.

Revealing the mechanisms of rearrangements.

Better identification of the exons of eukaryotic genes.Abouelhoda-Ohlebusch, WABI 2003

Page 28: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

28

Example Local chainsGraduiertenkolleg Bioinformatik

Universität Bielefeld

Fragments min. len. 12

M. Genetalium

M. P

neimoniaM

. Pne

imon

ia

M. GenetaliumM. Genetalium

Abouelhoda-Ohlebusch, WABI 2003

Page 29: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

29

Example Local chainsGraduiertenkolleg Bioinformatik

Universität Bielefeld

Fragments min. len. 15

E. coli

V. cholera

Page 30: 1 Multiple Genome Alignment: Chaining Algorithms Revisited Graduiertenkolleg Bioinformatik Universität…

30

Graduiertenkolleg Bioinformatik

Universität Bielefeld

Acknowledgement

Enno Ohlebusch, Ulm University

Stefan Kurtz, , Hamburg University

Robert Giegerich, Bielefeld University

Janina Scholz, Bielefeld University

Thanks for your attention

This work was funded by the Graduiertenkolleg Bioinformatik,

Universität Bielefeld, Germany.