Evolution of mitochondrial genomes and reconstruction of ...gfritzsch/dissertation_fritzsch_2009.pdf · Evolution of mitochondrial genomes and reconstruction of phylogenetic relationships

Evolution of mitochondrial genomes andreconstruction of phylogenetic

relationships

Der Fakultat fur Biowissenschaften, Pharmazieund Psychologie der Universitat Leipzig

eingereichte

DISSERTATION

zur Erlangung des akademischen Grades

DOCTOR RERUM NATURALIUM(Dr. rer. nat.)

vorgelegt von

Dipl. Biol. Guido Fritzsch

geboren am 10.09.1975 in Leipzig

Leipzig, den 23. Januar 2009

Abstract

Nowadays, molecular markers represent an essential part inreconstruction of phylogenetic rela-tionships between different organisms. They provide an opportunity to extend the information ofmorphological data and, beyond that, to resolve important questions such as the splittings duringthe Cambrian radiation.During the last few years the plenty of available molecular data increase exponential. A goodexample of such a marker is the mitochondrial genome, which includes phylogenetic informationon different taxonomic levels (). Bioinformatics, as an interdisciplinary field of research, is de-veloping a wide range of tools for biologists to analyse molecular data extensively and carefully.This work is a further step to find algorithms and develop new programs to give biologists basictools for an attentive work in the analysis of these data and to reconstruct phylogenetic rela-tionships. My work follows two different approaches. The first approach deals with the qualityof data sets, such as multiple substitutions, point mutations, wobbling third positions in proteincoding genes, and/or simple variable parts in sequences which lead to alignment positions whosecharacter information can’t be interpreted in ’the correctway’ any more. The second approachuses the information of mitochondrial gene order. This order includes comprehensive informa-tion of very old splittings, for example the metazoan deep phylogeny.In this dissertation I present the development and implementation of three novel algorithms. Onefor the interpretation of the quality of large sequences alignments and two algorithms for dealingwith mitochondrial gene order information.

I

Biblographical Data

Guido Fritzsch

Evolution of mitochondrial genomes and reconstruction of phylogenetic relationships

University of Leipzig, dissertation, 100 pages, 138 references, 25 figures, 3 tables

Abstract

This study includes various phylogenetic reconstructionsof the relationship of different spieces.The focus lies on mitochondria genomes and their manifold information. Three novel approcheswere developed to give the possibility to validate, to investigate, to analysis, and to use thisinformation.

Abbreviations

ACC Accession NumberATP Adenosintriphosphatbp base pairCOX I-III genes fo cytochrome c oxidase subunits I-IIIcyt b gene for cytochrome bD-Loop structure within the mitochondrial control regionde novo beginning againDNA Deoxyribonucleic acidmRNA Messenger Ribonucleic acidtRNA Transfer Ribonucleic acidkb kilo base pairML Maximum LikelihoodMP Maximum Parsimonymt mitochondrialmt DNA mitochondrial Deoxyribonucleic acidmt genome mitochondrial genomeNCBI National Center for Biotechnology InformationND1-6 genes for NADH dehydrogenase subunits 1-6NJ Neighbor JoiningOH origin of replication of the heavy strand of mitochondrial DNAOL origin of replication of the light strand of mitochondrial DNArRNA ribosomal Ribonucleic acidSET Serial Endosymbiotic Theorysp. speciesTIM the inner membrane complex

V

Contents

1 The Whisper of the Leaves 1

2 Mitochondria 92.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 The Origin of Mitochondria . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 11

2.2.1 The Serial Endosymbiosis Theory . . . . . . . . . . . . . . . . . .. . . 112.2.2 The Episome Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.3 The Hydrogen Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . .13

2.3 The Mitochondrial DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 132.3.1 The Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.2 The Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.3 The Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.4 The Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.5 Mitochondrial Inheritance . . . . . . . . . . . . . . . . . . . . . .. . . 19

3 Noisy 213.1 Misleading Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 213.2 Trees, Metrics, and Weighted Split Systems . . . . . . . . . . .. . . . . . . . . 223.3 Noise Detection Using Circular Split Systems . . . . . . . . .. . . . . . . . . . 233.4 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 283.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Gene Order Rearrangements 374.1 Breakpoint Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 384.2 Inversion Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 384.3 Parsimony Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 38

4.3.1 Encoding Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

VII

CONTENTS CONTENTS

4.3.2 Direct Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . .404.4 Thecircal algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4.1 Cyclic alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.4.2 Encoding of Mitochondrial Genomes . . . . . . . . . . . . . . . .. . . 444.4.3 Scoring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4.4 Multiple Cyclic Alignments . . . . . . . . . . . . . . . . . . . . . .. . 464.4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4.6 Tree Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . .484.4.7 Consensus Gene Arrangements . . . . . . . . . . . . . . . . . . . . .. 484.4.8 Ancestral Genome Organization . . . . . . . . . . . . . . . . . . .. . . 484.4.9 Mitochondrial Genomes . . . . . . . . . . . . . . . . . . . . . . . . . .494.4.10 Chloroplast Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . .. 52

4.5 TheCRExAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.5.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5.3 TheCRExAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.5.4 The Implementation of theCRExAlgorithm . . . . . . . . . . . . . . . 604.5.5 Real World Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.5.6 Current Developments . . . . . . . . . . . . . . . . . . . . . . . . . . .65

5 Summary 67

Bibliography 70

A Appendix 83

VIII

CHAPTER 1

The Whisper of the Leaves

Molecular phylogeny, also known as molecular systematics,is a sub-discipline of molecular bi-

ology. It deals with the inference of evolutionary relationships among taxa, organisms, within

populations and other inherited, biological entities, such as genes. Gathering of such relation-

ships between molecular markers or structures can also be used to study the properties of taxa

including intrinsic traits, ecological interactions, andgeographic distributions.

In recent publications, molecular characters are often used as evolutionary markers. These man-

ifold characters, such as nuclear resp. mitochondrial DNA or biochemical pathways, are good

indicators for a of phylogenetic classifications.

The ’speciation events’ are the processes of interest in molecular systematics. The traces left

behind by such events are genetic differences between organisms, which can be analysed either

directly at the level of nucleic acids, or indirectly with the visible modification of morphological

structures. The only information which exists and can be used is theapomorphiespresent in

recent organisms, fossils, genomes of organisms, etc., andsimilarities that indicate some proper-

ties of evolutionary processes. However, evolution does not only result in evolutionary novelties

which can be interpreted asapomorphies. Real data often include novelties with high similarity

to characters of distantly related organisms. At a molecular level, characters can easily suggest

close related to characters of distinct groups.

From the phylogenetic perspective, such characters are considered as ’noise’. These, often as

Homoplasysummarized characters, contain manifold information, butin the majority of cases

1

CHAPTER 1. THE WHISPER OF THE LEAVES

the information is modified, misinterpretative, or simply false signals. The termHomoplasyin

biology describes structures without common ancestry or the possibility of inheritance.Homo-

plasy, especially analogies or convergences resp., can, if they aren’t recognized, provoke wrong

phylogenetic reconstructions. The reason for this mistakeis the combination of groups with

independent characteristics.

A Homoplasycan be (Wagele, 2005):

• a real homology in an correct phylogenetic tree, in which thecharacter occurs as apparent

analogy

• an apparent reversal, which is a real plesiomorphy on the wrong topology

• a real analogy, convergence, or parallelism that evolved through independent events and is

mapped on a correct phylogenetic tree. Due to its lack of complex structure it cannot be

distinguished from a homology and has been coded as homology.

• an analogy that orginated from back mutations (reversals) and is recorded in the correct

phylogenetic tree, and which cannot be distinguished from ahomology.

• however, an incorrect hypothesis of analogy, which is basedon an erroneous interpretation

of a homology, will not have the distribution of a homoplasy,because each single character

will be coded with a different number.

On a molecular level homoplastic characters include multiple substitutions (backward and paral-

lel). These sites are most often informative sites without any possibility to use this information

in the correct way, such as the third codon position of protein coding genes.

The task of biologists is to identify phylogenetic relevantdata (Homology, Apomorphy, and/or

Plesiomorphy) and to separate them from random similarities, such as homoplastic characters.

Note in this context, thatHomology, Apomorphy, andPlesiomorphyare important concepts in

cladistics. These terms are hypotheses, which explain a real fact in the best way. In practise,

such hypotheses are often wrong and even a support for incompatible groups.

In many cases, large evolutionary distances imply a large number of homoplastic sites. As most

protein-coding genes show dramatic variation in substitution rates that are correlated across the

sequence, this often leads to a patchwork pattern of phylogenetically informative and effectively

randomized regions. Furthermore, in highly variable regions, alignment errors accumulate, re-

sulting in sometimes misleading signals in phylogenetic reconstruction.

2


In a parsimonious sense, homoplastic sites cause additional steps in tree reconstruction which

correspond to additional hypotheses (ad-hoc-hypotheses). The aim is then to minimize the ad-

hoc-hypotheses too.

Based on the overwhelming amount of molecular data, the exponential increase of new data (cur-

rently approx. 30 new mitochondria genomes per month are deposited in the GenBank ’NCBI’)

and the complexity of the containing information, it is indispensable to include informatical,

bioinformatical, and mathematical techniques for phylogenetic analysis and reconstruction. One

main reason for the extreme bulge of sequences over numberless genes and species in several

gene banks can be found in the accelerated development of themolecular methods, which cre-

ates the basis for modern phylogenetic analyses. Thus the molecular phylogeny has a vast in-

tersection with bioinformatics, which is needed to developtools to detect, extract, and assay

phylogenetic information in molecular data. Interdisciplinary interactions open further possibili-

ties for a more realistic and careful dealing with moleculardata. The inclusion of fast algorithms

and the combination of techniques from different disciplines, for example, allows the handling

of judge data sets and the analysis of complex molecular data, such as the third codon position

of protein coding genes or sequence positions with multiplesubstitutions (homoplastic sites).

At present, a considerable number of important and so-called ’standard methods’ exist, which

build the basis for a reasonable analysis or reconstruction. All methods are included in numerous

programs, which are essential to e.g. align sequences, sample evolutionary rates, generate likeli-

hoods, test evolutionary models, or reconstruct phylogenetic relationships. Based on the question

of these relationships, the results of a molecular phylogenetic analysis is most often visualized

in form of phylogenetic trees (cladograms, dendrograms, phylograms, tree graphs) or network

graphs. With the advent of completely sequenced genomes these approaches are complemented

by genome-wide comparisons of gene-contents (Fitz-Gibbonand House, 1999; Snel et al., 1999),

gene orders (Boore and Brown, 1998; Coenye and Vandamme, 2003), or composition measures

(Qi et al., 2004).

During the last few years, it was possible for me to gain experience in this field. My dissertation

includes several studies based on such standard methods, like different alignment tools, analysis

programs, or reconstruction methods such as Neighbor Joining, Maximum Parsimony, Maximum

Likelihood, or bayesian analysis. Based on these studies itwas possible to obtain competence

for the strengths and weaknesses of phylogenetic analyses.This work resulted in a number of

publications and in the following lines I will present threeselected studies.

3


Analysis of Andes frogs (Phrynopus, Leptodactylidae, Anura) phylogeny based on 12S and

16S mitochondrial rDNA sequences (Lehr et al., 2005)

South American leptodactylid frogs of the genusPhrynopusoccur in cloud-forest, paramo,

subparamo and puna habitats (1000 - 4400 m elevation) from Colombia to Bolivia. In 2005, there

were 34 described species; however, many additional species new to science have been reported

from Colombia, Peru, and Bolivia. The phylogeny of the species-diversePhrynopusis unknown

and the position of the genus within Leptodactylidae is poorly understood. We presented the re-

sults of a phylogenetic study based on 12S and 16S mitochondrial rDNA (see Figure 1.1). Fifteen

species ofPhrynopusfrom Bolivia to Ecuador are included, along with several other genera of

Leptodactylidae and representatives of other frog families. Our results indicate thatPhrynopus

is phylogenetically nested withinEleutherodactylus, whereasPhyllonastesis phylogenetically

nested withinPhrynopus. Based on the recovered phylogeny, we transferPhrynopus simonsiito

Eleutherodactylus, and show thatPhrynopus carpishneeds to be removed fromPhrynopus.

From terrestrial to aquatic habitats and back again - molecular insights into the evolution

and phylogeny of Hydrophiloidea (Coleoptera) using multigene analyses (Bernhard et al.,

2006)

The phylogenetic relationships within Hydrophiloidea have been a matter of controversial dis-

cussion for many years and the supposedly repeated changes between aquatic and terrestrial

lifestyles are not well understood. In order to address these issues we used an extensive molec-

ular data set comprising sequences from six nuclear and mitochondrial genes. The analyses

accomplished with the entire data set resulted in largely congruent tree topologies concerning

the main branches (see Figure 1.2), independent from the analytical procedures. However, only

bayesian analyses yielded sufficient high posterior probabilities, whereas bootstrap support val-

ues for most nodes were generally low. Our results are only partially congruent with hypotheses

based on morphological analyses. Spercheidae were placed as the sister group of the remain-

ing hydrophiloid subgroups. Hydrophiloidea excluding Spercheidae split into two clades: the

’helophorid lineage’ comprising the small groups Epimetopidae, Hydrochidae, Georissidae, and

Helophoridae, and the largest family, Hydrophilidae. Within Hydrophilidae, Hydrophilinae do

not form a monophylum. The predominantly terrestrial Sphaeridiinae were placed as a subor-

dinate clade within this subfamily. Furthermore, our data suggest a single origin of the aquatic

lifestyle in Hydrophiloidea, with numerous secondary changes to terrestrial habits and tertiary

changes to aquatic habitats within Sphaeridiinae.

4


Figure 1.1: Maximum likelihood tree topology for the pruneddataset withPleurodema marmoratum(Leptodactylidae) as outgroup. Numbers at the nodes separated by a slash are, from left to right: bootstrapvalues for ML (TBR swapping algorithm), bootstrap values of10.000 replicates for NJ, posterior prob-abilities for MrBayes, and bootstrap values for MP (10.000 replicates). A dash indicates that the branchwas not found in the ML topology. Numbers in parentheses are numbers of individuals of the speciesbeing used, if more than one. SyntopicPhrynopusare connected with an arrow-headed line (Lehr et al.,2005).

5


Figure 1.2: Bayesian tree of the whole data set (SSU rDNA, LSUrDNA, 16S rDNA, COI, COII). Fourspecies of the Histeroidea (Histeridae and Sphaeritidae) were chosen as outgroups. The number at eachnode refers to posterior probabilities. Habitats of adults(first letter) and larvae (second letter) are given inbrackets. The assumed transitions in lifestyle are marked with arrows. A = aquatic, S = semiaquatic, T =terrestrial (Bernhard et al., 2006).

6


The complete mitochondrial genome of the Green LizardLacerta viridis viridis (Reptilia:

Lacertidae) and its phylogenetic position within squamatereptiles (Boehme et al., 2007)

For the first time, the complete mitochondrial genome was sequenced for a member of Lacer-

tidae.Lacerta viridis viridiswas sequenced in order to compare the phylogenetic relationships of

this family to other reptilian lineages. Using the long-polymerase chain reaction (long PCR) we

characterized a mitochondrial genome, 17.156 bp long showing a typical vertebrate pattern with

13 protein coding genes, 22 transfer RNAs (tRNA), two ribosomal RNAs (rRNA) and one major

noncoding region. The noncoding region ofL. v. viridis was characterized by a conspicuous 35

bp tandem repeat at its 5’ terminus. A phylogenetic study including all currently available squa-

mate mitochondrial sequences revealed the position of Lacertidae within a monophyletic squa-

mate group. We obtained a narrow relationship of Lacertidaeto Scincidae, Iguanidae, Varanidae,

Anguidae, and Cordylidae. Although, the internal relationships within this group yielded only

a weak resolution and low bootstrap support, the revealed relationships were more congruent

with morphological studies than with recent molecular analyses (Townsend and Larson, 2002;

Townsend et al., 2004; Lee, 2005; Vidal and Hedges, 2005; Kumazawa, 2007).

In an ideal case, all methods applied to the data set of informative characters should result in

one tree topology. However, most often data sets include conflicts, that means contradictory

characteristics which can’t be explained by inheritance. These conflicts in molecular data are

summarized as homoplastic signals (mentioned above), because it is difficult to distinguish be-

tween random sites, convergences, and predisposition.

The notable part of the literature in molecular phylogeny isbased upon the analysis of nucleic

acid and/or amino acid sequences of individual genes or groups of genes. Such sequence based

methods for phylogenetic reconstruction are notoriously plagued by two effects: homoplastic

sites and/or alignment errors. Of course, it is no problem toalign sequences by hand or eye, if

they are short, similar, and/or based on constant regions. However, this will be very difficult or

impossible for large and/or variable sequences (i.e. largeevolutionary distances).

So the ’Achilles tendon’ of a careful exploration is the dataset in his earliest phase. For ex-

ample, the reconstruction of deep metazoan phylogeny is a challange. The selection of suitable

markers to resolve patterns of divergence is often one of thehardest steps. An example for such

markers are mitochondrial genomes. They represent an interesting system, because mitochondria

have small complete genomes, which include several genes with different evolutionary rates and

various degrees of conservation. Furthermore, the gene order of these compact genomes yields

a particularly fruitful data set for phylogenetic reconstruction (Watterson et al., 1982; Sankoff

et al., 1992). These changes of the gene order are due to gene rearrangements and can be differ-

7


ent from species to species. In the literature, several genome rearrangement events are described,

like inversions (Dobzhansky and Sturtevant, 1938), transpositions, reverse transpositions, and so

called tandem duplication random loss (TDRL) events (Moritz and Brown, 1986, 1987; Moritz

et al., 1987).

This dissertation is based on the assumption that moleculardata include more information as

used up to now and it follows two main approaches.

The first approach deals with homoplastic sites, which builda fundamental problem in phylo-

genetic reconstruction. One question arises: Is it possible to detect and to quantify informative

positions in difficult parts by the comparison of molecular data? Especially the reproducibility

of the information content of alignment positions are of interest in this case.

The second approach looks at the information content and thepossibilities to use the gene order

information of mitochondrial genomes for phylogenetic reconstruction. Recent tools are affected

by two problems. First, it is hardly possible to compare two or more gene order sequences with

different lengths. Secondly, there are no tools, which include all known rearrangement scenarios,

especially the tandem duplication random loss event. Therefore, one aim of this study is to

reconstruct phylogenetic relationships only with the geneorder information.

8

CHAPTER 2

Mitochondria

Mitochondria are small, rod shaped organelles surrounded by two highly specialized concentric

membranes. They resemble bacteria in their overall size andshape. The mitochondrial inner

membrane is folded inward and forms the sites of aerobic respiration, generally the major energy

production center in eukaryotes. Normally animal mitochondrial genomes are circular, about 16

kB in length, and encode thirteen proteins, two ribosomal RNAs (rRNA) and twenty-two transfer

RNAs (tRNA), all of which are essential, because they are necessary for oxidative phosphoryla-

tion and the production of cellular ATP by the Mitochondria.

2.1 History

In the years between 1850 and 1880 several cytologists observed independently granular or

threadlike components of the cytoplasm that we now recognize as mitochondria. In the his-

tory, this structure contains more than dozen terms like: blepharoblasts, chondriokonts, chondri-

omites, chondrioplasts, chondriosomes, chondriospheres, fila, fuchsinophilic granules, Korner,

Fadenkorper, mitogel, parabasal bodies, plasmasomes, plastochondria, plastosomes, vermicules,

sarcosomes, interstitial bodies, bioblasts, and so on.

In 1854 the Swiss physiologist and histologist Rudolf Albert Ritter von Kolliker (1817-1905) was

the first who observed small threadlike structures (granules) in the cytoplasm of striated muscle

cells of insects and reached the conclusion that the cells had a membrane. These granules, which

9

2.1. HISTORY CHAPTER 2. MITOCHONDRIA

were later to be called sarcosomes by Retzius in 1890, were atfirst thought to be present only in

the muscle, but today we recognize the sarcosomes as the mitochondria of muscle cells. Some

years later Flemming (1882) characterizedfilamentsin the cytoplasm of other cell types.

In 1890 Altmann described a method of staining these structures with fuchsin that made it pos-

sible to demonstrate their occurrence in nearly all types ofcells. In the same year he proposed

that this granules were autonomous, elemental living unitswhich form bacteria-like colonies

in the cytoplasm of the host cell. In 1912 and 1913, B.F. Kingsbury and O. Warburg found that

these granular, insoluble subcellular structures were associated with respiration, and this function

challenged the theory of their role in genetics (Scheffler, 1999).

Based on their similarity to bacteria and probably capable of independent existence he called

them ’elementary living particles or bioblasts’. It was Benda (1897) who coined the term mito-

chondria which can be descended from the greekmitos= Faden (German) or thread (English)

andchrondros= Korn (German) or grain (English). Benda made valuable observations on their

form and distribution in preparations stained with alizarin and crystal violet.

In 1914 Lewis and seven years later Strangeways and Canti observed that this rod shaped or-

ganelles were highly plastic structures which continuously executed slow sinuous movements

and sometimes experienced marked changes of shape. The different staining reactions over the

years gave occasion to investigate much conjecture about the probable chemical nature and func-

tion of the mitochondria. However, more information had to be awaited by their isolation from

cells in a quantity sufficient for chemical analysis. In 1934Bensley and Hoerr presented the first

attempts with some success. They separated mitochodria by differential centrifugation of cell ho-

mogenates. The method was further perfected by Claude in theearly 1940’s and by Hogeboom,

Schneider, and Palade in 1948.

In the following years lots of researchers like Kennedy and Lehninger in 1949 or Green and his

associates effected intensive studies of isolated mitochondria and demonstrated that they are the

principal site of the oxidative reactions by which the energy in foodstuff is made available for cell

metabolism. The development of improved methods of fixationand thin sectioning for electron

microscopy in the next years allowed a more detailed view on the structure of mitochondria.

Based on this methods Palade and Sjostrand in 1953 describedindependently the basic structural

plan of the internal membranes and coined the term ofcristae. In the next years, the focus

changed to analyses of the biochemical functions and pathways of the inner membrane, cristae,

and the matrix i.e. Racker (1976).

In 1963 the first definite identification of DNA in mitochondria was made (Nass and Nass,

1963). Later in the 1970s, the complete mitochondrial DNA ofa mammalian was successfully

10

CHAPTER 2. MITOCHONDRIA 2.2. THE ORIGIN OF MITOCHONDRIA

sequenced in Cambridge (Anderson et al., 1982). Almost simultaneously, Attardi (1981) and

his coworkers at the California Institute of Technology determined the nature of the transcripts

derived from this genome and set the stage for the identification of all the genes encoded by

mammalian mtDNA.

Mitochondria occupy a central position in the understanding of the cell, the ’basic unit of life’.

The study of mitochondria allowed a detailed view but also fundamental insights covering the

entire spectrum from biophysics to cell biology and genetics.

Today, one of several frontiers is the integration of mitochondria into the cell and their distri-

bution in the cytosol by means of their interaction with the cytoskeleton, especially in various

highly differentiated cells. The overwhelming part of the literature in molecular phylogeny is

based upon the analysis of nucleic acid and/or amino acid sequences of individual genes or

groups of genes. With the advent of completely sequenced genomes these approaches are com-

plemented by genome-wide comparisons of gene-contents (Fitz-Gibbon and House, 1999; Snel

et al., 1999), gene orders (Boore and Brown, 1998; Coenye andVandamme, 2003), or composi-

tion measures (Qi et al., 2004).

2.2 The Origin of Mitochondria

2.2.1 The Serial Endosymbiosis Theory

On the 51. ’Versammlung Deutscher Naturforscher undArzte’ 1878 in Kassel Heinrich Anton

de Bary (1831-1888) suggested, based on his work with lichens, to use the termSymbiosisfor

a close relationship between two species. In 1883 the Germanbotanist Andreas Franz Wilhelm

Schimper (1856-1901) observed that the division of chloroplasts in green plants closely resem-

bled that of free-living cyanobacteria and proposed (in a footnote) that green plants had arisen

from a symbiotic union of two organisms (Schimper, 1883). In1890 R. Altmann (Altmann,

1890) spotted that the ’granular bodies’ (mitochondria) inthe cytoplasm of plant and animal

cells display the staining properties of free living microbes. Based on his work maybe Schim-

per was the trailblazer of the Endosymbiotic Theory, which was first articulated by the Russian

botanist Konstantin Sergejewitsch Mereschkowski (1855-1921) in his early 1905 work’ Uber

Natur und Ursprung der Chromatophoren im Pflanzenreiche’(Mereschkowsky, 1905).

In the 1920s Ivan Emanuel Wallin (1883-1969) extended the idea of an endosymbiotic origin

to mitochondria (Wallin, 1923). All these theories were initially dismissed or ignored over the

following years. Their resurrection in the 1960s based essentially on more detailed electron

11

2.2. THE ORIGIN OF MITOCHONDRIA CHAPTER 2. MITOCHONDRIA

microscopic comparisons between cyanobacteria and chloroplasts (Ris and Singh, 1961) in com-

bination with the discovery that plastids and mitochondriacontain their own DNA (Stocking and

Gifford, 1959). The so calledSerial EndosymbioticTheory (SET) was fleshed out and popular-

ized by Lynn Margulis (Margulis, 1970, 1981). In her 1981 work ’Symbiosis in Cell Evolution’

she argued that eukaryotic cells originated as communitiesof interacting entities, including en-

dosymbiotic spirochaetes that developed into eukaryotic flagella and cilia. This last idea has not

received much acceptance, since flagella lack DNA and do not show ultrastructural similarities

to prokaryotes.

This theory includes several problems. Neither mitochondria nor plastids can survive outside the

cell, having lost many essential genes required for survival. This objection is easily accounted

for by simply considering the large timespan that the mitochondria/plastids have coexisted with

their hosts; genes and systems which were no longer necessary were simply deleted, or in many

cases, transferred into the host genome instead. In fact these transfers constitute an important

way for the host cell to regulate plastid or mitochondrial activity.

2.2.2 The Episome Theory

In 1972, R.A. Raff and H.R. Mahler presented a lot of evidences and assume that mitochondria

have developed from proto-mitochondria, that derived fromthe proto-eukaryote inner membrane,

and which contained genes for (mt) ribosomal components, t-RNAs and several elements of the

respiratory chain. Borst (1972) formulated anEpisome Theoryand supposed that the DNA of mi-

tochondria left the nuclear DNA by sort of amplification to become mapped within a membrane

containing the respiratory chain (Bhamrah and Juneja, 2002).

TheEpisome Theoryleads to several problems. There exist no assumption which is made with

respect to the question of whether the hypothetical episomewas composed of prokaryote-like or

eukaryote-like genetic material. Another problem are the genes postulated to be present on the

episome were probably located at a number of different siteson the proto-eukaryote genome,

which would imply multiple successive insertions and deletions.

The main experimental approach used in biochemical studiesclaiming to shed light on the ori-

gin of Mitochondria is the study of similarities among homologous components of bacteria and

Mitochondria. However, analysis of these results shows that it is impossible to decide between

theEpisome Theoryand theEndosymbiosis Theoriesin this way (Reijnders, 1975).

12

CHAPTER 2. MITOCHONDRIA 2.3. THE MITOCHONDRIAL DNA

2.2.3 The Hydrogen Hypothesis

In the year 1998 William Martin and Miklos Muller published theHydrogen Hypothesisin Nature

(Martin and Muller, 1998). They present a new hypothesis for the origin of eukaryotic cells,

based on the comparative biochemistry of energy metabolism.

In contrast to theSerial Endosymbiosis Theory, that prognosticated the phagocytosis of aa-

proteobacteria or a cyanobacteria by a first primitive cell,the hydrogen hypothesis claims a

different way of symbiosis and includes possible metabolism pathways. In this hypothesis the

authors argued that the host (a methanogenic archaebacterium which used hydrogen and carbon

dioxide, producing methane) and a facultatively anaerobiceubacterium (the possible future mito-

chondrion, which produced hydrogen and carbon dioxide as byproducts of anaerobic respiration)

started a symbiotic relationship based on the host’s hydrogen dependence (anaerobic syntrophy).

If correct, this hypothesis would imply that eukaryotes arechimeras with both archaebacterial

and eubacterial ancestry and that eukaryotes appeared later in evolution than prokaryotes. Fur-

thermore, the hydrogen hypothesis predicts that no primitively mitochondrion-lacking eukary-

otes ever existed.

2.3 The Mitochondrial DNA

2.3.1 The Function

Mitochondrial genes are involved, in at least, five basic processes: respiration and/or oxidative

phosphorylation and translation, and occasionally also intranscription, RNA maturation and

protein import. The leading roles of mitochondria are the production of ATP and regulation of

cellular metabolism (Voet et al., 2006). The central set of reactions involved in ATP production

are collectively known as the citric acid cycle. However, the mitochondrion has many other

metabolic tasks, such as:

• Regulation of the membrane potential (Voet et al., 2006)

• Apoptosis-programmed cell death (Green, 1998)

• Glutamate-mediated excitotoxic neuronal injury (Scanlonand Reynolds, 1998)

• Cellular proliferation regulation (McBride et al., 2006)

• Regulation of cellular metabolism (McBride et al., 2006)

• Certain heme synthesis reactions (Oh-hama, 1997)

13

2.3. THE MITOCHONDRIAL DNA CHAPTER 2. MITOCHONDRIA

• Steroid synthesis (Rossier, 2006)

In general, the number of mitochondria and the complexity oftheir internal structure varies

with the energy requirements for the specific functions carried out by the cell. In cells that are

relatively inactive, the mitochondria tend to be few and their internal structure simple. On the

other hand, cells engaged in active transport, in the synthesis of fat from carbohydrate, or in the

conversion of chemical energy to mechanical work usually have large numbers of mitochondria

that contain a profusion of cristae.

2.3.2 The Genome

The genetic material of the Mitochondria called mitochondrial DNA or the mitochondrial genome

is similar in structure to that of the prokaryotic genetic material. The mitochondrial chromosome

is, with rare exceptions (Fukuhara et al., 1993), a circularDNA molecule, which is much smaller

and exists as serveral copies in contrast to the prokaryoticchromosomes.

Apart from some exceptions, such as among the gymnosperms, in which some families inherit

mitochondria or chloroplasts paternally, the mitochondria of a sexually-reproducing species are

inherited maternally. The human mitochondrial genome consists of 16.571 base pairs, which

encodes only 13 proteins, 22 tRNAs, and 2 rRNAs (Anderson et al., 1981). In humans and

probably in metazoans in general, 100 - 10.000 separate copies of mitochondrial DNA are usually

present per cell (egg and sperm cells are exceptions).

The size of known mitochondrial DNA in most eukaryotic phylaranges from 11 - 60 kbp; how-

ever, there are some unusual exceptions. Among those organisms whose mt DNA has been

completely sequenced are two extreme outliers known. The first is the apicomplexan protist

Plasmodium sp.with a minuscule mitochondrial genome of 6 kbp (Feagin, 1992) and the second

is rice (Oryza sativa), whose mt DNA at 490 kbp (Notsu et al., 2002) is about 80 timeslarger

than that ofPlasmodium sp.. In mammals, each circular mitochondrial DNA molecule consists

of 15.000 - 17.000 base pairs, which encodes the same 37 genes: 13 for proteins (polypeptides),

22 for transfer RNA (tRNA) and one each for the small and largesubunits of ribosomal RNA

(rRNA).

This pattern is also seen among most metazoans, although in some cases one or more of the 37

genes is absent and the mt DNA size range is greater. In June 2008, approximately 1.189 com-

plete sequenced mitochondrial genomes of Metazoa are available at the GenBank NCBI (Maglott

et al., 2005) up to now. The shortest genome sequencedParaspadella gotoi(Acc: NC 006083,

Helfenbein et al. (2004)) counts 11.423 bp and the longest genomeTrichoplax adhaerens(Acc:

14


(a) (b)

Figure 2.1: Comparison of the compactness of (a)Homo sapiens(NC 001807) (Anderson et al., 1981)and (b)Saccharomyces cerevisiae(NC 001224)

NC 008151, (Dellaporta et al., 2006)) counts 43.079 bp. The square over all lengths is 16.652

bp.

Compared to the nuclear genome, the mitochondrial genome possesses some very interesting

features:

• Excluded ciliates, all the genes are carried on a single circular DNA molecule.

• The genetic material is not bounded by a nuclear envelope.

• The DNA is not packed into chromatin.

• The genome contains little non-coding DNA (’junk’ DNA, or introns).

• Some codons do not follow the universal rules in translation. Instead they resemble those of purple

non-sulfur bacteria.

• Some bases are considered to be part of two different genes: both as the last base of one gene and

as the first base of the next gene.

Mitochondrial genes are transcribed as multigenic transcripts, which are cleaved and polyadeny-

lated to yield mature mRNAs. The proteins that are necessaryfor mitochondrial function are not

15


encoded by the mitochondrial genome only. Most of them are coded by genes in the cell nu-

cleus and imported into the mitochondrion (Anderson et al.,1981). The exact number of genes

encoded by the nucleus and the mitochondrial genome differsbetween species.

2.3.3 The Replication

The replication of mitochondrial DNA is self-regulated in response to the energy demand of the

cell and consequently not linked to the cell cycle. At cell division, mitochondria are distributed

to the daughter cells essentially randomly during the division of the cytoplasm. Mitochondria

divide by binary fission similar to bacterial cell division;unlike bacteria, however, mitochondria

can also fuse with other mitochondria (Chan, 2006; Hermann et al., 1998).

Mitochondria replicate much like bacterial cells. The regulation of plasmids differs considerably

from the regulation of chromosomal replication. However, the machinery involved in the repli-

cation of plasmids is similar to that of chromosomal replication. D-loop replication is a process

by which chloroplasts and mitochondria replicate their genetic material.

In many organisms, one strand of DNA in the Mitochondriaplastid comprises heavier nucleotides

(relatively more purines: adenine and guanine). This strand is called the H (heavy) strand in

contrast to the L (light) strand, which comprises lighter nucleotides (pyrimidines: thymine and

cytosine). Replication begins with replication of the heavy strand starting at the D-loop (also

known as the control region), which also include the replication origin. This origin opens, and

the heavy strand is replicated in one direction. After heavystrand replication has continued for

some time, a new light strand is also synthesized, by openingof another origin of replication.

When diagramed, the resulting structure looks like the letter D. The D-loop region does not code

for any genes, it is free to vary with only a few selective limitations on size and heavy/light strand

factors.

2.3.4 The Structure

Like described before, Mitochondria are small, rod shaped organelles surrounded by two highly

specialized concentric membranes, an inner membrane and a outer membrane composed of phos-

pholipid bilayers and proteins (Alberts et al., 1994). However, the two membranes have different

properties.

Based on this double-membraned organization there are five distinct compartments within the

mitochondrion. There is the outer mitochondrial membrane,the intermembrane space (the space

between the outer and inner membranes), the inner mitochondrial membrane, the cristae space

16


(formed by infoldings of the inner membrane), and the matrix(space within the inner membrane).

(a) (b)

Figure 2.2: (a) Mitochondria, minute sausage-shaped structures found in the hyaloplasm (clear cytoplasm)of the cell, are responsible for energy production. Mitochondria contain enzymes that help to convert foodmaterial into adenosine triphosphate (ATP), which can be used directly by the cell as an energy source.Microsoft R©EncartaR©Encyclopedia 2001.c©1993-2000 Microsoft Corporation. All rights reserved. (b)Mitochondria. Courtesy of Dr. Henry Jakubowski

The Outer Mitochondrial Membrane

The outer mitochondrial membrane, which encloses the entire organelle, has a protein-to-phospholipid

ratio similar to the eukaryotic plasma membrane (about 1:1 by weight). It contains numerous in-

built proteins called porins. These porins comprise a relatively large internal channel (about 2-3

nm) which is permeable to molecules of 5000 daltons weigth orless.

Larger molecules can cross the membrane by active transportif they have a signaling sequence at

the N-terminus binding to a large multisubunit protein called translocase of the outer membrane.

Disruption of the outer membrane permits proteins in the intermembrane space to leak into the

cytosol, leading to certain cell death (Chipuk et al., 2006). Furthermore, the outer membrane

contains enzymes involved in such diverse activities as theelongation of fatty acids, oxidation of

epinephrine (adrenaline), and the degradation of tryptophan.

17


The Intermembrane Space

The space between the outer membrane and the inner membrane is called intermembrane space.

As the outer membrane is freely permeable to small moleculesthe concentrations of small

molecules such as ions and sugars in the intermembrane spaceis the same as in the cytosol

(Alberts et al., 1994). In contrast the protein compositionof large proteins is different in compar-

ison between the intermembrane space and the cytosol. For example, one protein that is localized

to the intermembrane space in this way is cytochrome c (Chipuk et al., 2006).

The Inner Mitochondrial Membrane

The inner mitochondrial membrane is folded inward and formsinternal compartments known as

cristae. These compartments allow greater space for the proteins such as cytochromes to function

properly and efficiently. Furthermore, the inner mitochondrial membrane includes transport pro-

teins that transport in a highly controlled manner metabolites across this membrane. The electron

transport chain is also located on the inner membrane of the mitochondria.

The inner mitochondrial membrane contains proteins with four types of functions (Alberts et al.,

1994):

• performance of the redox reactions of oxidative phosphorylation

• ATP synthase, which generates ATP in the matrix

• specific transport proteins that regulate metabolite passage

• protein import machinery.

The inner membrane of the mitochondria contains more than 100 different polypeptides, and has

a very high protein-to-phospholipid ratio (more than 3:1 byweight, which is about 1 protein for

15 phospholipids). It is home to around 1/5 of the total protein in a mitochondrion (Alberts et al.,

1994).

In contrast to the outer membrane, the inner membrane is missing porins and is highly imperme-

able to all molecules. To enter or exit the matrix almost all ions and/or molecules require special

membrane transporters.

Another interesting fact is that the inner membrane of mitochondria is similar in lipid composi-

tion to the membrane of prokaryotes, which permits scope fortheories like the endosymbiontic

theory (see above).

18


The Cristae

The structure of the inward folded inner mitochondrial membrane is leading to numerous com-

partments called cristae. These expand the surface area of the inner mitochondrial membrane and

increase its efficiency to produce ATP. These are not simple random folds but rather invagina-

tions of the inner membrane, which can affect overall chemiosmotic function (Mannella, 2006).

For example, the surface area, including cristae, in typical liver mitochondria, is about five times

larger than the outer membrane. Mitochondria of cells that have a greater demand for ATP, such

as muscle cells, contain more cristae than typical liver mitochondria (Alberts et al., 1994).

The Matrix

The space enclosed by the inner membrane is called matrix. This contains about 2/3 of the

total protein in a mitochondrion (Alberts et al., 1994). Thematrix has a highly-concentrated

mixture of hundreds of enzymes, special mitochondrial ribosomes, tRNA, and several copies

of the mitochondrial DNA genome. The major functions of the enzymes include oxidation of

pyruvate and fatty acids, and the citric acid cycle (Albertset al., 1994). So, the matrix is important

in the production of ATP with the aid of the ATP synthase.

2.3.5 Mitochondrial Inheritance

The inheritance of mitochondrial DNA occurs from the mother(maternally inherited), in most

multicellular organisms. The reason lies in the over presence of mitochondrial DNA molecules

from the egg, which cotains 100.000 to 1.000.000 in contrastto the sperm with only 100 to

1.000 mitochondrial DNA molecules. Furthermore in the degradation of sperm mt DNA in the

fertilized egg and at least in a few organisms failure of sperm mitochondrial DNA to enter the egg.

In mammals, 99.99% of mitochondrial DNA (mt DNA) is inherited from the mother. Whatever

the mechanism is, this single parent (uniparental) patternof mitochondrial DNA inheritance is

found in most animals, most plants and in fungi as well.

19


20

CHAPTER 3

Noisy

3.1 Misleading Sites

As mentioned above, homoplastic sites are frequent and a problem in the phylogenetic recon-

struction. Important in this case is a good data basis, that means for example the quality of an

alignment. The columns of such alignment can be classified based on their character structure

(see Figure 3.1).

In a parsimonious way, there is a differentiation between parsimony informative sites and par-

simony non-informative sites. The parsimony non-informative sites include sites with constant

(i.e. equal) nucleotides or nucleotide sites with only unique nucleotides (singletons). Singleton

sites produce a branch extension only in most cases of phylogenetic reconstruction. A site is in-

formative only when there are at least two different kinds ofnucleotides at the site, each of which

is represented in at least two of the sequences under study. These informative positions are the

sites, which are used to reconstruct phylogenetic trees. Note, that Maximum Parsimony is part

of a class of character-based tree estimation methods whichuse a matrix of discrete phylogenetic

characters to infer one or more optimal phylogenetic trees for a set of taxa. Other methods, like

Maximum Likelihood, include non-informative sites withinthe estimation too.

Apart from that, is it essential to classify the informationcontent of informative sites to minimize

the number of homoplastic and/or randomized sites for a stable reconstruction.

The method I present in this thesis allows the identificationof phylogenetically uninformative

21

3.2. TREES, METRICS, AND WEIGHTED SPLIT SYSTEMS CHAPTER 3. NOISY

Figure 3.1: Differentiation between parsimony informative sites and parsimony non-informative sites.

homoplastic columns in a multiple sequence alignment, based on assessing the distribution of

character states along a cyclic ordering of taxa. Removal ofthese columns improves the perfor-

mance of phylogenetic reconstruction algorithms as measured by various indices of tree quality.

In particular, I obtain more stable trees due to the exclusion of alternative splits that arise solely

from randomized characters. The basic idea was conceived during a conversation with Prof. An-

dreas Dress about a very difficult dataset that includes a lotof conflicting positions, which is later

published (Bernhard et al., 2006).

3.2 Trees, Metrics, and Weighted Split Systems

Let X denote a finite set ofn taxa. Asplit S = A|A = A|A is a bipartition of the setX of taxa,

i.e. a partition ofX into two disjoint, non-empty subsetsA andA. Two such splitsA1|A1 and

A2|A2 of X are calledcompatibleif one of the four intersectionsA1 ∩A2, A1 ∩ A2, A1 ∩A2 and

A1 ∩ A2 is empty. A split system is compatible if every pair of splitsis compatible.

It is a well known result that compatible split systems onX are in 1:1 correspondence with the

so calledX-trees (Buneman, 1971), i.e. finite treesT = (V, E) with vertex setV and edge set

E endowed with a map fromX into V whose image contains (at least) all vertices of degree less

than3.

22

CHAPTER 3. NOISY 3.3. NOISE DETECTION USING CIRCULAR SPLIT SYSTEMS

More specifically, this correspondence is given by:

(i) associating to any edgee ∈ E of such a treeT , the bipartitionSe of X into those two

subsets ofX that are mapped into the (exactly) two distinct connected components of the

graph obtained fromT by deleting the edgee,

(ii) associating toT the collectionS(T ) := {Se : e ∈ E} of all such splits.

Associating a positive weightαS to any such splitS = A|A (e.g. the length of the edgee in case

every edge in the tree is endowed with some predefined positive length andS = Se holds), one

can define the associated metricd onX by associating to any two taxax, y in X the term

d(x, y) :=∑

S∈S(T )

αSδS(x, y) (3.1)

where one puts, for any splitS = A|A ∈ S(T ) and allx, y ∈ X, δS(x, y) := 0 if x, y ∈ A

or x, y ∈ A holds, andδS(x, y) := 1 otherwise (i.e. ifx andy areseparatedby the splitS)

implying thatd(x, y) is the total length of the unique path from (the image of)x to (the image

of) y relative to the given family of split weights(αS)S∈S(T ).

It is the goal to detect homoplasywithout determining a tree; thus it is necessary to admit more

general split systems. Circular split systems are a good option which I will introduce in the

following.

3.3 Noise Detection Using Circular Split Systems

A split systemS is circular if the points inX (i.e. the taxa) can be arranged on a circle such

that each splitS ∈ S is induced by a division of that circle into two arcs by deleting two of its

(unlabeled) points. In this case, the circular ordering is said torepresentthe split system.

It is easy to verify that compatible split systems are circular (actually, every planar drawing of an

X-tree provides such a circular ordering), and that circularsplit systems areweakly compatible

— i.e. A1 ∩A2 ∩A3, A1 ∩ A2 ∩ A3, A1 ∩A2 ∩ A3 or A1 ∩ A2 ∩A3 is empty for any three splits

A1|A1, A2|A2, A3|A3 in a circular split system, cf. Bandelt and Dress (1992). Anydistance

constructed from a weighted circular split system is calleda ’circular’- or Kalmanson-Metric,

shown in Figure 3.2

It has been observed that phylogenetic distance data are often circular or at most mildly non-

circular (Huson, 1998; Bandelt and Dress, 1992; Wetzel, 1995). Starting from a suitable distance

23

3.3. NOISE DETECTION USING CIRCULAR SPLIT SYSTEMS CHAPTER 3. NOISY

D

CA

B

B

A C

D

B

A

D

C

B C

A D B

C

D

A

A C

DBB

D

A

C

(AD)____

(AB)|(CD)

(BC)

D

CA

B

Figure 3.2: Summary of possible splits in a split system for 4taxa (ABCD), and their possible relationship.

measure, we can construct a circular split system from an alignment without significantly prede-

term later tree constructions, since the circular split system still represents essentially unfiltered

data.

Circular split systems can be obtained in various ways. The computationally most straightfor-

ward approach is theNeighbor-Net algorithm (Bryant and Moulton, 2004) that starts from a

distance matrix. It computes the circular splits using an agglomerative procedure.

An alternative approach starts from weighted quartets. To this end, one first computes a weight

for each quartet, i.e. each pair of two pairs of taxa,{

{a, b}{c, d}}

. This quartet weight is inter-

preted as the support for the hypothesis that{a, b} and{c, d} are separated by an edge in the cor-

rect phylogenetic tree. Quartet weights can be obtained in various ways. In thequartet-mapping

approach (Nieselt-Struwe and von Haeseler, 2001) for example, one starts with an alignment of

four sequences and defines the weight of a given quartet to be the fraction of alignment sites

(columns) in whicha = b 6= c = d. One may modify this score by adding1/2 for every

additional column in whicha = b 6= c, d or c = d 6= a, b holds. Quartet weights can also

be derived directly from distances (although, in this case,it seems preferable to use the faster

Neighbor-Net approach). A more sophisticated weighting scheme uses “expected branch

lengths”, i.e. the product of the posterior likelihood and the maximum likelihood branch length

of the interior edge of the corresponding quartet tree.

The quartet{

{a, b}{c, d}}

is said to berealizedby a cyclic ordering ofX if the straight line

24


connectinga andb and the straight line connectingc andd do not intersect in the interior of

the circle. There is a circular split system represented by agiven cyclic ordering that contains a

split that separatesa andb from c andd if and only if {{a, b}{c, d}} is realized by that cyclic

ordering. Hence, to ensure that as much quartet informationas possible is represented,QNet

(Grunewald, 2006) tries to find a cyclic ordering so that thesum of the weights of all realized

quartets is maximal.

Both, Neighbor-Net andQNet, use the same agglomeration process to construct a cyclic

ordering. WhileNeighbor-Net tries to group those taxa close to each other that have a

small distance,QNet tries to construct a cyclic ordering that maximizes the sum of the weights

of the quartets it realizes. Hence, both methods construct cyclic orderings with the property

that groups of phylogenetically closely related taxa tend to assemble along an arcus function.

Neighbor-Net andQNet are bothconsistent, i.e. if the distances or quartet weights cor-

respond to a circular split system, they find a cyclic ordering that represents that split system

(Bryant and Moulton, 2007; Grunewald et al., 2007).

For this purpose, the important property of the circular split systems computed byNeighbor-Net

andQNet is that phylogenetically more closely related taxa are preferentially placed closer to-

gether in this cyclic ordering, since they are separated by fewer splits with positive weights. Thus,

if a characterχ = χi (defined by somealignment sitei in a given alignment) is phylogenetically

”useful”, its character states will appear ”clustered” along the cyclic ordering, independent of

the details of the branching order in individual subtrees. In contrast, if a character is completely

randomized, one will observe that character states are randomly arranged along the cycle.

The amount of clustering can be easily quantified by the number ν = ν(C, χ) of adjacent dis-

tinct character states along the cycleC. We haveν = 0 for constant sites andν ≥ 2 for all

non-constant sites. This number has to be compared with the numbers expected for a random

distribution of character values along the cycle, given theoverall distribution of the character

values ofχ. To this end, we use a shuffling procedure, i.e. we randomly generate a cyclic order-

ing C ′ of the same character states and compute the fractionq = q(C, χ) of randomized samples

with ν(C ′, χ) > ν(C, χ). The frequencyp = 1 − q thus estimates the probability that the char-

acterχ is randomized. Hence we can interpretq as a reliability measure for the phylogenetic

information contained in the alignment site (relative toC). Note that we obtainq = 0 for con-

stant and singleton sites, which are phylogenetically uninformative andq ≅ 0.5 for effectively

randomized sites. Sites withq ≤ 0.5 are “worse” then random and contradict the given cyclic

ordering while support for the ordering is found in sites with q ≥ 0.5.

25

3.3. NOISE DETECTION USING CIRCULAR SPLIT SYSTEMS CHAPTER 3. NOISY

The programnoisy executes the following commands:

1. Compute the cyclic orderingC from the input data using eitherQnet or NeighborNet .

2. For each characterχ

• Compute the numberν(C, χ) of break points.

• ComputeN random cyclic orderingsC ′.

• For each cyclic ordering computeν(C ′, χ).

• Compute the fractionq(C, χ) of random orderings withν(C ′, χ) > ν(C, χ).

3. If q(C, χ) exceeds a given threshold, then remove the characterχ.

The programnoisy is implemented inISO C++ and the source code is available for download

from http://www.bioinf.uni-leipzig.de/Software/noisy/ .

In a first phase, a cyclic ordering of the taxa set is computed.For this purpose,noisy includes

the corresponding subset of routines from David Bryant and Vincent Moulton’sNeighborNet

(Bryant and Moulton, 2004) and theQNet (Grunewald, 2006) packages. Subsequently, a re-

liability score q for each character is calculated. The number of character-state alterations is

counted and compared to the observed count in random shuffling. The uniform pseudo-random

number generatorMersenne Twister (Matsumoto, 1998) is used to generate the random

shuffling.

In order to assess whether the cyclic orderings obtained usingQNet andNeighborNet reduce

the fraction of uninterpretable variation, the following randomization experiment was performed.

Given an alignment, all possible cyclic orderings will be generated and the fractionr of sites with

q > 0.8 among all variable sites in the alignment are computed. As shown in Fig. 3.3,QNet

andNeighborNet nearly minimize the fraction of “noisy” alignment sites forthe10 squamate

mitochondria. The programnoisy exports aPostscript file, visualizing the quality of

the sites of the reordered input alignment (see Fig. 3.5), recording their reliability score as xy-

data, and containing a modified alignment for further analysis from which sites with reliability

q < qcutoff are removed. Figure 3.4 gives a overview of this scoresq in comparison with the

structure within alignment positions and Figure 3.5 shows typical examples for the distribution

of alignment sites with low and high reliability scoresq.

26


Fraction of noisy positions

Num

ber

of c

ircul

ar o

rder

ings

0.64 0.66 0.68 0.70 0.72 0.74

050

0010

000

1500

020

000

Ne

igh

bo

rNe

t

Clu

sta

lW

QN

et/Q

M

Figure 3.3: Distribution of the fraction of randomized characters (q(C,χ) ≤ 0.8) among the variablecharacters in a set of 10 complete mitochondrial genomes as afunction of the cyclic ordering. The cyclicorderings computed byNeighborNet or QNet indeed essentially minimize the fraction of putativerandomized alignment sites. At least in this example,QNet with quartet-mapping-derived quartet weightsperforms best.”ClustalW ” refers to the circular ordering implicitly constructed byClustalW from its guide treewhich determines the order in which sequences and profiles are combined to yield the final alignment.

27

3.4. COMPUTATIONAL RESULTS CHAPTER 3. NOISY

Figure 3.4: Classification scheme hownoisy handles information content of characters of different sites

3.4 Computational Results

As an example for the effect of removingnoisy sites, I consider a data set of combined 28S

rRNA, 16S rRNA, and mitochondrial COI sequences of spatangoid sea urchins that was reported

to have a high level of homoplasy (Stockley et al., 2005). Theraw sequence alignments lead

to significant different phylogenetic trees for different methods and disagree substantially with

morphology-based results. As reported in the original paper, manual removal of homoplastic

sites improved the trees considerably. The application ofnoisy with a cutoffq = 0.8 leads to

comparable results without human intervention, and produced a Maximum Parsimony (MP) tree

that is consistent with the results obtained by Bayesian andMaximum Likelihood (ML) methods

(in contrast to the manual procedure). The MP trees for the complete and thenoisy -reduced

alignments are presented in Figure 3.6.

In order to assess the influences of the removal of unreliablesites from real and simulated align-

ments on phylogenetic reconstruction, I consider theqcutoff-dependency of the most used com-

mon indices assessing tree quality. Phylogenies were computed using maximum parsimony and

neighbor joining (Kimura 2-parameter model) as implemented in PAUP* 4.0b10 (Swofford,

2002). Scaled log-likelihood score (i.e. the log-likelihood divided by the length of the align-

ment), homoplasy index (HI) (Kluge and Farris, 1969), rescaled consistency index (RC) (Farris,

1989), and average bootstrap support (over all internal vertices) were used to assess the tree sta-

bility while topological changes were described by split distance1. The data sets are described

1sdist , www.daimi.au.dk/ ˜ mailund/split-dist.html

28

CHAPTER 3. NOISY 3.4. COMPUTATIONAL RESULTS

50 100

150

200

250

300

350

400

450

500

550

600

650

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

1500

1600

1700

1800

1900

2000

Figure 3.5: Distribution of homoplastic sites for the mitochondrialatp6 gene of squamata (top) and for18S RNA of Coleoptera from an analysis of (Korte et al., 2004)(bottom). In terms of quality, the two datasets are very different. While the majority of sites inatp6 are parsimony informative and approximatelyone third of the sites have a reliability score aboveqcutoff = 0.8, this is clearly not the case for the dataset by Korte et al. (2004) where most of the sites are constantor unreliable. The color codes for the sitesof the alignment are as follows:� site with missing data;� constant site;� singleton site;� parsimonyinformative site (at least two different character states occur in at least two taxa). The black bar below thealignment displays the sites withq ≥ qcutoff (upper half) andq < qcutoff (lower half). Their position isdisplayed below each alignment in nucleotide positions.

29


2528

543

0.54

0.41

0.19

raw

2227

465

0.59

0.44

0.20

noisy

2076

260

0.50

0.56

0.28RC

RI

HI

PI−sites

length

Stockley

Conolampas sigsbei

Echinoneus cyclostomus

Paraster doederleini

Archeopneustes hystrixSpantagus matheyi

Spantagus raschi

Paramaretia multituerculata

Echinocardia laevigaster

Lovenia cordiformis

Allobrissus agassizii

Metalia spatagus

Plagiobrissus grandis

Linopneustes longispinus

Meoma ventricosa

Brissopsis atlanticaPaleopneustes cristatus

Brisaster fragilis

Amphipneustes lorioli

Abatus cavernosus

Amphipneustes lorioli

Abatus cavernosus

Brisaster fragilis

Paleopneustes cristatus

Allobrissus agassizii

Brissopsis atlantica

Meoma ventricosa

Linopneustes longispinus

Plagiobrissus grandis

Metalia spatagus

Lovenia cordiformis

Echinocardia laevigaster

Paramaretia multituerculata

Spantagus raschi

Spantagus matheyi

Archeopneustes hystrix

Paraster doederleini

Echinoneus cyclostomusConolampas sigsbei

Figure 3.6: MP trees of spatangoid sea urchins from combined28S rRNA, 16S rRNA, and mitochondrialCOI sequences (Stockley et al., 2005). On the left from original data, on the right from a reduced alignmentwith cutoff q = 0.8. The latter tree fits very well with the Bayesian and ML results reported in Stockleyet al. (2005) that were obtained from manually reduced alignments. In particular, the reduced MP treecorrectly showsBrissopsisandAllobrissusas sister groups and correctly identifies the large monophyleticclade consisting of theLinopneustes/Metalia andLovenia/Spatangusgroups to the exclusion ofMeomaand Archeopneustes. The included table compares the stability indices (HI = homoplasy index, RC =rescaled consistency index, RI = retention index) between the complete, Stockley’s manually improved,and thenoisy -reduced alignment.

30


Table 3.1: Randomized sites (atqcutoff = 0.8) in the 13 different individual protein-coding genes within the31 currently available complete mitochondrial genomes of squamata. The last column gives the fractionof randomized variable sites.

Gene length singletons q ≥ 0.8 random (%)atp6 684 42 405 34.65atp8 171 7 108 32.75cox1 1536 88 1008 28.65cox2 672 34 443 29.02cox3 786 45 516 28.63cytb 1131 74 676 33.69nd1 942 44 589 32.80nd2 1032 63 626 33.24nd3 345 11 222 32.46nd4 1371 65 831 34.65nd4l 288 16 183 30.90nd5 1803 103 1040 36.61nd6 540 25 373 26.30

briefly in the caption of Tab. 3.2; they are available for download at http://www.bioinf.uni-

leipzig.de/Publications/SUPPLEMENTS/06-013/.

Figure 3.7 summarizes the results for alignments of mitochondrial protein-coding genes. The

other data sets, omitted here, show the same qualitative character. Table 3.1 presents that the

fraction of effectively randomized sites varies considerably (from 26% to 37%) between different

proteins even in the relatively benign case of mitochondrial genomes (Simon et al., 1994). As

expected, the homoplasy index is significantly reduced while the rescaled consistency index

increases with increasing values ofqcutoff. Similarly, the scaled log-likelihood values increase

with the fraction of excluded randomized alignment sites. Note that while the tree-quality indices

improve consistently, indicating that the reconstructions become more stable, the absolute values

of the quality indices nevertheless strongly depend on the size and quality of the input alignments.

Hillis and Huelsenbeck (1992) suggested another method to estimate the phylogenetic infor-

mation content of an alignment. Finally, they determined the skewness-test statisticsg1 of the

corresponding tree-length distribution. I analyzed the data with the random-tree option imple-

mented inPAUP* 4.0b10 . For the data matrices, I generated 100.000 trees at random from all

possible tree topologies (replacements allowed). The results are consistent with the tree statistics

discussed above. As expected, we observe thatg1 becomes more negative with increasing values

of qcutoff, at least as long as one does not start to remove too many informative sites (data not

31


Figure 3.7: Dependency of tree-quality indices on the cut-off value qcutoff for data setSample6. The sta-bility of the trees is measured by the scaled log-likelihood(ln L)/n, the homoplasy index (HI) (Kluge andFarris, 1969) and the rescaled consistency index (RC) (Farris, 1989) as computed byPAUP* 4.0b10 .Data sets are alignments (supplied in the electronic supplement) of individual mitochondrial protein-coding genes. They vary in size (from about 170 to 1800 nt) andrandomization.

32


Table 3.2: Comparison between original and reduced alignmentsData set raw cutoff q=0.8 improvement (%)Sample 1 46.64 67.71 45.1Sample 2 70.89 71.36 0.6Sample 3 75.03 76.89 2.5Sample 4 82.28 84.67 2.9Sample 5 84.79 86.25 1.7Sample 6 90.28 89.83 -0.5

In the reduced alignments, sites with a cutoff value ofq = 0.8 are removed. Average bootstrap-support values (1000 replicates) are computed for neighbor-joining trees.Real data sets:Sample 1: complete coding sequence of mitochondrial genecox1 from 17 adephage aquaticbeetles (G. Fritzsch, unpublished);Sample 2: 48 nuclear 18S RNA genes from basal deuteros-tomes (Cameron et al., 2000), available fromhttp://chuma.cas.usf.edu/ ˜ garey/

alignments/alignment.html ; Sample 3: combined data set of 12S rRNA and amino-acidcoding gene ND1 from 41 Andean frogs (Lehr et al., 2005);Sample 4: complete set of protein-coding mitochondrial genes from 46 selected arthropods (see electronic supplement);Sample 5:12S, 16S, and ND1 sequences of 30 jewel beetles (Bernhard et al., 2005). Sample 6: completeset of protein-coding mitochondrial genes from all 31 currently available squamata (see electronicsupplement).

shown).

An alternative measure for the stability of a phylogenetic reconstruction is the bootstrap support

for trees. In this case computed with the Neighbor Joining method (Saitou and Nei, 1987).

Table 3.2 summarizes changes in average bootstrap support for neighbor-joining trees computed

usingPAUP* 4.0b10 and 2000 bootstrap replicates (Felsenstein, 1985; Efron etal., 1996).

Usually, there is a small but significant improvement of a fewpercent. However, in data sets that

lead to very weakly supported phylogenies the improvement can be quite dramatic, as in the case

of Sample 1.

In order to study the effect of removing putative homoplastic sites in a more systematic way, my

coworker S. Prohaska and I generated artificial data sets forcaterpillar and balanced trees with

4 to 29 taxa usingdawg (DNA Assembly With Gaps) (Cartwright, 2005). Fig. 3.8 showsthe

variation of the bootstrap support relative to the cutoff valueq. The ratio of the average bootstrap

support for the modified alignments divided by the bootstrapsupport obtained from the original

alignment gives the relative average bootstrap support of phylogenetic trees. Pairs of caterpillar

and balanced trees with the same number of taxa were constructed so that (a) all leaves have the

same evolutionary distance from the root and (b) all internal edges as well as all edges leading

33


Figure 3.8: The relative average bootstrap support of phylogenetic trees is computed as the ratio of theaverage bootstrap support for the modified alignments divided by the bootstrap support obtained from theoriginal alignment. Values larger than1 indicate an increase in tree quality. The curves show a distinctmaximum that depends on the number of taxa and the topology ofthe tree. The maximum improvementincreases with the number of taxa (indicated on the right margin of both panels for the highlighted curves).For clarity, error bars obtained from 100 replicates are shown only forN = 10 andN = 25 taxa. The treetopologies, caterpillar trees on the left and balanced trees on the right, are depicted by the insets.

to leaves with maximal depth (maximal number of internal nodes on the path to the root) have

the same ’unit length’. This unit length is set to 0.4 substitutions per site in the binary trees. In

the caterpillar trees the ’unit length’ is scaled so that thetotal length equals that of the binary

tree with the same number of species. For each tree, we useddawg to generate 100 independent

alignments using the following parameters: alignment length 800 nt, GTR model withγ = 0.5

andι = 0.1, anddawg’s default substitution matrix for the GTR model.

We observed a pronounced maximum of bootstrap support whoseposition and value, however,

depends strongly on both, the number of taxa and the topologyof the tree. For small values of

qcutoff the alignment stability increases because only the mostnoisy sites are removed. (In

contrast, tree stability decreases immediately when randomly chosen alignment columns are

removed; data not shown). For large values ofqcutoff, tree stability starts to decrease again because

noisy starts to remove too many informative sites.

Empirically, for large data sets I found thatqcutoff ≈ 0.8 is a good compromise between these

two effects. For small data sets with less than 15 taxa I foundno improvements except for rather

smallqcutoff values reflecting the fact that, for small data sets, there are not too many possibilities

34

CHAPTER 3. NOISY 3.5. CONCLUSION

for the values ofν(C, χ), implying thatnoisy should be used only for at least moderately large

data sets.

In general, the caterpillar trees admit larger improvements in bootstrap support than the balanced

ones. I remark that the balanced trees are almost correctly reconstructed while the caterpillar

trees are poorly reconstructed, in particular at the deep nodes (data not shown).

3.5 Conclusion

It has been argued repeatedly that saturated (homoplastic)characters are detrimental to phy-

logeny reconstruction and, thus, should be removed from multiple sequence alignments, see e.g.

Wagele (2005). Since homoplasy is defined relative to the unknown true tree, it is not obvious

how to reliably identify the homoplastic characters without prior knowledge of that tree. Here, I

show that cyclic orderings that can be obtained robustly, e.g. from pairwise distance data without

detailed knowledge of the correct phylogenetic relationships. Given a circular ordering that is

consistent with a phylogeny, the variation of character states of a given site along the circle is

used to determine the (putative) degree of its randomization. This information can then be used

to prune the sequence alignment. The computer programnoisy implements this procedure.

High rates of substitutions which are not equally distributed among sites in the sequences caused,

e.g. by sequence constraints due to environmental pressure, can produce a considerable amount

of phylogenetic noise in the data and so called ’bad’ and phylogenetically misleading alignments.

Such alignments can be improved by increasing the signal-to-noise ratio through exclusion of

noisy sites. Alignment modifications, like concatenation of conserved blocks, are known to im-

prove phylogenetic analysis and, carried out manually, arecommon practice. However, manual

improvements are almost impossible for large-size and/or diverse alignments and typically make

it hard to reproduce the results later on. Furthermore, theyare not immune to the effects of wish-

ful thinking. In contrast to this, a method such asnoisy provides an essentially deterministic

and unbiased solution.

It is important to note that ’good’ alignments cannot be further improved by the reduction of

the alignment length. While especially distance-based methods for phylogenetic reconstruction

are relatively robust and can tolerate a good fraction of phylogenetically uninformative sites (see

in particular Ogdenw and Rosenberg (2006)), a high absolutenumber of informative sites is

necessary to obtain reliable trees.

The analysis of artificial data sets allows to propose a set ofsimple rules that enables the user

to decide under which conditions it makes sense to usenoisy to process multiple sequence

35

3.5. CONCLUSION CHAPTER 3. NOISY

alignments prior to using them for phylogenetic reconstruction:

(1) If the original alignment already yields trees with veryhigh average bootstrap support there

is nothing to be gained from this method.

(2) Data-sets with less than about 10 taxa are unlikely to be improved.

(3) The cutoff value ofq depends on the tree topology and in particular on the number of

taxa. It pays off to determine the maximum of the gain as a function of q and to use the

corresponding optimal cutoff value.

The analysis of several published data sets shows that removal of randomized sites consistently

leads to more stable trees, irrespective of the method used for phylogeny reconstruction (neighbor

joining, maximum parsimony, or maximum likelihood). Whilein benign data sets, the effects

on consistency indices, likelihood score, or bootstrap support are typically small and I do not

observe changes in the reconstructed tree topologies, the effects of removing homoplastic sites

can become dramatic for poor data sets, as the example of theCox1genes ofSample 1(Table 3.2)

demonstrates. In some cases, the reconstructed tree topologies can be improved as well, see e.g.

the example of the sea urchin phylogeny in Figure 3.6.

In this chapter I outlined a novel possibility to handle difficult pre-computed alignments by min-

imizing the number of randomized sites. In contrast to manual manipulation of alignments,

reducing data sets usingnoisy is transparent and easy to reproduce. Randomized sites are,

at best, phylogenetically uninformative or, in the worst case, just misleading sites. Circular or-

derings allows to deteced homoplastic characters in a two-stage approach: In the first step, one

would construct a circular ordering that minimizes the fraction of ’noisy’ sites (as in Figure 3.3).

In the second step, one would then construct the tree impliedby the alignment obtained after

elimination of all sites that appear to be highly randomizedrelative to that circular ordering.

36

CHAPTER 4

Gene Order Rearrangements

Mitochondrial genomes provide a valuable data set for phylogenetic studies, in particular of

metazoan phylogeny because of the extensive taxon samplingthat is available. Beyond the tradi-

tional sequence-based analysis it is possible to extract additional phylogenetic information from

the gene order. Gene order data present significant mathematical challenges not encountered

when dealing with sequence data. Many evolutionary events may affect the gene order and gene

content of a genome; and each of these events creates its own challenges (Moret et al., 2004).

Figure 4.1: Every vertical pair of arrows marks a breakpoint, so there are two breakpoints to comparethese gene orders.

The mitochondrial genome in a perfect case and with regard tothe Metazoa, are circular and

include thirteen protein coding genes, two rRNAs, twenty-two tRNAs and one control region.

A common method to compare the gene order of two species is to compute a succession of

genome rearrangement operations that transfer one order inthe other. The focus of interest

37

4.1. BREAKPOINT DISTANCE CHAPTER 4. GENE ORDER REARRANGEMENTS

are most often parsimonious rearrangement scenarios whichuse a minimal number of opera-

tions. Therefore, one of the most frequently contemplated rearrangement operations are inver-

sion operations (calledreversalsin mathematics, bio- and informatics) which are permutations

that reverse the gene order of a subsuccession of neighboured genes and change the sign of each

reversed gene.

Commonly, methods for phylogenetic reconstruction of geneorder can be divided into three

classes: (i) distance based methods, (ii) parsimony based methods, and (iii) likelihood based

methods.

4.1 Breakpoint Distance

A breakpoint is an adjacency present in one genome, but not inthe other. Thebreakpoint distance

is then the number of breakpoints present; this measure is easily computed in linear time. How-

ever it does not directly reflect rearrangement events, but only their final outcome. Figure 4.1

shows two breakpoints between two strings (e.g. genomes). Note that the gene subsequence 3

4 5 is identical to -5 -4 -3, since the latter is just the formerread on the complementary strand

(Moret et al., 2004).

4.2 Inversion Distance

Given two signed gene orders of equal content, theinversion distanceis simply the edit distance

when inversion is the only operation allowed. Even though wehave to consider only one type

of rearrangement, this distance is very difficult to compute(Moret et al., 2004). For unsigned

permutations, in fact, the problem is NP-hard (nondeterministic polynomial-time hard, see Ap-

pendix A). For signed permutations, it can be computed in linear time (Bader et al., 2001), using

the theoretical results of Hannenhalli and Pevzner (1995).

4.3 Parsimony Approaches

Of particular interest are most parsimonious rearrangement scenarios which use a minimal num-

ber of rearrangement operations. Parsimonious approachesof gene order reconstruction fall into

two subcategories. Firstencoding methods, which reduce the gene order problems to sequence

problems and second indirect methods, which run optimization algorithms directly on the gene

order (Moret et al., 2004).

38

CHAPTER 4. GENE ORDER REARRANGEMENTS 4.3. PARSIMONY APPROACHES

4.3.1 Encoding Methods

The running times of direct optimization approaches are exponential in the number of genomes

and the number of genes. Therefore an approach that, while remaining exponential in the number

of genomes, takes polynomial time in the number of genes, maybe of relevant interest. It is for

this reason to reduce the gene order data to sequence trough some type of encoding.

Maximum Parsimony on Binary Encoding (MPBE) (Cosner et al., 2000a,b) is a ex-

ample for such a method.MPBEproduces one character for each gene adjacency present in the

data. If genesi andj occur as the adjacent pairi j (or −j − i) in one of the genomes, then set

up a binary character to indicate the presence or absence of this adjacency (coded1 for presence

and0 for absence). The position of a character within the sequence is arbitrary, as long as it is the

same for all genomes. By definition, there are at most2n2 characters, so that the sequences are

of lengths polynomial in the number of genes. In the number ofgenes such analyses using maxi-

mum parsimony will run in time polynomial, but may require time exponential in the number of

genomes.

However, while a parsimony analysis relies on independenceamong characters, the characters

produced by MPBE are emphatically dependent; moreover, translating the evolutionary model

of gene orders into a matching model of sequence evolution for the encoding is quite difficult.

Note that this method suffers from several problems: (i) theancestral sequences produced by

the reconstruction method may not be valid encoding; (ii) none of the ancestral sequences can

describe adjacencies not already present in the input data,thus limiting the possible rearrange-

ments; and (iii) genomes must have equal gene content with noduplication (Moret et al., 2004).

Another example is the Maximum Parsimony on Multistate Encoding (MPME) method (Wang

et al., 2002). In this method denote exactly one character for one gene (thus2n characters in all).

The state of a character is the signed gene that follows it in the gene ordering (in the direction

indicated by the sign). The position of each character within the sequence is arbitrary as long

as it is consistent across all genomes, although it is most convenient to think of theith character

(with i ≤ n) as associated with genei, with then + ith character associated with genei. For

instance, the circular gene order (1,-4,-3,-2) gives rise to the encoding (-4, 3, 4,-1, 2, 1,-2,-3)

(Moret et al., 2004).

The results of Morets analyses indicate that theMPMEmethod dominates theMPBEmethod.

However, both methods still suffer from some of the same problems, as they also require equal

gene content with no duplication and they too can create invalid encoding.

39

4.3. PARSIMONY APPROACHES CHAPTER 4. GENE ORDER REARRANGEMENTS

4.3.2 Direct Optimization

Sankoff and Blanchette (1998) proposed to reconstruct thebreakpoint phylogeny. Breakpoint

phylogeny favours the tree and therefore the ancestral geneorder which together minimize the

total number of breakpoints along all edges of the tree. A special case of this problem is the in-

cludes breakpoint median, which it is NP-hard even for fixed trees. Consequentially, Sankoff and

Blanchette (1998) suggested a heuristic calledBPAnalysis , based on iterative improvement,

for scoring a fixed tree and simply decided to examine all possible trees. TheBPAnalysis

heuristic is summarized following:

For each possible tree do

Initially label all internal nodes with gene orders

Repeat

For each internal nodev, with neighbors labelledA,B and,C, do

Solve the median problem onA,B and,C to yield labelM

If relabellingv with M improves the score ofT , then do it

until no internal node can be relabelled

This method is practicable for small data sets only, based onthe enormous computational time.

TheBPAnalysis is expensive at every level. One main problem is the innermost loop, which

repeatedly solves the breakpoint median problem, an NP-hard problem. Furthermore, the la-

belling procedure runs until no improvement is possible, therefore using a potentially large num-

ber of interactions. Finally, the labelling procedure is used on every possible tree topology, of

which there is an exponential number.

For example, the number of unrooted, unordered trees onn labelled leaves is(2n − 5)!! (double

factorial). That means(2n−5)!! = (2n−5)∗(2n−7)∗(2n−9)∗ ...∗5∗3. For just 13 genomes,

we obtain 13.5 billion trees; for 20 genomes, there are so many trees that merely counting to that

value would take thousands of years on the fastest supercomputer (Moret et al., 2004).

Moret et al. (2001) reimplemented theBPAnalysis heuristic and made extensive use of algo-

rithmic engineering techniques (Moret et al., 2002) to speed up it. Moreover, they added the use

of inversion distance in order to produceinversion phylogenies. The result of this reimplemen-

tation isGRAPPA(Genome Rearrangement Analysis under Parsimony and other Phylogenetic

Algorithms) (Moret et al., 2001).

GRAPPAiterates over all possible tree topologies. For each topology, GRAPPAinitialises the

internal nodes. In a next step, for each internal nodeσ a medianµ of its three neighboursπ1 , π2

, π3 is computed andσ will be replaced byµ if this improves the score of the tree. This procedure

40

CHAPTER 4. GENE ORDER REARRANGEMENTS 4.3. PARSIMONY APPROACHES

is repeated until no internal node can be relabeled. To solvethe treading median problems

GRAPPAproffers, with Siepel’s median solver (Siepel and Moret, 2001) and Caprara’s median

solver (Caprara, 2003), several methods. Both approaches are branch-and-bound algorithms

which solve the problem to optimality.

GRAPPA does not return optimal solutions although it iterates over all possible tree topologies

and solves each median problem optimal. This is because it can not assign the labels of the

internal nodes in an optimal way.

Over the years many different methods were developed to use the gene order information for

reconstruction of phylogenetic relationships. However, at that time there are still numerous prob-

lems to deal with. In the following I will introduce new approaches which give an initial stage

for a solution of three main problems: (1) to compare gene order strings with different lengths;

(2) to use common reconstruction methods of gene orders; andas a final step, (3) the possibility

to use genome arrangements to reconstruct phylogenetic relationships.

41

4.4. THECIRCAL ALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS

4.4 Thecircal algorithm

Here I present a novel approach utilizing these data based oncyclic list alignments of the gene

orders. This method was developed in cooperation with Prof.Peter F. Stadler. Thereby a progres-

sive alignment approach is used to combine pairwise list alignments into a multiple alignment

of gene orders. Parsimony methods are used to reconstruct phylogenetic trees, ancestral gene

orders, and consensus patterns in a straightforward approach. This method was applied to the

study of the metazoan phylogeny based exclusively on mitochondria gene arrangements. Fur-

thermore, I will demonstrate that the approach is also applicable to the much larger genomes of

chloroplasts.

This section will first introduce the cyclic sequence alignment problem and describe a polynomial

solution for arbitrary cost functions. Afterwards, the problem of directionality and how the

occurrence of duplicate mitochondrial genes will be addressed. In the final subsection I consider

various possibilities of extracting consensus gene arrangements from cyclic alignments.

4.4.1 Cyclic alignments

An alignmentA of two stringsx andy is a sequence of pairs of the form(xi, yj), (xi,−), and

(−, yj) that preserves the order of sequence positions in bothx andy. A pair (xi, yj) corresponds

to asubstitutionof xi by yj, a pair(xi,−) represents thedeletionof xi, and(−, yj) is theinsertion

of yj. A maximal subsequence consisting of deletions(xi,−), (xi+1,−), . . . , (xi+q−1,−) will be

referred to as the deletion of the substringx[i, i+q−1] of lengthq, and analogously for insertions.

We consider here a cyclic variantA of the alignment in which we allow insertions and deletions

of substrings to “wrap around” the ends of the alignments, sothat e.g.(x1,−) and(xn,−) are

part of the same deleted substring.

With each alignment we associate a cost function. We distinguish substitution costss(a, b) be-

tween two lettersa andb and costs of insertions and deletionsg(a) for a substringa. The cost

functiong(a) is called thegap cost function. The total cost of an alignment is the sum of costs

of the individual costs for each edit operation. We call the cost modeladditive if the gap cost

functions are additive

g(a) =∑

ai∈a

g(ai) (4.1)

Note that for additive gap costs the costf(A) of an alignmentA and the costf(A) of its cyclic

variantA is the same. Gap costs have to be sub-additive,g(a ∪ b) ≤ g(a) + g(b). It follows

that we have in generalf(A) ≥ f(A) since a ”wrap-around” gap is cheaper than two separate

42

CHAPTER 4. GENE ORDER REARRANGEMENTS 4.4. THECIRCAL ALGORITHM

end-gaps.

In the context of cyclic alignments one naturally considersthe strings themselves as cyclic

((Bunke and Buhler, 1993; Gregor and Thomason, 1993; Maes,1990; Mollineda et al., 2002)).

Formally, cyclic strings are usually introduced as equivalence classes w.r.t. the cyclic shift opera-

torσ that rotates a string by one position:σ(x) = (x2, . . . , xn−1, xn, x1). The cyclic string associ-

ated with an ordinary stringx is thus the equivalence class[x] = {x, σ(x), σ2(x), . . . , σn−1(x)}.

An alignment of two cyclic strings is simply a cyclic alignment of two representativesσk(x) and

σl(y) of [x] and[y]. Of course we are interested in those representatives that yield the optimal

alignment, i.e. that minimize

f(

A([x], [y]))

= mink,l

f(

A(σk(x), σl(y)))

(4.2)

where A(p, q) denotes the cost-optimal cyclic alignment of the (non-cyclic) stringsp and q.

This problem can be solved inO(|x||y| log(|x| + |y|)) time and quadratic space in the case of

additive cost functions (Maes, 1990; Gregor and Thomason, 1993), see also Landau et al. (1998).

Unfortunately, this approach does not generalize to the problem at hand, which requires us to

consider arbitrary cost functions.

A solution to the general problem can still be obtained with quadratic memory and in polynomial

time. First we note that the optimal circular alignment of[x] and[y] is either the trivial alignment

with costg(x) + g(y) in which [x] is deleted and[y] is inserted (this is cheaper than deleting[x]

and inserting[y] in multiple intervals because of the subadditivity of the gap cost function) or the

optimal alignment contains at least one pair of match positions, sayxp andyq. The costf(Apq)

of this alignment given by

f(Apq) = s(xp, yq) + f(A(σp(x)[2..|x|], σq(y)[2..|y|]) . (4.3)

Recall that the first position ofσp(x) isxp so that we consider a non-cyclic alignment which in the

very first position has the match(xp, yq) followed by the optimal alignment of the remainder of

the rotated stringsσp(x) andσq(y). Since the first position is a match, a possible end-gap cannot

wrap around, so that we have to consider an optimal non-cyclic alignment of the substrings

σp(x)[2..|x|] andσq(y)[2..|y|], see Figure 4.2. Each one of them can be computed inO(|x| · |y| ·

max(|x|, |y|)) operations with arbitrary cost functions, see e.g. Dewey (2001), or inO(|x| · |y|)

operations for affine gap cost functions (Gotoh, 1982). Thus, we can compute the optimal cyclic

alignment by computingApq for all pairs(p, q).

43


4.4.2 Encoding of Mitochondrial Genomes

Mitochondrial genomes clearly can be regarded as cyclically ordered lists of genes. In addition,

however, the genes are oriented depending on whether being located on the heavy or the light

strand. This is taken into account in the framework of list alignments by considering the same

gene in different orientations as different objects, Figure 4.3.

Duplication and deletion of genes in mitochondrial genomesoccurred frequently during the evo-

lution of the Metazoa. In recent years, Paul Higgs and co-workers presented compelling evidence

that duplication of a tRNA-Leu gene, followed by anticodon mutation, and subsequent deletion

of tRNA-Leu genes has occurred at least five times during the evolution of the Metazoa (Higgs

et al., 2003). Animal mitochondrial genomes, for example, usually have two transfer RNAs

for both leucine and serine. While such duplicate genes are problematic in permutation-based

approaches, they are naturally described in the list alignment model used here: There are two

tRNA-Leu genes in the cyclic list. We simply used different symbols, ’L1 ’ and ’L2 ’, for tRNA-

Leu genes.

4.4.3 Scoring Model

In the following we will turn to a more detailed description of the scoring functions that underlie

the pairwise linear alignments. The (mis)match scores are trivial because it doesn’t make sense

to align non-homologous genes, i.e. non-identical list entries. We have simplyσ(x, y) = 0 if

x = y andσ(x, y) = ∞ if x 6= y. Our knowledge about the mechanism of the genomic rear-

rangements must therefore be incorporated into the indel scores. Thus, we have to compute only

O(D max(|X|, |Y |)) pairwise linear alignments, whereD is the maximum number of copies of

a duplicated gene.

p

q

p

q

p

q

x y

x

y

Figure 4.2: An alignment of two cyclic strings that containsthe (mis)matchxp, yq is equivalent to a linearalignment ofσq(x) with σp(y) with the constraint thatxp, yq form a (mis)match. Note that — in contrastto the default of many alignment programs — we have to score ”end-gaps” just as all other indels here.

44


>NC_004419 Polyodon spathulaF 12S V 16S L2 ND1 I -Q M ND2 W -A -N -C -Y CO1 -S2 D CO2 K ATP8 ATP6 CO3G ND3 R ND4L ND4 H S1 L1 ND5 -ND6 -E CYTB T -P>NC_002639 Myxine glutinosaF 12S V 16S L2 ND1 I -Q M ND2 W -A -N -C -Y CO1 -S2 D CO2 K ATP8 ATP6 CO3G ND3 R ND4L ND4 H S1 L1 ND5 -ND6 -E CYTB T -P>NC_001626 Petromyzon marinusF 12S V 16S L2 ND1 I -Q M ND2 W -A -N -C -Y CO1 -S2 D CO2 K ATP8 ATP6 CO3G ND3 R ND4L ND4 H S1 L1 ND5 -ND6 T -E CYTB -P>NC_002177 Halocynthia roretziF G T ND6 L1 N G D CO3 ND4L C K 12S CO2 CYTB Y W I E ND2 H S1 R Q L2 ND5 M 16SND1 ATP6 S2 CO1 ND3 A P ND4 V

Figure 4.3: Example of input data. The abbreviations of mitochondrial gene names are listed in theAppendix A.

Branchiostoma_floridaeSus_scrofa

Cavia_porcellusChelonia_mydas

Eumeces_egregiusHippopotamus_amphibius

Mustelus_manazoDaphnia_pulex

Ceratitis_capitataAnopheles_quadrimaculatus

CO

1 L2-S

2 DC

O2 K D

AT

P8

AT

P6

CO

3 GN

D3 R A R N

-S1

S1 E -F

-ND

5 -H-N

D4

-ND

4L T -PN

D6

ND

4LN

D4 H S L1

ND

5 G-N

D6 -E

CY

TB S2

-ND

1S

2-L1

-16S -V-12S T -P F12S F V16S L2N

D1 I

-Q M -QN

D2 -N W -A -N -C -Y

Branchiostoma_floridaeSus_scrofa

Cavia_porcellusChelonia_mydas

Eumeces_egregiusHippopotamus_amphibius

Mustelus_manazoDaphnia_pulex

Ceratitis_capitataAnopheles_quadrimaculatus

CO

1 L2-S

2 DC

O2 K D

AT

P8

AT

P6

CO

3 GN

D3 R A R N

-S1

S1 E -F

-ND

5 -H-N

D4

-ND

4L T -PN

D6

CY

TB S2

-ND

1S

2-L1

-16S -V-12S

ND

4LN

D4 H S L1

ND

5 G-N

D6 -E

CY

TB T -P F

12S F V16S L2N

D1 I

-Q M -QN

D2 -N W -A -N -C -Y

Figure 4.4: Effect of scores: Top: additive model withδ(P) = 20, δ(r) = 10, δ(t) = 3 andη(a) = 0.Below: affine model with additive scores as above andη(ai) = 2δ(ai). As expected, the large one-timescore, which essentially acts like agap-open penalty, leads to alignments with a small number of largegaps. Protein coding genes are shown in red/dark grey, tRNAsin yellow/light grey, rRNA in orange, gapsin black.

It is well known that tRNA genes are much more mobile than protein-coding mitochondrial

genes (Boore, 1999). We therefore propose a scoring scheme that consists of two contributions

for each inserted or deleted intervala = [ai, ai+1, . . . , aj−1, aj] of the cyclic list. We define (1)

an additive score contributionδ(ai) to which each deleted list entryai contributes independently

and (2) a “one-time” contribution that allows us to distinguish between intervals that consist of

tRNAs only and those that also contain proteins. This “one-time” score essentially plays the role

of the gap-open penalty in the usual models of sequence alignments, see Figure 4.4. We think

it’s convenient to define the one-time scoreη(ai) for each list entry individually and to compute

indel-score for the intervala as

g(a) = maxi≤k≤j

η(ak) +

j∑

k=i

δ(ak) (4.4)

45


The default scores for mitochondrial genomes distinguish between three types of genes: proteins

P, ribosomal RNA genesr, and tRNAst. The downside of this scoring model is that we are forced

to use a computationally expensive algorithm to compute thelinear list alignments. Default

values for this scoring model have been chosen so that a number of test data sets with well-

established phylogenies were well reconstructed from the resulting alignments. In the following

we use

P r t

δ 4 3 3

η 6 4 4

(4.5)

Alignment distances, of course, not always correctly reproduce the relative ordering of distances

w.r.t. to reversals or transpositions. We do not consider this as a problem, since the mechanisms

of mitochondrial genome rearrangements are not very well understood. Nevertheless, one may

use a scoring model which is dominated by the one-time scoreηk, i.e. for whichδk ≪ ηk instead

of the default scores of thecircal program. Such a scoring scheme in essence counts the

number of rearrangement events. It measurescut-and-pasteevents instead of reversals, however,

i.e. it does not distinguish whether a translocated block isre-inserted in the same or in the reverse

orientation and it is insensitive to the location of reinsertion.

4.4.4 Multiple Cyclic Alignments

The pairwise cyclic alignment procedure outlined in the previous two subsections can be gener-

alized to multiple sequences by means of the sameprogressive alignmentapproach that is used

for example inclustalw (Thompson et al., 1994): We first compute all pairwise distances,

using the cyclic alignment procedure. From the resulting distance matrix we construct a guide

tree, in our case using the WPGMA clustering method (Sokal and Michener, 1958). This guide

tree is used to align profiles of aligned cyclic lists in the same way as individual lists. Finally,

the scoring scheme is extended in the obvious way from individual lists to alignments: both the

one-time scoreη and the additive scoreδ for a profile position is computed as the sum over all

entries in each column of the two profiles (alignments) that are to be combined. Equation (4.4)

is then used to determine the indel score for an interval.

46


Polyodon spathulaScaphirhynchus cf. albus

Polypterus ornatipinnisPolypterus senegalus

Erpetoichthys calabaricusCyprinus carpio

Oncorhynchus mykissXenopus laevis

Mertensiella luschaniRanodon sibiricus

Andrias davidianusTyphlonectes natans

Mustelus manazoRaja radiata

Scyliorhinus caniculaSqualus acanthias

Heterodontus francisciChimaera monstrosaLatimeria chalumnae

Neoceratodus forsteriProtopterus dolloi

Lepidosiren paradoxaMyxine glutinosa

Eptatretus burgeriEumeces egregius

Iguana iguanaHippopotamus amphibius

Balaenoptera physalusOrnithorhynchus anatinus

Tachyglossus aculeatusChelonia mydas

Pelomedusa subrufaDogania subplana

Rana nigromaculataPetromyzon marinus

Dinodon semicarinatusAlligator mississippiensis

Caiman crocodilusAlligator sinensis

Fejervarya limnocharisRhea americana

Aythya americanaGallus gallus

Corvus frugilegusChrysemys picta

Sphenodon punctatusBranchiostoma floridaeBranchiostoma belcheri

Balanoglossus carnosusBranchiostoma lanceolatum

Lampetra fluviatilis

E -T P L1 T P -P S1

-ND

6 -E F12S F V16S L1 L2N

D1 -Q I Q M L2 -Q M Q

ND

2 N -N W A N C Y -A -N -C -YC

O1

S2

-S2 D

CO

2 KA

TP

8A

TP

6 GC

O3 G

ND

3 RN

D4L

ND

4S

1 H S1 L1

ND

5 G-N

D6 T E -E L1

CY

TB

Figure 4.5: Graphical display of an alignment of vertebratemitochondrial genome arrangements. Proteins,rRNAs, and tRNAs are shown in red, orange, and yellow, resp. Most vertebrates share a common geneorder. However, there are numerous small deviations that mostly involving transposed tRNA genes, inparticular in the bird and reptile lineages, see e.g. Mindell et al. (1998) and Townsend and Larson (2002).

4.4.5 Implementation

The algorithm described here is implemented in the program packagecircal which is written

in ANSI C. Thecircal package is distributed under the GNU General Public License(GPL)1.

The current implementation ofcircal produces anexus format file of the alignment as well

as a graphical overview inPostScript format, see Figure 4.5. Datasets of about 30 rather

diverse gene arrangements can be computed within about a quarter of an hour on a common PC

(Linux operating system on a Dual-Pentium IV with two 2.4GHzCPUs and 1Gbyte RAM). The

full protostome data set runs about 1 day on the same PC.

1Download fromhttp://www.bioinf.uni-leipzig.de/Publications/SUPPL EMENTS/04-015/ .

47


4.4.6 Tree Reconstruction

Phylogenetic trees can be inferred from the cyclic list alignments by any of the usual approaches.

One might simply use the multiple alignment to re-compute a pairwise distance matrix, possi-

bly using a more sophisticated distance measure than those used for constructing the alignment.

We do not pursue this approach here, since the alignment contains much more information than

just the mutual distances. Maximum likelihood methods are applicable in principle (Larget and

Simon, 2002), albeit it seems non-trivial to derive a good rate model for mitochondrial genome

arrangements. We therefore resort to maximum parsimony. Since each column of the align-

ment marks only the presence (1) or absence (0) of a gene in a particular alignment position,

it is straightforward to apply standard programs such asPAUP* (Swofford and Olsen, 1990) or

phylip (Felsenstein, 1989) on the corresponding0/1 -strings. Alternatively, the position of a

gene in the list alignment can be interpreted as a distinct character state as described in subsec-

tion 4.4.8. This results in a string representation in whicheach mitochondrial gene corresponds

to a single column. In practice, we observe little difference between the two approaches. Boot-

strap support values can of course be computed in the same wayas for conventional sequence

alignments.

4.4.7 Consensus Gene Arrangements

An apparent shortcoming of the list-alignment approach is that the same object (gene) may ap-

pear multiple times in the alignment, i.e. there are multiple columns of the final alignment that

refer to the same protein or tRNA. We argue that this actuallyis an advantage. We can now

identify a subgroup of aligned genome arrangements (or use the whole alignment) to obtain a

consensus genome arrangement. Finally, we simply compare all columns that refer to the same

gene (in both orientations) and select the one with the most non-gap entries. In case of duplicated

tRNAs, say, we may take two most populated columns. The result is a ”valid” mitochondrial

genome arrangement that describes the consensus of the group in question.

By leaving out all columns that contain less than a minimum number of non-gap entries we can

directly extract conserved parts of the gene order even if they do not correspond to conserved

intervals.

4.4.8 Ancestral Genome Organization

It might be surprising at first glance that the multiple alignment can be used to reconstruct the

ancestral genome organization, since each genek can appear in multiple columns of the list

48


alignment. Suppose the number of these columns ismk. Clearly, genek is present in at most

one of these columns in each taxon. A deletion ofk is represented by the absence ofk in all mk

columns belonging to genek. Thus, we can regard the position ofk in the list alignment as a

character withmk + 1 possible states. Consequently, we can use standard parsimony approaches

to obtain the ancestral state (position and orientation) ofeach gene, see Figure 4.6. On the other

hand, if we allow duplicate genes, or, assuming a duplication-deletion mechanism for mitochon-

drial genome rearrangements such as those proposed by Maceyet al. (1998) and Lavrov et al.

(2002), we may use the simple presence/absence patterns of genes in the list alignment itself to

reconstruct ancestral gene orders by means of maximum parsimony.

As an example, we computed the reconstructed gene order of the echinoderm ancestor start-

ing from 8 mitochondrial genomes. The presence/absence pattern of genes in the list alignment

was converted into a0/1 character matrix and analyzed usingPAUP* . The resulting pres-

ence/absence patterns was then re-translated into a the gene order shown in Figure 4.6.

4.4.9 Mitochondrial Genomes

Analysis of mitochondrial genomes have significantly contributed to the reconstruction of deep

metazoan phylogeny (Boore and Brown, 1998). For example, the phylogenetic position of Ten-

taculata (Lophophorata), either as protostomes, sister group of deuterostomes, or even members

of the deuterostomes was a matter of long and controversial debate. Mitochondrial genome

analyses, both on gene order and nucleotide analysis of the brachiopodTerebratulina retusa,

convincingly support a protostome relationship with an affiliation to the spiral cleaving molluscs

and annelids (Stechmann and Schlegel, 1999). However, someresults on the analysis of mito-

chondrial genome comparisons challenge classical evidence, such as the monophyly of insects

(Nardi et al., 2003).

We applied our novel method to a comprehensive data set of metazoan mitochondrial genomes.

> ancestral_echinodermCYTB F 12S E T P -Q N L1 -A W C -V M -DY G L2 ND1 I ND2 16S CO1 R ND4L CO2 KATP8 ATP6 CO3 S2 ND3 ND4 H -S1 ND5 -ND6

Figure 4.6: Reconstructed gene order of the ancestral echinoderm starting from the gene orders of 3 seaurchins (Strongylocentrotus purpuratus, Paracentrotus lividus, Arbacia lixula), 2 brittle stars (Ophiopholisaculeata, Ophiura lutkeni), 1 sea star (Asterina pectinifera), 1 crinoid (Florometra serratissima), and 1sea cucumber (Cucumaria miniata).

49


��

��

��

��

��

��

��

��

��

��

��

��

52

100

95

97

99

46

26

94

15

40

100

100

100

100100

100

64

82

100

Arthropoda (69)

Mollusca − Cephalopoda (1)

Mollusca − Gastropoda (1)Mollusca − Polyplacophora (1)

Nematoda − Spirurida (3)

Nematoda − Rhabditida (5)

Annelida (2)

Plathelminthes (10)

Nematoda − Enoplea (1)Mollusca − Bivalvia (1)

Mollusca − Scaphidida (1)

Mollusca − Bivalvia (1)

Mollusca − Gastropoda (6)

Echinodermata (9)

Figure 4.7: Maximum parsimony tree of the mitochondrial gene order of the protostomian data set. MPanalysis was performed usingPAUP* with the heuristic search method (10 random stepwise additionsand the TBR branch swapping, 100 bootstrap replicates). Forthe sake of clarity triangles represent sub-trees which are not shown in full resolution. Echinodermatawere used as outgroup. Data and accessionnumbers were provided in the Appendix A.

50


Phylogenetic reconstructions reported here are performedby aligning the mitochondrial gene

orders (proteins, rRNAs, and tRNAs) usingcircal and reconstructing maximum parsimony

trees usingPAUP* . Gaps were treated as missing data. The data was also bootstrapped using the

MP method (1000 replicates).

An analysis using 166 metazoan taxa showed a good separationbetween the deuterostomian and

the protostomian lineages (Fritzsch et al., 2004a,b). In Figure 4.7 we show a phylogenetic re-

construction of 103 taxa from diverse protostome groups using 9 echinoderms as outgroup. Our

approach supports the monophyly of arthropods, annelids, platyhelminthes, and nematods, with

the exception of a single taxon (Enoplea). However the monophyly of molluscs was not recov-

ered. The genome arrangements of molluscs are very variable(Hoffmann et al., 1992; Dreyer

and Steiner, 2004). This is clearly a case where the ancestral gene order has been wiped out

by rapid rearrangements. A multifurcation within the arthropode clade makes further systematic

analysis impossible. Nevertheless, some tendencies can berecognized: for example, we find

support for a clade of chelicerata (10 taxa) and myriapoda (4taxa).

An ongoing debate in protostome phylogeny concerns the affiliation of the arthropods either to

the annelids in the traditional articulates, or with the nematodes in the clade Ecdysozoa (Adoutte

et al., 2000). The results do not support the latter hypothesis based on other molecular evidence,

such as nuclear rRNA,Hoxgene sequences, and EST analyses (Dunn et al., 2008).

Furthermore I analyzed 60 complete mitochondrial genomes of vertebrates, hemichordates, and

cephalochordates. The vertebrate gene order underwent only very few rearrangements, see Fig-

ure 4.5. Thus, phylogenetic information cannot be recovered with the data set analyzed. It is

noteworthy, that within the chordates the hemichordates share the only rearrangement of a mi-

tochondrial protein with the birds. Clearly, these are independent rearrangement events. This

example suggests that changes in mitochondrial genomes arenot unbiased random events; mul-

tiple list alignments can be used to determine likelihood differences between rearrangements

from sufficiently large datasets.

While testing thecircal program I observed that the resolution of the reconstructedtrees

and their agreement with well-established phylogenetic hypotheses improves with increasing

taxon sampling. The reason is probably that a dense taxon coverage leads to smaller differ-

ences between the gene orders of adjacent taxa which improves quality of the multiple sequence

alignment because the underlying pairwise alignments become less error prone. This contrasts

the observation of Rosenberg and Kumar (2001) that increased taxon sampling does not lead to

substantial improvements in sequence-based methods.

51


4.4.10 Chloroplast Genomes

Besides mitochondrial genomes, plastid genomes are a second field of application (Doyle et al.,

1992; Odintsova and Yurina, 2003). While data are still sparse, a genome database has recently

become available (Kurihara and Kunisawa, 2004). Chloroplast genomes, with about 100 protein

coding genes, are much larger than animal mitochondria. In order to demonstrate that the list

alignment approach is feasible for realistic datasets, we analyzed the protein gene order of 20

chloroplast genomes listed in Table 4.1 of Wolf et al. (2005).

Figure 4.8: (top) Phylogenetic tree derived from circular list alignments of 20 chloroplast genomes fromland plants. Numbers give bootstrap support from 1000 replicates in percent. (below) Thecircalalignment using uniform edit costs clearly identified blocks of protein genes whose relative ordering isconserved among land plants.

Orthology of chloroplast protein coding genes was checked by blast (Altschul et al., 1997)

both with other chloroplast genomes and GenBank (Maglott etal., 2005). Unknown open read-

ing frames without clear orthologs in other chloroplasts were removed from the gene lists. We

rancircal with uniform scoring using protein coding genes only. The maximum parsimony

tree resulting from the circular list alignment is shown in Figure 4.8. It correctly groups the

angiosperms with the exception of the filicopsida, which should appear as the sister group of an-

52


Table 4.1: GenBank accession numbers and sources of chloroplast gene maps for sampled taxa (Wolfet al., 2005)

Taxon GenBank accessionCharophytesChaetosphaeridium globosum NC 004115(Nordstedt) KlebahnLiverwortsMarchantia polymorphaL. NC 001319MossesPhyscomitrella patens(Hedw.) NC005087Bruch and W. P. SchimperHornwortsAnthoceros formosaeStephani NC004543LycophytesHuperzia lucidula(Michx.) Trevisan AY660566MoniliformsAdiantum capillis-venerisL. NC 004766Psilotum nudum(L.) P.Beauv. NC003386ConifersPinus koraiensisSiebold and Zucc. NC004677Pinus thunbergiiFranco NC001631AngiospermsAmborella trichopodaBaill. NC 005086Arabidopsis thaliana(L.) Heynh. NC000932Atropa belladonnaL. NC 004561Calycanthus floridusL. NC 004993Lotus japonicus(Regel ) K.Larsen NC002694Nicotiana tobacumL. NC 001879Oenothera elataKunth ssp. hookeri NC002693(Torr. & A.Gray) W. Dietr. and W. L. WagnerOryza sativaL. NC 001320Spinacia oleraceaL. NC 002202Triticum aestivumL. NC 002762Zea maysL. NC 001666

53


giosperms. The resolution of the basal is poor as expected, given that only single representatives

are available.

54

CHAPTER 4. GENE ORDER REARRANGEMENTS 4.5. THECREXALGORITHM

4.5 TheCRExAlgorithm

Based on the assumption that information of gene orders, in this case of the mitochondrial

genome, includes useful phylogenetic signals, we developed a new algorithm calledCREx(Common

interval RearrangementExplorer) (Bernt et al., 2007) to analyse this gene order information.

The CREx algorithm computes heuristically pairwise rearrangementscenarios for gene order

data. Possible phylogenetic events in such scenarios are reversals, transpositions, reverse trans-

positions, and the more complex tandem duplication random loss (TDRL) operations.CREx

can detect such events as patterns in the signed strong interval tree (Berard et al., 2007), a data

structure representing gene groups that appear consecutively in a set of gene orders. The basic

strategy underlying the study is to identify unambiguous information that of course does not

provide full resolution of the phylogeny but can be used as ’anchor’. In this case the directional

information of TDRLs is used to provide very strong evidencefor monophyletic groups. This

project was developed in a close cooperation with the group of Prof. Dr. Martin Middendorf,

head of the Department of Computer Science, in Leipzig. In the following I will give a short

survey on necessary formal basics and the developed theCRExalgorithm.

4.5.1 Basic Definitions

Rearrangement Operations

A permutation of sizen is a permutation of the elements{1, 2, . . . , n}. A signedpermutation

of sizen is a permutation of sizen where every element has an additional sign (”+” or ”−”)

that defines its orientation (”+” is usually omitted). In the following a signed permutationπ =

(π1, . . . , πn) is just called a permutation. AreversalρR(i, j), 1 ≤ i ≤ j ≤ n applied to a

permutationπ of sizen transforms it intoπ◦ρR(i, j) = (π1, . . . , πi−1,πj , . . . ,−πi, πj+1, . . . , πn).

A transpositionρT(i, j, k), 1 ≤ i ≤ j < k ≤ n applied toπ transforms it intoπ ◦ ρT(i, j, k) =

(π1, . . . , πi−1, πj+1, . . . ,πk, πi, . . . ,

πj, πk+1, . . . πn). A reverse transpositionρrT(i, j, k), with 1 ≤ i ≤ j ≤ n and (1 ≤ k < i)

or (j < k ≤ n), applied toπ transforms (here shown forj < k) it into π ◦ ρrT(i, j, k) =

(π1, . . . , πi−1,−πk, . . . ,−πj+1, πi, . . . , πj, πk+1, . . . πn).

A tandem duplication random lossρTDRL operation duplicates a contiguous segment of genes in

tandem, followed by the loss of one copy of each of the duplicated genes. In this case, such an

operation is considered as a TDRL only when it changes the gene order and is different from a

transposition.

A scenariofor two signed permutationsπ and σ is a sequence of rearrangement operations

55

4.5. THECREXALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS

that transformπ into σ. A sequence with a minimal (weighted) number of operations is called

parsimonious.

Common Intervals and Strong Interval Trees

An interval of a permutationπ is a set of consecutive elements of the permutationπ. Let Π be

a set of signed permutations of sizen. A common interval(Uno and Yagiura, 2000; Heber and

Stoye, 2001) ofΠ is a subset of{1, 2, . . . , n} that is an interval in eachπ ∈ Π. The singletons

{i}, i ∈ {1, 2, . . . , n} and the set{1, 2, . . . , n} of all elements are calledtrivial common intervals.

Let C(Π) be the set of all common intervals ofΠ. Two intervalsc andc′ overlapif c ∩ c′ 6= ∅,

c 6⊂ c′, andc′ 6⊂ c. If two intervals do not overlap theycommute. A common interval is called

a strong common interval, if it does not overlap with any other common interval. The set of

all strong common intervals can be computed inO(kn) time for k signed permutations of size

n (Bergeron et al., 2005, 2008). Thestrong interval tree(SIT) Berard et al. (2007) ofΠ is a

treeT (Π) where the nodes are exactly the strong common intervals ofΠ such that the root node

is the interval containing all elements, the leaves are the singletons, and the edges are defined

by the minimal inclusion relation of the intervals (i.e. there is an edge between nodec andc′

if c′ ⊂ c and there is no nodec′′ with c′ ⊂ c′′ ⊂ c). Each node is given a sign(+ or −). If

the children of a node appear in the same order in both input gene orders, the node is called

linear increasing (+); if the children of a node appear in opposite order in the twogene orders,

it is linear decreasing (−); otherwise the node is calledprime (see Figure 4.9(b)). For a more

comprehensive introduction of SITs see Berard et al. (2007). The importance of the SIT is that it

greatly facilitates the identification of the genome rearrangement operations in the algorithm of

CREx(Bernt et al., 2007). A genomic rearrangement operationρ π ∈ Π is said to bepreserving

for Π if it does not destroy any common intervalc ∈ C(Π) (i.e.,C(Π) = C(Π ∪ {π ◦ ρ})). An

operation is not preserving, if there exist a common interval, such that it does not subsists after

applying the rearrangement operation.

4.5.2 Methods

For the computation of optimal sorting scenarios based on inversions, that conserve combinato-

rial structures of genomes, it was recently proposed in Berard et al. (2007) to use strong interval

trees. A closely related data structure (known as PQ-tree (Booth and Lueker, 1976)) was shown

in Parida (2006) to be suitable for studying large scale rearrangement operations, with the focus

on inversions and transpositions.

56


(a)

cox1 R nad4L cox2 K atp8 atp6 cox3 -S2 nad3 nad4 H S1 nad5 -nad6 cob F rrnS

nad4L cox2 K atp8 atp6 cox3 -S2 nad3 nad4 H S1 nad5 -nad6 cob F rrnS

E T P -Q N L1

N L1

-A W C -V

nad4L cox2 K atp8 atp6 cox3 -S2 nad3 nad4 H S1 nad5 -nad6 cob F rrnS E T P -Q N L1 -A W C -V

M -D Y G L2 nad1 I nad2 rrnL

cox1 R nad4L cox2 K atp8 atp6 cox3 -S2 nad3 nad4 H S1 nad5 -nad6 cob F rrnS E T P -Q N L1 -A W C -V M -D Y G L2 nad1 I nad2 rrnL

(b)

E P N L1 W -V nad4L cox2 K atp8 atp6 cox3 -S2 nad3 nad4 H S1 nad5 -nad6 cob F rrnS T -Q -A C

nad4L cox2 K atp8 atp6 cox3 -S2 nad3 nad4 H S1 nad5 -nad6 cob F rrnS E T P -Q N L1 -A W C -V

TD

RL

(c)

Figure 4.9: (a): Mitochondrial gene order rearrangement scenario inferred byCREx for the given phy-logeny of Asteroidea (A), Echinoidea (E), and Holothuroidea (H) showing one reversal (REV) and oneTDRL; (b): SIT of E for E→H (TDRL scenario); (c): TDRL of E→H as suggested byCREx; Figures4.9(b) and 4.9(c) are exported fromCREx.

Formally, a common interval is a subset of genes that appear consecutively in two (or more)

input gene orders (Berard et al., 2007). Two intervals I andJ are said to commute if either I

⊂ J, or I⊃ J, or I∩ J = ∅ holds. A common interval is a strong interval if it commutes with

every common interval. Finally, a strong interval tree for two gene orders is a rooted tree that has

exactly one leaf for each gene and exactly one inner node for each strong common interval. The

edges of the tree are defined by the inclusion order of the set of strong intervals. Two types of

inner nodes do exist. An inner node is called (increasing or decreasing) linear if its child nodes

are in a left-to-right or right-to-left order, otherwise the child nodes are not ordered and the inner

node is calledprime, like described in Section 4.5.1.

The basic principle for computing heuristic gene order rearrangement scenarios withCREx is

to detect patterns in the SITs that reflect the correspondinggenome rearrangement operations.

For example, reversals are very simple to detect because this operation is reflected as a sign

difference of parent node and child nodes in the SIT. Prime nodes are good indicators for TDRL

events. Both, transpositions and reverse transpositions,also lead to recognizable patterns in

the SIT.CRExuses a stepwise approach to suggest a genome rearrangement scenario: First it

identifies transpositions and reverse transpositions, then reversals are identified based on the sign

differences between connected nodes in the SIT. In these first two stepsCREx only operates

on linear nodes. In the third step, the prime nodes are analyzed to identify combinations of

reversal and TDRL operations (including transpositions) which can explain the corresponding

prime nodes (Bernt et al., 2007). The order of these operations are important to identify all of

the rearrangement events and there combinations.

57


(a) (b) (c) (d)

Figure 4.10: Genomic rearrangement events considered inCREx: (a) inversion, (b) transposition, (c)reverse transposition, and (d) tandem duplication random loss

TDRLs are of particular interest for phylogenetic analysissince the distance measure between

gene orders that is based on the minimum number of TDRLs is notsymmetric (Chaudhuri et al.,

2006). Especially, in many cases a rearrangement can be explained by a single TDRL in one

direction, while reversing the rearrangement would require more than a single operation. It fol-

lows that TDRLs imply the evolutionary direction of the rearrangement in many cases and hence

allow the reconstruction of the ancestral state from a comparison of two gene orders without

considering an outgroup. This feature makes TDRLs particularly valuable for phylogenetic stud-

ies and suggests that a detailed reconstruction of the rearrangement history of gene orders can

lead to more detailed and more certain phylogenetic conclusions. In contrast, reversals, trans-

positions, and reverse transpositions are inherently symmetric, hence ancestral states cannot be

reconstructed without additional outgroup information.

If it can be deduced from the gene orders of other taxa that thecorresponding gene sequence in

the ancestor genome of both sister taxa equals the gene orderin one sister taxon the rearrange-

ment operation is assigned to the branch leading to the othersister taxon.

4.5.3 TheCRExAlgorithm

CREx (Bernt et al., 2007) is an algorithm to heuristically determine preserving rearrangement

operations for pairs of unichromosomal genomes. The algorithm uses the fact that each of the

four rearrangement operations that are considered here correspond to a pattern in the SIT. To

illustrate this, each of the four rearrangement operationsis applied to the identity permutation and

the resulting SIT is computed. Figures 4.10 and 4.11 show theapplied rearrangement operations

and the resulting SITs. More formally, the following patterns appear for the different operations

when applied to a permutationπ:

• If a reversalρR(i, j) is applied, a linear node with a linear parent node of opposite sign

occurs in the corresponding SIT (see also Berard et al. (2007)). The linear node reflects

the common interval of all elements that are inverted.

58


(a) (b) (c) (d)

Figure 4.11: Strong interval tree of the identity permutation and the resulting permutation after applyingthe corresponding genomic rearrangement event as given in Figure 4.10. The first of the two gene orders(not shown) in the example is 1234 and the other gene order hasbeen obtained by one of the followingoperations (the corresponding order can be found in the rootof the tree) (a) inversion, (b) transposition,(c) reverse transposition, (d) tandem duplication random loss. Prime nodes are depicted by ellipses, andlinear nodes by rectangles, where the sign in the square on top of a node indicates if the node is increasingor decreasing (+/-).

• If a transpositionρT(i, j, k) is applied, the corresponding SIT have a linear node with ele-

ments{πi, . . . , πk} that have two linear children reflecting the common intervals{πi, . . . , πj}

and{πj+1, . . . , πk}. The sign of the node needs to be different from the signs of the child

nodes.

• If a reverse transpositionρrT(i, j, k) is applied, the corresponding SIT has a linear node

with elements{πi, . . . , πk}. One child is a linear node reflecting the common interval of

elements{πi, . . . , πj} that are not inverted due to the reverse transposition. Thischild has

to have the different sign as its parent. The other involved elements are singletons as direct

child nodes of node{πi, . . . , πk} which must have a same sign.

• A tandem duplication random loss operationρTDRL leads to a prime node reflecting all the

elements involved in the rearrangement operation.

TheCRExalgorithm computes for two input permutationsπ1 andπ2 the strong interval tree for

these permutations. ThenCRExsearches for patterns corresponding to rearrangement operations.

If a pattern is identified, the corresponding rearrangementoperationρ is included in the scenario

to be computed and the next pattern is searched in the strong interval tree ofπ1 ◦ ρ andπ2 (the

pattern forρ will not occur in this strong interval tree). This process isrepeated until a complete

scenario is inferred. If the genomic rearrangement operations to be inferred do not overlap, the

pattern identification of theCRExalgorithm works. Obviously, the search order for patterns is

very important. If reversals are identified before transpositions and reverse transpositions, then

all transpositions and reverse transpositions would be inferred as being reversal operations in the

59


scenario. Therefore, the search order for the patterns of the genomic rearrangement operations is

i.) transpositions, ii.) reverse transpositions, iii.) reversals, and iv.) TDRL operations.

Special care has to be taken when prime nodes occur in the SIT.In theCRExalgorithm a prime

node is an indicator for one or several TDRLs. As a TDRL operation will not change the sign

of the elements involved, reversals are utilized to equalize the sign of all the elements in a prime

node. TheCREx algorithm uses a heuristic approach to identify a parsimonious number of

reversals and TDRLs for the corresponding prime node. Letπ1 andπ2 be the two permutations

of the elements in the prime node. Two variants are now included in the latest version ofCREx

to infer the reversals that are needed: i.) (reversals first)a set of reversals is applied to the

origin permutationπ1 to equalize the signs (with respect toπ2), and then, starting from the

resulting permutation, the minimum number of TDRLs is computed (Chaudhuri et al., 2006); or

ii.) (reversals last) first a set of reversals is applied toπ2 (resulting inπ′2, such that all the signs

are equalized with respect toπ1). Then a minimal number of TDRLs are inferred to transform

permutationπ1 to permutationπ′2. Note that the number of different possible parsimonious

scenarios to equalize the signs grows exponentially with the number of blocks of elements that

have different signs in both permutations. TheCRExalgorithm uses a brute-force approach and

each possible minimal set of reversals is tested, resultingin a potentially different number of

TDRLs per reversal set. Scenarios for which the sum of the number of reversals and TDRLs is

minimal are considered as possible scenarios. Furthermore, note that the resulting scenarios for

a prime node is not guaranteed to be parsimonious, as a mixed sequence of reversals and TDRLs

may result in a smaller scenario.

4.5.4 The Implementation of theCRExAlgorithm

Based on the described algorithm there is a web-based application, also calledCREx, for ana-

lyzing gene orders based on the application serverZope . The algorithms handling the common

interval data structures and for computing scenarios were implemented inC++ and integrated

into Zope via Python modules. For drawing publication ready downloadable versions of the

SITsReportLab was used.CREx, the program and the algorithm, a tutorial, and several de-

tailed examples are available online athttp://pacosy.informatik.uni-leipzig.

de/crex .

After uploading the gene order data in FASTA format toCREx, a distance matrix is computed

and displayed. The elements are colored according to the distance values so that gene orders with

a small evolutionary distance can be easily identified. The pairwise distances can be computed

as common interval distance, breakpoint, or reversal distance. Columns, rows, and individual

60


elements of the distance matrix can be selected; for the selected elements the SITs are computed

and displayed as a tree or as a family diagram (Bergeron and Stoye, 2006). As briefly described

in Section 4.5.2, the structure of the SIT can be used to infergenomic rearrangement operations

connecting a pair of input gene orders.CRExuses this heuristic approach to suggest a rearrange-

ment scenario based on common intervals.CRExallows the user to select individual operations

of the scenario to highlight the affected common intervals.

4.5.5 Real World Example

Mitochondrial genomes have been a particularly fruitful data set for phylogenetic reconstructions

due to their limited size and availability of a large number of informative data sets. In addition to

the sequence data of its protein, rRNAs, and tRNAs, the gene order of the genes on the circular

mitochondrial genome of animals has received extensive attention as a phylogenetic marker since

the seminal work of Watterson et al. (1982) and Sankoff et al.(1992).

In a recent work about echinoderms (Perseke et al., 2008) theCRExalgorithm was tested first.

It was possible to indentfy all of the known rearrangement events and beyond that in most cases

the direction resp. the branch on which this event probably happened. Furthermore, the well

described TDRL of the holothurians (Arndt and Smith, 1998) could be detected and two novel

gene orders of mitochondrial genomes of the ophiuroidOphiura albidaand the crinoidAntedon

mediterraneawere analysed.

O. albidapossesses the same gene order asO. lutkeni, which differs substantially fromOphio-

pholis aculeata(Scouras et al., 2004). So far, only two distinct gene ordersfor ophiuroids, which

are very different, are known. Unfortunately,CRExis not able to resolve a plausible rearrange-

ment sequence because there are too few conserved parts between the two gene orders. Thus, as

published in Scouras et al. (2004), the ancestral state of ophiuroids remains unresolved.

Surprisingly, the gene order ofA. mediterraneais different from the consensus gene order of the

crinoids, which is represented byF. serratissimaandP. gracilis (Scouras and Smith, 2001), in

three regions. All rearrangements within the crinoids could be completely resolved by means of

CREx. The mitochondrial genome ofA. mediterraneaincludes two equal variants of the tRNAVal,

which are absent in all other published crinoids. Furthermore,CRExwas possible to detect one

transposition of the gene ND4L coupled with the tRNAArg (T2, see Figure 4.13) and one TDRL

event of 6 tRNA genes and the control region (TDRL2). Following this event, UAS II was created

by duplication of the four tRNAs (tRNAVal-tRNAAsp-tRNAThr-tRNAGlu) and subsequent loss of

the copies of tRNAAsp, tRNAThr, and tRNAGlu. The reconstructed ancestral state of crinoids is

identical to the gene orders ofF. serratissimaandP. gracilis, which is in agreement with Scouras

61


COX1R

ND4L

COX2K

ATP8

ATP

6C

OX

3

S2ND3

ND

4H

S1

ND5

ND

6

UAS IG

UA

S II

16S

MP

12S

FE

CV

Y

L1AQ

N

L2

ND

1I

ND2D

CYTBT

UA

S II

IW

0K

1K

2K

3K

4K5K

6K

7K

8K9K

10K

11K

12K

13K

14K

15K

16K

AM404180 16580ntOphiura albida

COX1

COX2

K

ATP8ATP6

CO

X3

S2

ND3

ND

4H

UAS IVRND4LS1

ND5ND

6U

AS

V

CYTBPQ

N

L1

WCM

UAS I

AV

UAS II

VD

TE

12S

F

L2G

16S

Y

ND2I ND1

UA

S II

I

0K

1K

2K

3K

4K5K

6K

7K

8K9K

10K

11K

12K

13K

14K

15K

16K

AM404181 16169ntAntedon mediterranea

Figure 4.12: Maps of the mitogenomes of the crinoidAntedon mediterraneaand the ophiuroidOphiuraalbida. The images were generated from the GenBank files with the mitochondrial visualizationtool ’mtviz’ (application note in preparation). It can be found athttp://pacosy.informatik.uni-leipzig.de/mtviz .

and Smith (2006).

The known gene orders of the other echinoderm groups (Asteroidea, Echinoidea, and Holothuroidea)

demonstrate no rearrangements within each class and only few rearrangements are necessary to

transform the gene orders from one group into the other, and vice versa.

Between the Asteroidea and the Echinoidea exist only one well described inversionI1 of 16

genes (Asakawa et al., 1995; Smith et al., 1990). Unfortunately, there is only one complete

mitochondrial genome of the Holothuroidea, which can be deduced from the echinoid gene order

by a single TDRL event of one tRNA cluster (TDRL1) (Arndt and Smith, 1998). TheCREx

analysis implies that Echinoidea represents the ancestralstate within these three groups because

all events occurred on the branches leading to the Asteroidea or Holothuroidea. However, the

phylogenetic relationships between these three classes cannot be unambiguously reconstructed

based on both genome rearrangements and on the sequence analysis.

The differences between the echinoid gene order and ancestral crinoid gene order can be ex-

plained by one inversion (I2), see Scouras et al. (2004), followed by one tandem duplication

random loss (TDRL3), and one reverse transposition (rT1), cf. Figure 4.13.

TheI2+TDRL3 event concerns the 16S rRNA, the genes ND1, ND2, as well as tRNAIle, tRNALeu2,

tRNAGly, and tRNATyr. We suggest that these events is mechanistically coupled, i.e. constitute a

62


Asterina pectinifera

T1

TDRL2+T2

99/88/89/1.00

Ophiopholis aculeata

Ophiura lutkeniOphiura albida

Balanoglossus carnosus

Florometra serratissimaAntedon mediterranea

Cucumaria miniataStrongylocentrotus purpuratus

Paracentrotus lividusArbacia lixulaPisaster ochraceusAsterias amurensis

Luidia quinaliaAstropecten polyacanthus

Acanthaster planciAcanthaster brevispinus

Saccoglossus kowalevskii

Gymnocrinus richeriPhanogenia gracilisrT1

TDRL1

I1

100/96/97/1.00

79/36/−/−

100/100/97/1.00

Hemichordata

Ophiuroidea

Asteroidea

Echinoidea

Holothuroidea

Crinoidea

34/72/71/−

I2+TDRL3

77/96/100/−

Figure 4.13: Maximum likelihood analysis of thirteen protein coding genes of the echinoderm mtgenomes. The numbers show the bootstrap values for ML/NJ/MPand the posterior probability of bayesiananalysis. If these values are 100 or 1.00, resp. they are represented by solid points to save space. Thetree has been rooted with two hemichordates. Branch lengthsare proportional to evolutionary distance. I= Inversions (reversals), T = Transpositions, rT = Reverse Transpositions, TDRL = Tandem DuplicationRandom Loss

single rearrangement event. Alternatively, the putativeI2+TDRL3 event can be explained by an

inversion with two additional transposition events (see Figure A.4), or as a decoupled event of an

inversion and one tandem duplication random loss eventTDRL3. The direction of theTDRL3

implies that it occurred on the branch leading to echinoids,asteroids, and holothuroids. In con-

trast, the reverse transpositionrT1 of the fragment containing the 12S rRNA and three tRNAs

provides no direct information about its location on the twobranches.

Scouras and his collaborators (Scouras and Smith, 2001; Scouras et al., 2004; Scouras and Smith,

2006) suggested, based mostly on the nucleotide bias in crinoids and the putative reverse orien-

tation of the control region relative to the protein-codinggenes, that the reverse transposition

occurred in the crinoid lineage. This is consistent with theanalysed data. Accepting that the

rT1 event is crinoid specific and that theI2+TDRL3 are indeed part of the same rearrangement

63


A

E

H

CI2 and TDRL3 coupled

Hypothesis B1

A

E

H

C

Hypothesis A

I2 and TDRL3 decoupled

Hypothesis B2

A

E

H

C

I1

rT1

I2+TDRL3

TDRL1

I2+TDRL3

I1

TDRL1

rT1 rT1

I2/TDRL3

TDRL1

I1b

I1a

Figure 4.14: Two phylogenetic hypotheses are consistent with the most parsimonious rearrangement sce-nario. Hypothesis A: based on phylogenetic analyses of the amino acid sequences. This scenario impliesthat the eventsI2 andTDRL3 must be coupled. Hypothesis B: based on the gene order analysis and in thisscenario the eventsI2 andTDRL3 can be either coupled or not. Both variants are shown. A = Asteroidea,E = Echinoidea, H = Holothuroidea, C = Crinoidea

event, it is possible to draw the conclusion, that the ancestral gene order of asteroids, echinoids,

holothuroids, and crinoids complies to the crinoid gene order without the reverse transposition

rT1 (Figure 4.13). Furthermore, the ancestral arrangement of the echinoid, holothuroid, and as-

teroid ancestor coincides with the extant echinoid gene order. This scenario is consistent with two

phylogenetic hypotheses (A and B1), see Figure 4.14. If the coupling of theI2 and theTDRL3

events was rejected (Hypothesis B2 in Figure 4.14), an alternative scenario becomes plausible,

which places an inversionI1 event on the Echinozoa (echinoid+holothuroid) branch and makes

the Asteroidea gene order to the ancestral state of the echinoid+holothuroid+asteroid group. Note

that in this scenario the inversionI1 event implies two different gene orders depending on which

group (Asteroidea,I1a, or Echinozoa,I1b) represents the ancestral state.

In the analysis reported here the (putative) control regions (as annotated in GenBank and/or the

corresponding literature) were included. This additionalinformation results in a better support

for TDRL2 event (reversing the direction of the rearrangement now requires two TDRLs instead

of only two transpositions). Interestingly, most of the eight rearrangements contain or are located

close to the control region. A frequent involvement of the control region in genome rearrange-

ments was noted before in chordates (Boore and Brown, 1998).

In Figure 4.13 the rearrangement operations determined byCRExare mapped to the consensus

phylogenetic tree obtained from a careful analysis of the mitochondrial protein sequences. The

CRExresults alone are not sufficient to resolve the phylogeneticrelationships completely. How-

ever, the results ofCRExare consistent with the molecular phylogeny. It should be noted that

64


the gene order analysis fails to provide unambiguous information exactly for those nodes that

contradict the preferred phylogenetic hypothesis, in particular the position of the ophiuroids, see

below.

4.5.6 Current Developments

TheCRExalgorithm was extended to better handle alternative scenarios, ordered scenarios, and

combinations of inversion and tandem duplication random loss events. Furthermore, a novel

algorithm calledTREx (TreeRearrangementExplorer) was developed. TheTREx algorithm

utilies theCRExalgorithm and takes as input a binary rooted phylogenetic tree and the gene

orders of a set of taxa and heuristically infers the corresponding rearrangement operations on the

edges of the tree. Using theTRExalgorithm it is now possible to test the gene order information

in relation to different hypotheses and/or different phylogenetic trees.

The aim of this project is to use the gene order information toreconstruct phylogenetic trees and

to determine ancestral gene orders.

65


66

CHAPTER 5

Summary

The inclusion of molecular markers in reconstruction of phylogenetic relationships has increased

over the last decades. Molecular markers represent a essential part on different taxonomic levels.

Particularly, they are used within spheres where morphological data contain no more usefull

information. However, the increase of experimental data represents a new challenges in the

phylogenetic reconstruction. Bioinformatics, as a new field of research, take up these challenges

over the last few years and provide numerous tools to handle with this abundance of data.

At the time, several methods and approaches allow biologists to analyse molecular data exten-

sively and carefully. Based on the understanding of the biology on a molecular level it is possible,

in the majority of the cases, to solve phylogenetic relationships of organisms with a long time of

separation.

This work is a further step to find algorithms and develop new programs to give biologists basic

tools for an attentive work in the analysis of such data.

In my work, I followed two different approaches. The first approach deals with the quality of

data sets. With the increase of molecular data and the demandto resolve the metazoan deep

phylogenies, it will be more difficult to compare the data. Inmany cases the differences between

the sequences are on a high level, for example, based on largedivergence times or radiation

events in their history. So it is very hard to identify and usethe information for phylogenetic

reconstruction. Multiple substitutions, point mutations, wobbling third positions in protein cod-

ing genes, and/or simple variable parts in sequences lead toalignment positions whose character

67

CHAPTER 5. SUMMARY

information can’t be interpreted in ’the correct way’ any more. Such a critical alignment makes

it impossible to reconstruct accurate relationships between the species. Frequently, scientists

delete vague parts or positions by hand, which is most often impossible to reproduce.

Therefore we developed the programnoisy , which allows the user to detect random like po-

sitions with undefined information, check if these sites arerandom like, and delete them com-

prehensibly and able to reproduce them. First,noisy computes a cyclic ordering of the taxa

set, whereby the user can choose between two techniques (NeighborNet or QNet). Subse-

quently, a reliability scoreq for each character will be calculated. The number of character-state

alterations is counted and compared to the observed count inrandom shuffling. The uniform

pseudo-random number generatorMersenne Twisteris used to generate the random shuffling.

Furthermore, the fractionr of sites withqcutoff > 0.8 (or a user given value) among all variable

sites in the alignment are computed. At the end,noisy exports a Postscript file, visualizing

the quality of the sites of the reordered input alignment, recording their reliability score asxy-

data, and containing a modified alignment for further analysis from which sites with reliability

q < qcutoff are removed.

The analysis of the modified alignment with standard reconstruction tools is possible and the

reconstruction will be more stable and, in cases of data setsincluding very variable sequences,

the topology can change. Subsequently, thenoisy output allows other scientists to comprehend

the deleting pattern.

The second approach uses the information of mitochondrial gene order. The mitochondrial gene

orders include comprehensive information of very old splittings, for example the metazoan deep

phylogeny. In literature it was shown that mitochondrial genomes can be a good phylogenetic

marker using suitable methods. Recently, some implementations of different algorithms and

ideas were published. However, all of them are plagued by several problems, like the limitation

of hypothetical scenarios or the absent option to compare sequences with different length.

In this dissertation I present two novel algorithms for dealing with mitochondrial gene order in-

formation and their implementation. The first algorithm, implemented in the programcircal ,

allows to solve the problem of gene order strings with unequal lengths by means of encoding

the gene order information in list alignments. A progressive alignment approach is used to com-

bine pairwise list alignments into a multiple alignment of gene orders. This very simple and fast

approach gives the possibility to apply common reconstruction methods like Neighbor Joining,

Maximum Parsimony, Maximum Likelihood, or bayesian analysis, subsequently. Such consti-

tuted alignments of mitochondrial gene orders can readily be used to reconstruct phylogenetic

trees as well as ancestral gene orders using standard approaches. It could be shown that this

68

CHAPTER 5. SUMMARY

method is able to resolve large genome data sets with about 100 characters. Furthermore, we

demonstrated that only 37 characters and their direction information are sufficient to reconstruct

phylogenetic relationships.

Additionally, we developed and implemented the new algorithmCREx, based on the detection of

patterns in so calledstrong interval trees, which reflect the corresponding genome rearrangement

operations. Furthermore, we included the tandem duplication random loss operation (TDRL) as

additional plausible event. Based on the available information of the direction of this event,

the operation is very useful to reconstruct phylogenetic events on the level of gene order rear-

rangements. Moreover, it can explain complex rearrangement events of three or more inversions

with sometimes only one TDRL. However, it is important to bear in mind that no rearrangement

operation is verified in practice.

TheCRExapproach allows the user to compare gene order strings in a plain way. The graphical

output gives a good overview of the occurred events between to divergently evolving genomes.

It is possible to choose different distance matrices and to take in acount if the genome is linear

or circular. Furthermore, hypothetical tree reconstructions can be tested based on the gene order

information. The analyses usingCRExhave shown that this algorithm is able to identify pub-

lished and well described rearrangement scenarios and, moreover, to detect new or shorter ways

of comparative gene order evolution.

A future step of this work will use the gene order informationto reconstruct phylogenetic trees

without additional input, as shown in the outlook of theTREx algorithm. All programs, al-

gorithms, source codes, and information are freely avalible from the web. The results of my

dissertation are reflected by the list of publications whichinclude, besides different phylogenetic

’state of the art’ analyses, several novel methods.

69

CHAPTER 5. SUMMARY

70

Bibliography

Adoutte, A., N. Balavoine, G.and Lartillot, B. Lespinet, O.and Prud’homme, and R. de Rosa,

2000. The new animal phylogeny: reliability and implications. Proc. Natl. Acad. Sci. USA

97:4453–4456.

Alberts, B., A. Johnson, J. Lewis, M. Raff, K. Roberts, and P.Walter, 1994. Molecular Biology

of the Cell. Garland Publishing Inc., New York.

Altmann, R., 1890. Die Elementarorganismen und ihre Beziehungen zu den Zellen. Verlag von

Veit & Comp., Leipzig.

Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lip-

man, 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search

programs. Nucl. Acids Res25:3389–3402.

Anderson, S., A. T. Bankier, B. G. Barrell, M. H. L. de Bruijn,A. Coulson, J. Drouin, I. C.

Eperon, D. P. Nierlich, B. A. Roe, F. Sanger, et al., 1981. Sequence and organization of the

human mitochondrial genome. Nature290:457–465.

Anderson, S., M. H. de Bruijn, A. R. Coulson, I. C. Eperon, F. Sanger, and I. G. Young, 1982.

Complete sequence of bovine mitochondrial dna. conserved features of the mammalian mito-

chondrial genome. J Mol Biol156:683–717.

Arndt, A. and M. Smith, 1998. Mitochondrial gene rearrangement in the sea cucumber genus,

cucumaria. Mol. Biol. Evol15:1009–1016.

71

BIBLIOGRAPHY BIBLIOGRAPHY

Asakawa, S., H. Himeno, K. Miura, and K. Watanabe, 1995. Nucleotide sequence and gene

organization of the starfishAsterina pectiniferamitochondrial genome. Genetics140:1047–

1060.

Attardi, G., 1981. Organization and expression of the mammalian mitochondrial genome: a

lesson in economy. Trends Biochem Sci6:86–89,100–103.

Bader, D. A., B. M. E. Moret, and M. Yan, 2001. A linear-time algorithm for computing inversion

distance between signed permutations with an experimentalstudy. Journal of Computational

Biology 8(5):483–491.

Bandelt, H. J. and A. W. M. Dress, 1992. Split decomposition:A new and useful approach to

phylogenetic analysis of distance data. Mol. Phyl. Evol.1:242–252.

Berard, S., A. Bergeron, C. Chauve, and C. Paul, 2007. Perfect sorting by reversals is not always

difficult. IEEE-ACM Transaction on Computational Biology and Bioinformatics4:4–16.

Bergeron, A., C. Chauve, F. de Montgolfier, and M. Raffinot, 2005. Computing common intervals

of k permutations, with applications to modular decomposition of graphs. In ESA, pp. 779–

790.

Bergeron, A., C. Chauve, F. de Montgolfier, and M. Raffinot, 2008. Computing common intervals

of K permutations, with applications to modular decomposition of graphs. SIAM J. Discrete

Math. To appear.

Bergeron, A. and J. Stoye, 2006. On the similarity of sets of permutations and its applications to

genome comparison. Journal of Computational Biology13:1340 –1354.

Bernhard, D., G. Fritzsch, P. Glockner, and C. Wurst, 2005.Molecular insights into speciation in

theAgrillus viridiscomplex and the genusTrachys(Coleoptera: Buprestidae). Eur. J. Entomol.

102:599–605.

Bernhard, D., C. Schmidt, A. Korte, G. Fritzsch, and R. G. Beutel, 2006. From terrestrial to

aquatic habitats and back again - molecular insights into the evolution and phylogeny of hy-

drophiloidea (coleoptera) using multigene analyses. Zoologica Scripta35:597–606.

Bernt, M., D. Merkle, K. Ramsch, G. Fritzsch, M. Perseke, D. Bernhard, M. Schlegel, P. F.

Stadler, and M. Middendorf, 2007. CREx: Inferring Genomic Rearrangements Based on

Common Intervals. Bioinformatics23:2957–2958.

72


Bhamrah, H. S. and K. Juneja, 2002. Modern Zoology. Anmol Publications PVT. LTD., New

Delhi.

Boehme, M. U., G. Fritzsch, A. Tippmann, M. Schlegel, and T. U. Berendonk, 2007. The

complete mitochondrial genome of the green lizardlacerta viridis viridis(reptilia: Lacertidae)

and its phylogenetic position within squamate reptiles. Gene394:69–77.

Boore, J. L., 1999. Animal mitochondrial genomes. Nucl. Acids Res.27:1767–1780.

Boore, J. L. and W. M. Brown, 1998. Big trees from little genomes: mitochondrial gene order as

a phylogenetic tool. Curr. Opinion Gen. Devel.8:668–674.

Booth, K. and G. Lueker, 1976. Testing for the consecutive ones property, interval graphs, and

graph planarity using PQ-tree algorithms. J. Comp. System Sci. 13:335–379.

Borst, P., 1972. Mitochondrial nucleic acids. Ann.Rev.Biochem.41:333–376.

Bryant, D. and V. Moulton, 2004. Neighbor-net: An agglomerative method for the construction

of phylogenetic networks. Mol. Biol. Evol.21:255–265.

Bryant, D. and V. Moulton, 2007. Consistency of neighbor-net. Algorithms for Molecular

Biology 2:8. Preprint.

Buneman, P., 1971. The recovery of trees from measures of dissimilarity. In F. R. Hodson, D. G.

Kendall, and P. Tautu, eds., Mathematics and the Archeological and Historical Sciences, pp.

387–395. Edinburgh University Press, Edinburgh, UK.

Bunke, H. and U. Buhler, 1993. Applications of approximatestring matching to 2D shape recog-

nition. Patt. Recogn.26:1797–1812.

Cameron, C. B., J. R. Garey, and B. J. Swalla, 2000. Evolutionof the chordate body plan:

New insights from phylogenetic analyses of deuterostome phyla. Proc. Natl. Acad. Sci. USA

97:4469–4474.

Caprara, A., 2003. The reversal median problem. INFORMS Journal on Computing15:93–113.

Cartwright, R., 2005. DNA assembly with gaps (Dawg): Simulating sequence evolution. Bioin-

formatics21 S3:iii31–iii38.

Chan, D., 2006. Mitochondria: Dynamic organelles in disease, aging, and development. Cell

125:1241–1252.

73


Chaudhuri, K., K. Chen, R. Mihaescu, and S. Rao, 2006. On the tandem duplication-random

loss model of genome rearrangement. In SODA, pp. 564–570.

Chipuk, J., L. Bouchier-Hayes, and D. Green, 2006. Mitochondrial outer membrane permeabi-

lization during apoptosis: the innocent bystander scenario. Cell Death and Differentiation

13:1396–1402.

Coenye, T. and P. Vandamme, 2003. Extracting phylogenetic information from whole-genome

sequencing projects: the lactic acid bacteria as a test case. Microbiology149:3507–3517.

Cosner, M., R. Jansen, B. Moret, L. Raubeson, L.-S. Wang, T. Warnow, and S. Wyman, 2000a.

An empirical comparison of phylogenetic methods on chloroplast gene order data in campan-

ulaceae. In D. Sankoff and J. Nadeau, eds., Comparative Genomics: Empirical and Analytical

Approaches to Gene Order Dynamics, Map Alignment, and the Evolution of Gene Families.,

pp. 99–121. Kluwer Academic Pub., Dordrecht, Netherlands.

Cosner, M., R. Jansen, B. Moret, L. Raubeson, L.-S. Wang, T. Warnow, and S. Wyman, 2000b.

A new fast heuristic for computing the breakpoint phylogenyand a phylogenetic analysis of a

group of highly rearranged chloroplast genomes. In Proc. 8th Intl Conf. on Intelligent Systems

for Molecular Biology (ISMB00), pp. 104–115.

Dellaporta, S., A. Xu, S. Sagasser, W. Jakob, M. Moreno, L. Buss, and B. Schierwater, 2006.

Mitochondrial genome of trichoplax adhaerens supports placozoa as the basal lower metazoan

phylum. Proc Natl Acad Sci USA103:8751–8756.

Dewey, T. G., 2001. A sequence alignment algorithm with an arbitrary gap penalty function. J.

Comp. Biol.8:177–190.

Dobzhansky, T. and A. H. Sturtevant, 1938. Inversions in thechromosomes of drosophila pseu-

doobscura. Genetics23:28–64.

Doyle, J. J., J. I. Davis, R. J. Soreng, D. Garvin, and M. J. Anderson, 1992. Chloroplast DNA

inversions and the origin of the grass family (poaceae). Proc. Natl. Acad. Sci. USA89:7722–

7726.

Dreyer, H. and G. Steiner, 2004. The complete sequence and gene organization of the mitochon-

drial genome of the gadilid scaphopodSiphonondentalium lobatum(Mollusca). Mol. Phylog.

Evol. 31:605–617.

74


Dunn, C. W., A. Hejnol, D. Q. Matus, K. Pang, W. E. Browne, S. A.Smith, E. Seaver, G. W.

Rouse, M. Obst, G. Edgecombe, et al., 2008. Broad phylogenomic sampling improves resolu-

tion of the animal tree of life. Nature452:745–750.

Efron, B., E. Halloran, and S. Holmes, 1996. Bootstrap confidence levels for phylogenetic trees.

Proc. Natl. Acad. Sci. USA93:7085–7090.

Farris, J. S., 1989. The retention index and the rescaled consistency index. Cladistics5:417–419.

Feagin, J. E., 1992. The 6-kb element of plasmodium falciparum encodes mitochondrial cy-

tochrome genes. Mol Biochem Parasitol52:145–148.

Felsenstein, J., 1985. Confidence limits on phylogenies: Anapproach using the bootstrap. Evo-

lution 31:783–791.

Felsenstein, J., 1989. Phylip – phylogeny inference package (version 3.2). Cladistics5:164–166.

Fitz-Gibbon, S. T. and C. H. House, 1999. Whole genome-basedphylogenetic analysis of free-

living microorganisms. Nucl. Acids Res.27:4218–4222.

Fritzsch, G., M. Schlegel, and P. F. Stadler, 2004a. Consensus arrangements of mitochondrial

genomes. In Proceedings of the German Conference on Bioinformatics (GCB 04).

Fritzsch, G., M. Schlegel, and P. F. Stadler, 2004b. Metazoan deep phylogenies: Can the cam-

brian explosion be resolved with molecular markers? In Proceedings of the 12th International

Conference on Intelligent Systems for Molecular Biology (ISMB) / 3rd Europena Conference

on Computational Biology (ECCB) 2004.

Fukuhara, H., F. Sor, R. Drissi, N. Dinoul, I. Miyakawa, S. Rousset, and A. Viola, 1993. Linear

mitochondrial dnas of yeasts: frequency of occurrence and general features. Mol Cell Biol

13:23092314.

Gotoh, O., 1982. An improved algorithm for matching biological sequences. J. Mol. Biol.

162:705–708.

Green, D., 1998. Apoptotic pathways: the roads to ruin. Cell94:695–698.

Gregor, J. and M. G. Thomason, 1993. Dynamic programming alignment of sequences repre-

senting cyclic patterns. IEEE Trans. Patt. Anal. Mach. Intell. 15:129–135.

Grunewald, S., 2006. QNet. Unpublished Technical Report.

75


Grunewald, S., V. Moulton, and A. Spillner, 2007. Consistency of the QNet algorithm for gener-

ating planar split networks from weighted quartets. Disc. Appl. Math.to appear.

Hannenhalli, S. and P. A. Pevzner, 1995. Transforming cabbage into turnip (polynomial al-

gorithm for sorting signed permutations by reversals). In Proceedings 27th ACM Symp. on

Theory of Computing (STOC’95), pp. 178–189.

Heber, S. and J. Stoye, 2001. Algorithms for finding gene clusters. In Proceedings of WABI

2001, number 2149 in LNCS, pp. 252–263.

Helfenbein, K. G., H. Fourcade, R. G. Vanjani, and J. L. Boore, 2004. The mitochondrial genome

of paraspadella gotoi is highly reduced and reveals that chaetognaths are a sister group to

protostomes. Proc Natl Acad Sci USA101:10639–10643.

Hermann, G., J. Thatcher, J. Mills, K. Hales, M. Fuller, J. Nunnari, and J. Shaw, 1998. Mito-

chondrial fusion in yeast requires the transmembrane gtpase fzo1p. J. Cell. Bio.143:359–373.

Higgs, P. G., D. Jameson, H. Jow, and M. Rattray, 2003. The evolution of tRNA-Leu genes in

animal mitochondrial genomes. J. Mol. Evol.57:435–445.

Hillis, D. M. and J. P. Huelsenbeck, 1992. Signal, noise, andreliability in molecular phylogenetic

analysis. J. Heredity83:189–195.

Hoffmann, R. J., J. L. Boore, and W. M. Brown, 1992. A novel mitochondrial genome organiza-

tion for the blue mussel,Mytilus edulis. Genetics131:397–412.

Huson, D. H., 1998. Splitstree: analyzing and visualizing evolutionary data. Bioinformatics

14:68–73.

Kluge, A. G. and J. S. Farris, 1969. Quantitative phyletics and the evolution of anurans. Syst.

Zool. 18:1–32.

Korte, A., I. Ribera, R. G. Beutel, and D. Bernhard, 2004. Interrelationships of staphyliniform

groups inferred from 18S and 28S rDNA sequences, with special emphasis on hydrophiloidea

(coleoptera, staphyliniformia). J. Zool. Syst. Evol. Research42:281–288.

Kumazawa, Y., 2007. Mitochondrial genomes from major lizard families suggest their phyloge-

netic relationships and ancient radiations. Gene388:19–26.

76


Kurihara, K. and T. Kunisawa, 2004. A gene order database of plastid genomes. Data Sci. J.

3:60–79.

Landau, G. M., E. W. Myers, and J. P. Schmidt, 1998. Incremental string comparison. SIAM J.

Comput.27:557–582.

Larget, B. and D. L. Simon, 2002. Bayesian phylogenetic inference from animal mitochondrial

genome arrangements. J. Royal Statist. Soc. B64:681–693.

Lavrov, D. V., J. L. Boore, and W. M. Brown, 2002. Complete mtDNA sequences of two milli-

pedes suggest a new model for mitochondrial gene rearrangements: Duplication and nonran-

dom loss. Mol. Biol. Evol.19:163–169.

Lee, M., 2005. Squamate phylogeny, taxon sampling, and datacongruence. Org. Divers. Evol

5:25–45.

Lehr, E., G. Fritzsch, and A. Muller, 2005. An analysis of andes frogs (Phrynopus: Leptodactyl-

idae) phylogeny based on 12S and 16S mitochondrial DNA sequences. Zoologica Scripta

34:593–603.

Macey, J. R., J. A. Schulte II, A. Larson, and T. J. Papenfuss,1998. Tandem duplication via light-

strand synthesis may provide a precursor for mitochondrialgenomic rearrangement. Mol. Biol.

Evol. 15:71–75.

Maes, M., 1990. On a cyclic string-to-string correction problem. Inform. Process. Lett.35:73–

78.

Maglott, D., J. Ostell, K. D. Pruitt, and T. Tatusova, 2005. Entrez gene: Gene-centered informa-

tion at ncbi. Nucleic Acids Res3:D54–D58.

Mannella, C., 2006. Structure and dynamics of the mitochondrial inner membrane cristae.

Biochimica et Biophysica Acta (BBA) - Mol Cell Res.1763:542–548.

Margulis, L., 1970. Origin of Eukaryotic Cells. Yale Univ. Press, New Haven.

Margulis, L., 1981. Symbiosis in Cell Evolution. Freeman, San Francisco, San Francisco.

Martin, W. and M. Muller, 1998. The hydrogen hypothesis forthe first eukaryote. Nature

392:37–41.

77


Matsumoto, M., 1998. Mersenne Twister: A 623-dimensionally equidistributed uniform pseu-

dorandom number generator. ACM Trans. on Modeling and Computer Simulation8:3–30.

McBride, H., M. Neuspiel, and S. Wasiak, 2006. Mitochondria: more than just a powerhouse.

Curr Biol 16:551–560.

Mereschkowsky, C., 1905.Uber natur und ursprung der chromatophoren im pflanzenreiche. Biol

Centralbl25:593–604.

Mindell, D. P., M. D. Sorenson, and D. E. Dimcheff, 1998. Multiple independent origins of

mitochondrial gene order in birds. Proc. Natl. Acad. Sci. USA 95:10693–10697.

Mollineda, R. A., E. Vidal, and F. Casacuberta, 2002. Cyclicsequence alignments: approximate

versus optimal techniques. Int. J. Pattern Rec. Artif. Intel. 16:291–299.

Moret, B. M. E., D. A. Bader, and T. Warnow, 2002. High-performance algorithm engineering

for computational phylogenetics. J Supercomputing22:99–111.

Moret, B. M. E., J. Tang, and T. Warnow, 2004. Reconstructingphylogenies from gene-content

and gene-order data. In Gascuel O (ed) Mathematics of Evolution and Phylogeny, pp. 321–

352. Oxford University Press, New York.

Moret, B. M. E., S. K. Wymann, D. A. Bader, T. Warnow, and M. Yan, 2001. A new implemen-

tation and detailed study of breakpoint analysis. In Proc. 6th. Pacific Symp. on Biocomputing

(PSB’01), pp. 583–594. World Scientific Pub.

Moritz, C. and W. M. Brown, 1986. Tandem duplications of d- loop and ribosomal rna sequences

in lizard mitochondrial dna. Science233:1425–1427.

Moritz, C. and W. M. Brown, 1987. Tandem duplications in animal mitochondrial dnas: variation

in incidence and gene content among lizards. Proc. Natl. Acad. Sci84:7183–7187.

Moritz, C., T. E. Dowling, and W. M. Brown, 1987. Evolution ofanimal mitochondrial dna:

relevance for population biology and systematics. Annu. Rev. Ecol. Syst18:269–292.

Nardi, F., G. Spinsanti, J. L. Boore, A. Carapelli, R. Dallai, and F. Frati, 2003. Hexapod origins:

Monophyletic or paraphyletic? Science299:1887–1889.

Nass, S. and M. M. K. Nass, 1963. Ultramitochondrial fibers with dna characteristics. J Cell

Biol 19:593–629.

78


Nieselt-Struwe, K. and A. von Haeseler, 2001. Quartet-mapping, a generalization of the likeli-

hood mapping procedure. Mol. Biol. Evol.18:1204–1219.

Notsu, Y., S. Masood, T. Nishikawa, N. Kubo, G. Akiduki, M. Nakazono, A. Hirai, and K. Kad-

owaki, 2002. The complete sequence of the rice (oryza satival.) mitochondrial genome: fre-

quent dna sequence acquisition and loss during the evolution of flowering plants. Mol Genet

Genomics268:434–445.

Odintsova, M. S. and N. P. Yurina, 2003. Plastid genomes of higher plants and algae: Structure

and functions. Mol. Biol. (Mosk.)37:649–662. Translated fromMolekulyarnaya Biologiya,

Vol. 37, No. 5, 2003, pp. 768-783.

Ogdenw, T. and M. Rosenberg, 2006. Multiple sequence alignment accuracy and phylogenetic

inference. Syst. Biol.55:314–328.

Oh-hama, T., 1997. Evolutionary consideration on 5-aminolevulinate synthase in nature. Orig

Life Evol Biosph27:405–412.

Parida, L., 2006. A PQ framework for reconstructions of common ancestors and phylogeny.

In Comparative Genomics, volume 4205, pp. 141–155. Springer, Berlin. RECOMB 2006

International Workshop, RCG 2006 Montreal, Canada, September 24-26, 2006 Proceedings.

Perseke, M., G. Fritzsch, K. Ramsch, M. Bernt, D. Merkle, M. Middendorf, D. Bernhard, P. F.

Stadler, and M. Schlegel, 2008. Evolution of mitochondrialgene orders in echinoderms. Mol.

Phyl. Evol.47:855–864.

Qi, J., B. Wang, and B.-l. Hao, 2004. Whole proteome prokaryote phylogeny without sequence

alignment: ak-string composition approach. J. Mol. Evol.58:1–11.

Racker, E., 1976. A New Look at Mechanisms in Bioenergetics.Academic Press, New York.

Reijnders, L., 1975. The origin of mitochondria. J Mol Evol5:167–176.

Ris, H. and R. Singh, 1961. Electron microscope studies on blue-green algae. J Biophys Biochem

Cytol 9:63–80.

Rosenberg, M. S. and S. Kumar, 2001. Incomplete taxon sampling is not a problem for phyloge-

netic inference. Proc. Natl. Acad. Sci. USA11:10751–10756.

79


Rossier, M., 2006. T channels and steroid biosynthesis: in search of a link with mitochondria.

Cell Calcium40:155–164.

Saitou, N. and M. Nei, 1987. The neighbor-joining method: a new method for reconstructing

phylogenetic trees. Mol Biol. Evol.4:406–425.

Sankoff, D. and M. Blanchette, 1998. Multiple genome rearrangement and breakpoint phylogeny.

M Computational Biology5:555–570.

Sankoff, D., G. Leduc, N. Antoine, B. Paquin, B. F. Lang, and R. Cedergren, 1992. Gene order

comparisons for phylogenetic inference: evolution of the mitochondrial genome. Proc. Natl.

Acad. Sci. USA89:6575–6579.

Scanlon, J. and I. Reynolds, 1998. Effects of oxidants and glutamate receptor activation on

mitochondrial membrane potential in rat forebrain neurons. J Neurochem71:2392–2400.

Scheffler, I. E., 1999. Mitochondria. John Wiley and Sons Inc., New York.

Schimper, A. F. W., 1883.Uber die entwicklung der chlorophyllkorner und farbkorper. Bot

Zeitung41:105–146.

Scouras, A., K. Beckenbach, A. Arndt, and J. M. Smith, 2004. Complete mitochondrial genome

dna sequence for two ophiuroids and a holothuroid: the utility of protein gene sequence and

gene maps in the analyses of deep deuterostome phylogeny. Mol Phylogenet Evol31:50–65.

Scouras, A. and J. M. Smith, 2001. A novel mitochondrial geneorder in the crinoid echinoderm

Florometra serratissima. Mol Biol Evol 18:61–73.

Scouras, A. and J. M. Smith, 2006. The complete mitochondrial genomes of the sea lilyGym-

nocrinus richeriand the feather starPhanogenia gracilis: signature nucleotide bias and unique

nad4L gene rearrangement within crinoids. Mol Phylogenet Evol 39:323–34.

Siepel, A. and B. M. E. Moret, 2001. Finding an optimal inversion median: Experimental results.

In Proc. WABI, number 2149 in LNCS, pp. 189–203.

Simon, C., F. Frati, A. Beckenbach, B. Crespi, H. Liu, and P. Flook, 1994. Evolution, weight-

ing, and phylogenetic utility of mitochondrial gene sequences and a compilation of conserved

polymerase chain reaction primers. Ann. Entomol. Soc. Am.87:651–701.

80


Smith, M. J., D. K. Banfield, K. Doteval, S. Gorski, and D. J. Kowbel, 1990. Nucleotide sequence

of nine protein-coding genes and 22 tRNAs in the mitochondrial DNA of the sea starPisaster

ochraceus. J Mol Evol31:195–204.

Snel, B., P. Bork, and M. A. Huynen, 1999. Genome phylogeny based on gene content. Nature

Genet.21:108–110.

Sokal, R. R. and C. D. Michener, 1958. A statistical method for evaluating systematic relation-

ships. Univ. Kansas Sci. Bull.38:1409–1438.

Stechmann, A. and M. Schlegel, 1999. Analysis of the complete mitochondrial dna sequence of

the brachiopodTerebratulina retusaplaces brachiopoda within the protostomes. Proc. R. Soc.

Lond. B266:1–10.

Stocking, C. and E. Gifford, 1959. Incorporation of thymidine into chloroplasts of spirogyra.

Biochem. Biophys. Res. Comm.1:159–164.

Stockley, B., A. B. Smith, T. Littlewood, H. A. Lessios, and J. A. Mackenzie-Dodds, 2005.

Phylogenetic relationships of spatangoid sea urchins (Echinoidea): taxon sampling density

and congruence between morphological and molecular estimates. Zool. Scripta34:447–468.

Swofford, D. L., 2002.PAUP* : Phylogenetic Analysis Using Parsimony (*and Other Methods)

Version 4.0b10. Sinauer Associates, Sunderland, MA. Handbook and Software.

Swofford, D. L. and G. J. Olsen, 1990. Phylogeny reconstruction. In D. M. Hillis and C. Moritz,

eds., Molecular Systematics, pp. 411–501. Sinauer Associates, Sunderland MA.

Thompson, J. D., D. G. Higgs, and T. J. Gibson, 1994. CLUSTALW: improving the sensitivity

of progressive multiple sequence alignment through sequence weighting, position specific gap

penalties, and weight matrix choice. Nucl. Acids Res.22:4673–4680.

Townsend, T. and A. Larson, 2002. Molecular phylogenetics and mitochondrial genomic evolu-

tion in the chamaeleonidae (reptilia, squamata). Mol. Phylog. Evol.23:22–36.

Townsend, T. M., A. Larson, E. Louis, and J. Macey, 2004. Molecular phylogentics of squamata:

The position of snakes, amphisbaenians, and dibamids, and the root of the squamate tree. Syst.

Biol. 53:735–757.

Uno, T. and M. Yagiura, 2000. Fast algorithms to enumerate all common intervals of two per-

mutations. Algorithmica26:290 – 309.

81


Vidal, N. and S. B. Hedges, 2005. The phylogeny of squamate reptiles (lizards, snakes, and

amphisbaenians) inferred from nine nuclear protein-coding genes. C. R Biologies328:1000–

1008.

Voet, D., J. G. Voet, and C. W. Pratt, 2006. Fundamentals of Biochemistry, 2nd Edition. John

Wiley and Sons, Inc., New York.

Wagele, J.-W., 2005. Foundations of Phylogenetic Systematics. Verlag Dr Friedrich Pfeil, Mu-

nich, Germany.

Wallin, I. E., 1923. The mitochondria problem. The AmericanNaturalist57(650):255–261.

Wang, L.-S., R. K. Jansen, B. M. E. Moret, L. A. Raubeson, and T. Warnow, 2002. Fast phyloge-

netic methods for genome rearrangement evolution: An empirical study. In Proc. 7th. Pacific

Symp. on Biocomputing (PSB’02), pp. 524–535. World Scientific Pub.

Watterson, G. A., W. J. Ewens, T. E. Hall, and A. Morgan, 1982.The chromosome inversion

problem. J. Theor. Biol.99:1–7.

Wetzel, R., 1995. Zur Visualisierung abstrakterAhnlichkeitsbeziehungen. Ph.D. thesis, Bielefeld

University, Germany.

Wolf, P. G., K. G. Karol, D. F. Mandoli, J. Kuehl, K. Arumuganathan, M. W. Ellis, B. D. Mishler,

D. G. Kelch, R. G. Olmstead, and O. Boore, 2005. The first complete chloroplast genome

sequence of a lycophyte,Huperzia lucidula(lycopodiaceae). Gene350:117–128.

82

APPENDIX A

Appendix

NP-Hard

A problem H is NP-hard if and only if there is an NP-complete problem L that is polynomial

time Turing-reducible toH, i.e. L ≤ T H.

Figure A.1:Venn diagramfor P, NP, NP-complete, and NP-hard set of problems

83

APPENDIX A. APPENDIX

Indices in Phylogenetic Reconstruction

Taken from Wagele (2005) and modified.

CI = consistency index

The index evaluates the number of homoplasies as a portion ofthe total character state changes

of a topology. For a characteri in the data set is the number of character statesni. The lowest

number of character states changesmi which are to be expected in a topology isni−1, impleying

a single occurrence of each apomorphic state.

Whensi is the number of character changes occurring in a topology, the consistency index for a

characteri is:

ci = mi/si (A.1)

The consistency indexCI for the whole topology is calculated from the sumM of all mi and the

sumS of all character state changessi present in the topology:

CI = M/S (A.2)

When homoplasies are present for a character in a topology, this character shows more state

changessi than the minimum number of changesmi and the index decreases. If no homoplasies

are present the consistency index obtainCI = 1. However, the lower bound of the consistency

index not 0, andci varies with topology. For this reason, Farris (1989) proposed two more

quantities called the retention index (RI) and the rescaled consistency index (RC).

HI = homoplasy index

The homoplasy indexHI is complementary to the consistency index and taken no be measure

for the portion of character statechanges which are caused by homoplasies. It could be a measure

for the noise present in data.

HI = 1 − CI (A.3)

RI = retention index

The retention index (RI) was designed to be a measure for the amount of putative synapomor-

84


phies (in relation to a given data set) which are retained in atopology. The putative analogies

occur, the lower is theRI-value. The valuelmax is the maximal possible length of a dendrogram

for a given data set. The retention index (RI) is calculated as follows:

RI =lmax − S

lmax − M(A.4)

RC = rescaled consistency index

The rescaled consistency index (RC) is calculated as follows:

RC = CIxRI (A.5)

85


Circal Material

Table A.1: Accession numbers of used taxa

Accession number Organism Taxonomy

Annelida

NC 001673 Lumbricus terrestris Clitellata; Oligochaeta

NC 000931 Platynereis dumerilii Polychaeta; Palpata

Arthropoda - Chelicerata

NC 002010 Ixodes hexagonus Arachnida

NC 002074 Rhipicephalus sanguineus Arachnida

NC 004357 Ornithodoros moubata Arachnida

NC 004370 Ixodes persulcatus Arachnida

NC 004454 Varroa destructor Arachnida

NC 005291 Carios capensis Arachnida

NC 005292 Haemaphysalis flava Arachnida

NC 005293 Ixodes holocyclus Arachnida

NC 005820 Ornithodoros porcinus Arachnida

NC 005924 Heptathela hangzhouensis Arachnida

NC 005925 Ornithoctonus huwena Arachnida

NC 005942 Habronattus oregonensis Arachnida

NC 005963 Amblyomma triguttatum Arachnida

NC 006078 Ixodes uriae Arachnida

NC 003057 Limulus polyphemus Merostomata

Arthropoda - Crustacea

NC 000844 Daphnia pulex Branchiopoda

NC 001620 Artemia franciscana Branchiopoda

NC 004465 Triops cancriformis Branchiopoda

NC 006079 Triops longicaudatus Branchiopoda

NC 005937 Hutchinsoniella macracantha Cephalocarida

NC 005934 Armillifer armillatus Crustacea

NC 002184 Penaeus monodon Malacostraca

NC 003058 Pagurus longicarpus Malacostraca

NC 004251 Panulirus japonicus Malacostraca

Continued on next page

86


Table A.1 – continued from previous page


NC 005037 Portunus trituberculatus Malacostraca

NC 006081 Squilla mantis Malacostraca

NC 011243 Cherax destructor Malacostraca

NC 005936 Pollicipes polymerus Maxillopoda

NC 008974 Tetraclita japonica Maxillopoda

NC 005306 Vargula hilgendorfii Ostracoda

NC 005938 Speleonectes tulumensis Remipedia

Arthropoda - Hexapoda

NC 002735 Tetrodontophora bielanensis Collembola

NC 005438 Gomphiocephalus hodgsoni Collembola

NC 006074 Onychiurus orientalis Collembola

NC 006075 Podura aquatica Collembola

NC 000857 Ceratitis capitata Pterygota

NC 000875 Anopheles quadrimaculatus Pterygota

NC 001322 Drosophila yakuba Pterygota

NC 001566 Apis mellifera Pterygota

NC 001709 Drosophila melanogaster Pterygota

NC 001712 Locusta migratoria Pterygota

NC 002084 Anopheles gambiae Pterygota

NC 002355 Bombyx mori Pterygota

NC 002609 Triatoma dimidiata Pterygota

NC 002660 Cochliomyia hominivorax Pterygota

NC 002697 Chrysomya putoria Pterygota

NC 003081 Tribolium castaneum Pterygota

NC 003367 Ostrinia nubilalis Pterygota

NC 003368 Ostrinia furnacalis Pterygota

NC 003372 Crioceris duodecimpunctata Pterygota

NC 003395 Bombyx mandarina Pterygota

NC 003970 Pyrocoelia rufa Pterygota

NC 004529 Melipona bicolor Pterygota

NC 004622 Antheraea pernyi Pterygota


87




NC 004816 Lepidopsocid RS-2001 Pterygota

NC 005333 Bactrocera oleae Pterygota

NC 005779 Drosophila mauritiana Pterygota

NC 005780 Drosophila sechellia Pterygota

NC 005781 Drosophila simulans Pterygota

NC 005939 Aleurodicus dugesii Pterygota

NC 005944 Philaenus spumarius Pterygota

NC 006076 Periplaneta fuliginosa Pterygota

NC 006133 Pteronarcys princeps Pterygota

NC 005437 Tricholepidion gertschi Thysanura

NC 006080 Thermobia domestica Thysanura

Arthropoda - Myriapoda

NC 002629 Lithobius forficatus Chilopoda

NC 005870 Scutigera coleoptrata Chilopoda

NC 003343 Narceus annularus Diplopoda

NC 003344 Thyropygus sp. Diplopoda

Echinodermata

NC 001627 Asterina pectinifera Eleutherozoa; Asterozoa

NC 004610 Pisaster ochraceus Eleutherozoa; Asterozoa

NC 005334 Ophiopholis aculeata Eleutherozoa; Asterozoa

NC 005930 Ophiura lutkeni Eleutherozoa; Asterozoa

NC 001453 Strongylocentrotus purpuratusEleutherozoa; Echinozoa

NC 001572 Paracentrotus lividus Eleutherozoa; Echinozoa

NC 001770 Arbacia lixula Eleutherozoa; Echinozoa

NC 005929 Cucumaria miniata Eleutherozoa; Echinozoa

NC 001878 Florometra serratissima Pelmatozoa; Crinoidea

Mollusca

NC 005335 Lampsilis ornata Bivalvia; Palaeoheterodonta

NC 001276 Crassostrea gigas Bivalvia; Pteriomorphia

NC 002507 Loligo bleekeri Cephalopoda; Coleoidea

NC 002176 Pupa strigosa Gastropoda; Orthogastropoda


88




NC 004321 Roboastra europaea Gastropoda; Orthogastropoda

NC 005827 Aplysia californica Gastropoda; Orthogastropoda

NC 005940 Haliotis rubra Gastropoda; Orthogastropoda

NC 001761 Albinaria caerulea Gastropoda; Pulmonata

NC 001816 Cepaea nemoralis Gastropoda; Pulmonata

NC 005439 Biomphalaria glabrata Gastropoda; Pulmonata

NC 001636 Katharina tunicata Polyplacophora; Neoloricata

NC 005840 Siphonodentalium lobatum Scaphopoda; Gadilida

Nematoda

NC 001327 Ascaris suum Chromadorea; Ascaridida

NC 001328 Caenorhabditis elegans Chromadorea; Rhabditida

NC 003415 Ancylostoma duodenale Chromadorea; Rhabditida

NC 003416 Necator americanus Chromadorea; Rhabditida

NC 004806 Cooperia oncophora Chromadorea; Rhabditida

NC 005143 Strongyloides stercoralis Chromadorea; Rhabditida

NC 005941 Steinernema carpocapsae Chromadorea; Rhabditida

NC 001861 Onchocerca volvulus Chromadorea; Spirurida

NC 004298 Brugia malayi Chromadorea; Spirurida

NC 005305 Dirofilaria immitis Chromadorea; Spirurida

Platyhelminthes

NC 000928 Echinococcus multilocularis Cestoda; Eucestoda

NC 002547 Taenia crassiceps Cestoda; Eucestoda

NC 002767 Hymenolepis diminuta Cestoda; Eucestoda

NC 004022 Taenia solium Cestoda; Eucestoda

NC 004826 Taenia asiatica Cestoda; Eucestoda

NC 002354 Paragonimus westermani Trematoda; Digenea

NC 002529 Schistosoma mekongi Trematoda; Digenea

NC 002544 Schistosoma japonicum Trematoda; Digenea

NC 002545 Schistosoma mansoni Trematoda; Digenea

NC 002546 Fasciola hepatica Trematoda; Digenea

89


List of the Gene Abbreviations

ProteinsProteins ATP synthase F0 subunit 6 ATP6 ATP6Proteins ATP synthase F0 subunit 8 ATP8 ATP8Proteins Cytochrome c oxidase subunit I COX1 CO1Proteins Cytochrome c oxidase subunit II COX2 CO2Proteins Cytochrome c oxidase subunit III COX3 CO3Proteins Cytochrome B CytB CYTBProteins NADH dehydrogenase subunit 1 ND1 ND1Proteins NADH dehydrogenase subunit 2 ND2 ND2Proteins NADH dehydrogenase subunit 3 ND3 ND3Proteins NADH dehydrogenase subunit 4 ND4 ND4Proteins NADH dehydrogenase subunit 4L ND4L ND4LProteins NADH dehydrogenase subunit 5 ND5 ND5Proteins NADH dehydrogenase subunit 6 ND6 ND6rRNAsrRNA Large - rRNA large 16SrRNA Small - rRNA small 12StRNAstRNA Alanin Ala AtRNA Arginin Arg RtRNA Asparagin Asn NtRNA Asparaginsure Asp DtRNA Cystein Cys CtRNA Glutamin Gln QtRNA Glutaminsure Glu EtRNA Glycin Gly GtRNA Histidin His HtRNA Isoleucin Ile ItRNA Leucin Leu L2 UURtRNA Leucin Leu L1 CUNtRNA Lysin Lys KtRNA Methionin Met MtRNA Phenylalanin Phe FtRNA Prolin Pro PtRNA Serin Ser S1 AGNtRNA Serin Ser S2 UCNtRNA Threonin Thr TtRNA Tryptophan Trp WtRNA Tyrosin Tyr YtRNA Valin Val V

90


CRExFigures

CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S E T -16S -ND2 -I -ND1 -L2 -G -Y D -M V -C -W A -L1 -N Q -P

CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S E T P -Q N L1 -A W C -V M -D Y G L2 ND1 I ND2 16S

(a)

-16S -ND2 -I -ND1 -L2 -G -Y D -M V -C -W A -L1 -N Q -P

P -Q N L1 -A W C -V M -D Y G L2 ND1 I ND2 16S

(b)

Figure A.2: Inversion 1 (I1) - Echinoidea vs. Asteroidea: (a) family diagram for Echinoidea (top) andAsteroidea (bottom); (b) nondirectional inversion of 16 tRNAs.

CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB P -Q N L1 -A W C -V M -D -T -E -12S -F -L2 -G -16S -Y -ND2 -I -ND1

CO1 R CO2 K ATP8 ATP6 ND4L CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB P -Q N L1 -A W C -V M -D -T -E -12S -F -L2 -G -16S -Y -ND2 -I -ND1

(a)

ND4L CO2 K ATP8 ATP6

CO2 K ATP8 ATP6 ND4L

(b)

Figure A.3: Transposition 1 (T1) - Florometra serratissima, Phanogenia gracilisvs. Gymnocrinus richeri:(a) family diagram forFlorometra serratissima, Phanogenia gracilis(top) andGymnocrinus richeri(bot-tom); (b) the nondirectional transposition of the gene ND4L.

91




(a)

F 12S E T P -Q N L1 -A W C -V M -D

P -Q N L1 -A W C -V M -D -T -E -12S -F

(b)

Y G L2 ND1 I ND2 16S

-16S -ND2 -I -ND1 -L2 -G -Y

(c)

-16S -ND2 -I -ND1 -L2 -G -Y

-ND2 -I -ND1 -L2 -G -16S -Y

(d)

-ND2 -I -ND1 -L2 -G -16S -Y

-L2 -G -16S -Y -ND2 -I -ND1

(e)

Figure A.4: Tandem duplication random loss 3 (TDRL3) - Florometra serratissima, Phanogenia gracilisvs. Echinoidea: (a) family diagram forFlorometra serratissima, Phanogenia gracilis(top) and Echinoidea(bottom); (b) inversion transposition, (c) inversion, (d)first transposition, and (e) second transposition.



(a)

P -Q N L1 -A W C -V M -D -T -E -12S -F

F 12S E T P -Q N L1 -A W C -V M -D

(b)

-L2 -G -16S -Y -ND2 -I -ND1

ND1 I ND2 Y 16S G L2

(c)

ND1 I ND2 Y 16S G L2

Y G L2 ND1 I ND2 16S

(d)

Figure A.5: Tandem duplication random loss 3 (TDRL3) - Echinoidea vs. Florometra serratissima,Phanogenia gracilis: (a) family diagram for Echinoidea (top) andFlorometra serratissima, Phanogeniagracilis (bottom); (b) inversion transposition, (c) inversion, and(d) tandem duplication random loss event.This scenario was favored in the analysis due to parsimonious fundamental idea.

92



CO1 R E P N L1 W -V ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S T -Q -A C M -D Y G L2 ND1 I ND2 16S

(a)

ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S E T P -Q N L1 -A W C -V

E P N L1 W -V ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S T -Q -A C

(b)

CO1 R E P N L1 W -V ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S T -Q -A C M -D Y G L2 ND1 I ND2 16S


(c)

E P N L1 W -V ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S T -Q -A C

P W ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S -Q C E N L1 -V T -A

P W ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S -Q C E N L1 -V T -A

W ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S C E -V T P -Q N L1 -A

W ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S C E -V T P -Q N L1 -A

ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S E T P -Q N L1 -A W C -V

(d)

Figure A.6: Tandem duplication random loss 1 (TDRL1): This example illustrates the assymetry of TDRLoperations (and the corresponding distance measure) perfectly: To transform the gene order ofC. miniata(A and B) into the Echinoid order three TDRL rearrangements are needed. (a) Echinoidea vs.Cucumariaminiata and (b) the resulting TDRL; (c)Cucumaria miniatavs. Echinoidea and (d) the three resultingTDRLs in this direction.


CO1 CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H R ND4L S1 ND5 -ND6 CYTB P -Q N L1 W C M -A -V -D -T -E -12S -F -L2 -G -16S -Y -ND2 -I -ND1

(a)

R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H

CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H R ND4L

(b)

-A W C -V M

W C M -A -V

(c)

Figure A.7: The rearrangements (T2, TDRL2) ofAntedon mediterraneawith respect toFlorometra ser-ratissimaare shown above. Note that TDRL2 acts in the direction ofA. mediterranea(in the direction ofF.serratissimaone more TDRL would be needed - see Figure A.8). Tandem duplication random loss 2(TDRL2) - Florometra serratissima, Phanogenia gracilisvs. Antedon mediterranea: (a) family diagramfor Florometra serratissima, Phanogenia gracilis(top) andAntedon mediterranea(bottom); (b) transpo-sition T2, and (c) theTDRL3favored in this analysis.

93


CO1 CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H R ND4L S1 ND5 -ND6 CYTB P -Q N L1 W C M -A -V -D -T -E -12S -F -L2 -G -16S -Y -ND2 -I -ND1


(a)

CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H R ND4L

R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H

(b)

W C M -A -V

M -A W C -V

(c)

M -A W C -V

-A W C -V M

(d)

Figure A.8: Tandem duplication random loss 2 (TDRL2) - Antedon mediterraneavs. Florometra serratis-sima, Phanogenia gracilis: (a) family diagram forAntedon mediterranea(top) andFlorometra serratis-sima, Phanogenia gracilis(bottom); (b) transpositionT2, (c) and (d) the alternativ scenario TDRLs - seeFigure A.7.

94

Zusammenfassung

Die Verwendung von molekularen Markern zur Rekonstruktionvon phylogenetischen Verwandt-schaftsverhaltnissen hat in den letzten Jahrzehnten stark zugenommen. Sie sind zur Zeit einwichtiger Bestandteil phylogenetischer Untersuchungen auf verschiedenen taxonomischen Ebe-nen. Insbesondere werden sie immer dann herangezogen, wennmit Hilfe von morphologischenDaten keine oder nur eine ungenugende Auflosung der stammesgeschichtlichen Beziehungenerreicht werden kann.

Jedoch stellt die Fulle an erhobenen molekularen Daten undderen stetiger Zuwachs die Wis-senschaftler vor neue Herausforderungen im Umgang mit diesen Daten. Die noch recht jungeDisziplin der Bioinformatik, welche Aspekte der Biologie,Informatik und Mathematik in sichvereint, hat verschiede Werkzeuge zur Handhabung und Analyse der komplexen Daten entwick-elt.

In dieser Dissertation wird unter Verwendung des mitochondrialen Genoms auf zwei Teilprob-leme im Umgang mit molekularen Markern eingangen und ein Losungsweg aufgezeigt.

Das erste Problem ergibt sich aus der Fulle der zur Zeit zur Verfugung stehenden Daten und derenexpontentielle Zunahme. Die Grundlage einer phylogenetischen Rekonstruktion ist ein Vergleichvon Zeichenketten und deren Information, was im Fall von molekularen Markern sehr oft Gense-quenzen sind. Dieser Vergleich ist die Basis und der Grundstein fur eine sorgfaltige Rekonstruk-tion der Verwandtschaftsverhaltnisse. Jedoch fuhren Multiple Substitutionen, Punktmutationenund zum Beispiel der haufige Austausch (wobbling) der dritten Codonposition in Proteinen zuVergleichen, deren Informationsgehalt sehr schwer oder nicht zu erkennen ist. In der Literaturkonnte gezeigt werden, dass diese Positionen oft wichtige Infomation, aber auch phylogenetis-ches Rauschen enthalten. In der Vergangenheit wurden diesePositionen meist durch entfernenper Hand verworfen, was ein Reproduzieren der Ergebnisse oft unmoglich machte. An diesemPunkt setzt der in dieser Dissertation beschriebene und implementierte Algorithmus NOISY an.Das daraus entwickelte Computerprogramm gibt dem Nutzer die Moglichkeit, jede Position ineinem Sequenzvergleich mit einem Wert der Vertrauenswurdigkeit zu versehen und zu testenin wie weit das Muster der Position auch zufallig erzeugt werden kann. Hierbei wird davonausgegangen das bei einer Sequenzevolution Muster in den seltensten Fallen zufallig auftreten.Im Ergebnis erhalt der Nutzer einen grafischenUberblick uber den Informationsgehalt und dieVerteilung der informativen Positionen uber den gesamtenSequenzvergleich in einem fur jedenreproduzierbarem Rahmen.

Das zweite von mir aufgegriffene Problem beschaftigt sichmit dem Erschliessen von phylo-genetischer Information uber die Sequenzanalyse hinaus.In der Literatur konnte gezeigt werden,dass die Anordnung der Gene im mitochondrialen Genom eine reiche Quelle an Informationen,vor allem fur sehr alte Aufspaltungsereignisse enthalt.In der vorliegenden Dissertation prasen-

tiere ich zwei neue Algorithmen und deren Implementation, um genau diese Information nutzbarzu machen.

Der erste Algorithmus, implementiert im ComputerprogrammCIRCAL, ermoglicht es Genanord-nungen unterschiedlicher Lange mittels alinierter Listen zu vergleichen. Hierbei wurde einprogressiver Ansatz des Vergleiches genutzt um paarweise Sequenzvergleiche von aliniertenAnordnungslisten mit Methoden multipler Vergleiche von Genanordnungen zu kombinieren.Dieser sehr einfache und schnelle Ansatz gibt anschließenddie Moglichkeit gangige Metho-den der Rekonstruktion wie Neighbor Joining, Maximum Parsimony, Maximum Likelihood oderBayesianische Verfahren darauf anzuwenden.

Der zweite Algorithmus beinhaltet die Entwicklung und Implementation des CREx Algorithmus.Dieser basiert auf der Detektion von Mustern in so genannten’strong interval trees’, welchedie entsprechenden Operationen der Genomumordnung widerspiegeln. Zusatzlich wurde diesogenannte ’tandem duplication random loss’ Operation (TDRL) als ein weiteres komplexesEreignis in der mitochondrialen Genomevolution erkannt und implementiert. Basierend aufder Information auf welchem Strang sich ein Gen befindet kanndiese Operation sehr kom-plexe Umordnungen mit einer Vielzahl an Inversionen mit nureinem Ereignis beschreiben.Mit Hilfe einer grafischenUbersicht ist es moglich, einen schnellenUberblick uber Operatio-nen zu gewinnen, welche zwischen unterschiedlich evolvierenden Genomen stattgfunden habenkonnten. Weiterhin konnen hypothetische oder konkurrierende Topologien basierend auf der In-formation der Genanordnung der mitochondrialen Genome getestet werden. Analysen die ichmit dem CREx Verfahren durchgefuhrt habe zeigten, dass es in der Lage ist sowohl publizierteund gut beschriebene Umordnungsszenarien wieder zu finden,als auch neu zu identifizierenbzw. kurzere wahrscheinliche Szenarien der Evolution derGenanordnung zu detektieren. Ineinem zukunftigem Schritt sollte es moglich sein, alleinmit der Information der Anordnung dermitochondrialen Gene phylogenetische Verwandtschaftsverhaltnisse zu rekonstruktionen.

Documents

Evolution of mitochondrial genomes and reconstruction of ...gfritzsch/dissertation_fritzsch_2009.pdf · Evolution of mitochondrial genomes and reconstruction of phylogenetic relationships