Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Evolution of mitochondrial genomes andreconstruction of phylogenetic
relationships
Der Fakultat fur Biowissenschaften, Pharmazieund Psychologie der Universitat Leipzig
eingereichte
DISSERTATION
zur Erlangung des akademischen Grades
DOCTOR RERUM NATURALIUM(Dr. rer. nat.)
vorgelegt von
Dipl. Biol. Guido Fritzsch
geboren am 10.09.1975 in Leipzig
Leipzig, den 23. Januar 2009
Abstract
Nowadays, molecular markers represent an essential part inreconstruction of phylogenetic rela-tionships between different organisms. They provide an opportunity to extend the information ofmorphological data and, beyond that, to resolve important questions such as the splittings duringthe Cambrian radiation.During the last few years the plenty of available molecular data increase exponential. A goodexample of such a marker is the mitochondrial genome, which includes phylogenetic informationon different taxonomic levels (). Bioinformatics, as an interdisciplinary field of research, is de-veloping a wide range of tools for biologists to analyse molecular data extensively and carefully.This work is a further step to find algorithms and develop new programs to give biologists basictools for an attentive work in the analysis of these data and to reconstruct phylogenetic rela-tionships. My work follows two different approaches. The first approach deals with the qualityof data sets, such as multiple substitutions, point mutations, wobbling third positions in proteincoding genes, and/or simple variable parts in sequences which lead to alignment positions whosecharacter information can’t be interpreted in ’the correctway’ any more. The second approachuses the information of mitochondrial gene order. This order includes comprehensive informa-tion of very old splittings, for example the metazoan deep phylogeny.In this dissertation I present the development and implementation of three novel algorithms. Onefor the interpretation of the quality of large sequences alignments and two algorithms for dealingwith mitochondrial gene order information.
I
Biblographical Data
Guido Fritzsch
Evolution of mitochondrial genomes and reconstruction of phylogenetic relationships
University of Leipzig, dissertation, 100 pages, 138 references, 25 figures, 3 tables
Abstract
This study includes various phylogenetic reconstructionsof the relationship of different spieces.The focus lies on mitochondria genomes and their manifold information. Three novel approcheswere developed to give the possibility to validate, to investigate, to analysis, and to use thisinformation.
Abbreviations
ACC Accession NumberATP Adenosintriphosphatbp base pairCOX I-III genes fo cytochrome c oxidase subunits I-IIIcyt b gene for cytochrome bD-Loop structure within the mitochondrial control regionde novo beginning againDNA Deoxyribonucleic acidmRNA Messenger Ribonucleic acidtRNA Transfer Ribonucleic acidkb kilo base pairML Maximum LikelihoodMP Maximum Parsimonymt mitochondrialmt DNA mitochondrial Deoxyribonucleic acidmt genome mitochondrial genomeNCBI National Center for Biotechnology InformationND1-6 genes for NADH dehydrogenase subunits 1-6NJ Neighbor JoiningOH origin of replication of the heavy strand of mitochondrial DNAOL origin of replication of the light strand of mitochondrial DNArRNA ribosomal Ribonucleic acidSET Serial Endosymbiotic Theorysp. speciesTIM the inner membrane complex
V
Contents
1 The Whisper of the Leaves 1
2 Mitochondria 92.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 The Origin of Mitochondria . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 11
2.2.1 The Serial Endosymbiosis Theory . . . . . . . . . . . . . . . . . .. . . 112.2.2 The Episome Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.3 The Hydrogen Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . .13
2.3 The Mitochondrial DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 132.3.1 The Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.2 The Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.3 The Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.4 The Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.5 Mitochondrial Inheritance . . . . . . . . . . . . . . . . . . . . . .. . . 19
3 Noisy 213.1 Misleading Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 213.2 Trees, Metrics, and Weighted Split Systems . . . . . . . . . . .. . . . . . . . . 223.3 Noise Detection Using Circular Split Systems . . . . . . . . .. . . . . . . . . . 233.4 Computational Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 283.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Gene Order Rearrangements 374.1 Breakpoint Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 384.2 Inversion Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 384.3 Parsimony Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 38
4.3.1 Encoding Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
VII
CONTENTS CONTENTS
4.3.2 Direct Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . .404.4 Thecircal algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.1 Cyclic alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.4.2 Encoding of Mitochondrial Genomes . . . . . . . . . . . . . . . .. . . 444.4.3 Scoring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4.4 Multiple Cyclic Alignments . . . . . . . . . . . . . . . . . . . . . .. . 464.4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4.6 Tree Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . .484.4.7 Consensus Gene Arrangements . . . . . . . . . . . . . . . . . . . . .. 484.4.8 Ancestral Genome Organization . . . . . . . . . . . . . . . . . . .. . . 484.4.9 Mitochondrial Genomes . . . . . . . . . . . . . . . . . . . . . . . . . .494.4.10 Chloroplast Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . .. 52
4.5 TheCRExAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.5.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5.3 TheCRExAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.5.4 The Implementation of theCRExAlgorithm . . . . . . . . . . . . . . . 604.5.5 Real World Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.5.6 Current Developments . . . . . . . . . . . . . . . . . . . . . . . . . . .65
5 Summary 67
Bibliography 70
A Appendix 83
VIII
CHAPTER 1
The Whisper of the Leaves
Molecular phylogeny, also known as molecular systematics,is a sub-discipline of molecular bi-
ology. It deals with the inference of evolutionary relationships among taxa, organisms, within
populations and other inherited, biological entities, such as genes. Gathering of such relation-
ships between molecular markers or structures can also be used to study the properties of taxa
including intrinsic traits, ecological interactions, andgeographic distributions.
In recent publications, molecular characters are often used as evolutionary markers. These man-
ifold characters, such as nuclear resp. mitochondrial DNA or biochemical pathways, are good
indicators for a of phylogenetic classifications.
The ’speciation events’ are the processes of interest in molecular systematics. The traces left
behind by such events are genetic differences between organisms, which can be analysed either
directly at the level of nucleic acids, or indirectly with the visible modification of morphological
structures. The only information which exists and can be used is theapomorphiespresent in
recent organisms, fossils, genomes of organisms, etc., andsimilarities that indicate some proper-
ties of evolutionary processes. However, evolution does not only result in evolutionary novelties
which can be interpreted asapomorphies. Real data often include novelties with high similarity
to characters of distantly related organisms. At a molecular level, characters can easily suggest
close related to characters of distinct groups.
From the phylogenetic perspective, such characters are considered as ’noise’. These, often as
Homoplasysummarized characters, contain manifold information, butin the majority of cases
1
CHAPTER 1. THE WHISPER OF THE LEAVES
the information is modified, misinterpretative, or simply false signals. The termHomoplasyin
biology describes structures without common ancestry or the possibility of inheritance.Homo-
plasy, especially analogies or convergences resp., can, if they aren’t recognized, provoke wrong
phylogenetic reconstructions. The reason for this mistakeis the combination of groups with
independent characteristics.
A Homoplasycan be (Wagele, 2005):
• a real homology in an correct phylogenetic tree, in which thecharacter occurs as apparent
analogy
• an apparent reversal, which is a real plesiomorphy on the wrong topology
• a real analogy, convergence, or parallelism that evolved through independent events and is
mapped on a correct phylogenetic tree. Due to its lack of complex structure it cannot be
distinguished from a homology and has been coded as homology.
• an analogy that orginated from back mutations (reversals) and is recorded in the correct
phylogenetic tree, and which cannot be distinguished from ahomology.
• however, an incorrect hypothesis of analogy, which is basedon an erroneous interpretation
of a homology, will not have the distribution of a homoplasy,because each single character
will be coded with a different number.
On a molecular level homoplastic characters include multiple substitutions (backward and paral-
lel). These sites are most often informative sites without any possibility to use this information
in the correct way, such as the third codon position of protein coding genes.
The task of biologists is to identify phylogenetic relevantdata (Homology, Apomorphy, and/or
Plesiomorphy) and to separate them from random similarities, such as homoplastic characters.
Note in this context, thatHomology, Apomorphy, andPlesiomorphyare important concepts in
cladistics. These terms are hypotheses, which explain a real fact in the best way. In practise,
such hypotheses are often wrong and even a support for incompatible groups.
In many cases, large evolutionary distances imply a large number of homoplastic sites. As most
protein-coding genes show dramatic variation in substitution rates that are correlated across the
sequence, this often leads to a patchwork pattern of phylogenetically informative and effectively
randomized regions. Furthermore, in highly variable regions, alignment errors accumulate, re-
sulting in sometimes misleading signals in phylogenetic reconstruction.
2
CHAPTER 1. THE WHISPER OF THE LEAVES
In a parsimonious sense, homoplastic sites cause additional steps in tree reconstruction which
correspond to additional hypotheses (ad-hoc-hypotheses). The aim is then to minimize the ad-
hoc-hypotheses too.
Based on the overwhelming amount of molecular data, the exponential increase of new data (cur-
rently approx. 30 new mitochondria genomes per month are deposited in the GenBank ’NCBI’)
and the complexity of the containing information, it is indispensable to include informatical,
bioinformatical, and mathematical techniques for phylogenetic analysis and reconstruction. One
main reason for the extreme bulge of sequences over numberless genes and species in several
gene banks can be found in the accelerated development of themolecular methods, which cre-
ates the basis for modern phylogenetic analyses. Thus the molecular phylogeny has a vast in-
tersection with bioinformatics, which is needed to developtools to detect, extract, and assay
phylogenetic information in molecular data. Interdisciplinary interactions open further possibili-
ties for a more realistic and careful dealing with moleculardata. The inclusion of fast algorithms
and the combination of techniques from different disciplines, for example, allows the handling
of judge data sets and the analysis of complex molecular data, such as the third codon position
of protein coding genes or sequence positions with multiplesubstitutions (homoplastic sites).
At present, a considerable number of important and so-called ’standard methods’ exist, which
build the basis for a reasonable analysis or reconstruction. All methods are included in numerous
programs, which are essential to e.g. align sequences, sample evolutionary rates, generate likeli-
hoods, test evolutionary models, or reconstruct phylogenetic relationships. Based on the question
of these relationships, the results of a molecular phylogenetic analysis is most often visualized
in form of phylogenetic trees (cladograms, dendrograms, phylograms, tree graphs) or network
graphs. With the advent of completely sequenced genomes these approaches are complemented
by genome-wide comparisons of gene-contents (Fitz-Gibbonand House, 1999; Snel et al., 1999),
gene orders (Boore and Brown, 1998; Coenye and Vandamme, 2003), or composition measures
(Qi et al., 2004).
During the last few years, it was possible for me to gain experience in this field. My dissertation
includes several studies based on such standard methods, like different alignment tools, analysis
programs, or reconstruction methods such as Neighbor Joining, Maximum Parsimony, Maximum
Likelihood, or bayesian analysis. Based on these studies itwas possible to obtain competence
for the strengths and weaknesses of phylogenetic analyses.This work resulted in a number of
publications and in the following lines I will present threeselected studies.
3
CHAPTER 1. THE WHISPER OF THE LEAVES
Analysis of Andes frogs (Phrynopus, Leptodactylidae, Anura) phylogeny based on 12S and
16S mitochondrial rDNA sequences (Lehr et al., 2005)
South American leptodactylid frogs of the genusPhrynopusoccur in cloud-forest, paramo,
subparamo and puna habitats (1000 - 4400 m elevation) from Colombia to Bolivia. In 2005, there
were 34 described species; however, many additional species new to science have been reported
from Colombia, Peru, and Bolivia. The phylogeny of the species-diversePhrynopusis unknown
and the position of the genus within Leptodactylidae is poorly understood. We presented the re-
sults of a phylogenetic study based on 12S and 16S mitochondrial rDNA (see Figure 1.1). Fifteen
species ofPhrynopusfrom Bolivia to Ecuador are included, along with several other genera of
Leptodactylidae and representatives of other frog families. Our results indicate thatPhrynopus
is phylogenetically nested withinEleutherodactylus, whereasPhyllonastesis phylogenetically
nested withinPhrynopus. Based on the recovered phylogeny, we transferPhrynopus simonsiito
Eleutherodactylus, and show thatPhrynopus carpishneeds to be removed fromPhrynopus.
From terrestrial to aquatic habitats and back again - molecular insights into the evolution
and phylogeny of Hydrophiloidea (Coleoptera) using multigene analyses (Bernhard et al.,
2006)
The phylogenetic relationships within Hydrophiloidea have been a matter of controversial dis-
cussion for many years and the supposedly repeated changes between aquatic and terrestrial
lifestyles are not well understood. In order to address these issues we used an extensive molec-
ular data set comprising sequences from six nuclear and mitochondrial genes. The analyses
accomplished with the entire data set resulted in largely congruent tree topologies concerning
the main branches (see Figure 1.2), independent from the analytical procedures. However, only
bayesian analyses yielded sufficient high posterior probabilities, whereas bootstrap support val-
ues for most nodes were generally low. Our results are only partially congruent with hypotheses
based on morphological analyses. Spercheidae were placed as the sister group of the remain-
ing hydrophiloid subgroups. Hydrophiloidea excluding Spercheidae split into two clades: the
’helophorid lineage’ comprising the small groups Epimetopidae, Hydrochidae, Georissidae, and
Helophoridae, and the largest family, Hydrophilidae. Within Hydrophilidae, Hydrophilinae do
not form a monophylum. The predominantly terrestrial Sphaeridiinae were placed as a subor-
dinate clade within this subfamily. Furthermore, our data suggest a single origin of the aquatic
lifestyle in Hydrophiloidea, with numerous secondary changes to terrestrial habits and tertiary
changes to aquatic habitats within Sphaeridiinae.
4
CHAPTER 1. THE WHISPER OF THE LEAVES
Figure 1.1: Maximum likelihood tree topology for the pruneddataset withPleurodema marmoratum(Leptodactylidae) as outgroup. Numbers at the nodes separated by a slash are, from left to right: bootstrapvalues for ML (TBR swapping algorithm), bootstrap values of10.000 replicates for NJ, posterior prob-abilities for MrBayes, and bootstrap values for MP (10.000 replicates). A dash indicates that the branchwas not found in the ML topology. Numbers in parentheses are numbers of individuals of the speciesbeing used, if more than one. SyntopicPhrynopusare connected with an arrow-headed line (Lehr et al.,2005).
5
CHAPTER 1. THE WHISPER OF THE LEAVES
Figure 1.2: Bayesian tree of the whole data set (SSU rDNA, LSUrDNA, 16S rDNA, COI, COII). Fourspecies of the Histeroidea (Histeridae and Sphaeritidae) were chosen as outgroups. The number at eachnode refers to posterior probabilities. Habitats of adults(first letter) and larvae (second letter) are given inbrackets. The assumed transitions in lifestyle are marked with arrows. A = aquatic, S = semiaquatic, T =terrestrial (Bernhard et al., 2006).
6
CHAPTER 1. THE WHISPER OF THE LEAVES
The complete mitochondrial genome of the Green LizardLacerta viridis viridis (Reptilia:
Lacertidae) and its phylogenetic position within squamatereptiles (Boehme et al., 2007)
For the first time, the complete mitochondrial genome was sequenced for a member of Lacer-
tidae.Lacerta viridis viridiswas sequenced in order to compare the phylogenetic relationships of
this family to other reptilian lineages. Using the long-polymerase chain reaction (long PCR) we
characterized a mitochondrial genome, 17.156 bp long showing a typical vertebrate pattern with
13 protein coding genes, 22 transfer RNAs (tRNA), two ribosomal RNAs (rRNA) and one major
noncoding region. The noncoding region ofL. v. viridis was characterized by a conspicuous 35
bp tandem repeat at its 5’ terminus. A phylogenetic study including all currently available squa-
mate mitochondrial sequences revealed the position of Lacertidae within a monophyletic squa-
mate group. We obtained a narrow relationship of Lacertidaeto Scincidae, Iguanidae, Varanidae,
Anguidae, and Cordylidae. Although, the internal relationships within this group yielded only
a weak resolution and low bootstrap support, the revealed relationships were more congruent
with morphological studies than with recent molecular analyses (Townsend and Larson, 2002;
Townsend et al., 2004; Lee, 2005; Vidal and Hedges, 2005; Kumazawa, 2007).
In an ideal case, all methods applied to the data set of informative characters should result in
one tree topology. However, most often data sets include conflicts, that means contradictory
characteristics which can’t be explained by inheritance. These conflicts in molecular data are
summarized as homoplastic signals (mentioned above), because it is difficult to distinguish be-
tween random sites, convergences, and predisposition.
The notable part of the literature in molecular phylogeny isbased upon the analysis of nucleic
acid and/or amino acid sequences of individual genes or groups of genes. Such sequence based
methods for phylogenetic reconstruction are notoriously plagued by two effects: homoplastic
sites and/or alignment errors. Of course, it is no problem toalign sequences by hand or eye, if
they are short, similar, and/or based on constant regions. However, this will be very difficult or
impossible for large and/or variable sequences (i.e. largeevolutionary distances).
So the ’Achilles tendon’ of a careful exploration is the dataset in his earliest phase. For ex-
ample, the reconstruction of deep metazoan phylogeny is a challange. The selection of suitable
markers to resolve patterns of divergence is often one of thehardest steps. An example for such
markers are mitochondrial genomes. They represent an interesting system, because mitochondria
have small complete genomes, which include several genes with different evolutionary rates and
various degrees of conservation. Furthermore, the gene order of these compact genomes yields
a particularly fruitful data set for phylogenetic reconstruction (Watterson et al., 1982; Sankoff
et al., 1992). These changes of the gene order are due to gene rearrangements and can be differ-
7
CHAPTER 1. THE WHISPER OF THE LEAVES
ent from species to species. In the literature, several genome rearrangement events are described,
like inversions (Dobzhansky and Sturtevant, 1938), transpositions, reverse transpositions, and so
called tandem duplication random loss (TDRL) events (Moritz and Brown, 1986, 1987; Moritz
et al., 1987).
This dissertation is based on the assumption that moleculardata include more information as
used up to now and it follows two main approaches.
The first approach deals with homoplastic sites, which builda fundamental problem in phylo-
genetic reconstruction. One question arises: Is it possible to detect and to quantify informative
positions in difficult parts by the comparison of molecular data? Especially the reproducibility
of the information content of alignment positions are of interest in this case.
The second approach looks at the information content and thepossibilities to use the gene order
information of mitochondrial genomes for phylogenetic reconstruction. Recent tools are affected
by two problems. First, it is hardly possible to compare two or more gene order sequences with
different lengths. Secondly, there are no tools, which include all known rearrangement scenarios,
especially the tandem duplication random loss event. Therefore, one aim of this study is to
reconstruct phylogenetic relationships only with the geneorder information.
8
CHAPTER 2
Mitochondria
Mitochondria are small, rod shaped organelles surrounded by two highly specialized concentric
membranes. They resemble bacteria in their overall size andshape. The mitochondrial inner
membrane is folded inward and forms the sites of aerobic respiration, generally the major energy
production center in eukaryotes. Normally animal mitochondrial genomes are circular, about 16
kB in length, and encode thirteen proteins, two ribosomal RNAs (rRNA) and twenty-two transfer
RNAs (tRNA), all of which are essential, because they are necessary for oxidative phosphoryla-
tion and the production of cellular ATP by the Mitochondria.
2.1 History
In the years between 1850 and 1880 several cytologists observed independently granular or
threadlike components of the cytoplasm that we now recognize as mitochondria. In the his-
tory, this structure contains more than dozen terms like: blepharoblasts, chondriokonts, chondri-
omites, chondrioplasts, chondriosomes, chondriospheres, fila, fuchsinophilic granules, Korner,
Fadenkorper, mitogel, parabasal bodies, plasmasomes, plastochondria, plastosomes, vermicules,
sarcosomes, interstitial bodies, bioblasts, and so on.
In 1854 the Swiss physiologist and histologist Rudolf Albert Ritter von Kolliker (1817-1905) was
the first who observed small threadlike structures (granules) in the cytoplasm of striated muscle
cells of insects and reached the conclusion that the cells had a membrane. These granules, which
9
2.1. HISTORY CHAPTER 2. MITOCHONDRIA
were later to be called sarcosomes by Retzius in 1890, were atfirst thought to be present only in
the muscle, but today we recognize the sarcosomes as the mitochondria of muscle cells. Some
years later Flemming (1882) characterizedfilamentsin the cytoplasm of other cell types.
In 1890 Altmann described a method of staining these structures with fuchsin that made it pos-
sible to demonstrate their occurrence in nearly all types ofcells. In the same year he proposed
that this granules were autonomous, elemental living unitswhich form bacteria-like colonies
in the cytoplasm of the host cell. In 1912 and 1913, B.F. Kingsbury and O. Warburg found that
these granular, insoluble subcellular structures were associated with respiration, and this function
challenged the theory of their role in genetics (Scheffler, 1999).
Based on their similarity to bacteria and probably capable of independent existence he called
them ’elementary living particles or bioblasts’. It was Benda (1897) who coined the term mito-
chondria which can be descended from the greekmitos= Faden (German) or thread (English)
andchrondros= Korn (German) or grain (English). Benda made valuable observations on their
form and distribution in preparations stained with alizarin and crystal violet.
In 1914 Lewis and seven years later Strangeways and Canti observed that this rod shaped or-
ganelles were highly plastic structures which continuously executed slow sinuous movements
and sometimes experienced marked changes of shape. The different staining reactions over the
years gave occasion to investigate much conjecture about the probable chemical nature and func-
tion of the mitochondria. However, more information had to be awaited by their isolation from
cells in a quantity sufficient for chemical analysis. In 1934Bensley and Hoerr presented the first
attempts with some success. They separated mitochodria by differential centrifugation of cell ho-
mogenates. The method was further perfected by Claude in theearly 1940’s and by Hogeboom,
Schneider, and Palade in 1948.
In the following years lots of researchers like Kennedy and Lehninger in 1949 or Green and his
associates effected intensive studies of isolated mitochondria and demonstrated that they are the
principal site of the oxidative reactions by which the energy in foodstuff is made available for cell
metabolism. The development of improved methods of fixationand thin sectioning for electron
microscopy in the next years allowed a more detailed view on the structure of mitochondria.
Based on this methods Palade and Sjostrand in 1953 describedindependently the basic structural
plan of the internal membranes and coined the term ofcristae. In the next years, the focus
changed to analyses of the biochemical functions and pathways of the inner membrane, cristae,
and the matrix i.e. Racker (1976).
In 1963 the first definite identification of DNA in mitochondria was made (Nass and Nass,
1963). Later in the 1970s, the complete mitochondrial DNA ofa mammalian was successfully
10
CHAPTER 2. MITOCHONDRIA 2.2. THE ORIGIN OF MITOCHONDRIA
sequenced in Cambridge (Anderson et al., 1982). Almost simultaneously, Attardi (1981) and
his coworkers at the California Institute of Technology determined the nature of the transcripts
derived from this genome and set the stage for the identification of all the genes encoded by
mammalian mtDNA.
Mitochondria occupy a central position in the understanding of the cell, the ’basic unit of life’.
The study of mitochondria allowed a detailed view but also fundamental insights covering the
entire spectrum from biophysics to cell biology and genetics.
Today, one of several frontiers is the integration of mitochondria into the cell and their distri-
bution in the cytosol by means of their interaction with the cytoskeleton, especially in various
highly differentiated cells. The overwhelming part of the literature in molecular phylogeny is
based upon the analysis of nucleic acid and/or amino acid sequences of individual genes or
groups of genes. With the advent of completely sequenced genomes these approaches are com-
plemented by genome-wide comparisons of gene-contents (Fitz-Gibbon and House, 1999; Snel
et al., 1999), gene orders (Boore and Brown, 1998; Coenye andVandamme, 2003), or composi-
tion measures (Qi et al., 2004).
2.2 The Origin of Mitochondria
2.2.1 The Serial Endosymbiosis Theory
On the 51. ’Versammlung Deutscher Naturforscher undArzte’ 1878 in Kassel Heinrich Anton
de Bary (1831-1888) suggested, based on his work with lichens, to use the termSymbiosisfor
a close relationship between two species. In 1883 the Germanbotanist Andreas Franz Wilhelm
Schimper (1856-1901) observed that the division of chloroplasts in green plants closely resem-
bled that of free-living cyanobacteria and proposed (in a footnote) that green plants had arisen
from a symbiotic union of two organisms (Schimper, 1883). In1890 R. Altmann (Altmann,
1890) spotted that the ’granular bodies’ (mitochondria) inthe cytoplasm of plant and animal
cells display the staining properties of free living microbes. Based on his work maybe Schim-
per was the trailblazer of the Endosymbiotic Theory, which was first articulated by the Russian
botanist Konstantin Sergejewitsch Mereschkowski (1855-1921) in his early 1905 work’ Uber
Natur und Ursprung der Chromatophoren im Pflanzenreiche’(Mereschkowsky, 1905).
In the 1920s Ivan Emanuel Wallin (1883-1969) extended the idea of an endosymbiotic origin
to mitochondria (Wallin, 1923). All these theories were initially dismissed or ignored over the
following years. Their resurrection in the 1960s based essentially on more detailed electron
11
2.2. THE ORIGIN OF MITOCHONDRIA CHAPTER 2. MITOCHONDRIA
microscopic comparisons between cyanobacteria and chloroplasts (Ris and Singh, 1961) in com-
bination with the discovery that plastids and mitochondriacontain their own DNA (Stocking and
Gifford, 1959). The so calledSerial EndosymbioticTheory (SET) was fleshed out and popular-
ized by Lynn Margulis (Margulis, 1970, 1981). In her 1981 work ’Symbiosis in Cell Evolution’
she argued that eukaryotic cells originated as communitiesof interacting entities, including en-
dosymbiotic spirochaetes that developed into eukaryotic flagella and cilia. This last idea has not
received much acceptance, since flagella lack DNA and do not show ultrastructural similarities
to prokaryotes.
This theory includes several problems. Neither mitochondria nor plastids can survive outside the
cell, having lost many essential genes required for survival. This objection is easily accounted
for by simply considering the large timespan that the mitochondria/plastids have coexisted with
their hosts; genes and systems which were no longer necessary were simply deleted, or in many
cases, transferred into the host genome instead. In fact these transfers constitute an important
way for the host cell to regulate plastid or mitochondrial activity.
2.2.2 The Episome Theory
In 1972, R.A. Raff and H.R. Mahler presented a lot of evidences and assume that mitochondria
have developed from proto-mitochondria, that derived fromthe proto-eukaryote inner membrane,
and which contained genes for (mt) ribosomal components, t-RNAs and several elements of the
respiratory chain. Borst (1972) formulated anEpisome Theoryand supposed that the DNA of mi-
tochondria left the nuclear DNA by sort of amplification to become mapped within a membrane
containing the respiratory chain (Bhamrah and Juneja, 2002).
TheEpisome Theoryleads to several problems. There exist no assumption which is made with
respect to the question of whether the hypothetical episomewas composed of prokaryote-like or
eukaryote-like genetic material. Another problem are the genes postulated to be present on the
episome were probably located at a number of different siteson the proto-eukaryote genome,
which would imply multiple successive insertions and deletions.
The main experimental approach used in biochemical studiesclaiming to shed light on the ori-
gin of Mitochondria is the study of similarities among homologous components of bacteria and
Mitochondria. However, analysis of these results shows that it is impossible to decide between
theEpisome Theoryand theEndosymbiosis Theoriesin this way (Reijnders, 1975).
12
CHAPTER 2. MITOCHONDRIA 2.3. THE MITOCHONDRIAL DNA
2.2.3 The Hydrogen Hypothesis
In the year 1998 William Martin and Miklos Muller published theHydrogen Hypothesisin Nature
(Martin and Muller, 1998). They present a new hypothesis for the origin of eukaryotic cells,
based on the comparative biochemistry of energy metabolism.
In contrast to theSerial Endosymbiosis Theory, that prognosticated the phagocytosis of aa-
proteobacteria or a cyanobacteria by a first primitive cell,the hydrogen hypothesis claims a
different way of symbiosis and includes possible metabolism pathways. In this hypothesis the
authors argued that the host (a methanogenic archaebacterium which used hydrogen and carbon
dioxide, producing methane) and a facultatively anaerobiceubacterium (the possible future mito-
chondrion, which produced hydrogen and carbon dioxide as byproducts of anaerobic respiration)
started a symbiotic relationship based on the host’s hydrogen dependence (anaerobic syntrophy).
If correct, this hypothesis would imply that eukaryotes arechimeras with both archaebacterial
and eubacterial ancestry and that eukaryotes appeared later in evolution than prokaryotes. Fur-
thermore, the hydrogen hypothesis predicts that no primitively mitochondrion-lacking eukary-
otes ever existed.
2.3 The Mitochondrial DNA
2.3.1 The Function
Mitochondrial genes are involved, in at least, five basic processes: respiration and/or oxidative
phosphorylation and translation, and occasionally also intranscription, RNA maturation and
protein import. The leading roles of mitochondria are the production of ATP and regulation of
cellular metabolism (Voet et al., 2006). The central set of reactions involved in ATP production
are collectively known as the citric acid cycle. However, the mitochondrion has many other
metabolic tasks, such as:
• Regulation of the membrane potential (Voet et al., 2006)
• Apoptosis-programmed cell death (Green, 1998)
• Glutamate-mediated excitotoxic neuronal injury (Scanlonand Reynolds, 1998)
• Cellular proliferation regulation (McBride et al., 2006)
• Regulation of cellular metabolism (McBride et al., 2006)
• Certain heme synthesis reactions (Oh-hama, 1997)
13
2.3. THE MITOCHONDRIAL DNA CHAPTER 2. MITOCHONDRIA
• Steroid synthesis (Rossier, 2006)
In general, the number of mitochondria and the complexity oftheir internal structure varies
with the energy requirements for the specific functions carried out by the cell. In cells that are
relatively inactive, the mitochondria tend to be few and their internal structure simple. On the
other hand, cells engaged in active transport, in the synthesis of fat from carbohydrate, or in the
conversion of chemical energy to mechanical work usually have large numbers of mitochondria
that contain a profusion of cristae.
2.3.2 The Genome
The genetic material of the Mitochondria called mitochondrial DNA or the mitochondrial genome
is similar in structure to that of the prokaryotic genetic material. The mitochondrial chromosome
is, with rare exceptions (Fukuhara et al., 1993), a circularDNA molecule, which is much smaller
and exists as serveral copies in contrast to the prokaryoticchromosomes.
Apart from some exceptions, such as among the gymnosperms, in which some families inherit
mitochondria or chloroplasts paternally, the mitochondria of a sexually-reproducing species are
inherited maternally. The human mitochondrial genome consists of 16.571 base pairs, which
encodes only 13 proteins, 22 tRNAs, and 2 rRNAs (Anderson et al., 1981). In humans and
probably in metazoans in general, 100 - 10.000 separate copies of mitochondrial DNA are usually
present per cell (egg and sperm cells are exceptions).
The size of known mitochondrial DNA in most eukaryotic phylaranges from 11 - 60 kbp; how-
ever, there are some unusual exceptions. Among those organisms whose mt DNA has been
completely sequenced are two extreme outliers known. The first is the apicomplexan protist
Plasmodium sp.with a minuscule mitochondrial genome of 6 kbp (Feagin, 1992) and the second
is rice (Oryza sativa), whose mt DNA at 490 kbp (Notsu et al., 2002) is about 80 timeslarger
than that ofPlasmodium sp.. In mammals, each circular mitochondrial DNA molecule consists
of 15.000 - 17.000 base pairs, which encodes the same 37 genes: 13 for proteins (polypeptides),
22 for transfer RNA (tRNA) and one each for the small and largesubunits of ribosomal RNA
(rRNA).
This pattern is also seen among most metazoans, although in some cases one or more of the 37
genes is absent and the mt DNA size range is greater. In June 2008, approximately 1.189 com-
plete sequenced mitochondrial genomes of Metazoa are available at the GenBank NCBI (Maglott
et al., 2005) up to now. The shortest genome sequencedParaspadella gotoi(Acc: NC 006083,
Helfenbein et al. (2004)) counts 11.423 bp and the longest genomeTrichoplax adhaerens(Acc:
14
CHAPTER 2. MITOCHONDRIA 2.3. THE MITOCHONDRIAL DNA
(a) (b)
Figure 2.1: Comparison of the compactness of (a)Homo sapiens(NC 001807) (Anderson et al., 1981)and (b)Saccharomyces cerevisiae(NC 001224)
NC 008151, (Dellaporta et al., 2006)) counts 43.079 bp. The square over all lengths is 16.652
bp.
Compared to the nuclear genome, the mitochondrial genome possesses some very interesting
features:
• Excluded ciliates, all the genes are carried on a single circular DNA molecule.
• The genetic material is not bounded by a nuclear envelope.
• The DNA is not packed into chromatin.
• The genome contains little non-coding DNA (’junk’ DNA, or introns).
• Some codons do not follow the universal rules in translation. Instead they resemble those of purple
non-sulfur bacteria.
• Some bases are considered to be part of two different genes: both as the last base of one gene and
as the first base of the next gene.
Mitochondrial genes are transcribed as multigenic transcripts, which are cleaved and polyadeny-
lated to yield mature mRNAs. The proteins that are necessaryfor mitochondrial function are not
15
2.3. THE MITOCHONDRIAL DNA CHAPTER 2. MITOCHONDRIA
encoded by the mitochondrial genome only. Most of them are coded by genes in the cell nu-
cleus and imported into the mitochondrion (Anderson et al.,1981). The exact number of genes
encoded by the nucleus and the mitochondrial genome differsbetween species.
2.3.3 The Replication
The replication of mitochondrial DNA is self-regulated in response to the energy demand of the
cell and consequently not linked to the cell cycle. At cell division, mitochondria are distributed
to the daughter cells essentially randomly during the division of the cytoplasm. Mitochondria
divide by binary fission similar to bacterial cell division;unlike bacteria, however, mitochondria
can also fuse with other mitochondria (Chan, 2006; Hermann et al., 1998).
Mitochondria replicate much like bacterial cells. The regulation of plasmids differs considerably
from the regulation of chromosomal replication. However, the machinery involved in the repli-
cation of plasmids is similar to that of chromosomal replication. D-loop replication is a process
by which chloroplasts and mitochondria replicate their genetic material.
In many organisms, one strand of DNA in the Mitochondriaplastid comprises heavier nucleotides
(relatively more purines: adenine and guanine). This strand is called the H (heavy) strand in
contrast to the L (light) strand, which comprises lighter nucleotides (pyrimidines: thymine and
cytosine). Replication begins with replication of the heavy strand starting at the D-loop (also
known as the control region), which also include the replication origin. This origin opens, and
the heavy strand is replicated in one direction. After heavystrand replication has continued for
some time, a new light strand is also synthesized, by openingof another origin of replication.
When diagramed, the resulting structure looks like the letter D. The D-loop region does not code
for any genes, it is free to vary with only a few selective limitations on size and heavy/light strand
factors.
2.3.4 The Structure
Like described before, Mitochondria are small, rod shaped organelles surrounded by two highly
specialized concentric membranes, an inner membrane and a outer membrane composed of phos-
pholipid bilayers and proteins (Alberts et al., 1994). However, the two membranes have different
properties.
Based on this double-membraned organization there are five distinct compartments within the
mitochondrion. There is the outer mitochondrial membrane,the intermembrane space (the space
between the outer and inner membranes), the inner mitochondrial membrane, the cristae space
16
CHAPTER 2. MITOCHONDRIA 2.3. THE MITOCHONDRIAL DNA
(formed by infoldings of the inner membrane), and the matrix(space within the inner membrane).
(a) (b)
Figure 2.2: (a) Mitochondria, minute sausage-shaped structures found in the hyaloplasm (clear cytoplasm)of the cell, are responsible for energy production. Mitochondria contain enzymes that help to convert foodmaterial into adenosine triphosphate (ATP), which can be used directly by the cell as an energy source.Microsoft R©EncartaR©Encyclopedia 2001.c©1993-2000 Microsoft Corporation. All rights reserved. (b)Mitochondria. Courtesy of Dr. Henry Jakubowski
The Outer Mitochondrial Membrane
The outer mitochondrial membrane, which encloses the entire organelle, has a protein-to-phospholipid
ratio similar to the eukaryotic plasma membrane (about 1:1 by weight). It contains numerous in-
built proteins called porins. These porins comprise a relatively large internal channel (about 2-3
nm) which is permeable to molecules of 5000 daltons weigth orless.
Larger molecules can cross the membrane by active transportif they have a signaling sequence at
the N-terminus binding to a large multisubunit protein called translocase of the outer membrane.
Disruption of the outer membrane permits proteins in the intermembrane space to leak into the
cytosol, leading to certain cell death (Chipuk et al., 2006). Furthermore, the outer membrane
contains enzymes involved in such diverse activities as theelongation of fatty acids, oxidation of
epinephrine (adrenaline), and the degradation of tryptophan.
17
2.3. THE MITOCHONDRIAL DNA CHAPTER 2. MITOCHONDRIA
The Intermembrane Space
The space between the outer membrane and the inner membrane is called intermembrane space.
As the outer membrane is freely permeable to small moleculesthe concentrations of small
molecules such as ions and sugars in the intermembrane spaceis the same as in the cytosol
(Alberts et al., 1994). In contrast the protein compositionof large proteins is different in compar-
ison between the intermembrane space and the cytosol. For example, one protein that is localized
to the intermembrane space in this way is cytochrome c (Chipuk et al., 2006).
The Inner Mitochondrial Membrane
The inner mitochondrial membrane is folded inward and formsinternal compartments known as
cristae. These compartments allow greater space for the proteins such as cytochromes to function
properly and efficiently. Furthermore, the inner mitochondrial membrane includes transport pro-
teins that transport in a highly controlled manner metabolites across this membrane. The electron
transport chain is also located on the inner membrane of the mitochondria.
The inner mitochondrial membrane contains proteins with four types of functions (Alberts et al.,
1994):
• performance of the redox reactions of oxidative phosphorylation
• ATP synthase, which generates ATP in the matrix
• specific transport proteins that regulate metabolite passage
• protein import machinery.
The inner membrane of the mitochondria contains more than 100 different polypeptides, and has
a very high protein-to-phospholipid ratio (more than 3:1 byweight, which is about 1 protein for
15 phospholipids). It is home to around 1/5 of the total protein in a mitochondrion (Alberts et al.,
1994).
In contrast to the outer membrane, the inner membrane is missing porins and is highly imperme-
able to all molecules. To enter or exit the matrix almost all ions and/or molecules require special
membrane transporters.
Another interesting fact is that the inner membrane of mitochondria is similar in lipid composi-
tion to the membrane of prokaryotes, which permits scope fortheories like the endosymbiontic
theory (see above).
18
CHAPTER 2. MITOCHONDRIA 2.3. THE MITOCHONDRIAL DNA
The Cristae
The structure of the inward folded inner mitochondrial membrane is leading to numerous com-
partments called cristae. These expand the surface area of the inner mitochondrial membrane and
increase its efficiency to produce ATP. These are not simple random folds but rather invagina-
tions of the inner membrane, which can affect overall chemiosmotic function (Mannella, 2006).
For example, the surface area, including cristae, in typical liver mitochondria, is about five times
larger than the outer membrane. Mitochondria of cells that have a greater demand for ATP, such
as muscle cells, contain more cristae than typical liver mitochondria (Alberts et al., 1994).
The Matrix
The space enclosed by the inner membrane is called matrix. This contains about 2/3 of the
total protein in a mitochondrion (Alberts et al., 1994). Thematrix has a highly-concentrated
mixture of hundreds of enzymes, special mitochondrial ribosomes, tRNA, and several copies
of the mitochondrial DNA genome. The major functions of the enzymes include oxidation of
pyruvate and fatty acids, and the citric acid cycle (Albertset al., 1994). So, the matrix is important
in the production of ATP with the aid of the ATP synthase.
2.3.5 Mitochondrial Inheritance
The inheritance of mitochondrial DNA occurs from the mother(maternally inherited), in most
multicellular organisms. The reason lies in the over presence of mitochondrial DNA molecules
from the egg, which cotains 100.000 to 1.000.000 in contrastto the sperm with only 100 to
1.000 mitochondrial DNA molecules. Furthermore in the degradation of sperm mt DNA in the
fertilized egg and at least in a few organisms failure of sperm mitochondrial DNA to enter the egg.
In mammals, 99.99% of mitochondrial DNA (mt DNA) is inherited from the mother. Whatever
the mechanism is, this single parent (uniparental) patternof mitochondrial DNA inheritance is
found in most animals, most plants and in fungi as well.
19
2.3. THE MITOCHONDRIAL DNA CHAPTER 2. MITOCHONDRIA
20
CHAPTER 3
Noisy
3.1 Misleading Sites
As mentioned above, homoplastic sites are frequent and a problem in the phylogenetic recon-
struction. Important in this case is a good data basis, that means for example the quality of an
alignment. The columns of such alignment can be classified based on their character structure
(see Figure 3.1).
In a parsimonious way, there is a differentiation between parsimony informative sites and par-
simony non-informative sites. The parsimony non-informative sites include sites with constant
(i.e. equal) nucleotides or nucleotide sites with only unique nucleotides (singletons). Singleton
sites produce a branch extension only in most cases of phylogenetic reconstruction. A site is in-
formative only when there are at least two different kinds ofnucleotides at the site, each of which
is represented in at least two of the sequences under study. These informative positions are the
sites, which are used to reconstruct phylogenetic trees. Note, that Maximum Parsimony is part
of a class of character-based tree estimation methods whichuse a matrix of discrete phylogenetic
characters to infer one or more optimal phylogenetic trees for a set of taxa. Other methods, like
Maximum Likelihood, include non-informative sites withinthe estimation too.
Apart from that, is it essential to classify the informationcontent of informative sites to minimize
the number of homoplastic and/or randomized sites for a stable reconstruction.
The method I present in this thesis allows the identificationof phylogenetically uninformative
21
3.2. TREES, METRICS, AND WEIGHTED SPLIT SYSTEMS CHAPTER 3. NOISY
Figure 3.1: Differentiation between parsimony informative sites and parsimony non-informative sites.
homoplastic columns in a multiple sequence alignment, based on assessing the distribution of
character states along a cyclic ordering of taxa. Removal ofthese columns improves the perfor-
mance of phylogenetic reconstruction algorithms as measured by various indices of tree quality.
In particular, I obtain more stable trees due to the exclusion of alternative splits that arise solely
from randomized characters. The basic idea was conceived during a conversation with Prof. An-
dreas Dress about a very difficult dataset that includes a lotof conflicting positions, which is later
published (Bernhard et al., 2006).
3.2 Trees, Metrics, and Weighted Split Systems
Let X denote a finite set ofn taxa. Asplit S = A|A = A|A is a bipartition of the setX of taxa,
i.e. a partition ofX into two disjoint, non-empty subsetsA andA. Two such splitsA1|A1 and
A2|A2 of X are calledcompatibleif one of the four intersectionsA1 ∩A2, A1 ∩ A2, A1 ∩A2 and
A1 ∩ A2 is empty. A split system is compatible if every pair of splitsis compatible.
It is a well known result that compatible split systems onX are in 1:1 correspondence with the
so calledX-trees (Buneman, 1971), i.e. finite treesT = (V, E) with vertex setV and edge set
E endowed with a map fromX into V whose image contains (at least) all vertices of degree less
than3.
22
CHAPTER 3. NOISY 3.3. NOISE DETECTION USING CIRCULAR SPLIT SYSTEMS
More specifically, this correspondence is given by:
(i) associating to any edgee ∈ E of such a treeT , the bipartitionSe of X into those two
subsets ofX that are mapped into the (exactly) two distinct connected components of the
graph obtained fromT by deleting the edgee,
(ii) associating toT the collectionS(T ) := {Se : e ∈ E} of all such splits.
Associating a positive weightαS to any such splitS = A|A (e.g. the length of the edgee in case
every edge in the tree is endowed with some predefined positive length andS = Se holds), one
can define the associated metricd onX by associating to any two taxax, y in X the term
d(x, y) :=∑
S∈S(T )
αSδS(x, y) (3.1)
where one puts, for any splitS = A|A ∈ S(T ) and allx, y ∈ X, δS(x, y) := 0 if x, y ∈ A
or x, y ∈ A holds, andδS(x, y) := 1 otherwise (i.e. ifx andy areseparatedby the splitS)
implying thatd(x, y) is the total length of the unique path from (the image of)x to (the image
of) y relative to the given family of split weights(αS)S∈S(T ).
It is the goal to detect homoplasywithout determining a tree; thus it is necessary to admit more
general split systems. Circular split systems are a good option which I will introduce in the
following.
3.3 Noise Detection Using Circular Split Systems
A split systemS is circular if the points inX (i.e. the taxa) can be arranged on a circle such
that each splitS ∈ S is induced by a division of that circle into two arcs by deleting two of its
(unlabeled) points. In this case, the circular ordering is said torepresentthe split system.
It is easy to verify that compatible split systems are circular (actually, every planar drawing of an
X-tree provides such a circular ordering), and that circularsplit systems areweakly compatible
— i.e. A1 ∩A2 ∩A3, A1 ∩ A2 ∩ A3, A1 ∩A2 ∩ A3 or A1 ∩ A2 ∩A3 is empty for any three splits
A1|A1, A2|A2, A3|A3 in a circular split system, cf. Bandelt and Dress (1992). Anydistance
constructed from a weighted circular split system is calleda ’circular’- or Kalmanson-Metric,
shown in Figure 3.2
It has been observed that phylogenetic distance data are often circular or at most mildly non-
circular (Huson, 1998; Bandelt and Dress, 1992; Wetzel, 1995). Starting from a suitable distance
23
3.3. NOISE DETECTION USING CIRCULAR SPLIT SYSTEMS CHAPTER 3. NOISY
D
CA
B
B
A C
D
B
A
D
C
B C
A D B
C
D
A
A C
DBB
D
A
C
(AD)____
(AB)|(CD)
(BC)
D
CA
B
Figure 3.2: Summary of possible splits in a split system for 4taxa (ABCD), and their possible relationship.
measure, we can construct a circular split system from an alignment without significantly prede-
term later tree constructions, since the circular split system still represents essentially unfiltered
data.
Circular split systems can be obtained in various ways. The computationally most straightfor-
ward approach is theNeighbor-Net algorithm (Bryant and Moulton, 2004) that starts from a
distance matrix. It computes the circular splits using an agglomerative procedure.
An alternative approach starts from weighted quartets. To this end, one first computes a weight
for each quartet, i.e. each pair of two pairs of taxa,{
{a, b}{c, d}}
. This quartet weight is inter-
preted as the support for the hypothesis that{a, b} and{c, d} are separated by an edge in the cor-
rect phylogenetic tree. Quartet weights can be obtained in various ways. In thequartet-mapping
approach (Nieselt-Struwe and von Haeseler, 2001) for example, one starts with an alignment of
four sequences and defines the weight of a given quartet to be the fraction of alignment sites
(columns) in whicha = b 6= c = d. One may modify this score by adding1/2 for every
additional column in whicha = b 6= c, d or c = d 6= a, b holds. Quartet weights can also
be derived directly from distances (although, in this case,it seems preferable to use the faster
Neighbor-Net approach). A more sophisticated weighting scheme uses “expected branch
lengths”, i.e. the product of the posterior likelihood and the maximum likelihood branch length
of the interior edge of the corresponding quartet tree.
The quartet{
{a, b}{c, d}}
is said to berealizedby a cyclic ordering ofX if the straight line
24
CHAPTER 3. NOISY 3.3. NOISE DETECTION USING CIRCULAR SPLIT SYSTEMS
connectinga andb and the straight line connectingc andd do not intersect in the interior of
the circle. There is a circular split system represented by agiven cyclic ordering that contains a
split that separatesa andb from c andd if and only if {{a, b}{c, d}} is realized by that cyclic
ordering. Hence, to ensure that as much quartet informationas possible is represented,QNet
(Grunewald, 2006) tries to find a cyclic ordering so that thesum of the weights of all realized
quartets is maximal.
Both, Neighbor-Net andQNet, use the same agglomeration process to construct a cyclic
ordering. WhileNeighbor-Net tries to group those taxa close to each other that have a
small distance,QNet tries to construct a cyclic ordering that maximizes the sum of the weights
of the quartets it realizes. Hence, both methods construct cyclic orderings with the property
that groups of phylogenetically closely related taxa tend to assemble along an arcus function.
Neighbor-Net andQNet are bothconsistent, i.e. if the distances or quartet weights cor-
respond to a circular split system, they find a cyclic ordering that represents that split system
(Bryant and Moulton, 2007; Grunewald et al., 2007).
For this purpose, the important property of the circular split systems computed byNeighbor-Net
andQNet is that phylogenetically more closely related taxa are preferentially placed closer to-
gether in this cyclic ordering, since they are separated by fewer splits with positive weights. Thus,
if a characterχ = χi (defined by somealignment sitei in a given alignment) is phylogenetically
”useful”, its character states will appear ”clustered” along the cyclic ordering, independent of
the details of the branching order in individual subtrees. In contrast, if a character is completely
randomized, one will observe that character states are randomly arranged along the cycle.
The amount of clustering can be easily quantified by the number ν = ν(C, χ) of adjacent dis-
tinct character states along the cycleC. We haveν = 0 for constant sites andν ≥ 2 for all
non-constant sites. This number has to be compared with the numbers expected for a random
distribution of character values along the cycle, given theoverall distribution of the character
values ofχ. To this end, we use a shuffling procedure, i.e. we randomly generate a cyclic order-
ing C ′ of the same character states and compute the fractionq = q(C, χ) of randomized samples
with ν(C ′, χ) > ν(C, χ). The frequencyp = 1 − q thus estimates the probability that the char-
acterχ is randomized. Hence we can interpretq as a reliability measure for the phylogenetic
information contained in the alignment site (relative toC). Note that we obtainq = 0 for con-
stant and singleton sites, which are phylogenetically uninformative andq ≅ 0.5 for effectively
randomized sites. Sites withq ≤ 0.5 are “worse” then random and contradict the given cyclic
ordering while support for the ordering is found in sites with q ≥ 0.5.
25
3.3. NOISE DETECTION USING CIRCULAR SPLIT SYSTEMS CHAPTER 3. NOISY
The programnoisy executes the following commands:
1. Compute the cyclic orderingC from the input data using eitherQnet or NeighborNet .
2. For each characterχ
• Compute the numberν(C, χ) of break points.
• ComputeN random cyclic orderingsC ′.
• For each cyclic ordering computeν(C ′, χ).
• Compute the fractionq(C, χ) of random orderings withν(C ′, χ) > ν(C, χ).
3. If q(C, χ) exceeds a given threshold, then remove the characterχ.
The programnoisy is implemented inISO C++ and the source code is available for download
from http://www.bioinf.uni-leipzig.de/Software/noisy/ .
In a first phase, a cyclic ordering of the taxa set is computed.For this purpose,noisy includes
the corresponding subset of routines from David Bryant and Vincent Moulton’sNeighborNet
(Bryant and Moulton, 2004) and theQNet (Grunewald, 2006) packages. Subsequently, a re-
liability score q for each character is calculated. The number of character-state alterations is
counted and compared to the observed count in random shuffling. The uniform pseudo-random
number generatorMersenne Twister (Matsumoto, 1998) is used to generate the random
shuffling.
In order to assess whether the cyclic orderings obtained usingQNet andNeighborNet reduce
the fraction of uninterpretable variation, the following randomization experiment was performed.
Given an alignment, all possible cyclic orderings will be generated and the fractionr of sites with
q > 0.8 among all variable sites in the alignment are computed. As shown in Fig. 3.3,QNet
andNeighborNet nearly minimize the fraction of “noisy” alignment sites forthe10 squamate
mitochondria. The programnoisy exports aPostscript file, visualizing the quality of
the sites of the reordered input alignment (see Fig. 3.5), recording their reliability score as xy-
data, and containing a modified alignment for further analysis from which sites with reliability
q < qcutoff are removed. Figure 3.4 gives a overview of this scoresq in comparison with the
structure within alignment positions and Figure 3.5 shows typical examples for the distribution
of alignment sites with low and high reliability scoresq.
26
CHAPTER 3. NOISY 3.3. NOISE DETECTION USING CIRCULAR SPLIT SYSTEMS
Fraction of noisy positions
Num
ber
of c
ircul
ar o
rder
ings
0.64 0.66 0.68 0.70 0.72 0.74
050
0010
000
1500
020
000
Ne
igh
bo
rNe
t
Clu
sta
lW
QN
et/Q
M
Figure 3.3: Distribution of the fraction of randomized characters (q(C,χ) ≤ 0.8) among the variablecharacters in a set of 10 complete mitochondrial genomes as afunction of the cyclic ordering. The cyclicorderings computed byNeighborNet or QNet indeed essentially minimize the fraction of putativerandomized alignment sites. At least in this example,QNet with quartet-mapping-derived quartet weightsperforms best.”ClustalW ” refers to the circular ordering implicitly constructed byClustalW from its guide treewhich determines the order in which sequences and profiles are combined to yield the final alignment.
27
3.4. COMPUTATIONAL RESULTS CHAPTER 3. NOISY
Figure 3.4: Classification scheme hownoisy handles information content of characters of different sites
3.4 Computational Results
As an example for the effect of removingnoisy sites, I consider a data set of combined 28S
rRNA, 16S rRNA, and mitochondrial COI sequences of spatangoid sea urchins that was reported
to have a high level of homoplasy (Stockley et al., 2005). Theraw sequence alignments lead
to significant different phylogenetic trees for different methods and disagree substantially with
morphology-based results. As reported in the original paper, manual removal of homoplastic
sites improved the trees considerably. The application ofnoisy with a cutoffq = 0.8 leads to
comparable results without human intervention, and produced a Maximum Parsimony (MP) tree
that is consistent with the results obtained by Bayesian andMaximum Likelihood (ML) methods
(in contrast to the manual procedure). The MP trees for the complete and thenoisy -reduced
alignments are presented in Figure 3.6.
In order to assess the influences of the removal of unreliablesites from real and simulated align-
ments on phylogenetic reconstruction, I consider theqcutoff-dependency of the most used com-
mon indices assessing tree quality. Phylogenies were computed using maximum parsimony and
neighbor joining (Kimura 2-parameter model) as implemented in PAUP* 4.0b10 (Swofford,
2002). Scaled log-likelihood score (i.e. the log-likelihood divided by the length of the align-
ment), homoplasy index (HI) (Kluge and Farris, 1969), rescaled consistency index (RC) (Farris,
1989), and average bootstrap support (over all internal vertices) were used to assess the tree sta-
bility while topological changes were described by split distance1. The data sets are described
1sdist , www.daimi.au.dk/ ˜ mailund/split-dist.html
28
CHAPTER 3. NOISY 3.4. COMPUTATIONAL RESULTS
50 100
150
200
250
300
350
400
450
500
550
600
650
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
Figure 3.5: Distribution of homoplastic sites for the mitochondrialatp6 gene of squamata (top) and for18S RNA of Coleoptera from an analysis of (Korte et al., 2004)(bottom). In terms of quality, the two datasets are very different. While the majority of sites inatp6 are parsimony informative and approximatelyone third of the sites have a reliability score aboveqcutoff = 0.8, this is clearly not the case for the dataset by Korte et al. (2004) where most of the sites are constantor unreliable. The color codes for the sitesof the alignment are as follows:� site with missing data;� constant site;� singleton site;� parsimonyinformative site (at least two different character states occur in at least two taxa). The black bar below thealignment displays the sites withq ≥ qcutoff (upper half) andq < qcutoff (lower half). Their position isdisplayed below each alignment in nucleotide positions.
29
3.4. COMPUTATIONAL RESULTS CHAPTER 3. NOISY
2528
543
0.54
0.41
0.19
raw
2227
465
0.59
0.44
0.20
noisy
2076
260
0.50
0.56
0.28RC
RI
HI
PI−sites
length
Stockley
Conolampas sigsbei
Echinoneus cyclostomus
Paraster doederleini
Archeopneustes hystrixSpantagus matheyi
Spantagus raschi
Paramaretia multituerculata
Echinocardia laevigaster
Lovenia cordiformis
Allobrissus agassizii
Metalia spatagus
Plagiobrissus grandis
Linopneustes longispinus
Meoma ventricosa
Brissopsis atlanticaPaleopneustes cristatus
Brisaster fragilis
Amphipneustes lorioli
Abatus cavernosus
Amphipneustes lorioli
Abatus cavernosus
Brisaster fragilis
Paleopneustes cristatus
Allobrissus agassizii
Brissopsis atlantica
Meoma ventricosa
Linopneustes longispinus
Plagiobrissus grandis
Metalia spatagus
Lovenia cordiformis
Echinocardia laevigaster
Paramaretia multituerculata
Spantagus raschi
Spantagus matheyi
Archeopneustes hystrix
Paraster doederleini
Echinoneus cyclostomusConolampas sigsbei
Figure 3.6: MP trees of spatangoid sea urchins from combined28S rRNA, 16S rRNA, and mitochondrialCOI sequences (Stockley et al., 2005). On the left from original data, on the right from a reduced alignmentwith cutoff q = 0.8. The latter tree fits very well with the Bayesian and ML results reported in Stockleyet al. (2005) that were obtained from manually reduced alignments. In particular, the reduced MP treecorrectly showsBrissopsisandAllobrissusas sister groups and correctly identifies the large monophyleticclade consisting of theLinopneustes/Metalia andLovenia/Spatangusgroups to the exclusion ofMeomaand Archeopneustes. The included table compares the stability indices (HI = homoplasy index, RC =rescaled consistency index, RI = retention index) between the complete, Stockley’s manually improved,and thenoisy -reduced alignment.
30
CHAPTER 3. NOISY 3.4. COMPUTATIONAL RESULTS
Table 3.1: Randomized sites (atqcutoff = 0.8) in the 13 different individual protein-coding genes within the31 currently available complete mitochondrial genomes of squamata. The last column gives the fractionof randomized variable sites.
Gene length singletons q ≥ 0.8 random (%)atp6 684 42 405 34.65atp8 171 7 108 32.75cox1 1536 88 1008 28.65cox2 672 34 443 29.02cox3 786 45 516 28.63cytb 1131 74 676 33.69nd1 942 44 589 32.80nd2 1032 63 626 33.24nd3 345 11 222 32.46nd4 1371 65 831 34.65nd4l 288 16 183 30.90nd5 1803 103 1040 36.61nd6 540 25 373 26.30
briefly in the caption of Tab. 3.2; they are available for download at http://www.bioinf.uni-
leipzig.de/Publications/SUPPLEMENTS/06-013/.
Figure 3.7 summarizes the results for alignments of mitochondrial protein-coding genes. The
other data sets, omitted here, show the same qualitative character. Table 3.1 presents that the
fraction of effectively randomized sites varies considerably (from 26% to 37%) between different
proteins even in the relatively benign case of mitochondrial genomes (Simon et al., 1994). As
expected, the homoplasy index is significantly reduced while the rescaled consistency index
increases with increasing values ofqcutoff. Similarly, the scaled log-likelihood values increase
with the fraction of excluded randomized alignment sites. Note that while the tree-quality indices
improve consistently, indicating that the reconstructions become more stable, the absolute values
of the quality indices nevertheless strongly depend on the size and quality of the input alignments.
Hillis and Huelsenbeck (1992) suggested another method to estimate the phylogenetic infor-
mation content of an alignment. Finally, they determined the skewness-test statisticsg1 of the
corresponding tree-length distribution. I analyzed the data with the random-tree option imple-
mented inPAUP* 4.0b10 . For the data matrices, I generated 100.000 trees at random from all
possible tree topologies (replacements allowed). The results are consistent with the tree statistics
discussed above. As expected, we observe thatg1 becomes more negative with increasing values
of qcutoff, at least as long as one does not start to remove too many informative sites (data not
31
3.4. COMPUTATIONAL RESULTS CHAPTER 3. NOISY
Figure 3.7: Dependency of tree-quality indices on the cut-off value qcutoff for data setSample6. The sta-bility of the trees is measured by the scaled log-likelihood(ln L)/n, the homoplasy index (HI) (Kluge andFarris, 1969) and the rescaled consistency index (RC) (Farris, 1989) as computed byPAUP* 4.0b10 .Data sets are alignments (supplied in the electronic supplement) of individual mitochondrial protein-coding genes. They vary in size (from about 170 to 1800 nt) andrandomization.
32
CHAPTER 3. NOISY 3.4. COMPUTATIONAL RESULTS
Table 3.2: Comparison between original and reduced alignmentsData set raw cutoff q=0.8 improvement (%)Sample 1 46.64 67.71 45.1Sample 2 70.89 71.36 0.6Sample 3 75.03 76.89 2.5Sample 4 82.28 84.67 2.9Sample 5 84.79 86.25 1.7Sample 6 90.28 89.83 -0.5
In the reduced alignments, sites with a cutoff value ofq = 0.8 are removed. Average bootstrap-support values (1000 replicates) are computed for neighbor-joining trees.Real data sets:Sample 1: complete coding sequence of mitochondrial genecox1 from 17 adephage aquaticbeetles (G. Fritzsch, unpublished);Sample 2: 48 nuclear 18S RNA genes from basal deuteros-tomes (Cameron et al., 2000), available fromhttp://chuma.cas.usf.edu/ ˜ garey/
alignments/alignment.html ; Sample 3: combined data set of 12S rRNA and amino-acidcoding gene ND1 from 41 Andean frogs (Lehr et al., 2005);Sample 4: complete set of protein-coding mitochondrial genes from 46 selected arthropods (see electronic supplement);Sample 5:12S, 16S, and ND1 sequences of 30 jewel beetles (Bernhard et al., 2005). Sample 6: completeset of protein-coding mitochondrial genes from all 31 currently available squamata (see electronicsupplement).
shown).
An alternative measure for the stability of a phylogenetic reconstruction is the bootstrap support
for trees. In this case computed with the Neighbor Joining method (Saitou and Nei, 1987).
Table 3.2 summarizes changes in average bootstrap support for neighbor-joining trees computed
usingPAUP* 4.0b10 and 2000 bootstrap replicates (Felsenstein, 1985; Efron etal., 1996).
Usually, there is a small but significant improvement of a fewpercent. However, in data sets that
lead to very weakly supported phylogenies the improvement can be quite dramatic, as in the case
of Sample 1.
In order to study the effect of removing putative homoplastic sites in a more systematic way, my
coworker S. Prohaska and I generated artificial data sets forcaterpillar and balanced trees with
4 to 29 taxa usingdawg (DNA Assembly With Gaps) (Cartwright, 2005). Fig. 3.8 showsthe
variation of the bootstrap support relative to the cutoff valueq. The ratio of the average bootstrap
support for the modified alignments divided by the bootstrapsupport obtained from the original
alignment gives the relative average bootstrap support of phylogenetic trees. Pairs of caterpillar
and balanced trees with the same number of taxa were constructed so that (a) all leaves have the
same evolutionary distance from the root and (b) all internal edges as well as all edges leading
33
3.4. COMPUTATIONAL RESULTS CHAPTER 3. NOISY
Figure 3.8: The relative average bootstrap support of phylogenetic trees is computed as the ratio of theaverage bootstrap support for the modified alignments divided by the bootstrap support obtained from theoriginal alignment. Values larger than1 indicate an increase in tree quality. The curves show a distinctmaximum that depends on the number of taxa and the topology ofthe tree. The maximum improvementincreases with the number of taxa (indicated on the right margin of both panels for the highlighted curves).For clarity, error bars obtained from 100 replicates are shown only forN = 10 andN = 25 taxa. The treetopologies, caterpillar trees on the left and balanced trees on the right, are depicted by the insets.
to leaves with maximal depth (maximal number of internal nodes on the path to the root) have
the same ’unit length’. This unit length is set to 0.4 substitutions per site in the binary trees. In
the caterpillar trees the ’unit length’ is scaled so that thetotal length equals that of the binary
tree with the same number of species. For each tree, we useddawg to generate 100 independent
alignments using the following parameters: alignment length 800 nt, GTR model withγ = 0.5
andι = 0.1, anddawg’s default substitution matrix for the GTR model.
We observed a pronounced maximum of bootstrap support whoseposition and value, however,
depends strongly on both, the number of taxa and the topologyof the tree. For small values of
qcutoff the alignment stability increases because only the mostnoisy sites are removed. (In
contrast, tree stability decreases immediately when randomly chosen alignment columns are
removed; data not shown). For large values ofqcutoff, tree stability starts to decrease again because
noisy starts to remove too many informative sites.
Empirically, for large data sets I found thatqcutoff ≈ 0.8 is a good compromise between these
two effects. For small data sets with less than 15 taxa I foundno improvements except for rather
smallqcutoff values reflecting the fact that, for small data sets, there are not too many possibilities
34
CHAPTER 3. NOISY 3.5. CONCLUSION
for the values ofν(C, χ), implying thatnoisy should be used only for at least moderately large
data sets.
In general, the caterpillar trees admit larger improvements in bootstrap support than the balanced
ones. I remark that the balanced trees are almost correctly reconstructed while the caterpillar
trees are poorly reconstructed, in particular at the deep nodes (data not shown).
3.5 Conclusion
It has been argued repeatedly that saturated (homoplastic)characters are detrimental to phy-
logeny reconstruction and, thus, should be removed from multiple sequence alignments, see e.g.
Wagele (2005). Since homoplasy is defined relative to the unknown true tree, it is not obvious
how to reliably identify the homoplastic characters without prior knowledge of that tree. Here, I
show that cyclic orderings that can be obtained robustly, e.g. from pairwise distance data without
detailed knowledge of the correct phylogenetic relationships. Given a circular ordering that is
consistent with a phylogeny, the variation of character states of a given site along the circle is
used to determine the (putative) degree of its randomization. This information can then be used
to prune the sequence alignment. The computer programnoisy implements this procedure.
High rates of substitutions which are not equally distributed among sites in the sequences caused,
e.g. by sequence constraints due to environmental pressure, can produce a considerable amount
of phylogenetic noise in the data and so called ’bad’ and phylogenetically misleading alignments.
Such alignments can be improved by increasing the signal-to-noise ratio through exclusion of
noisy sites. Alignment modifications, like concatenation of conserved blocks, are known to im-
prove phylogenetic analysis and, carried out manually, arecommon practice. However, manual
improvements are almost impossible for large-size and/or diverse alignments and typically make
it hard to reproduce the results later on. Furthermore, theyare not immune to the effects of wish-
ful thinking. In contrast to this, a method such asnoisy provides an essentially deterministic
and unbiased solution.
It is important to note that ’good’ alignments cannot be further improved by the reduction of
the alignment length. While especially distance-based methods for phylogenetic reconstruction
are relatively robust and can tolerate a good fraction of phylogenetically uninformative sites (see
in particular Ogdenw and Rosenberg (2006)), a high absolutenumber of informative sites is
necessary to obtain reliable trees.
The analysis of artificial data sets allows to propose a set ofsimple rules that enables the user
to decide under which conditions it makes sense to usenoisy to process multiple sequence
35
3.5. CONCLUSION CHAPTER 3. NOISY
alignments prior to using them for phylogenetic reconstruction:
(1) If the original alignment already yields trees with veryhigh average bootstrap support there
is nothing to be gained from this method.
(2) Data-sets with less than about 10 taxa are unlikely to be improved.
(3) The cutoff value ofq depends on the tree topology and in particular on the number of
taxa. It pays off to determine the maximum of the gain as a function of q and to use the
corresponding optimal cutoff value.
The analysis of several published data sets shows that removal of randomized sites consistently
leads to more stable trees, irrespective of the method used for phylogeny reconstruction (neighbor
joining, maximum parsimony, or maximum likelihood). Whilein benign data sets, the effects
on consistency indices, likelihood score, or bootstrap support are typically small and I do not
observe changes in the reconstructed tree topologies, the effects of removing homoplastic sites
can become dramatic for poor data sets, as the example of theCox1genes ofSample 1(Table 3.2)
demonstrates. In some cases, the reconstructed tree topologies can be improved as well, see e.g.
the example of the sea urchin phylogeny in Figure 3.6.
In this chapter I outlined a novel possibility to handle difficult pre-computed alignments by min-
imizing the number of randomized sites. In contrast to manual manipulation of alignments,
reducing data sets usingnoisy is transparent and easy to reproduce. Randomized sites are,
at best, phylogenetically uninformative or, in the worst case, just misleading sites. Circular or-
derings allows to deteced homoplastic characters in a two-stage approach: In the first step, one
would construct a circular ordering that minimizes the fraction of ’noisy’ sites (as in Figure 3.3).
In the second step, one would then construct the tree impliedby the alignment obtained after
elimination of all sites that appear to be highly randomizedrelative to that circular ordering.
36
CHAPTER 4
Gene Order Rearrangements
Mitochondrial genomes provide a valuable data set for phylogenetic studies, in particular of
metazoan phylogeny because of the extensive taxon samplingthat is available. Beyond the tradi-
tional sequence-based analysis it is possible to extract additional phylogenetic information from
the gene order. Gene order data present significant mathematical challenges not encountered
when dealing with sequence data. Many evolutionary events may affect the gene order and gene
content of a genome; and each of these events creates its own challenges (Moret et al., 2004).
Figure 4.1: Every vertical pair of arrows marks a breakpoint, so there are two breakpoints to comparethese gene orders.
The mitochondrial genome in a perfect case and with regard tothe Metazoa, are circular and
include thirteen protein coding genes, two rRNAs, twenty-two tRNAs and one control region.
A common method to compare the gene order of two species is to compute a succession of
genome rearrangement operations that transfer one order inthe other. The focus of interest
37
4.1. BREAKPOINT DISTANCE CHAPTER 4. GENE ORDER REARRANGEMENTS
are most often parsimonious rearrangement scenarios whichuse a minimal number of opera-
tions. Therefore, one of the most frequently contemplated rearrangement operations are inver-
sion operations (calledreversalsin mathematics, bio- and informatics) which are permutations
that reverse the gene order of a subsuccession of neighboured genes and change the sign of each
reversed gene.
Commonly, methods for phylogenetic reconstruction of geneorder can be divided into three
classes: (i) distance based methods, (ii) parsimony based methods, and (iii) likelihood based
methods.
4.1 Breakpoint Distance
A breakpoint is an adjacency present in one genome, but not inthe other. Thebreakpoint distance
is then the number of breakpoints present; this measure is easily computed in linear time. How-
ever it does not directly reflect rearrangement events, but only their final outcome. Figure 4.1
shows two breakpoints between two strings (e.g. genomes). Note that the gene subsequence 3
4 5 is identical to -5 -4 -3, since the latter is just the formerread on the complementary strand
(Moret et al., 2004).
4.2 Inversion Distance
Given two signed gene orders of equal content, theinversion distanceis simply the edit distance
when inversion is the only operation allowed. Even though wehave to consider only one type
of rearrangement, this distance is very difficult to compute(Moret et al., 2004). For unsigned
permutations, in fact, the problem is NP-hard (nondeterministic polynomial-time hard, see Ap-
pendix A). For signed permutations, it can be computed in linear time (Bader et al., 2001), using
the theoretical results of Hannenhalli and Pevzner (1995).
4.3 Parsimony Approaches
Of particular interest are most parsimonious rearrangement scenarios which use a minimal num-
ber of rearrangement operations. Parsimonious approachesof gene order reconstruction fall into
two subcategories. Firstencoding methods, which reduce the gene order problems to sequence
problems and second indirect methods, which run optimization algorithms directly on the gene
order (Moret et al., 2004).
38
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.3. PARSIMONY APPROACHES
4.3.1 Encoding Methods
The running times of direct optimization approaches are exponential in the number of genomes
and the number of genes. Therefore an approach that, while remaining exponential in the number
of genomes, takes polynomial time in the number of genes, maybe of relevant interest. It is for
this reason to reduce the gene order data to sequence trough some type of encoding.
Maximum Parsimony on Binary Encoding (MPBE) (Cosner et al., 2000a,b) is a ex-
ample for such a method.MPBEproduces one character for each gene adjacency present in the
data. If genesi andj occur as the adjacent pairi j (or −j − i) in one of the genomes, then set
up a binary character to indicate the presence or absence of this adjacency (coded1 for presence
and0 for absence). The position of a character within the sequence is arbitrary, as long as it is the
same for all genomes. By definition, there are at most2n2 characters, so that the sequences are
of lengths polynomial in the number of genes. In the number ofgenes such analyses using maxi-
mum parsimony will run in time polynomial, but may require time exponential in the number of
genomes.
However, while a parsimony analysis relies on independenceamong characters, the characters
produced by MPBE are emphatically dependent; moreover, translating the evolutionary model
of gene orders into a matching model of sequence evolution for the encoding is quite difficult.
Note that this method suffers from several problems: (i) theancestral sequences produced by
the reconstruction method may not be valid encoding; (ii) none of the ancestral sequences can
describe adjacencies not already present in the input data,thus limiting the possible rearrange-
ments; and (iii) genomes must have equal gene content with noduplication (Moret et al., 2004).
Another example is the Maximum Parsimony on Multistate Encoding (MPME) method (Wang
et al., 2002). In this method denote exactly one character for one gene (thus2n characters in all).
The state of a character is the signed gene that follows it in the gene ordering (in the direction
indicated by the sign). The position of each character within the sequence is arbitrary as long
as it is consistent across all genomes, although it is most convenient to think of theith character
(with i ≤ n) as associated with genei, with then + ith character associated with genei. For
instance, the circular gene order (1,-4,-3,-2) gives rise to the encoding (-4, 3, 4,-1, 2, 1,-2,-3)
(Moret et al., 2004).
The results of Morets analyses indicate that theMPMEmethod dominates theMPBEmethod.
However, both methods still suffer from some of the same problems, as they also require equal
gene content with no duplication and they too can create invalid encoding.
39
4.3. PARSIMONY APPROACHES CHAPTER 4. GENE ORDER REARRANGEMENTS
4.3.2 Direct Optimization
Sankoff and Blanchette (1998) proposed to reconstruct thebreakpoint phylogeny. Breakpoint
phylogeny favours the tree and therefore the ancestral geneorder which together minimize the
total number of breakpoints along all edges of the tree. A special case of this problem is the in-
cludes breakpoint median, which it is NP-hard even for fixed trees. Consequentially, Sankoff and
Blanchette (1998) suggested a heuristic calledBPAnalysis , based on iterative improvement,
for scoring a fixed tree and simply decided to examine all possible trees. TheBPAnalysis
heuristic is summarized following:
For each possible tree do
Initially label all internal nodes with gene orders
Repeat
For each internal nodev, with neighbors labelledA,B and,C, do
Solve the median problem onA,B and,C to yield labelM
If relabellingv with M improves the score ofT , then do it
until no internal node can be relabelled
This method is practicable for small data sets only, based onthe enormous computational time.
TheBPAnalysis is expensive at every level. One main problem is the innermost loop, which
repeatedly solves the breakpoint median problem, an NP-hard problem. Furthermore, the la-
belling procedure runs until no improvement is possible, therefore using a potentially large num-
ber of interactions. Finally, the labelling procedure is used on every possible tree topology, of
which there is an exponential number.
For example, the number of unrooted, unordered trees onn labelled leaves is(2n − 5)!! (double
factorial). That means(2n−5)!! = (2n−5)∗(2n−7)∗(2n−9)∗ ...∗5∗3. For just 13 genomes,
we obtain 13.5 billion trees; for 20 genomes, there are so many trees that merely counting to that
value would take thousands of years on the fastest supercomputer (Moret et al., 2004).
Moret et al. (2001) reimplemented theBPAnalysis heuristic and made extensive use of algo-
rithmic engineering techniques (Moret et al., 2002) to speed up it. Moreover, they added the use
of inversion distance in order to produceinversion phylogenies. The result of this reimplemen-
tation isGRAPPA(Genome Rearrangement Analysis under Parsimony and other Phylogenetic
Algorithms) (Moret et al., 2001).
GRAPPAiterates over all possible tree topologies. For each topology, GRAPPAinitialises the
internal nodes. In a next step, for each internal nodeσ a medianµ of its three neighboursπ1 , π2
, π3 is computed andσ will be replaced byµ if this improves the score of the tree. This procedure
40
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.3. PARSIMONY APPROACHES
is repeated until no internal node can be relabeled. To solvethe treading median problems
GRAPPAproffers, with Siepel’s median solver (Siepel and Moret, 2001) and Caprara’s median
solver (Caprara, 2003), several methods. Both approaches are branch-and-bound algorithms
which solve the problem to optimality.
GRAPPA does not return optimal solutions although it iterates over all possible tree topologies
and solves each median problem optimal. This is because it can not assign the labels of the
internal nodes in an optimal way.
Over the years many different methods were developed to use the gene order information for
reconstruction of phylogenetic relationships. However, at that time there are still numerous prob-
lems to deal with. In the following I will introduce new approaches which give an initial stage
for a solution of three main problems: (1) to compare gene order strings with different lengths;
(2) to use common reconstruction methods of gene orders; andas a final step, (3) the possibility
to use genome arrangements to reconstruct phylogenetic relationships.
41
4.4. THECIRCAL ALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS
4.4 Thecircal algorithm
Here I present a novel approach utilizing these data based oncyclic list alignments of the gene
orders. This method was developed in cooperation with Prof.Peter F. Stadler. Thereby a progres-
sive alignment approach is used to combine pairwise list alignments into a multiple alignment
of gene orders. Parsimony methods are used to reconstruct phylogenetic trees, ancestral gene
orders, and consensus patterns in a straightforward approach. This method was applied to the
study of the metazoan phylogeny based exclusively on mitochondria gene arrangements. Fur-
thermore, I will demonstrate that the approach is also applicable to the much larger genomes of
chloroplasts.
This section will first introduce the cyclic sequence alignment problem and describe a polynomial
solution for arbitrary cost functions. Afterwards, the problem of directionality and how the
occurrence of duplicate mitochondrial genes will be addressed. In the final subsection I consider
various possibilities of extracting consensus gene arrangements from cyclic alignments.
4.4.1 Cyclic alignments
An alignmentA of two stringsx andy is a sequence of pairs of the form(xi, yj), (xi,−), and
(−, yj) that preserves the order of sequence positions in bothx andy. A pair (xi, yj) corresponds
to asubstitutionof xi by yj, a pair(xi,−) represents thedeletionof xi, and(−, yj) is theinsertion
of yj. A maximal subsequence consisting of deletions(xi,−), (xi+1,−), . . . , (xi+q−1,−) will be
referred to as the deletion of the substringx[i, i+q−1] of lengthq, and analogously for insertions.
We consider here a cyclic variantA of the alignment in which we allow insertions and deletions
of substrings to “wrap around” the ends of the alignments, sothat e.g.(x1,−) and(xn,−) are
part of the same deleted substring.
With each alignment we associate a cost function. We distinguish substitution costss(a, b) be-
tween two lettersa andb and costs of insertions and deletionsg(a) for a substringa. The cost
functiong(a) is called thegap cost function. The total cost of an alignment is the sum of costs
of the individual costs for each edit operation. We call the cost modeladditive if the gap cost
functions are additive
g(a) =∑
ai∈a
g(ai) (4.1)
Note that for additive gap costs the costf(A) of an alignmentA and the costf(A) of its cyclic
variantA is the same. Gap costs have to be sub-additive,g(a ∪ b) ≤ g(a) + g(b). It follows
that we have in generalf(A) ≥ f(A) since a ”wrap-around” gap is cheaper than two separate
42
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.4. THECIRCAL ALGORITHM
end-gaps.
In the context of cyclic alignments one naturally considersthe strings themselves as cyclic
((Bunke and Buhler, 1993; Gregor and Thomason, 1993; Maes,1990; Mollineda et al., 2002)).
Formally, cyclic strings are usually introduced as equivalence classes w.r.t. the cyclic shift opera-
torσ that rotates a string by one position:σ(x) = (x2, . . . , xn−1, xn, x1). The cyclic string associ-
ated with an ordinary stringx is thus the equivalence class[x] = {x, σ(x), σ2(x), . . . , σn−1(x)}.
An alignment of two cyclic strings is simply a cyclic alignment of two representativesσk(x) and
σl(y) of [x] and[y]. Of course we are interested in those representatives that yield the optimal
alignment, i.e. that minimize
f(
A([x], [y]))
= mink,l
f(
A(σk(x), σl(y)))
(4.2)
where A(p, q) denotes the cost-optimal cyclic alignment of the (non-cyclic) stringsp and q.
This problem can be solved inO(|x||y| log(|x| + |y|)) time and quadratic space in the case of
additive cost functions (Maes, 1990; Gregor and Thomason, 1993), see also Landau et al. (1998).
Unfortunately, this approach does not generalize to the problem at hand, which requires us to
consider arbitrary cost functions.
A solution to the general problem can still be obtained with quadratic memory and in polynomial
time. First we note that the optimal circular alignment of[x] and[y] is either the trivial alignment
with costg(x) + g(y) in which [x] is deleted and[y] is inserted (this is cheaper than deleting[x]
and inserting[y] in multiple intervals because of the subadditivity of the gap cost function) or the
optimal alignment contains at least one pair of match positions, sayxp andyq. The costf(Apq)
of this alignment given by
f(Apq) = s(xp, yq) + f(A(σp(x)[2..|x|], σq(y)[2..|y|]) . (4.3)
Recall that the first position ofσp(x) isxp so that we consider a non-cyclic alignment which in the
very first position has the match(xp, yq) followed by the optimal alignment of the remainder of
the rotated stringsσp(x) andσq(y). Since the first position is a match, a possible end-gap cannot
wrap around, so that we have to consider an optimal non-cyclic alignment of the substrings
σp(x)[2..|x|] andσq(y)[2..|y|], see Figure 4.2. Each one of them can be computed inO(|x| · |y| ·
max(|x|, |y|)) operations with arbitrary cost functions, see e.g. Dewey (2001), or inO(|x| · |y|)
operations for affine gap cost functions (Gotoh, 1982). Thus, we can compute the optimal cyclic
alignment by computingApq for all pairs(p, q).
43
4.4. THECIRCAL ALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS
4.4.2 Encoding of Mitochondrial Genomes
Mitochondrial genomes clearly can be regarded as cyclically ordered lists of genes. In addition,
however, the genes are oriented depending on whether being located on the heavy or the light
strand. This is taken into account in the framework of list alignments by considering the same
gene in different orientations as different objects, Figure 4.3.
Duplication and deletion of genes in mitochondrial genomesoccurred frequently during the evo-
lution of the Metazoa. In recent years, Paul Higgs and co-workers presented compelling evidence
that duplication of a tRNA-Leu gene, followed by anticodon mutation, and subsequent deletion
of tRNA-Leu genes has occurred at least five times during the evolution of the Metazoa (Higgs
et al., 2003). Animal mitochondrial genomes, for example, usually have two transfer RNAs
for both leucine and serine. While such duplicate genes are problematic in permutation-based
approaches, they are naturally described in the list alignment model used here: There are two
tRNA-Leu genes in the cyclic list. We simply used different symbols, ’L1 ’ and ’L2 ’, for tRNA-
Leu genes.
4.4.3 Scoring Model
In the following we will turn to a more detailed description of the scoring functions that underlie
the pairwise linear alignments. The (mis)match scores are trivial because it doesn’t make sense
to align non-homologous genes, i.e. non-identical list entries. We have simplyσ(x, y) = 0 if
x = y andσ(x, y) = ∞ if x 6= y. Our knowledge about the mechanism of the genomic rear-
rangements must therefore be incorporated into the indel scores. Thus, we have to compute only
O(D max(|X|, |Y |)) pairwise linear alignments, whereD is the maximum number of copies of
a duplicated gene.
p
q
p
q
p
q
x y
x
y
Figure 4.2: An alignment of two cyclic strings that containsthe (mis)matchxp, yq is equivalent to a linearalignment ofσq(x) with σp(y) with the constraint thatxp, yq form a (mis)match. Note that — in contrastto the default of many alignment programs — we have to score ”end-gaps” just as all other indels here.
44
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.4. THECIRCAL ALGORITHM
>NC_004419 Polyodon spathulaF 12S V 16S L2 ND1 I -Q M ND2 W -A -N -C -Y CO1 -S2 D CO2 K ATP8 ATP6 CO3G ND3 R ND4L ND4 H S1 L1 ND5 -ND6 -E CYTB T -P>NC_002639 Myxine glutinosaF 12S V 16S L2 ND1 I -Q M ND2 W -A -N -C -Y CO1 -S2 D CO2 K ATP8 ATP6 CO3G ND3 R ND4L ND4 H S1 L1 ND5 -ND6 -E CYTB T -P>NC_001626 Petromyzon marinusF 12S V 16S L2 ND1 I -Q M ND2 W -A -N -C -Y CO1 -S2 D CO2 K ATP8 ATP6 CO3G ND3 R ND4L ND4 H S1 L1 ND5 -ND6 T -E CYTB -P>NC_002177 Halocynthia roretziF G T ND6 L1 N G D CO3 ND4L C K 12S CO2 CYTB Y W I E ND2 H S1 R Q L2 ND5 M 16SND1 ATP6 S2 CO1 ND3 A P ND4 V
Figure 4.3: Example of input data. The abbreviations of mitochondrial gene names are listed in theAppendix A.
Branchiostoma_floridaeSus_scrofa
Cavia_porcellusChelonia_mydas
Eumeces_egregiusHippopotamus_amphibius
Mustelus_manazoDaphnia_pulex
Ceratitis_capitataAnopheles_quadrimaculatus
CO
1 L2-S
2 DC
O2 K D
AT
P8
AT
P6
CO
3 GN
D3 R A R N
-S1
S1 E -F
-ND
5 -H-N
D4
-ND
4L T -PN
D6
ND
4LN
D4 H S L1
ND
5 G-N
D6 -E
CY
TB S2
-ND
1S
2-L1
-16S -V-12S T -P F12S F V16S L2N
D1 I
-Q M -QN
D2 -N W -A -N -C -Y
Branchiostoma_floridaeSus_scrofa
Cavia_porcellusChelonia_mydas
Eumeces_egregiusHippopotamus_amphibius
Mustelus_manazoDaphnia_pulex
Ceratitis_capitataAnopheles_quadrimaculatus
CO
1 L2-S
2 DC
O2 K D
AT
P8
AT
P6
CO
3 GN
D3 R A R N
-S1
S1 E -F
-ND
5 -H-N
D4
-ND
4L T -PN
D6
CY
TB S2
-ND
1S
2-L1
-16S -V-12S
ND
4LN
D4 H S L1
ND
5 G-N
D6 -E
CY
TB T -P F
12S F V16S L2N
D1 I
-Q M -QN
D2 -N W -A -N -C -Y
Figure 4.4: Effect of scores: Top: additive model withδ(P) = 20, δ(r) = 10, δ(t) = 3 andη(a) = 0.Below: affine model with additive scores as above andη(ai) = 2δ(ai). As expected, the large one-timescore, which essentially acts like agap-open penalty, leads to alignments with a small number of largegaps. Protein coding genes are shown in red/dark grey, tRNAsin yellow/light grey, rRNA in orange, gapsin black.
It is well known that tRNA genes are much more mobile than protein-coding mitochondrial
genes (Boore, 1999). We therefore propose a scoring scheme that consists of two contributions
for each inserted or deleted intervala = [ai, ai+1, . . . , aj−1, aj] of the cyclic list. We define (1)
an additive score contributionδ(ai) to which each deleted list entryai contributes independently
and (2) a “one-time” contribution that allows us to distinguish between intervals that consist of
tRNAs only and those that also contain proteins. This “one-time” score essentially plays the role
of the gap-open penalty in the usual models of sequence alignments, see Figure 4.4. We think
it’s convenient to define the one-time scoreη(ai) for each list entry individually and to compute
indel-score for the intervala as
g(a) = maxi≤k≤j
η(ak) +
j∑
k=i
δ(ak) (4.4)
45
4.4. THECIRCAL ALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS
The default scores for mitochondrial genomes distinguish between three types of genes: proteins
P, ribosomal RNA genesr, and tRNAst. The downside of this scoring model is that we are forced
to use a computationally expensive algorithm to compute thelinear list alignments. Default
values for this scoring model have been chosen so that a number of test data sets with well-
established phylogenies were well reconstructed from the resulting alignments. In the following
we use
P r t
δ 4 3 3
η 6 4 4
(4.5)
Alignment distances, of course, not always correctly reproduce the relative ordering of distances
w.r.t. to reversals or transpositions. We do not consider this as a problem, since the mechanisms
of mitochondrial genome rearrangements are not very well understood. Nevertheless, one may
use a scoring model which is dominated by the one-time scoreηk, i.e. for whichδk ≪ ηk instead
of the default scores of thecircal program. Such a scoring scheme in essence counts the
number of rearrangement events. It measurescut-and-pasteevents instead of reversals, however,
i.e. it does not distinguish whether a translocated block isre-inserted in the same or in the reverse
orientation and it is insensitive to the location of reinsertion.
4.4.4 Multiple Cyclic Alignments
The pairwise cyclic alignment procedure outlined in the previous two subsections can be gener-
alized to multiple sequences by means of the sameprogressive alignmentapproach that is used
for example inclustalw (Thompson et al., 1994): We first compute all pairwise distances,
using the cyclic alignment procedure. From the resulting distance matrix we construct a guide
tree, in our case using the WPGMA clustering method (Sokal and Michener, 1958). This guide
tree is used to align profiles of aligned cyclic lists in the same way as individual lists. Finally,
the scoring scheme is extended in the obvious way from individual lists to alignments: both the
one-time scoreη and the additive scoreδ for a profile position is computed as the sum over all
entries in each column of the two profiles (alignments) that are to be combined. Equation (4.4)
is then used to determine the indel score for an interval.
46
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.4. THECIRCAL ALGORITHM
Polyodon spathulaScaphirhynchus cf. albus
Polypterus ornatipinnisPolypterus senegalus
Erpetoichthys calabaricusCyprinus carpio
Oncorhynchus mykissXenopus laevis
Mertensiella luschaniRanodon sibiricus
Andrias davidianusTyphlonectes natans
Mustelus manazoRaja radiata
Scyliorhinus caniculaSqualus acanthias
Heterodontus francisciChimaera monstrosaLatimeria chalumnae
Neoceratodus forsteriProtopterus dolloi
Lepidosiren paradoxaMyxine glutinosa
Eptatretus burgeriEumeces egregius
Iguana iguanaHippopotamus amphibius
Balaenoptera physalusOrnithorhynchus anatinus
Tachyglossus aculeatusChelonia mydas
Pelomedusa subrufaDogania subplana
Rana nigromaculataPetromyzon marinus
Dinodon semicarinatusAlligator mississippiensis
Caiman crocodilusAlligator sinensis
Fejervarya limnocharisRhea americana
Aythya americanaGallus gallus
Corvus frugilegusChrysemys picta
Sphenodon punctatusBranchiostoma floridaeBranchiostoma belcheri
Balanoglossus carnosusBranchiostoma lanceolatum
Lampetra fluviatilis
E -T P L1 T P -P S1
-ND
6 -E F12S F V16S L1 L2N
D1 -Q I Q M L2 -Q M Q
ND
2 N -N W A N C Y -A -N -C -YC
O1
S2
-S2 D
CO
2 KA
TP
8A
TP
6 GC
O3 G
ND
3 RN
D4L
ND
4S
1 H S1 L1
ND
5 G-N
D6 T E -E L1
CY
TB
Figure 4.5: Graphical display of an alignment of vertebratemitochondrial genome arrangements. Proteins,rRNAs, and tRNAs are shown in red, orange, and yellow, resp. Most vertebrates share a common geneorder. However, there are numerous small deviations that mostly involving transposed tRNA genes, inparticular in the bird and reptile lineages, see e.g. Mindell et al. (1998) and Townsend and Larson (2002).
4.4.5 Implementation
The algorithm described here is implemented in the program packagecircal which is written
in ANSI C. Thecircal package is distributed under the GNU General Public License(GPL)1.
The current implementation ofcircal produces anexus format file of the alignment as well
as a graphical overview inPostScript format, see Figure 4.5. Datasets of about 30 rather
diverse gene arrangements can be computed within about a quarter of an hour on a common PC
(Linux operating system on a Dual-Pentium IV with two 2.4GHzCPUs and 1Gbyte RAM). The
full protostome data set runs about 1 day on the same PC.
1Download fromhttp://www.bioinf.uni-leipzig.de/Publications/SUPPL EMENTS/04-015/ .
47
4.4. THECIRCAL ALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS
4.4.6 Tree Reconstruction
Phylogenetic trees can be inferred from the cyclic list alignments by any of the usual approaches.
One might simply use the multiple alignment to re-compute a pairwise distance matrix, possi-
bly using a more sophisticated distance measure than those used for constructing the alignment.
We do not pursue this approach here, since the alignment contains much more information than
just the mutual distances. Maximum likelihood methods are applicable in principle (Larget and
Simon, 2002), albeit it seems non-trivial to derive a good rate model for mitochondrial genome
arrangements. We therefore resort to maximum parsimony. Since each column of the align-
ment marks only the presence (1) or absence (0) of a gene in a particular alignment position,
it is straightforward to apply standard programs such asPAUP* (Swofford and Olsen, 1990) or
phylip (Felsenstein, 1989) on the corresponding0/1 -strings. Alternatively, the position of a
gene in the list alignment can be interpreted as a distinct character state as described in subsec-
tion 4.4.8. This results in a string representation in whicheach mitochondrial gene corresponds
to a single column. In practice, we observe little difference between the two approaches. Boot-
strap support values can of course be computed in the same wayas for conventional sequence
alignments.
4.4.7 Consensus Gene Arrangements
An apparent shortcoming of the list-alignment approach is that the same object (gene) may ap-
pear multiple times in the alignment, i.e. there are multiple columns of the final alignment that
refer to the same protein or tRNA. We argue that this actuallyis an advantage. We can now
identify a subgroup of aligned genome arrangements (or use the whole alignment) to obtain a
consensus genome arrangement. Finally, we simply compare all columns that refer to the same
gene (in both orientations) and select the one with the most non-gap entries. In case of duplicated
tRNAs, say, we may take two most populated columns. The result is a ”valid” mitochondrial
genome arrangement that describes the consensus of the group in question.
By leaving out all columns that contain less than a minimum number of non-gap entries we can
directly extract conserved parts of the gene order even if they do not correspond to conserved
intervals.
4.4.8 Ancestral Genome Organization
It might be surprising at first glance that the multiple alignment can be used to reconstruct the
ancestral genome organization, since each genek can appear in multiple columns of the list
48
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.4. THECIRCAL ALGORITHM
alignment. Suppose the number of these columns ismk. Clearly, genek is present in at most
one of these columns in each taxon. A deletion ofk is represented by the absence ofk in all mk
columns belonging to genek. Thus, we can regard the position ofk in the list alignment as a
character withmk + 1 possible states. Consequently, we can use standard parsimony approaches
to obtain the ancestral state (position and orientation) ofeach gene, see Figure 4.6. On the other
hand, if we allow duplicate genes, or, assuming a duplication-deletion mechanism for mitochon-
drial genome rearrangements such as those proposed by Maceyet al. (1998) and Lavrov et al.
(2002), we may use the simple presence/absence patterns of genes in the list alignment itself to
reconstruct ancestral gene orders by means of maximum parsimony.
As an example, we computed the reconstructed gene order of the echinoderm ancestor start-
ing from 8 mitochondrial genomes. The presence/absence pattern of genes in the list alignment
was converted into a0/1 character matrix and analyzed usingPAUP* . The resulting pres-
ence/absence patterns was then re-translated into a the gene order shown in Figure 4.6.
4.4.9 Mitochondrial Genomes
Analysis of mitochondrial genomes have significantly contributed to the reconstruction of deep
metazoan phylogeny (Boore and Brown, 1998). For example, the phylogenetic position of Ten-
taculata (Lophophorata), either as protostomes, sister group of deuterostomes, or even members
of the deuterostomes was a matter of long and controversial debate. Mitochondrial genome
analyses, both on gene order and nucleotide analysis of the brachiopodTerebratulina retusa,
convincingly support a protostome relationship with an affiliation to the spiral cleaving molluscs
and annelids (Stechmann and Schlegel, 1999). However, someresults on the analysis of mito-
chondrial genome comparisons challenge classical evidence, such as the monophyly of insects
(Nardi et al., 2003).
We applied our novel method to a comprehensive data set of metazoan mitochondrial genomes.
> ancestral_echinodermCYTB F 12S E T P -Q N L1 -A W C -V M -DY G L2 ND1 I ND2 16S CO1 R ND4L CO2 KATP8 ATP6 CO3 S2 ND3 ND4 H -S1 ND5 -ND6
Figure 4.6: Reconstructed gene order of the ancestral echinoderm starting from the gene orders of 3 seaurchins (Strongylocentrotus purpuratus, Paracentrotus lividus, Arbacia lixula), 2 brittle stars (Ophiopholisaculeata, Ophiura lutkeni), 1 sea star (Asterina pectinifera), 1 crinoid (Florometra serratissima), and 1sea cucumber (Cucumaria miniata).
49
4.4. THECIRCAL ALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS
������������������������������������������������������������������������������������������������������������
������������������������������������������������������������������������������������������������������������
������������
������������
��������
��������
���������������������������
���������������������������
���������������������
���������������������
��������������������������������
��������������������������������
52
100
95
97
99
46
26
94
15
40
100
100
100
100100
100
64
82
100
Arthropoda (69)
Mollusca − Cephalopoda (1)
Mollusca − Gastropoda (1)Mollusca − Polyplacophora (1)
Nematoda − Spirurida (3)
Nematoda − Rhabditida (5)
Annelida (2)
Plathelminthes (10)
Nematoda − Enoplea (1)Mollusca − Bivalvia (1)
Mollusca − Scaphidida (1)
Mollusca − Bivalvia (1)
Mollusca − Gastropoda (6)
Echinodermata (9)
Figure 4.7: Maximum parsimony tree of the mitochondrial gene order of the protostomian data set. MPanalysis was performed usingPAUP* with the heuristic search method (10 random stepwise additionsand the TBR branch swapping, 100 bootstrap replicates). Forthe sake of clarity triangles represent sub-trees which are not shown in full resolution. Echinodermatawere used as outgroup. Data and accessionnumbers were provided in the Appendix A.
50
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.4. THECIRCAL ALGORITHM
Phylogenetic reconstructions reported here are performedby aligning the mitochondrial gene
orders (proteins, rRNAs, and tRNAs) usingcircal and reconstructing maximum parsimony
trees usingPAUP* . Gaps were treated as missing data. The data was also bootstrapped using the
MP method (1000 replicates).
An analysis using 166 metazoan taxa showed a good separationbetween the deuterostomian and
the protostomian lineages (Fritzsch et al., 2004a,b). In Figure 4.7 we show a phylogenetic re-
construction of 103 taxa from diverse protostome groups using 9 echinoderms as outgroup. Our
approach supports the monophyly of arthropods, annelids, platyhelminthes, and nematods, with
the exception of a single taxon (Enoplea). However the monophyly of molluscs was not recov-
ered. The genome arrangements of molluscs are very variable(Hoffmann et al., 1992; Dreyer
and Steiner, 2004). This is clearly a case where the ancestral gene order has been wiped out
by rapid rearrangements. A multifurcation within the arthropode clade makes further systematic
analysis impossible. Nevertheless, some tendencies can berecognized: for example, we find
support for a clade of chelicerata (10 taxa) and myriapoda (4taxa).
An ongoing debate in protostome phylogeny concerns the affiliation of the arthropods either to
the annelids in the traditional articulates, or with the nematodes in the clade Ecdysozoa (Adoutte
et al., 2000). The results do not support the latter hypothesis based on other molecular evidence,
such as nuclear rRNA,Hoxgene sequences, and EST analyses (Dunn et al., 2008).
Furthermore I analyzed 60 complete mitochondrial genomes of vertebrates, hemichordates, and
cephalochordates. The vertebrate gene order underwent only very few rearrangements, see Fig-
ure 4.5. Thus, phylogenetic information cannot be recovered with the data set analyzed. It is
noteworthy, that within the chordates the hemichordates share the only rearrangement of a mi-
tochondrial protein with the birds. Clearly, these are independent rearrangement events. This
example suggests that changes in mitochondrial genomes arenot unbiased random events; mul-
tiple list alignments can be used to determine likelihood differences between rearrangements
from sufficiently large datasets.
While testing thecircal program I observed that the resolution of the reconstructedtrees
and their agreement with well-established phylogenetic hypotheses improves with increasing
taxon sampling. The reason is probably that a dense taxon coverage leads to smaller differ-
ences between the gene orders of adjacent taxa which improves quality of the multiple sequence
alignment because the underlying pairwise alignments become less error prone. This contrasts
the observation of Rosenberg and Kumar (2001) that increased taxon sampling does not lead to
substantial improvements in sequence-based methods.
51
4.4. THECIRCAL ALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS
4.4.10 Chloroplast Genomes
Besides mitochondrial genomes, plastid genomes are a second field of application (Doyle et al.,
1992; Odintsova and Yurina, 2003). While data are still sparse, a genome database has recently
become available (Kurihara and Kunisawa, 2004). Chloroplast genomes, with about 100 protein
coding genes, are much larger than animal mitochondria. In order to demonstrate that the list
alignment approach is feasible for realistic datasets, we analyzed the protein gene order of 20
chloroplast genomes listed in Table 4.1 of Wolf et al. (2005).
Figure 4.8: (top) Phylogenetic tree derived from circular list alignments of 20 chloroplast genomes fromland plants. Numbers give bootstrap support from 1000 replicates in percent. (below) Thecircalalignment using uniform edit costs clearly identified blocks of protein genes whose relative ordering isconserved among land plants.
Orthology of chloroplast protein coding genes was checked by blast (Altschul et al., 1997)
both with other chloroplast genomes and GenBank (Maglott etal., 2005). Unknown open read-
ing frames without clear orthologs in other chloroplasts were removed from the gene lists. We
rancircal with uniform scoring using protein coding genes only. The maximum parsimony
tree resulting from the circular list alignment is shown in Figure 4.8. It correctly groups the
angiosperms with the exception of the filicopsida, which should appear as the sister group of an-
52
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.4. THECIRCAL ALGORITHM
Table 4.1: GenBank accession numbers and sources of chloroplast gene maps for sampled taxa (Wolfet al., 2005)
Taxon GenBank accessionCharophytesChaetosphaeridium globosum NC 004115(Nordstedt) KlebahnLiverwortsMarchantia polymorphaL. NC 001319MossesPhyscomitrella patens(Hedw.) NC005087Bruch and W. P. SchimperHornwortsAnthoceros formosaeStephani NC004543LycophytesHuperzia lucidula(Michx.) Trevisan AY660566MoniliformsAdiantum capillis-venerisL. NC 004766Psilotum nudum(L.) P.Beauv. NC003386ConifersPinus koraiensisSiebold and Zucc. NC004677Pinus thunbergiiFranco NC001631AngiospermsAmborella trichopodaBaill. NC 005086Arabidopsis thaliana(L.) Heynh. NC000932Atropa belladonnaL. NC 004561Calycanthus floridusL. NC 004993Lotus japonicus(Regel ) K.Larsen NC002694Nicotiana tobacumL. NC 001879Oenothera elataKunth ssp. hookeri NC002693(Torr. & A.Gray) W. Dietr. and W. L. WagnerOryza sativaL. NC 001320Spinacia oleraceaL. NC 002202Triticum aestivumL. NC 002762Zea maysL. NC 001666
53
4.4. THECIRCAL ALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS
giosperms. The resolution of the basal is poor as expected, given that only single representatives
are available.
54
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.5. THECREXALGORITHM
4.5 TheCRExAlgorithm
Based on the assumption that information of gene orders, in this case of the mitochondrial
genome, includes useful phylogenetic signals, we developed a new algorithm calledCREx(Common
interval RearrangementExplorer) (Bernt et al., 2007) to analyse this gene order information.
The CREx algorithm computes heuristically pairwise rearrangementscenarios for gene order
data. Possible phylogenetic events in such scenarios are reversals, transpositions, reverse trans-
positions, and the more complex tandem duplication random loss (TDRL) operations.CREx
can detect such events as patterns in the signed strong interval tree (Berard et al., 2007), a data
structure representing gene groups that appear consecutively in a set of gene orders. The basic
strategy underlying the study is to identify unambiguous information that of course does not
provide full resolution of the phylogeny but can be used as ’anchor’. In this case the directional
information of TDRLs is used to provide very strong evidencefor monophyletic groups. This
project was developed in a close cooperation with the group of Prof. Dr. Martin Middendorf,
head of the Department of Computer Science, in Leipzig. In the following I will give a short
survey on necessary formal basics and the developed theCRExalgorithm.
4.5.1 Basic Definitions
Rearrangement Operations
A permutation of sizen is a permutation of the elements{1, 2, . . . , n}. A signedpermutation
of sizen is a permutation of sizen where every element has an additional sign (”+” or ”−”)
that defines its orientation (”+” is usually omitted). In the following a signed permutationπ =
(π1, . . . , πn) is just called a permutation. AreversalρR(i, j), 1 ≤ i ≤ j ≤ n applied to a
permutationπ of sizen transforms it intoπ◦ρR(i, j) = (π1, . . . , πi−1,πj , . . . ,−πi, πj+1, . . . , πn).
A transpositionρT(i, j, k), 1 ≤ i ≤ j < k ≤ n applied toπ transforms it intoπ ◦ ρT(i, j, k) =
(π1, . . . , πi−1, πj+1, . . . ,πk, πi, . . . ,
πj, πk+1, . . . πn). A reverse transpositionρrT(i, j, k), with 1 ≤ i ≤ j ≤ n and (1 ≤ k < i)
or (j < k ≤ n), applied toπ transforms (here shown forj < k) it into π ◦ ρrT(i, j, k) =
(π1, . . . , πi−1,−πk, . . . ,−πj+1, πi, . . . , πj, πk+1, . . . πn).
A tandem duplication random lossρTDRL operation duplicates a contiguous segment of genes in
tandem, followed by the loss of one copy of each of the duplicated genes. In this case, such an
operation is considered as a TDRL only when it changes the gene order and is different from a
transposition.
A scenariofor two signed permutationsπ and σ is a sequence of rearrangement operations
55
4.5. THECREXALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS
that transformπ into σ. A sequence with a minimal (weighted) number of operations is called
parsimonious.
Common Intervals and Strong Interval Trees
An interval of a permutationπ is a set of consecutive elements of the permutationπ. Let Π be
a set of signed permutations of sizen. A common interval(Uno and Yagiura, 2000; Heber and
Stoye, 2001) ofΠ is a subset of{1, 2, . . . , n} that is an interval in eachπ ∈ Π. The singletons
{i}, i ∈ {1, 2, . . . , n} and the set{1, 2, . . . , n} of all elements are calledtrivial common intervals.
Let C(Π) be the set of all common intervals ofΠ. Two intervalsc andc′ overlapif c ∩ c′ 6= ∅,
c 6⊂ c′, andc′ 6⊂ c. If two intervals do not overlap theycommute. A common interval is called
a strong common interval, if it does not overlap with any other common interval. The set of
all strong common intervals can be computed inO(kn) time for k signed permutations of size
n (Bergeron et al., 2005, 2008). Thestrong interval tree(SIT) Berard et al. (2007) ofΠ is a
treeT (Π) where the nodes are exactly the strong common intervals ofΠ such that the root node
is the interval containing all elements, the leaves are the singletons, and the edges are defined
by the minimal inclusion relation of the intervals (i.e. there is an edge between nodec andc′
if c′ ⊂ c and there is no nodec′′ with c′ ⊂ c′′ ⊂ c). Each node is given a sign(+ or −). If
the children of a node appear in the same order in both input gene orders, the node is called
linear increasing (+); if the children of a node appear in opposite order in the twogene orders,
it is linear decreasing (−); otherwise the node is calledprime (see Figure 4.9(b)). For a more
comprehensive introduction of SITs see Berard et al. (2007). The importance of the SIT is that it
greatly facilitates the identification of the genome rearrangement operations in the algorithm of
CREx(Bernt et al., 2007). A genomic rearrangement operationρ π ∈ Π is said to bepreserving
for Π if it does not destroy any common intervalc ∈ C(Π) (i.e.,C(Π) = C(Π ∪ {π ◦ ρ})). An
operation is not preserving, if there exist a common interval, such that it does not subsists after
applying the rearrangement operation.
4.5.2 Methods
For the computation of optimal sorting scenarios based on inversions, that conserve combinato-
rial structures of genomes, it was recently proposed in Berard et al. (2007) to use strong interval
trees. A closely related data structure (known as PQ-tree (Booth and Lueker, 1976)) was shown
in Parida (2006) to be suitable for studying large scale rearrangement operations, with the focus
on inversions and transpositions.
56
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.5. THECREXALGORITHM
(a)
cox1 R nad4L cox2 K atp8 atp6 cox3 -S2 nad3 nad4 H S1 nad5 -nad6 cob F rrnS
nad4L cox2 K atp8 atp6 cox3 -S2 nad3 nad4 H S1 nad5 -nad6 cob F rrnS
E T P -Q N L1
N L1
-A W C -V
nad4L cox2 K atp8 atp6 cox3 -S2 nad3 nad4 H S1 nad5 -nad6 cob F rrnS E T P -Q N L1 -A W C -V
M -D Y G L2 nad1 I nad2 rrnL
cox1 R nad4L cox2 K atp8 atp6 cox3 -S2 nad3 nad4 H S1 nad5 -nad6 cob F rrnS E T P -Q N L1 -A W C -V M -D Y G L2 nad1 I nad2 rrnL
(b)
E P N L1 W -V nad4L cox2 K atp8 atp6 cox3 -S2 nad3 nad4 H S1 nad5 -nad6 cob F rrnS T -Q -A C
nad4L cox2 K atp8 atp6 cox3 -S2 nad3 nad4 H S1 nad5 -nad6 cob F rrnS E T P -Q N L1 -A W C -V
TD
RL
(c)
Figure 4.9: (a): Mitochondrial gene order rearrangement scenario inferred byCREx for the given phy-logeny of Asteroidea (A), Echinoidea (E), and Holothuroidea (H) showing one reversal (REV) and oneTDRL; (b): SIT of E for E→H (TDRL scenario); (c): TDRL of E→H as suggested byCREx; Figures4.9(b) and 4.9(c) are exported fromCREx.
Formally, a common interval is a subset of genes that appear consecutively in two (or more)
input gene orders (Berard et al., 2007). Two intervals I andJ are said to commute if either I
⊂ J, or I⊃ J, or I∩ J = ∅ holds. A common interval is a strong interval if it commutes with
every common interval. Finally, a strong interval tree for two gene orders is a rooted tree that has
exactly one leaf for each gene and exactly one inner node for each strong common interval. The
edges of the tree are defined by the inclusion order of the set of strong intervals. Two types of
inner nodes do exist. An inner node is called (increasing or decreasing) linear if its child nodes
are in a left-to-right or right-to-left order, otherwise the child nodes are not ordered and the inner
node is calledprime, like described in Section 4.5.1.
The basic principle for computing heuristic gene order rearrangement scenarios withCREx is
to detect patterns in the SITs that reflect the correspondinggenome rearrangement operations.
For example, reversals are very simple to detect because this operation is reflected as a sign
difference of parent node and child nodes in the SIT. Prime nodes are good indicators for TDRL
events. Both, transpositions and reverse transpositions,also lead to recognizable patterns in
the SIT.CRExuses a stepwise approach to suggest a genome rearrangement scenario: First it
identifies transpositions and reverse transpositions, then reversals are identified based on the sign
differences between connected nodes in the SIT. In these first two stepsCREx only operates
on linear nodes. In the third step, the prime nodes are analyzed to identify combinations of
reversal and TDRL operations (including transpositions) which can explain the corresponding
prime nodes (Bernt et al., 2007). The order of these operations are important to identify all of
the rearrangement events and there combinations.
57
4.5. THECREXALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS
(a) (b) (c) (d)
Figure 4.10: Genomic rearrangement events considered inCREx: (a) inversion, (b) transposition, (c)reverse transposition, and (d) tandem duplication random loss
TDRLs are of particular interest for phylogenetic analysissince the distance measure between
gene orders that is based on the minimum number of TDRLs is notsymmetric (Chaudhuri et al.,
2006). Especially, in many cases a rearrangement can be explained by a single TDRL in one
direction, while reversing the rearrangement would require more than a single operation. It fol-
lows that TDRLs imply the evolutionary direction of the rearrangement in many cases and hence
allow the reconstruction of the ancestral state from a comparison of two gene orders without
considering an outgroup. This feature makes TDRLs particularly valuable for phylogenetic stud-
ies and suggests that a detailed reconstruction of the rearrangement history of gene orders can
lead to more detailed and more certain phylogenetic conclusions. In contrast, reversals, trans-
positions, and reverse transpositions are inherently symmetric, hence ancestral states cannot be
reconstructed without additional outgroup information.
If it can be deduced from the gene orders of other taxa that thecorresponding gene sequence in
the ancestor genome of both sister taxa equals the gene orderin one sister taxon the rearrange-
ment operation is assigned to the branch leading to the othersister taxon.
4.5.3 TheCRExAlgorithm
CREx (Bernt et al., 2007) is an algorithm to heuristically determine preserving rearrangement
operations for pairs of unichromosomal genomes. The algorithm uses the fact that each of the
four rearrangement operations that are considered here correspond to a pattern in the SIT. To
illustrate this, each of the four rearrangement operationsis applied to the identity permutation and
the resulting SIT is computed. Figures 4.10 and 4.11 show theapplied rearrangement operations
and the resulting SITs. More formally, the following patterns appear for the different operations
when applied to a permutationπ:
• If a reversalρR(i, j) is applied, a linear node with a linear parent node of opposite sign
occurs in the corresponding SIT (see also Berard et al. (2007)). The linear node reflects
the common interval of all elements that are inverted.
58
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.5. THECREXALGORITHM
(a) (b) (c) (d)
Figure 4.11: Strong interval tree of the identity permutation and the resulting permutation after applyingthe corresponding genomic rearrangement event as given in Figure 4.10. The first of the two gene orders(not shown) in the example is 1234 and the other gene order hasbeen obtained by one of the followingoperations (the corresponding order can be found in the rootof the tree) (a) inversion, (b) transposition,(c) reverse transposition, (d) tandem duplication random loss. Prime nodes are depicted by ellipses, andlinear nodes by rectangles, where the sign in the square on top of a node indicates if the node is increasingor decreasing (+/-).
• If a transpositionρT(i, j, k) is applied, the corresponding SIT have a linear node with ele-
ments{πi, . . . , πk} that have two linear children reflecting the common intervals{πi, . . . , πj}
and{πj+1, . . . , πk}. The sign of the node needs to be different from the signs of the child
nodes.
• If a reverse transpositionρrT(i, j, k) is applied, the corresponding SIT has a linear node
with elements{πi, . . . , πk}. One child is a linear node reflecting the common interval of
elements{πi, . . . , πj} that are not inverted due to the reverse transposition. Thischild has
to have the different sign as its parent. The other involved elements are singletons as direct
child nodes of node{πi, . . . , πk} which must have a same sign.
• A tandem duplication random loss operationρTDRL leads to a prime node reflecting all the
elements involved in the rearrangement operation.
TheCRExalgorithm computes for two input permutationsπ1 andπ2 the strong interval tree for
these permutations. ThenCRExsearches for patterns corresponding to rearrangement operations.
If a pattern is identified, the corresponding rearrangementoperationρ is included in the scenario
to be computed and the next pattern is searched in the strong interval tree ofπ1 ◦ ρ andπ2 (the
pattern forρ will not occur in this strong interval tree). This process isrepeated until a complete
scenario is inferred. If the genomic rearrangement operations to be inferred do not overlap, the
pattern identification of theCRExalgorithm works. Obviously, the search order for patterns is
very important. If reversals are identified before transpositions and reverse transpositions, then
all transpositions and reverse transpositions would be inferred as being reversal operations in the
59
4.5. THECREXALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS
scenario. Therefore, the search order for the patterns of the genomic rearrangement operations is
i.) transpositions, ii.) reverse transpositions, iii.) reversals, and iv.) TDRL operations.
Special care has to be taken when prime nodes occur in the SIT.In theCRExalgorithm a prime
node is an indicator for one or several TDRLs. As a TDRL operation will not change the sign
of the elements involved, reversals are utilized to equalize the sign of all the elements in a prime
node. TheCREx algorithm uses a heuristic approach to identify a parsimonious number of
reversals and TDRLs for the corresponding prime node. Letπ1 andπ2 be the two permutations
of the elements in the prime node. Two variants are now included in the latest version ofCREx
to infer the reversals that are needed: i.) (reversals first)a set of reversals is applied to the
origin permutationπ1 to equalize the signs (with respect toπ2), and then, starting from the
resulting permutation, the minimum number of TDRLs is computed (Chaudhuri et al., 2006); or
ii.) (reversals last) first a set of reversals is applied toπ2 (resulting inπ′2, such that all the signs
are equalized with respect toπ1). Then a minimal number of TDRLs are inferred to transform
permutationπ1 to permutationπ′2. Note that the number of different possible parsimonious
scenarios to equalize the signs grows exponentially with the number of blocks of elements that
have different signs in both permutations. TheCRExalgorithm uses a brute-force approach and
each possible minimal set of reversals is tested, resultingin a potentially different number of
TDRLs per reversal set. Scenarios for which the sum of the number of reversals and TDRLs is
minimal are considered as possible scenarios. Furthermore, note that the resulting scenarios for
a prime node is not guaranteed to be parsimonious, as a mixed sequence of reversals and TDRLs
may result in a smaller scenario.
4.5.4 The Implementation of theCRExAlgorithm
Based on the described algorithm there is a web-based application, also calledCREx, for ana-
lyzing gene orders based on the application serverZope . The algorithms handling the common
interval data structures and for computing scenarios were implemented inC++ and integrated
into Zope via Python modules. For drawing publication ready downloadable versions of the
SITsReportLab was used.CREx, the program and the algorithm, a tutorial, and several de-
tailed examples are available online athttp://pacosy.informatik.uni-leipzig.
de/crex .
After uploading the gene order data in FASTA format toCREx, a distance matrix is computed
and displayed. The elements are colored according to the distance values so that gene orders with
a small evolutionary distance can be easily identified. The pairwise distances can be computed
as common interval distance, breakpoint, or reversal distance. Columns, rows, and individual
60
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.5. THECREXALGORITHM
elements of the distance matrix can be selected; for the selected elements the SITs are computed
and displayed as a tree or as a family diagram (Bergeron and Stoye, 2006). As briefly described
in Section 4.5.2, the structure of the SIT can be used to infergenomic rearrangement operations
connecting a pair of input gene orders.CRExuses this heuristic approach to suggest a rearrange-
ment scenario based on common intervals.CRExallows the user to select individual operations
of the scenario to highlight the affected common intervals.
4.5.5 Real World Example
Mitochondrial genomes have been a particularly fruitful data set for phylogenetic reconstructions
due to their limited size and availability of a large number of informative data sets. In addition to
the sequence data of its protein, rRNAs, and tRNAs, the gene order of the genes on the circular
mitochondrial genome of animals has received extensive attention as a phylogenetic marker since
the seminal work of Watterson et al. (1982) and Sankoff et al.(1992).
In a recent work about echinoderms (Perseke et al., 2008) theCRExalgorithm was tested first.
It was possible to indentfy all of the known rearrangement events and beyond that in most cases
the direction resp. the branch on which this event probably happened. Furthermore, the well
described TDRL of the holothurians (Arndt and Smith, 1998) could be detected and two novel
gene orders of mitochondrial genomes of the ophiuroidOphiura albidaand the crinoidAntedon
mediterraneawere analysed.
O. albidapossesses the same gene order asO. lutkeni, which differs substantially fromOphio-
pholis aculeata(Scouras et al., 2004). So far, only two distinct gene ordersfor ophiuroids, which
are very different, are known. Unfortunately,CRExis not able to resolve a plausible rearrange-
ment sequence because there are too few conserved parts between the two gene orders. Thus, as
published in Scouras et al. (2004), the ancestral state of ophiuroids remains unresolved.
Surprisingly, the gene order ofA. mediterraneais different from the consensus gene order of the
crinoids, which is represented byF. serratissimaandP. gracilis (Scouras and Smith, 2001), in
three regions. All rearrangements within the crinoids could be completely resolved by means of
CREx. The mitochondrial genome ofA. mediterraneaincludes two equal variants of the tRNAVal,
which are absent in all other published crinoids. Furthermore,CRExwas possible to detect one
transposition of the gene ND4L coupled with the tRNAArg (T2, see Figure 4.13) and one TDRL
event of 6 tRNA genes and the control region (TDRL2). Following this event, UAS II was created
by duplication of the four tRNAs (tRNAVal-tRNAAsp-tRNAThr-tRNAGlu) and subsequent loss of
the copies of tRNAAsp, tRNAThr, and tRNAGlu. The reconstructed ancestral state of crinoids is
identical to the gene orders ofF. serratissimaandP. gracilis, which is in agreement with Scouras
61
4.5. THECREXALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS
COX1R
ND4L
COX2K
ATP8
ATP
6C
OX
3
S2ND3
ND
4H
S1
ND5
ND
6
UAS IG
UA
S II
16S
MP
12S
FE
CV
Y
L1AQ
N
L2
ND
1I
ND2D
CYTBT
UA
S II
IW
0K
1K
2K
3K
4K5K
6K
7K
8K9K
10K
11K
12K
13K
14K
15K
16K
AM404180 16580ntOphiura albida
COX1
COX2
K
ATP8ATP6
CO
X3
S2
ND3
ND
4H
UAS IVRND4LS1
ND5ND
6U
AS
V
CYTBPQ
N
L1
WCM
UAS I
AV
UAS II
VD
TE
12S
F
L2G
16S
Y
ND2I ND1
UA
S II
I
0K
1K
2K
3K
4K5K
6K
7K
8K9K
10K
11K
12K
13K
14K
15K
16K
AM404181 16169ntAntedon mediterranea
Figure 4.12: Maps of the mitogenomes of the crinoidAntedon mediterraneaand the ophiuroidOphiuraalbida. The images were generated from the GenBank files with the mitochondrial visualizationtool ’mtviz’ (application note in preparation). It can be found athttp://pacosy.informatik.uni-leipzig.de/mtviz .
and Smith (2006).
The known gene orders of the other echinoderm groups (Asteroidea, Echinoidea, and Holothuroidea)
demonstrate no rearrangements within each class and only few rearrangements are necessary to
transform the gene orders from one group into the other, and vice versa.
Between the Asteroidea and the Echinoidea exist only one well described inversionI1 of 16
genes (Asakawa et al., 1995; Smith et al., 1990). Unfortunately, there is only one complete
mitochondrial genome of the Holothuroidea, which can be deduced from the echinoid gene order
by a single TDRL event of one tRNA cluster (TDRL1) (Arndt and Smith, 1998). TheCREx
analysis implies that Echinoidea represents the ancestralstate within these three groups because
all events occurred on the branches leading to the Asteroidea or Holothuroidea. However, the
phylogenetic relationships between these three classes cannot be unambiguously reconstructed
based on both genome rearrangements and on the sequence analysis.
The differences between the echinoid gene order and ancestral crinoid gene order can be ex-
plained by one inversion (I2), see Scouras et al. (2004), followed by one tandem duplication
random loss (TDRL3), and one reverse transposition (rT1), cf. Figure 4.13.
TheI2+TDRL3 event concerns the 16S rRNA, the genes ND1, ND2, as well as tRNAIle, tRNALeu2,
tRNAGly, and tRNATyr. We suggest that these events is mechanistically coupled, i.e. constitute a
62
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.5. THECREXALGORITHM
Asterina pectinifera
T1
TDRL2+T2
99/88/89/1.00
Ophiopholis aculeata
Ophiura lutkeniOphiura albida
Balanoglossus carnosus
Florometra serratissimaAntedon mediterranea
Cucumaria miniataStrongylocentrotus purpuratus
Paracentrotus lividusArbacia lixulaPisaster ochraceusAsterias amurensis
Luidia quinaliaAstropecten polyacanthus
Acanthaster planciAcanthaster brevispinus
Saccoglossus kowalevskii
Gymnocrinus richeriPhanogenia gracilisrT1
TDRL1
I1
100/96/97/1.00
79/36/−/−
100/100/97/1.00
Hemichordata
Ophiuroidea
Asteroidea
Echinoidea
Holothuroidea
Crinoidea
34/72/71/−
I2+TDRL3
77/96/100/−
Figure 4.13: Maximum likelihood analysis of thirteen protein coding genes of the echinoderm mtgenomes. The numbers show the bootstrap values for ML/NJ/MPand the posterior probability of bayesiananalysis. If these values are 100 or 1.00, resp. they are represented by solid points to save space. Thetree has been rooted with two hemichordates. Branch lengthsare proportional to evolutionary distance. I= Inversions (reversals), T = Transpositions, rT = Reverse Transpositions, TDRL = Tandem DuplicationRandom Loss
single rearrangement event. Alternatively, the putativeI2+TDRL3 event can be explained by an
inversion with two additional transposition events (see Figure A.4), or as a decoupled event of an
inversion and one tandem duplication random loss eventTDRL3. The direction of theTDRL3
implies that it occurred on the branch leading to echinoids,asteroids, and holothuroids. In con-
trast, the reverse transpositionrT1 of the fragment containing the 12S rRNA and three tRNAs
provides no direct information about its location on the twobranches.
Scouras and his collaborators (Scouras and Smith, 2001; Scouras et al., 2004; Scouras and Smith,
2006) suggested, based mostly on the nucleotide bias in crinoids and the putative reverse orien-
tation of the control region relative to the protein-codinggenes, that the reverse transposition
occurred in the crinoid lineage. This is consistent with theanalysed data. Accepting that the
rT1 event is crinoid specific and that theI2+TDRL3 are indeed part of the same rearrangement
63
4.5. THECREXALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS
A
E
H
CI2 and TDRL3 coupled
Hypothesis B1
A
E
H
C
Hypothesis A
I2 and TDRL3 decoupled
Hypothesis B2
A
E
H
C
I1
rT1
I2+TDRL3
TDRL1
I2+TDRL3
I1
TDRL1
rT1 rT1
I2/TDRL3
TDRL1
I1b
I1a
Figure 4.14: Two phylogenetic hypotheses are consistent with the most parsimonious rearrangement sce-nario. Hypothesis A: based on phylogenetic analyses of the amino acid sequences. This scenario impliesthat the eventsI2 andTDRL3 must be coupled. Hypothesis B: based on the gene order analysis and in thisscenario the eventsI2 andTDRL3 can be either coupled or not. Both variants are shown. A = Asteroidea,E = Echinoidea, H = Holothuroidea, C = Crinoidea
event, it is possible to draw the conclusion, that the ancestral gene order of asteroids, echinoids,
holothuroids, and crinoids complies to the crinoid gene order without the reverse transposition
rT1 (Figure 4.13). Furthermore, the ancestral arrangement of the echinoid, holothuroid, and as-
teroid ancestor coincides with the extant echinoid gene order. This scenario is consistent with two
phylogenetic hypotheses (A and B1), see Figure 4.14. If the coupling of theI2 and theTDRL3
events was rejected (Hypothesis B2 in Figure 4.14), an alternative scenario becomes plausible,
which places an inversionI1 event on the Echinozoa (echinoid+holothuroid) branch and makes
the Asteroidea gene order to the ancestral state of the echinoid+holothuroid+asteroid group. Note
that in this scenario the inversionI1 event implies two different gene orders depending on which
group (Asteroidea,I1a, or Echinozoa,I1b) represents the ancestral state.
In the analysis reported here the (putative) control regions (as annotated in GenBank and/or the
corresponding literature) were included. This additionalinformation results in a better support
for TDRL2 event (reversing the direction of the rearrangement now requires two TDRLs instead
of only two transpositions). Interestingly, most of the eight rearrangements contain or are located
close to the control region. A frequent involvement of the control region in genome rearrange-
ments was noted before in chordates (Boore and Brown, 1998).
In Figure 4.13 the rearrangement operations determined byCRExare mapped to the consensus
phylogenetic tree obtained from a careful analysis of the mitochondrial protein sequences. The
CRExresults alone are not sufficient to resolve the phylogeneticrelationships completely. How-
ever, the results ofCRExare consistent with the molecular phylogeny. It should be noted that
64
CHAPTER 4. GENE ORDER REARRANGEMENTS 4.5. THECREXALGORITHM
the gene order analysis fails to provide unambiguous information exactly for those nodes that
contradict the preferred phylogenetic hypothesis, in particular the position of the ophiuroids, see
below.
4.5.6 Current Developments
TheCRExalgorithm was extended to better handle alternative scenarios, ordered scenarios, and
combinations of inversion and tandem duplication random loss events. Furthermore, a novel
algorithm calledTREx (TreeRearrangementExplorer) was developed. TheTREx algorithm
utilies theCRExalgorithm and takes as input a binary rooted phylogenetic tree and the gene
orders of a set of taxa and heuristically infers the corresponding rearrangement operations on the
edges of the tree. Using theTRExalgorithm it is now possible to test the gene order information
in relation to different hypotheses and/or different phylogenetic trees.
The aim of this project is to use the gene order information toreconstruct phylogenetic trees and
to determine ancestral gene orders.
65
4.5. THECREXALGORITHM CHAPTER 4. GENE ORDER REARRANGEMENTS
66
CHAPTER 5
Summary
The inclusion of molecular markers in reconstruction of phylogenetic relationships has increased
over the last decades. Molecular markers represent a essential part on different taxonomic levels.
Particularly, they are used within spheres where morphological data contain no more usefull
information. However, the increase of experimental data represents a new challenges in the
phylogenetic reconstruction. Bioinformatics, as a new field of research, take up these challenges
over the last few years and provide numerous tools to handle with this abundance of data.
At the time, several methods and approaches allow biologists to analyse molecular data exten-
sively and carefully. Based on the understanding of the biology on a molecular level it is possible,
in the majority of the cases, to solve phylogenetic relationships of organisms with a long time of
separation.
This work is a further step to find algorithms and develop new programs to give biologists basic
tools for an attentive work in the analysis of such data.
In my work, I followed two different approaches. The first approach deals with the quality of
data sets. With the increase of molecular data and the demandto resolve the metazoan deep
phylogenies, it will be more difficult to compare the data. Inmany cases the differences between
the sequences are on a high level, for example, based on largedivergence times or radiation
events in their history. So it is very hard to identify and usethe information for phylogenetic
reconstruction. Multiple substitutions, point mutations, wobbling third positions in protein cod-
ing genes, and/or simple variable parts in sequences lead toalignment positions whose character
67
CHAPTER 5. SUMMARY
information can’t be interpreted in ’the correct way’ any more. Such a critical alignment makes
it impossible to reconstruct accurate relationships between the species. Frequently, scientists
delete vague parts or positions by hand, which is most often impossible to reproduce.
Therefore we developed the programnoisy , which allows the user to detect random like po-
sitions with undefined information, check if these sites arerandom like, and delete them com-
prehensibly and able to reproduce them. First,noisy computes a cyclic ordering of the taxa
set, whereby the user can choose between two techniques (NeighborNet or QNet). Subse-
quently, a reliability scoreq for each character will be calculated. The number of character-state
alterations is counted and compared to the observed count inrandom shuffling. The uniform
pseudo-random number generatorMersenne Twisteris used to generate the random shuffling.
Furthermore, the fractionr of sites withqcutoff > 0.8 (or a user given value) among all variable
sites in the alignment are computed. At the end,noisy exports a Postscript file, visualizing
the quality of the sites of the reordered input alignment, recording their reliability score asxy-
data, and containing a modified alignment for further analysis from which sites with reliability
q < qcutoff are removed.
The analysis of the modified alignment with standard reconstruction tools is possible and the
reconstruction will be more stable and, in cases of data setsincluding very variable sequences,
the topology can change. Subsequently, thenoisy output allows other scientists to comprehend
the deleting pattern.
The second approach uses the information of mitochondrial gene order. The mitochondrial gene
orders include comprehensive information of very old splittings, for example the metazoan deep
phylogeny. In literature it was shown that mitochondrial genomes can be a good phylogenetic
marker using suitable methods. Recently, some implementations of different algorithms and
ideas were published. However, all of them are plagued by several problems, like the limitation
of hypothetical scenarios or the absent option to compare sequences with different length.
In this dissertation I present two novel algorithms for dealing with mitochondrial gene order in-
formation and their implementation. The first algorithm, implemented in the programcircal ,
allows to solve the problem of gene order strings with unequal lengths by means of encoding
the gene order information in list alignments. A progressive alignment approach is used to com-
bine pairwise list alignments into a multiple alignment of gene orders. This very simple and fast
approach gives the possibility to apply common reconstruction methods like Neighbor Joining,
Maximum Parsimony, Maximum Likelihood, or bayesian analysis, subsequently. Such consti-
tuted alignments of mitochondrial gene orders can readily be used to reconstruct phylogenetic
trees as well as ancestral gene orders using standard approaches. It could be shown that this
68
CHAPTER 5. SUMMARY
method is able to resolve large genome data sets with about 100 characters. Furthermore, we
demonstrated that only 37 characters and their direction information are sufficient to reconstruct
phylogenetic relationships.
Additionally, we developed and implemented the new algorithmCREx, based on the detection of
patterns in so calledstrong interval trees, which reflect the corresponding genome rearrangement
operations. Furthermore, we included the tandem duplication random loss operation (TDRL) as
additional plausible event. Based on the available information of the direction of this event,
the operation is very useful to reconstruct phylogenetic events on the level of gene order rear-
rangements. Moreover, it can explain complex rearrangement events of three or more inversions
with sometimes only one TDRL. However, it is important to bear in mind that no rearrangement
operation is verified in practice.
TheCRExapproach allows the user to compare gene order strings in a plain way. The graphical
output gives a good overview of the occurred events between to divergently evolving genomes.
It is possible to choose different distance matrices and to take in acount if the genome is linear
or circular. Furthermore, hypothetical tree reconstructions can be tested based on the gene order
information. The analyses usingCRExhave shown that this algorithm is able to identify pub-
lished and well described rearrangement scenarios and, moreover, to detect new or shorter ways
of comparative gene order evolution.
A future step of this work will use the gene order informationto reconstruct phylogenetic trees
without additional input, as shown in the outlook of theTREx algorithm. All programs, al-
gorithms, source codes, and information are freely avalible from the web. The results of my
dissertation are reflected by the list of publications whichinclude, besides different phylogenetic
’state of the art’ analyses, several novel methods.
69
CHAPTER 5. SUMMARY
70
Bibliography
Adoutte, A., N. Balavoine, G.and Lartillot, B. Lespinet, O.and Prud’homme, and R. de Rosa,
2000. The new animal phylogeny: reliability and implications. Proc. Natl. Acad. Sci. USA
97:4453–4456.
Alberts, B., A. Johnson, J. Lewis, M. Raff, K. Roberts, and P.Walter, 1994. Molecular Biology
of the Cell. Garland Publishing Inc., New York.
Altmann, R., 1890. Die Elementarorganismen und ihre Beziehungen zu den Zellen. Verlag von
Veit & Comp., Leipzig.
Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lip-
man, 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucl. Acids Res25:3389–3402.
Anderson, S., A. T. Bankier, B. G. Barrell, M. H. L. de Bruijn,A. Coulson, J. Drouin, I. C.
Eperon, D. P. Nierlich, B. A. Roe, F. Sanger, et al., 1981. Sequence and organization of the
human mitochondrial genome. Nature290:457–465.
Anderson, S., M. H. de Bruijn, A. R. Coulson, I. C. Eperon, F. Sanger, and I. G. Young, 1982.
Complete sequence of bovine mitochondrial dna. conserved features of the mammalian mito-
chondrial genome. J Mol Biol156:683–717.
Arndt, A. and M. Smith, 1998. Mitochondrial gene rearrangement in the sea cucumber genus,
cucumaria. Mol. Biol. Evol15:1009–1016.
71
BIBLIOGRAPHY BIBLIOGRAPHY
Asakawa, S., H. Himeno, K. Miura, and K. Watanabe, 1995. Nucleotide sequence and gene
organization of the starfishAsterina pectiniferamitochondrial genome. Genetics140:1047–
1060.
Attardi, G., 1981. Organization and expression of the mammalian mitochondrial genome: a
lesson in economy. Trends Biochem Sci6:86–89,100–103.
Bader, D. A., B. M. E. Moret, and M. Yan, 2001. A linear-time algorithm for computing inversion
distance between signed permutations with an experimentalstudy. Journal of Computational
Biology 8(5):483–491.
Bandelt, H. J. and A. W. M. Dress, 1992. Split decomposition:A new and useful approach to
phylogenetic analysis of distance data. Mol. Phyl. Evol.1:242–252.
Berard, S., A. Bergeron, C. Chauve, and C. Paul, 2007. Perfect sorting by reversals is not always
difficult. IEEE-ACM Transaction on Computational Biology and Bioinformatics4:4–16.
Bergeron, A., C. Chauve, F. de Montgolfier, and M. Raffinot, 2005. Computing common intervals
of k permutations, with applications to modular decomposition of graphs. In ESA, pp. 779–
790.
Bergeron, A., C. Chauve, F. de Montgolfier, and M. Raffinot, 2008. Computing common intervals
of K permutations, with applications to modular decomposition of graphs. SIAM J. Discrete
Math. To appear.
Bergeron, A. and J. Stoye, 2006. On the similarity of sets of permutations and its applications to
genome comparison. Journal of Computational Biology13:1340 –1354.
Bernhard, D., G. Fritzsch, P. Glockner, and C. Wurst, 2005.Molecular insights into speciation in
theAgrillus viridiscomplex and the genusTrachys(Coleoptera: Buprestidae). Eur. J. Entomol.
102:599–605.
Bernhard, D., C. Schmidt, A. Korte, G. Fritzsch, and R. G. Beutel, 2006. From terrestrial to
aquatic habitats and back again - molecular insights into the evolution and phylogeny of hy-
drophiloidea (coleoptera) using multigene analyses. Zoologica Scripta35:597–606.
Bernt, M., D. Merkle, K. Ramsch, G. Fritzsch, M. Perseke, D. Bernhard, M. Schlegel, P. F.
Stadler, and M. Middendorf, 2007. CREx: Inferring Genomic Rearrangements Based on
Common Intervals. Bioinformatics23:2957–2958.
72
BIBLIOGRAPHY BIBLIOGRAPHY
Bhamrah, H. S. and K. Juneja, 2002. Modern Zoology. Anmol Publications PVT. LTD., New
Delhi.
Boehme, M. U., G. Fritzsch, A. Tippmann, M. Schlegel, and T. U. Berendonk, 2007. The
complete mitochondrial genome of the green lizardlacerta viridis viridis(reptilia: Lacertidae)
and its phylogenetic position within squamate reptiles. Gene394:69–77.
Boore, J. L., 1999. Animal mitochondrial genomes. Nucl. Acids Res.27:1767–1780.
Boore, J. L. and W. M. Brown, 1998. Big trees from little genomes: mitochondrial gene order as
a phylogenetic tool. Curr. Opinion Gen. Devel.8:668–674.
Booth, K. and G. Lueker, 1976. Testing for the consecutive ones property, interval graphs, and
graph planarity using PQ-tree algorithms. J. Comp. System Sci. 13:335–379.
Borst, P., 1972. Mitochondrial nucleic acids. Ann.Rev.Biochem.41:333–376.
Bryant, D. and V. Moulton, 2004. Neighbor-net: An agglomerative method for the construction
of phylogenetic networks. Mol. Biol. Evol.21:255–265.
Bryant, D. and V. Moulton, 2007. Consistency of neighbor-net. Algorithms for Molecular
Biology 2:8. Preprint.
Buneman, P., 1971. The recovery of trees from measures of dissimilarity. In F. R. Hodson, D. G.
Kendall, and P. Tautu, eds., Mathematics and the Archeological and Historical Sciences, pp.
387–395. Edinburgh University Press, Edinburgh, UK.
Bunke, H. and U. Buhler, 1993. Applications of approximatestring matching to 2D shape recog-
nition. Patt. Recogn.26:1797–1812.
Cameron, C. B., J. R. Garey, and B. J. Swalla, 2000. Evolutionof the chordate body plan:
New insights from phylogenetic analyses of deuterostome phyla. Proc. Natl. Acad. Sci. USA
97:4469–4474.
Caprara, A., 2003. The reversal median problem. INFORMS Journal on Computing15:93–113.
Cartwright, R., 2005. DNA assembly with gaps (Dawg): Simulating sequence evolution. Bioin-
formatics21 S3:iii31–iii38.
Chan, D., 2006. Mitochondria: Dynamic organelles in disease, aging, and development. Cell
125:1241–1252.
73
BIBLIOGRAPHY BIBLIOGRAPHY
Chaudhuri, K., K. Chen, R. Mihaescu, and S. Rao, 2006. On the tandem duplication-random
loss model of genome rearrangement. In SODA, pp. 564–570.
Chipuk, J., L. Bouchier-Hayes, and D. Green, 2006. Mitochondrial outer membrane permeabi-
lization during apoptosis: the innocent bystander scenario. Cell Death and Differentiation
13:1396–1402.
Coenye, T. and P. Vandamme, 2003. Extracting phylogenetic information from whole-genome
sequencing projects: the lactic acid bacteria as a test case. Microbiology149:3507–3517.
Cosner, M., R. Jansen, B. Moret, L. Raubeson, L.-S. Wang, T. Warnow, and S. Wyman, 2000a.
An empirical comparison of phylogenetic methods on chloroplast gene order data in campan-
ulaceae. In D. Sankoff and J. Nadeau, eds., Comparative Genomics: Empirical and Analytical
Approaches to Gene Order Dynamics, Map Alignment, and the Evolution of Gene Families.,
pp. 99–121. Kluwer Academic Pub., Dordrecht, Netherlands.
Cosner, M., R. Jansen, B. Moret, L. Raubeson, L.-S. Wang, T. Warnow, and S. Wyman, 2000b.
A new fast heuristic for computing the breakpoint phylogenyand a phylogenetic analysis of a
group of highly rearranged chloroplast genomes. In Proc. 8th Intl Conf. on Intelligent Systems
for Molecular Biology (ISMB00), pp. 104–115.
Dellaporta, S., A. Xu, S. Sagasser, W. Jakob, M. Moreno, L. Buss, and B. Schierwater, 2006.
Mitochondrial genome of trichoplax adhaerens supports placozoa as the basal lower metazoan
phylum. Proc Natl Acad Sci USA103:8751–8756.
Dewey, T. G., 2001. A sequence alignment algorithm with an arbitrary gap penalty function. J.
Comp. Biol.8:177–190.
Dobzhansky, T. and A. H. Sturtevant, 1938. Inversions in thechromosomes of drosophila pseu-
doobscura. Genetics23:28–64.
Doyle, J. J., J. I. Davis, R. J. Soreng, D. Garvin, and M. J. Anderson, 1992. Chloroplast DNA
inversions and the origin of the grass family (poaceae). Proc. Natl. Acad. Sci. USA89:7722–
7726.
Dreyer, H. and G. Steiner, 2004. The complete sequence and gene organization of the mitochon-
drial genome of the gadilid scaphopodSiphonondentalium lobatum(Mollusca). Mol. Phylog.
Evol. 31:605–617.
74
BIBLIOGRAPHY BIBLIOGRAPHY
Dunn, C. W., A. Hejnol, D. Q. Matus, K. Pang, W. E. Browne, S. A.Smith, E. Seaver, G. W.
Rouse, M. Obst, G. Edgecombe, et al., 2008. Broad phylogenomic sampling improves resolu-
tion of the animal tree of life. Nature452:745–750.
Efron, B., E. Halloran, and S. Holmes, 1996. Bootstrap confidence levels for phylogenetic trees.
Proc. Natl. Acad. Sci. USA93:7085–7090.
Farris, J. S., 1989. The retention index and the rescaled consistency index. Cladistics5:417–419.
Feagin, J. E., 1992. The 6-kb element of plasmodium falciparum encodes mitochondrial cy-
tochrome genes. Mol Biochem Parasitol52:145–148.
Felsenstein, J., 1985. Confidence limits on phylogenies: Anapproach using the bootstrap. Evo-
lution 31:783–791.
Felsenstein, J., 1989. Phylip – phylogeny inference package (version 3.2). Cladistics5:164–166.
Fitz-Gibbon, S. T. and C. H. House, 1999. Whole genome-basedphylogenetic analysis of free-
living microorganisms. Nucl. Acids Res.27:4218–4222.
Fritzsch, G., M. Schlegel, and P. F. Stadler, 2004a. Consensus arrangements of mitochondrial
genomes. In Proceedings of the German Conference on Bioinformatics (GCB 04).
Fritzsch, G., M. Schlegel, and P. F. Stadler, 2004b. Metazoan deep phylogenies: Can the cam-
brian explosion be resolved with molecular markers? In Proceedings of the 12th International
Conference on Intelligent Systems for Molecular Biology (ISMB) / 3rd Europena Conference
on Computational Biology (ECCB) 2004.
Fukuhara, H., F. Sor, R. Drissi, N. Dinoul, I. Miyakawa, S. Rousset, and A. Viola, 1993. Linear
mitochondrial dnas of yeasts: frequency of occurrence and general features. Mol Cell Biol
13:23092314.
Gotoh, O., 1982. An improved algorithm for matching biological sequences. J. Mol. Biol.
162:705–708.
Green, D., 1998. Apoptotic pathways: the roads to ruin. Cell94:695–698.
Gregor, J. and M. G. Thomason, 1993. Dynamic programming alignment of sequences repre-
senting cyclic patterns. IEEE Trans. Patt. Anal. Mach. Intell. 15:129–135.
Grunewald, S., 2006. QNet. Unpublished Technical Report.
75
BIBLIOGRAPHY BIBLIOGRAPHY
Grunewald, S., V. Moulton, and A. Spillner, 2007. Consistency of the QNet algorithm for gener-
ating planar split networks from weighted quartets. Disc. Appl. Math.to appear.
Hannenhalli, S. and P. A. Pevzner, 1995. Transforming cabbage into turnip (polynomial al-
gorithm for sorting signed permutations by reversals). In Proceedings 27th ACM Symp. on
Theory of Computing (STOC’95), pp. 178–189.
Heber, S. and J. Stoye, 2001. Algorithms for finding gene clusters. In Proceedings of WABI
2001, number 2149 in LNCS, pp. 252–263.
Helfenbein, K. G., H. Fourcade, R. G. Vanjani, and J. L. Boore, 2004. The mitochondrial genome
of paraspadella gotoi is highly reduced and reveals that chaetognaths are a sister group to
protostomes. Proc Natl Acad Sci USA101:10639–10643.
Hermann, G., J. Thatcher, J. Mills, K. Hales, M. Fuller, J. Nunnari, and J. Shaw, 1998. Mito-
chondrial fusion in yeast requires the transmembrane gtpase fzo1p. J. Cell. Bio.143:359–373.
Higgs, P. G., D. Jameson, H. Jow, and M. Rattray, 2003. The evolution of tRNA-Leu genes in
animal mitochondrial genomes. J. Mol. Evol.57:435–445.
Hillis, D. M. and J. P. Huelsenbeck, 1992. Signal, noise, andreliability in molecular phylogenetic
analysis. J. Heredity83:189–195.
Hoffmann, R. J., J. L. Boore, and W. M. Brown, 1992. A novel mitochondrial genome organiza-
tion for the blue mussel,Mytilus edulis. Genetics131:397–412.
Huson, D. H., 1998. Splitstree: analyzing and visualizing evolutionary data. Bioinformatics
14:68–73.
Kluge, A. G. and J. S. Farris, 1969. Quantitative phyletics and the evolution of anurans. Syst.
Zool. 18:1–32.
Korte, A., I. Ribera, R. G. Beutel, and D. Bernhard, 2004. Interrelationships of staphyliniform
groups inferred from 18S and 28S rDNA sequences, with special emphasis on hydrophiloidea
(coleoptera, staphyliniformia). J. Zool. Syst. Evol. Research42:281–288.
Kumazawa, Y., 2007. Mitochondrial genomes from major lizard families suggest their phyloge-
netic relationships and ancient radiations. Gene388:19–26.
76
BIBLIOGRAPHY BIBLIOGRAPHY
Kurihara, K. and T. Kunisawa, 2004. A gene order database of plastid genomes. Data Sci. J.
3:60–79.
Landau, G. M., E. W. Myers, and J. P. Schmidt, 1998. Incremental string comparison. SIAM J.
Comput.27:557–582.
Larget, B. and D. L. Simon, 2002. Bayesian phylogenetic inference from animal mitochondrial
genome arrangements. J. Royal Statist. Soc. B64:681–693.
Lavrov, D. V., J. L. Boore, and W. M. Brown, 2002. Complete mtDNA sequences of two milli-
pedes suggest a new model for mitochondrial gene rearrangements: Duplication and nonran-
dom loss. Mol. Biol. Evol.19:163–169.
Lee, M., 2005. Squamate phylogeny, taxon sampling, and datacongruence. Org. Divers. Evol
5:25–45.
Lehr, E., G. Fritzsch, and A. Muller, 2005. An analysis of andes frogs (Phrynopus: Leptodactyl-
idae) phylogeny based on 12S and 16S mitochondrial DNA sequences. Zoologica Scripta
34:593–603.
Macey, J. R., J. A. Schulte II, A. Larson, and T. J. Papenfuss,1998. Tandem duplication via light-
strand synthesis may provide a precursor for mitochondrialgenomic rearrangement. Mol. Biol.
Evol. 15:71–75.
Maes, M., 1990. On a cyclic string-to-string correction problem. Inform. Process. Lett.35:73–
78.
Maglott, D., J. Ostell, K. D. Pruitt, and T. Tatusova, 2005. Entrez gene: Gene-centered informa-
tion at ncbi. Nucleic Acids Res3:D54–D58.
Mannella, C., 2006. Structure and dynamics of the mitochondrial inner membrane cristae.
Biochimica et Biophysica Acta (BBA) - Mol Cell Res.1763:542–548.
Margulis, L., 1970. Origin of Eukaryotic Cells. Yale Univ. Press, New Haven.
Margulis, L., 1981. Symbiosis in Cell Evolution. Freeman, San Francisco, San Francisco.
Martin, W. and M. Muller, 1998. The hydrogen hypothesis forthe first eukaryote. Nature
392:37–41.
77
BIBLIOGRAPHY BIBLIOGRAPHY
Matsumoto, M., 1998. Mersenne Twister: A 623-dimensionally equidistributed uniform pseu-
dorandom number generator. ACM Trans. on Modeling and Computer Simulation8:3–30.
McBride, H., M. Neuspiel, and S. Wasiak, 2006. Mitochondria: more than just a powerhouse.
Curr Biol 16:551–560.
Mereschkowsky, C., 1905.Uber natur und ursprung der chromatophoren im pflanzenreiche. Biol
Centralbl25:593–604.
Mindell, D. P., M. D. Sorenson, and D. E. Dimcheff, 1998. Multiple independent origins of
mitochondrial gene order in birds. Proc. Natl. Acad. Sci. USA 95:10693–10697.
Mollineda, R. A., E. Vidal, and F. Casacuberta, 2002. Cyclicsequence alignments: approximate
versus optimal techniques. Int. J. Pattern Rec. Artif. Intel. 16:291–299.
Moret, B. M. E., D. A. Bader, and T. Warnow, 2002. High-performance algorithm engineering
for computational phylogenetics. J Supercomputing22:99–111.
Moret, B. M. E., J. Tang, and T. Warnow, 2004. Reconstructingphylogenies from gene-content
and gene-order data. In Gascuel O (ed) Mathematics of Evolution and Phylogeny, pp. 321–
352. Oxford University Press, New York.
Moret, B. M. E., S. K. Wymann, D. A. Bader, T. Warnow, and M. Yan, 2001. A new implemen-
tation and detailed study of breakpoint analysis. In Proc. 6th. Pacific Symp. on Biocomputing
(PSB’01), pp. 583–594. World Scientific Pub.
Moritz, C. and W. M. Brown, 1986. Tandem duplications of d- loop and ribosomal rna sequences
in lizard mitochondrial dna. Science233:1425–1427.
Moritz, C. and W. M. Brown, 1987. Tandem duplications in animal mitochondrial dnas: variation
in incidence and gene content among lizards. Proc. Natl. Acad. Sci84:7183–7187.
Moritz, C., T. E. Dowling, and W. M. Brown, 1987. Evolution ofanimal mitochondrial dna:
relevance for population biology and systematics. Annu. Rev. Ecol. Syst18:269–292.
Nardi, F., G. Spinsanti, J. L. Boore, A. Carapelli, R. Dallai, and F. Frati, 2003. Hexapod origins:
Monophyletic or paraphyletic? Science299:1887–1889.
Nass, S. and M. M. K. Nass, 1963. Ultramitochondrial fibers with dna characteristics. J Cell
Biol 19:593–629.
78
BIBLIOGRAPHY BIBLIOGRAPHY
Nieselt-Struwe, K. and A. von Haeseler, 2001. Quartet-mapping, a generalization of the likeli-
hood mapping procedure. Mol. Biol. Evol.18:1204–1219.
Notsu, Y., S. Masood, T. Nishikawa, N. Kubo, G. Akiduki, M. Nakazono, A. Hirai, and K. Kad-
owaki, 2002. The complete sequence of the rice (oryza satival.) mitochondrial genome: fre-
quent dna sequence acquisition and loss during the evolution of flowering plants. Mol Genet
Genomics268:434–445.
Odintsova, M. S. and N. P. Yurina, 2003. Plastid genomes of higher plants and algae: Structure
and functions. Mol. Biol. (Mosk.)37:649–662. Translated fromMolekulyarnaya Biologiya,
Vol. 37, No. 5, 2003, pp. 768-783.
Ogdenw, T. and M. Rosenberg, 2006. Multiple sequence alignment accuracy and phylogenetic
inference. Syst. Biol.55:314–328.
Oh-hama, T., 1997. Evolutionary consideration on 5-aminolevulinate synthase in nature. Orig
Life Evol Biosph27:405–412.
Parida, L., 2006. A PQ framework for reconstructions of common ancestors and phylogeny.
In Comparative Genomics, volume 4205, pp. 141–155. Springer, Berlin. RECOMB 2006
International Workshop, RCG 2006 Montreal, Canada, September 24-26, 2006 Proceedings.
Perseke, M., G. Fritzsch, K. Ramsch, M. Bernt, D. Merkle, M. Middendorf, D. Bernhard, P. F.
Stadler, and M. Schlegel, 2008. Evolution of mitochondrialgene orders in echinoderms. Mol.
Phyl. Evol.47:855–864.
Qi, J., B. Wang, and B.-l. Hao, 2004. Whole proteome prokaryote phylogeny without sequence
alignment: ak-string composition approach. J. Mol. Evol.58:1–11.
Racker, E., 1976. A New Look at Mechanisms in Bioenergetics.Academic Press, New York.
Reijnders, L., 1975. The origin of mitochondria. J Mol Evol5:167–176.
Ris, H. and R. Singh, 1961. Electron microscope studies on blue-green algae. J Biophys Biochem
Cytol 9:63–80.
Rosenberg, M. S. and S. Kumar, 2001. Incomplete taxon sampling is not a problem for phyloge-
netic inference. Proc. Natl. Acad. Sci. USA11:10751–10756.
79
BIBLIOGRAPHY BIBLIOGRAPHY
Rossier, M., 2006. T channels and steroid biosynthesis: in search of a link with mitochondria.
Cell Calcium40:155–164.
Saitou, N. and M. Nei, 1987. The neighbor-joining method: a new method for reconstructing
phylogenetic trees. Mol Biol. Evol.4:406–425.
Sankoff, D. and M. Blanchette, 1998. Multiple genome rearrangement and breakpoint phylogeny.
M Computational Biology5:555–570.
Sankoff, D., G. Leduc, N. Antoine, B. Paquin, B. F. Lang, and R. Cedergren, 1992. Gene order
comparisons for phylogenetic inference: evolution of the mitochondrial genome. Proc. Natl.
Acad. Sci. USA89:6575–6579.
Scanlon, J. and I. Reynolds, 1998. Effects of oxidants and glutamate receptor activation on
mitochondrial membrane potential in rat forebrain neurons. J Neurochem71:2392–2400.
Scheffler, I. E., 1999. Mitochondria. John Wiley and Sons Inc., New York.
Schimper, A. F. W., 1883.Uber die entwicklung der chlorophyllkorner und farbkorper. Bot
Zeitung41:105–146.
Scouras, A., K. Beckenbach, A. Arndt, and J. M. Smith, 2004. Complete mitochondrial genome
dna sequence for two ophiuroids and a holothuroid: the utility of protein gene sequence and
gene maps in the analyses of deep deuterostome phylogeny. Mol Phylogenet Evol31:50–65.
Scouras, A. and J. M. Smith, 2001. A novel mitochondrial geneorder in the crinoid echinoderm
Florometra serratissima. Mol Biol Evol 18:61–73.
Scouras, A. and J. M. Smith, 2006. The complete mitochondrial genomes of the sea lilyGym-
nocrinus richeriand the feather starPhanogenia gracilis: signature nucleotide bias and unique
nad4L gene rearrangement within crinoids. Mol Phylogenet Evol 39:323–34.
Siepel, A. and B. M. E. Moret, 2001. Finding an optimal inversion median: Experimental results.
In Proc. WABI, number 2149 in LNCS, pp. 189–203.
Simon, C., F. Frati, A. Beckenbach, B. Crespi, H. Liu, and P. Flook, 1994. Evolution, weight-
ing, and phylogenetic utility of mitochondrial gene sequences and a compilation of conserved
polymerase chain reaction primers. Ann. Entomol. Soc. Am.87:651–701.
80
BIBLIOGRAPHY BIBLIOGRAPHY
Smith, M. J., D. K. Banfield, K. Doteval, S. Gorski, and D. J. Kowbel, 1990. Nucleotide sequence
of nine protein-coding genes and 22 tRNAs in the mitochondrial DNA of the sea starPisaster
ochraceus. J Mol Evol31:195–204.
Snel, B., P. Bork, and M. A. Huynen, 1999. Genome phylogeny based on gene content. Nature
Genet.21:108–110.
Sokal, R. R. and C. D. Michener, 1958. A statistical method for evaluating systematic relation-
ships. Univ. Kansas Sci. Bull.38:1409–1438.
Stechmann, A. and M. Schlegel, 1999. Analysis of the complete mitochondrial dna sequence of
the brachiopodTerebratulina retusaplaces brachiopoda within the protostomes. Proc. R. Soc.
Lond. B266:1–10.
Stocking, C. and E. Gifford, 1959. Incorporation of thymidine into chloroplasts of spirogyra.
Biochem. Biophys. Res. Comm.1:159–164.
Stockley, B., A. B. Smith, T. Littlewood, H. A. Lessios, and J. A. Mackenzie-Dodds, 2005.
Phylogenetic relationships of spatangoid sea urchins (Echinoidea): taxon sampling density
and congruence between morphological and molecular estimates. Zool. Scripta34:447–468.
Swofford, D. L., 2002.PAUP* : Phylogenetic Analysis Using Parsimony (*and Other Methods)
Version 4.0b10. Sinauer Associates, Sunderland, MA. Handbook and Software.
Swofford, D. L. and G. J. Olsen, 1990. Phylogeny reconstruction. In D. M. Hillis and C. Moritz,
eds., Molecular Systematics, pp. 411–501. Sinauer Associates, Sunderland MA.
Thompson, J. D., D. G. Higgs, and T. J. Gibson, 1994. CLUSTALW: improving the sensitivity
of progressive multiple sequence alignment through sequence weighting, position specific gap
penalties, and weight matrix choice. Nucl. Acids Res.22:4673–4680.
Townsend, T. and A. Larson, 2002. Molecular phylogenetics and mitochondrial genomic evolu-
tion in the chamaeleonidae (reptilia, squamata). Mol. Phylog. Evol.23:22–36.
Townsend, T. M., A. Larson, E. Louis, and J. Macey, 2004. Molecular phylogentics of squamata:
The position of snakes, amphisbaenians, and dibamids, and the root of the squamate tree. Syst.
Biol. 53:735–757.
Uno, T. and M. Yagiura, 2000. Fast algorithms to enumerate all common intervals of two per-
mutations. Algorithmica26:290 – 309.
81
BIBLIOGRAPHY BIBLIOGRAPHY
Vidal, N. and S. B. Hedges, 2005. The phylogeny of squamate reptiles (lizards, snakes, and
amphisbaenians) inferred from nine nuclear protein-coding genes. C. R Biologies328:1000–
1008.
Voet, D., J. G. Voet, and C. W. Pratt, 2006. Fundamentals of Biochemistry, 2nd Edition. John
Wiley and Sons, Inc., New York.
Wagele, J.-W., 2005. Foundations of Phylogenetic Systematics. Verlag Dr Friedrich Pfeil, Mu-
nich, Germany.
Wallin, I. E., 1923. The mitochondria problem. The AmericanNaturalist57(650):255–261.
Wang, L.-S., R. K. Jansen, B. M. E. Moret, L. A. Raubeson, and T. Warnow, 2002. Fast phyloge-
netic methods for genome rearrangement evolution: An empirical study. In Proc. 7th. Pacific
Symp. on Biocomputing (PSB’02), pp. 524–535. World Scientific Pub.
Watterson, G. A., W. J. Ewens, T. E. Hall, and A. Morgan, 1982.The chromosome inversion
problem. J. Theor. Biol.99:1–7.
Wetzel, R., 1995. Zur Visualisierung abstrakterAhnlichkeitsbeziehungen. Ph.D. thesis, Bielefeld
University, Germany.
Wolf, P. G., K. G. Karol, D. F. Mandoli, J. Kuehl, K. Arumuganathan, M. W. Ellis, B. D. Mishler,
D. G. Kelch, R. G. Olmstead, and O. Boore, 2005. The first complete chloroplast genome
sequence of a lycophyte,Huperzia lucidula(lycopodiaceae). Gene350:117–128.
82
APPENDIX A
Appendix
NP-Hard
A problem H is NP-hard if and only if there is an NP-complete problem L that is polynomial
time Turing-reducible toH, i.e. L ≤ T H.
Figure A.1:Venn diagramfor P, NP, NP-complete, and NP-hard set of problems
83
APPENDIX A. APPENDIX
Indices in Phylogenetic Reconstruction
Taken from Wagele (2005) and modified.
CI = consistency index
The index evaluates the number of homoplasies as a portion ofthe total character state changes
of a topology. For a characteri in the data set is the number of character statesni. The lowest
number of character states changesmi which are to be expected in a topology isni−1, impleying
a single occurrence of each apomorphic state.
Whensi is the number of character changes occurring in a topology, the consistency index for a
characteri is:
ci = mi/si (A.1)
The consistency indexCI for the whole topology is calculated from the sumM of all mi and the
sumS of all character state changessi present in the topology:
CI = M/S (A.2)
When homoplasies are present for a character in a topology, this character shows more state
changessi than the minimum number of changesmi and the index decreases. If no homoplasies
are present the consistency index obtainCI = 1. However, the lower bound of the consistency
index not 0, andci varies with topology. For this reason, Farris (1989) proposed two more
quantities called the retention index (RI) and the rescaled consistency index (RC).
HI = homoplasy index
The homoplasy indexHI is complementary to the consistency index and taken no be measure
for the portion of character statechanges which are caused by homoplasies. It could be a measure
for the noise present in data.
HI = 1 − CI (A.3)
RI = retention index
The retention index (RI) was designed to be a measure for the amount of putative synapomor-
84
APPENDIX A. APPENDIX
phies (in relation to a given data set) which are retained in atopology. The putative analogies
occur, the lower is theRI-value. The valuelmax is the maximal possible length of a dendrogram
for a given data set. The retention index (RI) is calculated as follows:
RI =lmax − S
lmax − M(A.4)
RC = rescaled consistency index
The rescaled consistency index (RC) is calculated as follows:
RC = CIxRI (A.5)
85
APPENDIX A. APPENDIX
Circal Material
Table A.1: Accession numbers of used taxa
Accession number Organism Taxonomy
Annelida
NC 001673 Lumbricus terrestris Clitellata; Oligochaeta
NC 000931 Platynereis dumerilii Polychaeta; Palpata
Arthropoda - Chelicerata
NC 002010 Ixodes hexagonus Arachnida
NC 002074 Rhipicephalus sanguineus Arachnida
NC 004357 Ornithodoros moubata Arachnida
NC 004370 Ixodes persulcatus Arachnida
NC 004454 Varroa destructor Arachnida
NC 005291 Carios capensis Arachnida
NC 005292 Haemaphysalis flava Arachnida
NC 005293 Ixodes holocyclus Arachnida
NC 005820 Ornithodoros porcinus Arachnida
NC 005924 Heptathela hangzhouensis Arachnida
NC 005925 Ornithoctonus huwena Arachnida
NC 005942 Habronattus oregonensis Arachnida
NC 005963 Amblyomma triguttatum Arachnida
NC 006078 Ixodes uriae Arachnida
NC 003057 Limulus polyphemus Merostomata
Arthropoda - Crustacea
NC 000844 Daphnia pulex Branchiopoda
NC 001620 Artemia franciscana Branchiopoda
NC 004465 Triops cancriformis Branchiopoda
NC 006079 Triops longicaudatus Branchiopoda
NC 005937 Hutchinsoniella macracantha Cephalocarida
NC 005934 Armillifer armillatus Crustacea
NC 002184 Penaeus monodon Malacostraca
NC 003058 Pagurus longicarpus Malacostraca
NC 004251 Panulirus japonicus Malacostraca
Continued on next page
86
APPENDIX A. APPENDIX
Table A.1 – continued from previous page
Accession number Organism Taxonomy
NC 005037 Portunus trituberculatus Malacostraca
NC 006081 Squilla mantis Malacostraca
NC 011243 Cherax destructor Malacostraca
NC 005936 Pollicipes polymerus Maxillopoda
NC 008974 Tetraclita japonica Maxillopoda
NC 005306 Vargula hilgendorfii Ostracoda
NC 005938 Speleonectes tulumensis Remipedia
Arthropoda - Hexapoda
NC 002735 Tetrodontophora bielanensis Collembola
NC 005438 Gomphiocephalus hodgsoni Collembola
NC 006074 Onychiurus orientalis Collembola
NC 006075 Podura aquatica Collembola
NC 000857 Ceratitis capitata Pterygota
NC 000875 Anopheles quadrimaculatus Pterygota
NC 001322 Drosophila yakuba Pterygota
NC 001566 Apis mellifera Pterygota
NC 001709 Drosophila melanogaster Pterygota
NC 001712 Locusta migratoria Pterygota
NC 002084 Anopheles gambiae Pterygota
NC 002355 Bombyx mori Pterygota
NC 002609 Triatoma dimidiata Pterygota
NC 002660 Cochliomyia hominivorax Pterygota
NC 002697 Chrysomya putoria Pterygota
NC 003081 Tribolium castaneum Pterygota
NC 003367 Ostrinia nubilalis Pterygota
NC 003368 Ostrinia furnacalis Pterygota
NC 003372 Crioceris duodecimpunctata Pterygota
NC 003395 Bombyx mandarina Pterygota
NC 003970 Pyrocoelia rufa Pterygota
NC 004529 Melipona bicolor Pterygota
NC 004622 Antheraea pernyi Pterygota
Continued on next page
87
APPENDIX A. APPENDIX
Table A.1 – continued from previous page
Accession number Organism Taxonomy
NC 004816 Lepidopsocid RS-2001 Pterygota
NC 005333 Bactrocera oleae Pterygota
NC 005779 Drosophila mauritiana Pterygota
NC 005780 Drosophila sechellia Pterygota
NC 005781 Drosophila simulans Pterygota
NC 005939 Aleurodicus dugesii Pterygota
NC 005944 Philaenus spumarius Pterygota
NC 006076 Periplaneta fuliginosa Pterygota
NC 006133 Pteronarcys princeps Pterygota
NC 005437 Tricholepidion gertschi Thysanura
NC 006080 Thermobia domestica Thysanura
Arthropoda - Myriapoda
NC 002629 Lithobius forficatus Chilopoda
NC 005870 Scutigera coleoptrata Chilopoda
NC 003343 Narceus annularus Diplopoda
NC 003344 Thyropygus sp. Diplopoda
Echinodermata
NC 001627 Asterina pectinifera Eleutherozoa; Asterozoa
NC 004610 Pisaster ochraceus Eleutherozoa; Asterozoa
NC 005334 Ophiopholis aculeata Eleutherozoa; Asterozoa
NC 005930 Ophiura lutkeni Eleutherozoa; Asterozoa
NC 001453 Strongylocentrotus purpuratusEleutherozoa; Echinozoa
NC 001572 Paracentrotus lividus Eleutherozoa; Echinozoa
NC 001770 Arbacia lixula Eleutherozoa; Echinozoa
NC 005929 Cucumaria miniata Eleutherozoa; Echinozoa
NC 001878 Florometra serratissima Pelmatozoa; Crinoidea
Mollusca
NC 005335 Lampsilis ornata Bivalvia; Palaeoheterodonta
NC 001276 Crassostrea gigas Bivalvia; Pteriomorphia
NC 002507 Loligo bleekeri Cephalopoda; Coleoidea
NC 002176 Pupa strigosa Gastropoda; Orthogastropoda
Continued on next page
88
APPENDIX A. APPENDIX
Table A.1 – continued from previous page
Accession number Organism Taxonomy
NC 004321 Roboastra europaea Gastropoda; Orthogastropoda
NC 005827 Aplysia californica Gastropoda; Orthogastropoda
NC 005940 Haliotis rubra Gastropoda; Orthogastropoda
NC 001761 Albinaria caerulea Gastropoda; Pulmonata
NC 001816 Cepaea nemoralis Gastropoda; Pulmonata
NC 005439 Biomphalaria glabrata Gastropoda; Pulmonata
NC 001636 Katharina tunicata Polyplacophora; Neoloricata
NC 005840 Siphonodentalium lobatum Scaphopoda; Gadilida
Nematoda
NC 001327 Ascaris suum Chromadorea; Ascaridida
NC 001328 Caenorhabditis elegans Chromadorea; Rhabditida
NC 003415 Ancylostoma duodenale Chromadorea; Rhabditida
NC 003416 Necator americanus Chromadorea; Rhabditida
NC 004806 Cooperia oncophora Chromadorea; Rhabditida
NC 005143 Strongyloides stercoralis Chromadorea; Rhabditida
NC 005941 Steinernema carpocapsae Chromadorea; Rhabditida
NC 001861 Onchocerca volvulus Chromadorea; Spirurida
NC 004298 Brugia malayi Chromadorea; Spirurida
NC 005305 Dirofilaria immitis Chromadorea; Spirurida
Platyhelminthes
NC 000928 Echinococcus multilocularis Cestoda; Eucestoda
NC 002547 Taenia crassiceps Cestoda; Eucestoda
NC 002767 Hymenolepis diminuta Cestoda; Eucestoda
NC 004022 Taenia solium Cestoda; Eucestoda
NC 004826 Taenia asiatica Cestoda; Eucestoda
NC 002354 Paragonimus westermani Trematoda; Digenea
NC 002529 Schistosoma mekongi Trematoda; Digenea
NC 002544 Schistosoma japonicum Trematoda; Digenea
NC 002545 Schistosoma mansoni Trematoda; Digenea
NC 002546 Fasciola hepatica Trematoda; Digenea
89
APPENDIX A. APPENDIX
List of the Gene Abbreviations
ProteinsProteins ATP synthase F0 subunit 6 ATP6 ATP6Proteins ATP synthase F0 subunit 8 ATP8 ATP8Proteins Cytochrome c oxidase subunit I COX1 CO1Proteins Cytochrome c oxidase subunit II COX2 CO2Proteins Cytochrome c oxidase subunit III COX3 CO3Proteins Cytochrome B CytB CYTBProteins NADH dehydrogenase subunit 1 ND1 ND1Proteins NADH dehydrogenase subunit 2 ND2 ND2Proteins NADH dehydrogenase subunit 3 ND3 ND3Proteins NADH dehydrogenase subunit 4 ND4 ND4Proteins NADH dehydrogenase subunit 4L ND4L ND4LProteins NADH dehydrogenase subunit 5 ND5 ND5Proteins NADH dehydrogenase subunit 6 ND6 ND6rRNAsrRNA Large - rRNA large 16SrRNA Small - rRNA small 12StRNAstRNA Alanin Ala AtRNA Arginin Arg RtRNA Asparagin Asn NtRNA Asparaginsure Asp DtRNA Cystein Cys CtRNA Glutamin Gln QtRNA Glutaminsure Glu EtRNA Glycin Gly GtRNA Histidin His HtRNA Isoleucin Ile ItRNA Leucin Leu L2 UURtRNA Leucin Leu L1 CUNtRNA Lysin Lys KtRNA Methionin Met MtRNA Phenylalanin Phe FtRNA Prolin Pro PtRNA Serin Ser S1 AGNtRNA Serin Ser S2 UCNtRNA Threonin Thr TtRNA Tryptophan Trp WtRNA Tyrosin Tyr YtRNA Valin Val V
90
APPENDIX A. APPENDIX
CRExFigures
CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S E T -16S -ND2 -I -ND1 -L2 -G -Y D -M V -C -W A -L1 -N Q -P
CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S E T P -Q N L1 -A W C -V M -D Y G L2 ND1 I ND2 16S
(a)
-16S -ND2 -I -ND1 -L2 -G -Y D -M V -C -W A -L1 -N Q -P
P -Q N L1 -A W C -V M -D Y G L2 ND1 I ND2 16S
(b)
Figure A.2: Inversion 1 (I1) - Echinoidea vs. Asteroidea: (a) family diagram for Echinoidea (top) andAsteroidea (bottom); (b) nondirectional inversion of 16 tRNAs.
CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB P -Q N L1 -A W C -V M -D -T -E -12S -F -L2 -G -16S -Y -ND2 -I -ND1
CO1 R CO2 K ATP8 ATP6 ND4L CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB P -Q N L1 -A W C -V M -D -T -E -12S -F -L2 -G -16S -Y -ND2 -I -ND1
(a)
ND4L CO2 K ATP8 ATP6
CO2 K ATP8 ATP6 ND4L
(b)
Figure A.3: Transposition 1 (T1) - Florometra serratissima, Phanogenia gracilisvs. Gymnocrinus richeri:(a) family diagram forFlorometra serratissima, Phanogenia gracilis(top) andGymnocrinus richeri(bot-tom); (b) the nondirectional transposition of the gene ND4L.
91
APPENDIX A. APPENDIX
CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S E T P -Q N L1 -A W C -V M -D Y G L2 ND1 I ND2 16S
CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB P -Q N L1 -A W C -V M -D -T -E -12S -F -L2 -G -16S -Y -ND2 -I -ND1
(a)
F 12S E T P -Q N L1 -A W C -V M -D
P -Q N L1 -A W C -V M -D -T -E -12S -F
(b)
Y G L2 ND1 I ND2 16S
-16S -ND2 -I -ND1 -L2 -G -Y
(c)
-16S -ND2 -I -ND1 -L2 -G -Y
-ND2 -I -ND1 -L2 -G -16S -Y
(d)
-ND2 -I -ND1 -L2 -G -16S -Y
-L2 -G -16S -Y -ND2 -I -ND1
(e)
Figure A.4: Tandem duplication random loss 3 (TDRL3) - Florometra serratissima, Phanogenia gracilisvs. Echinoidea: (a) family diagram forFlorometra serratissima, Phanogenia gracilis(top) and Echinoidea(bottom); (b) inversion transposition, (c) inversion, (d)first transposition, and (e) second transposition.
CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB P -Q N L1 -A W C -V M -D -T -E -12S -F -L2 -G -16S -Y -ND2 -I -ND1
CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S E T P -Q N L1 -A W C -V M -D Y G L2 ND1 I ND2 16S
(a)
P -Q N L1 -A W C -V M -D -T -E -12S -F
F 12S E T P -Q N L1 -A W C -V M -D
(b)
-L2 -G -16S -Y -ND2 -I -ND1
ND1 I ND2 Y 16S G L2
(c)
ND1 I ND2 Y 16S G L2
Y G L2 ND1 I ND2 16S
(d)
Figure A.5: Tandem duplication random loss 3 (TDRL3) - Echinoidea vs. Florometra serratissima,Phanogenia gracilis: (a) family diagram for Echinoidea (top) andFlorometra serratissima, Phanogeniagracilis (bottom); (b) inversion transposition, (c) inversion, and(d) tandem duplication random loss event.This scenario was favored in the analysis due to parsimonious fundamental idea.
92
APPENDIX A. APPENDIX
CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S E T P -Q N L1 -A W C -V M -D Y G L2 ND1 I ND2 16S
CO1 R E P N L1 W -V ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S T -Q -A C M -D Y G L2 ND1 I ND2 16S
(a)
ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S E T P -Q N L1 -A W C -V
E P N L1 W -V ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S T -Q -A C
(b)
CO1 R E P N L1 W -V ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S T -Q -A C M -D Y G L2 ND1 I ND2 16S
CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S E T P -Q N L1 -A W C -V M -D Y G L2 ND1 I ND2 16S
(c)
E P N L1 W -V ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S T -Q -A C
P W ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S -Q C E N L1 -V T -A
P W ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S -Q C E N L1 -V T -A
W ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S C E -V T P -Q N L1 -A
W ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S C E -V T P -Q N L1 -A
ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB F 12S E T P -Q N L1 -A W C -V
(d)
Figure A.6: Tandem duplication random loss 1 (TDRL1): This example illustrates the assymetry of TDRLoperations (and the corresponding distance measure) perfectly: To transform the gene order ofC. miniata(A and B) into the Echinoid order three TDRL rearrangements are needed. (a) Echinoidea vs.Cucumariaminiata and (b) the resulting TDRL; (c)Cucumaria miniatavs. Echinoidea and (d) the three resultingTDRLs in this direction.
CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB P -Q N L1 -A W C -V M -D -T -E -12S -F -L2 -G -16S -Y -ND2 -I -ND1
CO1 CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H R ND4L S1 ND5 -ND6 CYTB P -Q N L1 W C M -A -V -D -T -E -12S -F -L2 -G -16S -Y -ND2 -I -ND1
(a)
R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H
CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H R ND4L
(b)
-A W C -V M
W C M -A -V
(c)
Figure A.7: The rearrangements (T2, TDRL2) ofAntedon mediterraneawith respect toFlorometra ser-ratissimaare shown above. Note that TDRL2 acts in the direction ofA. mediterranea(in the direction ofF.serratissimaone more TDRL would be needed - see Figure A.8). Tandem duplication random loss 2(TDRL2) - Florometra serratissima, Phanogenia gracilisvs. Antedon mediterranea: (a) family diagramfor Florometra serratissima, Phanogenia gracilis(top) andAntedon mediterranea(bottom); (b) transpo-sition T2, and (c) theTDRL3favored in this analysis.
93
APPENDIX A. APPENDIX
CO1 CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H R ND4L S1 ND5 -ND6 CYTB P -Q N L1 W C M -A -V -D -T -E -12S -F -L2 -G -16S -Y -ND2 -I -ND1
CO1 R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H S1 ND5 -ND6 CYTB P -Q N L1 -A W C -V M -D -T -E -12S -F -L2 -G -16S -Y -ND2 -I -ND1
(a)
CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H R ND4L
R ND4L CO2 K ATP8 ATP6 CO3 -S2 ND3 ND4 H
(b)
W C M -A -V
M -A W C -V
(c)
M -A W C -V
-A W C -V M
(d)
Figure A.8: Tandem duplication random loss 2 (TDRL2) - Antedon mediterraneavs. Florometra serratis-sima, Phanogenia gracilis: (a) family diagram forAntedon mediterranea(top) andFlorometra serratis-sima, Phanogenia gracilis(bottom); (b) transpositionT2, (c) and (d) the alternativ scenario TDRLs - seeFigure A.7.
94
Zusammenfassung
Die Verwendung von molekularen Markern zur Rekonstruktionvon phylogenetischen Verwandt-schaftsverhaltnissen hat in den letzten Jahrzehnten stark zugenommen. Sie sind zur Zeit einwichtiger Bestandteil phylogenetischer Untersuchungen auf verschiedenen taxonomischen Ebe-nen. Insbesondere werden sie immer dann herangezogen, wennmit Hilfe von morphologischenDaten keine oder nur eine ungenugende Auflosung der stammesgeschichtlichen Beziehungenerreicht werden kann.
Jedoch stellt die Fulle an erhobenen molekularen Daten undderen stetiger Zuwachs die Wis-senschaftler vor neue Herausforderungen im Umgang mit diesen Daten. Die noch recht jungeDisziplin der Bioinformatik, welche Aspekte der Biologie,Informatik und Mathematik in sichvereint, hat verschiede Werkzeuge zur Handhabung und Analyse der komplexen Daten entwick-elt.
In dieser Dissertation wird unter Verwendung des mitochondrialen Genoms auf zwei Teilprob-leme im Umgang mit molekularen Markern eingangen und ein Losungsweg aufgezeigt.
Das erste Problem ergibt sich aus der Fulle der zur Zeit zur Verfugung stehenden Daten und derenexpontentielle Zunahme. Die Grundlage einer phylogenetischen Rekonstruktion ist ein Vergleichvon Zeichenketten und deren Information, was im Fall von molekularen Markern sehr oft Gense-quenzen sind. Dieser Vergleich ist die Basis und der Grundstein fur eine sorgfaltige Rekonstruk-tion der Verwandtschaftsverhaltnisse. Jedoch fuhren Multiple Substitutionen, Punktmutationenund zum Beispiel der haufige Austausch (wobbling) der dritten Codonposition in Proteinen zuVergleichen, deren Informationsgehalt sehr schwer oder nicht zu erkennen ist. In der Literaturkonnte gezeigt werden, dass diese Positionen oft wichtige Infomation, aber auch phylogenetis-ches Rauschen enthalten. In der Vergangenheit wurden diesePositionen meist durch entfernenper Hand verworfen, was ein Reproduzieren der Ergebnisse oft unmoglich machte. An diesemPunkt setzt der in dieser Dissertation beschriebene und implementierte Algorithmus NOISY an.Das daraus entwickelte Computerprogramm gibt dem Nutzer die Moglichkeit, jede Position ineinem Sequenzvergleich mit einem Wert der Vertrauenswurdigkeit zu versehen und zu testenin wie weit das Muster der Position auch zufallig erzeugt werden kann. Hierbei wird davonausgegangen das bei einer Sequenzevolution Muster in den seltensten Fallen zufallig auftreten.Im Ergebnis erhalt der Nutzer einen grafischenUberblick uber den Informationsgehalt und dieVerteilung der informativen Positionen uber den gesamtenSequenzvergleich in einem fur jedenreproduzierbarem Rahmen.
Das zweite von mir aufgegriffene Problem beschaftigt sichmit dem Erschliessen von phylo-genetischer Information uber die Sequenzanalyse hinaus.In der Literatur konnte gezeigt werden,dass die Anordnung der Gene im mitochondrialen Genom eine reiche Quelle an Informationen,vor allem fur sehr alte Aufspaltungsereignisse enthalt.In der vorliegenden Dissertation prasen-
tiere ich zwei neue Algorithmen und deren Implementation, um genau diese Information nutzbarzu machen.
Der erste Algorithmus, implementiert im ComputerprogrammCIRCAL, ermoglicht es Genanord-nungen unterschiedlicher Lange mittels alinierter Listen zu vergleichen. Hierbei wurde einprogressiver Ansatz des Vergleiches genutzt um paarweise Sequenzvergleiche von aliniertenAnordnungslisten mit Methoden multipler Vergleiche von Genanordnungen zu kombinieren.Dieser sehr einfache und schnelle Ansatz gibt anschließenddie Moglichkeit gangige Metho-den der Rekonstruktion wie Neighbor Joining, Maximum Parsimony, Maximum Likelihood oderBayesianische Verfahren darauf anzuwenden.
Der zweite Algorithmus beinhaltet die Entwicklung und Implementation des CREx Algorithmus.Dieser basiert auf der Detektion von Mustern in so genannten’strong interval trees’, welchedie entsprechenden Operationen der Genomumordnung widerspiegeln. Zusatzlich wurde diesogenannte ’tandem duplication random loss’ Operation (TDRL) als ein weiteres komplexesEreignis in der mitochondrialen Genomevolution erkannt und implementiert. Basierend aufder Information auf welchem Strang sich ein Gen befindet kanndiese Operation sehr kom-plexe Umordnungen mit einer Vielzahl an Inversionen mit nureinem Ereignis beschreiben.Mit Hilfe einer grafischenUbersicht ist es moglich, einen schnellenUberblick uber Operatio-nen zu gewinnen, welche zwischen unterschiedlich evolvierenden Genomen stattgfunden habenkonnten. Weiterhin konnen hypothetische oder konkurrierende Topologien basierend auf der In-formation der Genanordnung der mitochondrialen Genome getestet werden. Analysen die ichmit dem CREx Verfahren durchgefuhrt habe zeigten, dass es in der Lage ist sowohl publizierteund gut beschriebene Umordnungsszenarien wieder zu finden,als auch neu zu identifizierenbzw. kurzere wahrscheinliche Szenarien der Evolution derGenanordnung zu detektieren. Ineinem zukunftigem Schritt sollte es moglich sein, alleinmit der Information der Anordnung dermitochondrialen Gene phylogenetische Verwandtschaftsverhaltnisse zu rekonstruktionen.