Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Computational investigations into
cis-regulation in eukaryotes
Laurence EttwillerJesus College
A dissertation submitted to the University of Cambridgefor the degree of Doctor of Philosophy
European Molecular Biology LaboratoryEuropean Bioinformatics InstituteWellcome Trust Genome CampusHinxton, Cambridge, CB10 1SD
United Kingdom
Email: [email protected]
December 22, 2005
To my grandmother, for everything she taught me, especiallycourage, perseverance and so many other things.
This thesis is the result of my own work and includes nothing which is theoutcome of work done in collaboration except where specifically indicated inthe text.
This thesis does not exceed the specified length limit of 300 pages as de-fined by the Biology Degree Committee.
This thesis has been typeset in 12pt type using LATEX2ε according to thespecifications defined by the Board of Graduate Studies and the BiologyDegree Committee.
ii
I would like to thanks everyone who supported me during myPhD. This include the Ensembl team, especially Ben Paten,
Abel Ureta-Vidal, Manu Mongin, Martin Hammond and ArekKasprzyk. I would also like to thanks Ewan Birney , my
supervisor for all his help and support. Lastly, I thank myParents, my family and my friends, Arnaud, Sylvain, Chloe,Wei, Shu Ching and Ling and of course my boyfriend Tom.
This thesis presents essentially two computational methods that I devel-opped to locate cis-regulatory motifs in eukaryotes. Both methods are basedon information that have been shown in the past to be successful in locatingregulatory regions but the approaches I used are novel.
The first method is based on the information about co-regulation of genes toderive a dictionary of interesting motifs. This is done by uncovering potentialmappings between the upstream regulatory sequences of genes and proteinfunctions in S. cerevisiae. In contrast to the conventional approach that usesco-regulated groups of genes on the basis of similar expression profiles, co-expression has been investigated using functional networks. The idea behindthe investigation is that proteins involved in the same cellular process shouldbe regulated in synergy. Motifs of interest should therefore both be limitedto a specific set of genes, and this set of genes should have a significant non-random correlation with the input functional information.
The second method uses comparative genomics and the notion that func-tional regions are conserved across species. This method predicts a dictio-nary of regulatory motifs based on occurrence in non-coding regions that areconserved between many vertebrate species. Once the dictionary of motifsis obtained, the genome-wide distribution of the motifs is then investigatedand, based on these results, functional regions for transcriptional control arepredicted.
ii
Contents
1 Introduction 11.1 Biological background . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 DNA accessibility . . . . . . . . . . . . . . . . . . . . . 51.1.2 Trans-activator/repressor and cis regulatory elements . 61.1.3 Post-transcriptional regulation of gene expression . . . 71.1.4 Gene regulation and cellular function . . . . . . . . . . 9
1.2 Experimental approaches to find regulatory regions . . . . . . 121.2.1 one-by-one gene analysis . . . . . . . . . . . . . . . . . 121.2.2 High throughput analysis . . . . . . . . . . . . . . . . . 13
1.3 Bioinformatic approach to finding regulatory elements . . . . . 141.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 141.3.2 Finding over-represented motifs on unrelated sequences 151.3.3 Phylogenetic footprinting to find cis-regulatory elements 151.3.4 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.3.5 Finding eukaryotic promoters . . . . . . . . . . . . . . 22
2 Finding regulatory regions using functional information inyeast 232.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2 Example: the nucleotide pathway in yeast . . . . . . . . . . . 242.3 Useful functional network . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Metabolic network . . . . . . . . . . . . . . . . . . . . 262.3.2 Protein interaction . . . . . . . . . . . . . . . . . . . . 27
2.4 Generating and assessing motifs . . . . . . . . . . . . . . . . . 282.4.1 Generating motifs . . . . . . . . . . . . . . . . . . . . . 282.4.2 Assessment of the motifs using functional networks . . 32
2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.5.1 Significant motifs . . . . . . . . . . . . . . . . . . . . . 382.5.2 Non-random behaviour of significant motifs . . . . . . 422.5.3 Assessment of known transcription factor binding sites 422.5.4 Inferring functionality to putative motifs . . . . . . . . 44
iii
2.5.5 Promoter scanning . . . . . . . . . . . . . . . . . . . . 452.5.6 Discovering cis-regulatory elements using functional net-
work in higher eukaryotes . . . . . . . . . . . . . . . . 472.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3 Evolution dynamic of cis-regulatory regions in higher eukary-otes 493.1 Detailed analysis of a specific example : the Atonal 5 gene . . 51
3.1.1 The Atonal 5 protein . . . . . . . . . . . . . . . . . . . 513.1.2 The promoter of atonal5 gene . . . . . . . . . . . . . . 523.1.3 The Atonal5 motif . . . . . . . . . . . . . . . . . . . . 553.1.4 Experimental validations . . . . . . . . . . . . . . . . . 573.1.5 Conclusion regarding this example . . . . . . . . . . . 60
3.2 Global run of promoterwise . . . . . . . . . . . . . . . . . . . 613.2.1 Promoterwise : the algorithm . . . . . . . . . . . . . . 613.2.2 Defining the cut-off . . . . . . . . . . . . . . . . . . . . 623.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.2.4 Genes with conserved 5’ proximal intergenic regions . . 67
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4 Defining a mammalian dictionary of regulatory motifs 724.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2 Finding functional motifs . . . . . . . . . . . . . . . . . . . . . 73
4.2.1 Derivation of a reliable motif dictionary . . . . . . . . . 734.2.2 Finding region of clustered motifs on the human genome 82
4.3 Experimental evaluation of the methodology . . . . . . . . . . 864.3.1 The FOXM1 gene . . . . . . . . . . . . . . . . . . . . . 874.3.2 The ARF3 gene . . . . . . . . . . . . . . . . . . . . . . 874.3.3 The Q99JW1 gene . . . . . . . . . . . . . . . . . . . . 884.3.4 The Q9BU67 gene . . . . . . . . . . . . . . . . . . . . 884.3.5 The SM31 gene . . . . . . . . . . . . . . . . . . . . . . 894.3.6 The ZIC1 gene . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5 Effect of the ATG triplet on gene expression in yeast 935.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.2 ATG codon at the genomic level . . . . . . . . . . . . . . . . . 935.3 ATG codon at the transcript level . . . . . . . . . . . . . . . . 985.4 The upf genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
iv
6 Conclusion 1046.1 Perspective and further work . . . . . . . . . . . . . . . . . . . 105
A Publications during the PhD work 108
B Finding regulatory motifs using functional network in yeast: material and method 109B.1 Networks generation . . . . . . . . . . . . . . . . . . . . . . . 109
B.1.1 Metabolic network . . . . . . . . . . . . . . . . . . . . 109B.1.2 Protein interaction network . . . . . . . . . . . . . . . 109
B.2 Pattern search . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.3 Overlap score . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.4 Standard deviation score . . . . . . . . . . . . . . . . . . . . . 111B.5 Pattern clustering and sequence logo generation . . . . . . . . 111
C Yeast significant motifs 113
Bibliography 118
v
List of Tables
2.1 Assessment of known sites . . . . . . . . . . . . . . . . . . . . 45
3.1 Atonal5 homologs gene names and locations. . . . . . . . . . . 533.2 Scores for different species . . . . . . . . . . . . . . . . . . . . 663.3 Human-mouse enriched gene classes . . . . . . . . . . . . . . . 693.4 Human-fugu enriched gene classes . . . . . . . . . . . . . . . . 703.5 Human-mouse under-represented gene classes . . . . . . . . . 70
4.1 Table of motifs. . . . . . . . . . . . . . . . . . . . . . . . . . . 804.2 Candidates : ensembl id . . . . . . . . . . . . . . . . . . . . . 86
C.1 Significant motifs in yeast . . . . . . . . . . . . . . . . . . . . 117
vi
List of Figures
1.1 Gene structure. . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Gene expression - an overview . . . . . . . . . . . . . . . . . . 41.3 Lac operon in bacteria . . . . . . . . . . . . . . . . . . . . . . 101.4 Example of a synexpression group in higher eukaryote . . . . . 11
2.1 Example of the nucleotides pathway in yeast . . . . . . . . . . 252.2 Graph data structure . . . . . . . . . . . . . . . . . . . . . . . 292.3 Overall schema . . . . . . . . . . . . . . . . . . . . . . . . . . 302.4 Overlap score explanation . . . . . . . . . . . . . . . . . . . . 342.5 Overlap score distribution for MDS network. . . . . . . . . . . 352.6 Overlap score distribution for KEGG network. . . . . . . . . . 362.7 Overlap score distribution for Cellzome network. . . . . . . . . 372.8 Overlap network for motif TGACTC . . . . . . . . . . . . . . 412.9 Overlap network for motif d(A)-d(T) . . . . . . . . . . . . . . 432.10 Motif location relative to coding start site . . . . . . . . . . . 442.11 Promoter scanning example . . . . . . . . . . . . . . . . . . . 46
3.1 Promoterwise: the schema . . . . . . . . . . . . . . . . . . . . 503.2 GFP construct under Atonal5 promoter . . . . . . . . . . . . . 523.3 Conserved region 1 in the Atonal 5 promoter . . . . . . . . . . 543.4 Conserved region in the Atonal 5 promoter . . . . . . . . . . . 563.5 Candidate genes for CCACCTG motif . . . . . . . . . . . . . 583.6 Known Atonal5 targets with conserved motifs . . . . . . . . . 593.7 Schema of the procedure . . . . . . . . . . . . . . . . . . . . . 613.8 Promoterwise: positive upstream region function of the score
cut-off. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.9 Promoterwise: are hits reverse-complemented ? . . . . . . . . 643.10 Promoterwise: example of an inversion . . . . . . . . . . . . . 653.11 GO category . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1 Schema of the procedure. . . . . . . . . . . . . . . . . . . . . . 744.2 Occurrence of motifs in conserved/ non conserved regions. . . 75
vii
4.3 Density function of the motif occurrence. . . . . . . . . . . . . 774.4 Occurrence of motifs in conserved/ non conserved regions for
cg motifs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.5 Conserved motifs in conserved regions. . . . . . . . . . . . . . 794.6 Density of motifwise hits around gene starts . . . . . . . . . . 834.7 Comparison with transfac . . . . . . . . . . . . . . . . . . . . 844.8 Motifwise example . . . . . . . . . . . . . . . . . . . . . . . . 854.9 Candidate : Foxm1 . . . . . . . . . . . . . . . . . . . . . . . . 874.10 Candidate :ARF3 . . . . . . . . . . . . . . . . . . . . . . . . 884.11 Candidate :Q99JW1 . . . . . . . . . . . . . . . . . . . . . . . 884.12 Candidate : Q9BU67 . . . . . . . . . . . . . . . . . . . . . . . 884.13 Candidate :SM31 . . . . . . . . . . . . . . . . . . . . . . . . . 894.14 Candidate :SM31 fish construct. . . . . . . . . . . . . . . . . . 914.15 Candidate :ZIC1 . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.1 Distribution of ATG upstream of the coding start. . . . . . . . 955.2 Density distribution of expression data in yeast . . . . . . . . 965.3 Effect of the first 5’ ATG on expression in yeast . . . . . . . . 975.4 Effect of the presence of an ATG in the 5’UTR on expression
in yeast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.5 The upf mutants . . . . . . . . . . . . . . . . . . . . . . . . . 101
viii
Chapter 1
Introduction
This past decade has witnessed a major change in the biological sciences
due to the rapid development of high throughput technologies; in particular,
DNA sequencing. It is now possible to sequence whole genomes, and as of
now around 20 eukaryote and over 100 prokaryote genomes are either finished
or about to be finished. This includes the completion of the human genome in
2001 by the International Human Genome Sequencing Consortium (IHGSC
et al., 2001), one of the great milestones in the field of biology. Considering
that just 50 years passed since the discovery of the structure of DNA by Crick
and Watson (Watson and Crick, 1953), this is a important advance in biology.
The availability of this flood of information of genome sequences and other
data has also revolutionized the way scientists approach biological problems.
The analysis of these data has tremendous potential from the understanding
of basic biological processes to human medicine. However, this raw informa-
tion needs to be treated using computational procedures and as a consequence
of such demand, the field of bioinformatics has blossomed.
The field of computational biology is quite large and overall its aim is to
answer biological questions using computational tools. The mechanisms of
gene regulation and more specifically the prediction of cis-regulatory elements
is one of the questions that remains mostly unsolved. This is the subject of
my PhD work presented here. I will first introduce the biological background
before introducing the existing computational approaches to attempt to solve
this challenge.
1
1.1. BIOLOGICAL BACKGROUND 2
1.1 Biological background
Deoxyribonucleic acid (DNA), the molecule that stores the genetic informa-
tion of nearly all organisms, is a polymer composed of 4 single chemical
units called nucleotides. The polymer is arranged in a double helix of two
complementary anti-parallel chains. In eukaryotes the DNA is organized in
chromosomes and the complete set of chromosomes constitutes the genome.
For example, Homo sapiens has 3.2x109 base pairs (bp) in 24 chromosomes
that contain virtually all the information any cell needs for its maintenance,
propagation and differentiation. Most of the Homo sapiens cells are diploids;
they possess 2 complete sets of chromosomes.
One of the earliest features discovered in the DNA is the coding gene and we
know now that it is encoded on a limited physical stretch of DNA that ulti-
mately determines the sequence of a protein (see Figure 1.1 for details). The
gene is composed of exons interrupted by introns and the coding sequence
is flanked by untranslated regions (UTR) necessary for the stability and reg-
ulation of the transcript. The protein coding gene has a well understood
grammar, making it relatively easy to differentiate this feature from the rest
of the genome (Burge and Karlin, 1997).
Scientists have a good idea of the number of coding genes per mammalian
genome currently estimated to be between 25,000-28,000 for human (Crollius
et al., 2000) or mouse (Waterston et al., 2002). Other non-coding genes were
also characterised, genes for tRNA and rRNA being the most studied ones.
Recently, more non-coding RNA types have been discovered (Eddy, 2001).
For example, micro RNA was found to be involved in the regulation of the
translation of coding mRNAs and, so far, no good estimates were given as
for how many of these non-coding genes are present on the genome.
Proteins, the product of coding genes, are the basic functional molecules
of the cell and have many roles, from catalysing biochemical reactions to
regulating complex pathways. The DNA that is not coding for proteins has
a number of associated functions including gene regulation.
To be functionally active, coding genes need to be transcribed into mRNA
molecules and, in turn, are translated into proteins that may or may not need
further processing to become functional. This whole process, common to all
1.1. BIOLOGICAL BACKGROUND 3
Exon1 exon2
coding sequence
intergenic DNA intron
gene
5’UT
R
3’UT
Rintergenic DNA
Figure 1.1: Typical gene structure in eukaryotes: the gene contains exon(s)and often intron(s) that are spliced out during maturation
living organisms, is termed gene expression and is fundamental to the un-
derstanding of life. Gene expression involves many steps that are described
in more detail in Figure 1.2. This first step, commonly called transcription
consists of generating a pre-messenger RNA (or pre-mRNA) from a DNA
template and the intron(s) are spliced out during the maturation of the tran-
script to form a mature mRNA. In eukaryotes, most of the synthesis of mRNA
precursor is done by the RNA polymerase II complex.
The mRNA is, in turn, used as a template for the synthesis of the polypeptide
chain in a process called translation, and is catalysed by ribosomes. While
being synthesised, the nascent polypeptide adopts a 3D structure and even-
tually forms a native protein with biological function. Only a subset of all
possible proteins are present in a particular cell type, and it is important to
keep tight control of this subset. The presence of a protein at a wrong time
or place can be deleterious for the cell or the organism.
The regulation of gene expression is therefore crucial for living organisms
and happens in all the stages described in Figure 1.2. Nevertheless, the com-
mitment of the cell to make mRNA is the most effective point of control in
gene expression. Despite its importance, many aspects of the regulation of
transcription remain unclear. What is known is that some of its elements lie
mainly in the intergenic DNA; the other elements are epigenetic, but it is
not clear in what proportion the epigenetic factors influence gene regulation.
It is the success of recruiting the transcriptional machinery a few base pairs
upstream of the start of the gene that determines the expression of the gene.
As we will see below, many levels of regulation dictate this success or failure.
1.1. BIOLOGICAL BACKGROUND 4
G A A A G C T
T T TC CG AT
G
C
C A
GT
A T
A
A U G C A G A A A G C U
LYSMET GLN ALA
TRANSCRIPTION
TRANSLATION
DNA
mRNA
Protein
Figure 1.2: The central dogma in biology: One DNA strand is used as atemplate to synthesise a pre-mRNA by the RNA polymerase (transcription).This pre-mRNA is then matured into a mRNA and, consequently, is used as atemplate by the ribosome machinery to produce a polypeptide (translation).
1.1. BIOLOGICAL BACKGROUND 5
1.1.1 DNA accessibility
The transcription machinery, as well as the necessary associated proteins,
need to physically reach the location on the DNA molecule that will permit
the complex to start the transcription. Yet the genome is compressed by
a linear factor of about 1x104 - 1x105, and this compression is achieved by
proteins -mainly histones, but also non-histone proteins- to form a dynamic
polymer call chromatin. The degree and type of compression varies accord-
ing to many factors that influence the chromatin conformation. At a much
higher level, this compression consequently forms a well-defined chromosomal
architecture with densely and loosely packed regions. Both the chromosomal
architecture and the chromatin structure determines the accessibility of the
DNA to the transcriptional machinery.
1. Chromosomal architecture: Recent technical advances have shown ev-
idence of discrete territories in an individual chromosome where some
parts of the DNA are deeply buried and others are easy accessible by
a battery of proteins (Cremer and Cremer, 2001). This architecture is
well-defined and is cell type specific or developmental specific, leading
to the fact that different part of the DNA is accessible in different cells.
The location of the DNA region relative to other regions in the nucleus,
such as interchromatin compartment or nuclear lamina, is very impor-
tant for gene expression and a remodelling of such architecture leads
to a long-term change in gene expression.
2. Chromatin conformation: Even if a DNA region is exposed to less
condensed areas of the nucleus, the local structure of the chromatin af-
fects the accessibility of the transcription start site. It has been shown
that many post-transcriptional modifications of the histones determine
the state of the chromatin (open or closed), and only open chromatin
allows efficient gene transcription. Histone modifications are acetyla-
tion, phosphorylation and methylation, and the combinatorial nature
of these modifications have brought people to propose a histone-code
(Turner, 2000) along the same lines as the genetic code.
1.1. BIOLOGICAL BACKGROUND 6
1.1.2 Trans-activator/repressor and cis regulatory ele-ments
In exposed open chromatin regions, trans-activators and repressors play a
key role in gene expression.
These transacting elements are proteins that either bind directly to the DNA
or bind to another transfactor. The mode of control depends on the nature
of the protein, but usually directly enhances and/or inhibits the initiation of
transcription or can play a role in modifying the chromatin structure as well.
Transcription factor that binds DNA has generally two domains, the acti-
vation domain and the DNA binding domain and may form homo- or hetero-
dimers. The binding to DNA is usually sequence-specific, meaning that selec-
tivity is given by direct contact between the polypeptide chain of the protein
and the exposed edges of the base pairs in the DNA (usually in the major
groove). These direct interactions can be complemented by the bendability
of the DNA, but this is usually a secondary effect. Each transcription factor
recognises a specific DNA sequence called a cis-regulatory element.
These elements are usually located in intergenic DNA around the gene that
they regulate, but can also be found in introns (especially the first intron
(Majewski and Ott, 2002)). The promoter is the region located directly
upstream of the gene. In addition to containing gene-specific regulatory ele-
ments, the promoter can also contain all the necessary binding sites for the
basal transcription machinery like the CAAT or the TATAA box, though not
all promoters contain these signals.
Many mammalian promoters also contain so called ’CpG islands’. CpG is a
special di-nucleotide in the human genome. Indeed, in higher eukaryotes a
significant number of CpG dinucleotides are methylated and the methylated
nucleotide is misrecognised by the DNA polymerase machinery with a higher
frequency than the background mutation rate (Sved and Bird, 1990). The
amount of CpG in the genome is therefore much lower than expected. In
cis-regulatory regions, methylation occurs less frequently around functional
elements in order to keep the chromatin open. Consequently, the fraction of
CpGs is higher there. The consequence is that these are easily recognised
1.1. BIOLOGICAL BACKGROUND 7
CpG rich regions around genes called CpG islands.
Cis-regulatory elements found further away from the genes are in regions
called modules (also called enhancer or locus-control regions). A module is
defined as a cluster of binding sites that produces a discrete aspect of the
total transcription profile. A single module typically contains about 6 to 15
binding sites and binds 4 to 8 different transcription factors (Arnone and
Davidson, 1997).
Variation of the affinity of the binding site is commonly acheived by slightly
changing the nucleotide sequence of the element. This variability in the se-
quence element results in a fine-tune control of the expression of the gene,
but also implies that the strictly conserved sequence can be very small (typi-
cally 6-10 bp) and very difficult to detect compared to the background noise.
An excellent review on cis-regulatory sites was done by (Wray et al., 2003).
1.1.3 Post-transcriptional regulation of gene expres-sion
Once the pre mRNA is synthesised, transcript maturation and turn-over, as
well as translation, are mechanisms under tight control as well. For example,
the rate of translation of the ferritin heavy chain mRNA is controlled by the
iron-responsive element (IRE) binding protein that acts as a translational
repressor by binding to the IRE site located on the transcript (Munro et al.,
1988).
Post-transcriptional regulation is mainly achieved by controlling the rate of
degradation of the messenger RNA. Indeed, at any moment the total amount
of a specific transcript in the cell is the result of two antagonistic processes,
namely the rate of RNA synthesis (transcription), and the rate of degrada-
tion (RNA catabolism). The degradation process is an active process that
involves many regulatory and enzymatic steps. It seems to be a waste of
energy to actively degrade a transcript, but it has been shown that degra-
dation is also a powerful mechanism for gene regulation. Furthermore, each
transcript has a different degradation rate, and this rate can vary greatly
from condition to condition (cell type, cell cycle, stress).
1.1. BIOLOGICAL BACKGROUND 8
Many pathways of mRNA turnover have been reported in the literature
(Parker and Song, 2004). The most studied process involves shortening of
the poly(A) tail followed by the decapping of the transcript, and finally the
5’-3’ exonucleolytic degradation. Other pathways involve either the direct
decapping of the transcript, the use of a 3’ to 5’ exonucleolytic decay or the
use of endonucleolytic cleavage by endonucleases.
Each transcript have a different intrinsic susceptibility to be degraded by
these pathways and, in addition, cis- or trans-factors can act upon the tran-
script and change its rate of degradation. For example, it has been shown
that premature termination codons trigger the decapping of the mRNA
which exposes the transcript to 5’ to 3’ exonuclease degradation (Maquat
and Carmichael, 2001). This process, known as nonsense-mediated mRNA
decay, or NMD, is known to be used as a surveillance mechanism to promptly
remove mRNA having frameshift or nonsense mutation.
The exact mechanism of NMD remains obscure but it is known to be tightly
coupled with translation by ribosomes. In mammals, if translation termi-
nates more than 50-55 nucleotides upstream of the last exon-exon junction,
the transcript is considered premature and NMD is triggered. In yeast, where
less transcripts bear introns, NMD seems to be triggered when a significant
amount of the mRNA length is free of ribosomes. In yeast, genetic stud-
ies identified three proteins that are involve in NMD (upf1, upf2 and upf3).
Mutation in one of these three proteins leads to a defective NMD without af-
fecting the other degradation processes (Schell et al., 2003)(Cui et al., 1995).
The homologues of upf1, upf2 and upf3 are found in human and were shown
to be involved in NMD as well.
Since most eukaryotic translation happens by a scanning process and not
via an internal ribosome entry, a premature stop codon triggered by an up-
stream open reading frame (uORF), for example, can possibly result in an
extended 3’ region of the transcript free of ribosomes and turn on the NMD
process for that transcript. Chapter 5 is devoted to the study of uORFs in
yeast transcripts and the effect on gene-expression.
1.1. BIOLOGICAL BACKGROUND 9
1.1.4 Gene regulation and cellular function
Sets of genes are usually expressed simultaneously in order to produce pro-
teins that, together, perform a given task. To be functionally active, pro-
teins need to associate with others, and the type of association defines the
functional information. Consequently, genes that are co-regulated are of-
ten functionally related. This has been proven to be true for many cases,
both in eukaryotes and prokaryotes, despite a very different mechanism of
co-regulation between these two groups.
1.1.4.1 Mechanisms of co-regulation of functionally related pro-teins in prokaryotes
Co-regulation in prokaryotes is often due to operons, and even though this
work involves only eukaryotes, operons are nice examples of a well studied
mechanism that keeps functionally related genes under similar regulations.
Operons in prokaryotes were first described by Jacob F. and Monod J. in
1960 (Jacob et al., 1960). The operon is a coordinately regulated unit that
contains a set of genes. At the genomic level this unit consists of genes that
are contiguous on the same strand of DNA and a regulatory unit located
directly in the upstream region. Operons have been studied for many years,
and it has been show that, in most cases, operon units contain functionally
related genes, often the complete set of genes involved in one particular path-
way. This organisation is believed to be advantageous for the coordinated
expression of related genes, but it has also been suggested that operons play
an important role in gene transfer because a complete functional unit can be
given to another bacteria by only the transfer of a single limited stretch of
DNA (Lawrence and Roth, 1996). On the genome, the presence of operons
leads to interesting features: that functionally related genes remains together
even across many species, and that genes within operons have much shorter
intergenic distances. Based on these observations, one study estimated a to-
tal of 630-700 operons in E.coli (Salgado et al., 2000). Figure 1.3 shows one
of the most studied operons in E. coli, the Lac operon.
Operons occure rarely in eukaryotes, apart from nematodes where a large
portion of genes is arranged in operons. The mechanism for nematode oper-
ons is entirely different from bacteria. The bacteria operon produces a poly-
1.1. BIOLOGICAL BACKGROUND 10
promoter operator Lac Operon structural genes
Transcription
Translation
Lactose Operon in E. Coli
Galactosidase TransacetylasePermease
Figure 1.3: The lac operon in E. coli consists of 3 genes involved in thecatabolism of lactose. These genes are under the control of a single promoterthat is repressed by the operator in absence of lactose. Once the promoter isactivated a polycistronic mRNA is synthesised.
cistronic mRNA, while the nematode produces a polycistronic pre-mRNA
that is trans-spliced into many mono-cistronic mRNAs. Like prokaryotes,
genes that encode for functionally related proteins have been shown to occur
often in the same operon, suggesting a similar selection pressure in nema-
todes to co-express functionally related proteins. In higher eukaryotes operon
structures have not been characterized and seems unlikely to occur.
1.1.4.2 Mechanisms of gene expression in eukaryotes
Co-expression of genes that are involved in a common process have been
widely reported. An excellent review article by (Niehrs and Pollet, 1999)
summarises the current knowledge of co-expression of functionally related
genes in eukaryotes, or what the authors call a ’synexpression group’. In
yeast, where expression of the entire transcriptome can be easily monitored,
synexpression groups were reported in various biological processes like the
cell cycle, metabolism or protein bio-synthesis. Synexpression groups have
also been wildly reported in higher eukaryotes, including mammalian organ-
isms. For example, genes involved in the synthesis of cholesterol also have a
1.1. BIOLOGICAL BACKGROUND 11
reductaseHMG CoA
C5C6 C15 C30 C30 C30 C29C2
C4+
C5
IPP isomerase
Squaleneepoxidase
Cyt. P450demethylase
FDPfarnesyltransferase
CholesterolC27
B
A
Figure 1.4: A) biosynthesis pathway for the production of cholesterol inhumans.(from (Niehrs and Pollet, 1999)) B) expression profiles of HMG CoAreductase (1) IPP isomerase (triplicate)(2-4), farnesyl-diphosphate farnesyltransferase (5), squalene epoxidase (6), Cytochrome P450 lanosterol 1,4-alfa-demethylase (7) in starved human fibroblasts after serum addition. These 7genes have similar expression profiles and are functionally related by beingpart of the same metabolic pathway (Iyer et al., 1999).
very similar expression in starved human fibroblasts after serum addition, as
shown in Figure 1.4. (Niehrs and Pollet, 1999) and (Iyer et al., 1999).
Since proteins involved in the same biological process also physically inter-
act together or form complexes, people have correlated synexpression with
protein interactions or complexes in yeast (Ge et al., 2001) as well as in
Drosophila (Walhout et al., 2002).
Contrary to prokaryotes, where the mechanism of regulation of operons is
well understood, much less is known about the mechanism of co-expression
of synexpression groups in eukaryotes. For instance, some current mod-
els suggest that tissue-specific genes in higher eukaryotes are arranged in
discrete, independently controlled segments of chromatin. Enhancers and
locus-control regions (LCR) also affect many genes. A well known example
is the globin cluster in humans. The globin genes are under the control of
a single LCR that lies far upstream from the cluster and appear to act by
controlling chromatin condensation (Bungert et al., 1995). Many LCR are
thought to be present in the human genome, and they regulate a variety of
cell type specific genes.
Nevertheless, despite this higher level of control, regulation can also be
achieved by the binding of regulatory elements in the proximal promoter
1.2. EXPERIMENTAL APPROACHES TO FIND REGULATORY REGIONS 12
and, therefore, co-regulated genes should have significantly more of a given
cis-regulatory motif in their upstream sequences. This has been shown to be
true in yeast by the work of (Hughes et al., 2000). In this work they used sets
of genes grouped from different sources (YPD, Munich Information Center
for protein Sequence, SGD).
1.2 Experimental approaches to find regula-
tory regions
Cis-regulatory elements have been studied for decades by a myriad of sci-
entists. Most of the techniques they have developed are labor-intensive and
difficult to scale up and, consequently, focus on one gene or one element of
regulation. More recently, global approaches to deciphering gene regulation
in a genome-wide manner have been applied.
1.2.1 one-by-one gene analysis
Many techniques have been developed to localise the binding site for regu-
latory proteins. The most used ones are DNase footprinting (Leblanc and
Moss, 2001) and mobility shift assay. Both are based on the modification of
physical properties of the DNA fragment when proteins specifically bind. The
first method, DNase footprinting, uses the fact that the DNA is protected by
the binding protein from degradation by DNaseI. The other method, mobil-
ity shift assay (Chan et al., 2004), uses the differential mobility of the DNA
fragment on a non-denaturing gel when the protein is bound to the DNA.
Another approach is a genetic analysis where isolation of mutants in the
DNA binding site help to identify which residues in the binding site are im-
portant (Walter and Biggin, 1996).
Even though some attempts to use these techniques in a high throughput
manner on the entire genome have been tested, these approaches remain
time-consuming and, consequently, can only be applied to a few cases at a
time.
1.2. EXPERIMENTAL APPROACHES TO FIND REGULATORY REGIONS 13
1.2.2 High throughput analysis
To study the entire transcriptome of an organism, a number of high through-
put methods have been developed. Two methods are particularly relevant
to derive cis-regulatory regions. The first approach is indirect and involves
micro-array technology; the second one, ChipIP, attempts to directly locate
the regions of importance for the binding of trans-factor.
Micro-array analysis monitors the relative amount of transcript in a pop-
ulation of cells at a given time by measuring the hybridisation between an
immobilised DNA or oligonucleotide sequence and the corresponding cDNA
derived from the sample. This measure can either be absolute (that is, the
intensity of hybridisation in relation with the amount of transcript in the
cell) or relative (the intensity of hybridisation in condition 1 relative to the
intensity of hybridisation in condition 2, in order to measure differential ex-
pression). For the relative measure, two dyes are used to label the cDNA from
the two samples respectively, and are hybridised onto the same immobilised
probe. By repeating the measure at different times and/or under different
conditions, it is possible to obtain the expression profiles for a large set of
genes. Genes that have similar expression profiles are said to be co-regulated.
Because co-regulated genes are believed to be under the control of a similar
set of transcription factors, these genes should possess common regulatory
regions. Micro-arrays therefore only indirectly find cis-regulatory regions by
providing co-regulation information, but this approach has been proven to be
very successful, particularly in S. cerevisiae (Brazma et al., 1998), (Hughes
et al., 2000).
Although yeast are eukaryotes and therefore have greater complexity than
bacteria, they share many of the technical advantages that permit an ease of
handling for diverse investigations. Furthermore, the yeast genomic organi-
sation also shows much lower complexity than higher eukaryotes : therefore
it has been harder to find cis-regulatory motifs in higher eukaryotes using
microarray.
Chromatin immunoprecipitation (ChIP) (Weinmann and Farnham, 2002)
does not monitor the gene expression per se but instead investigates di-
rectly the interactions between proteins; for example, transcription factors
1.3. BIOINFORMATIC APPROACH TO FINDING REGULATORY ELEMENTS 14
and DNA. Coupled with whole genome DNA microarrays, ChIP allows the
identification of the DNA binding sites of any given transcription factor and
by extension, can infer possible co-expression. In both the micro-array and
chromatin IP approach, bioinformatic tools are needed in order to identify
over-represented motifs that are believed to be cis-regulatory elements.
1.3 Bioinformatic approach to finding regu-
latory elements
1.3.1 Background
Bioinformatics is based on the prediction of certain characteristics of biolog-
ical entities. These entities can be sequences and in this case, one of the
most common approaches is to find other related sequences in order to infer
function or to gather more information about the given sequence. Finding
related sequences is achieved by using alignment algorithms that also pro-
duce the best alignment.
Homologous sequences derived from a common ancestor can undergo substi-
tution, insertion and deletion, and the rate of these changes varies according
to the section pressure. Alignment algorithms should therefore take all these
events in account. Many such algorithms were developed and can be classified
according to their characteristics. These tools can be clustered roughly into
global and local algorithms which, in turn, can be separated into pair-wise
and multiple alignment methods. Pairwise global alignment algorithms such
as the one developed by Needleman and Wunsch (Needleman and Wunsch,
1970) consider the entire sequences, whereas local alignment algorithms such
as the one developed by Smith and Waterman (Smith and Waterman, 1981)
focuses on the region of greatest homology. Fasta (Pearson, 1991) and Blast
(Altschul et al., 1990), both pair-wise local alignment algorithms, provide
rapid alternatives to the Smith-Waterman tool by finding exactly matching
words. This step confines the subsequent search to a small fraction of the
entire search space. Many other alignment algorithms have been developed,
each to answer specific questions.
As outlined earlier, binding sites on DNA for transcription factors are usually
very small, and two identical binding sites usually are not due to common
1.3. BIOINFORMATIC APPROACH TO FINDING REGULATORY ELEMENTS 15
ancestry. Alignment algorithms are of use only if the surrounding sequence is
believed to be derived from the same ancestor and the identity high enough.
That is often not the case and, therefore, algorithms based on finding over-
represented motifs in a set of sequence is sometimes a better approach to
find protein binding sites on DNA. As we have seen in section 1.2.2 the set
of sequences can be, for example, derived from microarray analysis and are
not believed to have common ancestor.
1.3.2 Finding over-represented motifs on unrelated se-
quences
Typically, data derived from microarray analysis where clustered into co-
expressed genes that are believed to have common motifs in the corresponding
upstream regions. These studies have been done mostly on yeast, and many
algorithms to find over-represented motifs have been developed. MEME
(Bailey and Elkan, 1995), AlignACE (Roth et al., 1998) and DIALIGN
(Morgenstern et al., 1998),Teiresias (Rigoutsos and Floratos, 1998) are four
example of such techniques but many more have been reported in the liter-
ature (Hertz and Stormo, 2000)(Brazma et al., 1998)(Hughes et al., 2000).
1.3.3 Phylogenetic footprinting to find cis-regulatoryelements
1.3.3.1 Introduction
Evolutionary information is used extensively in computational biology to in-
fer function. For example, if two entities share features, then knowledge can
be inferred between them; if two genes share sequence homology, and hence
a common ancestor, they are likely to share a similar function. This notion
has been widely applied in bioinformatics and is routinely used in automated
genome annotation.
Different functional elements in the genomes are under different selection
pressure. A good example of this is the coding region where substitution of
the third position is far more common than at the other positions. Because
of the degeneracy of the genetic code, mutation of the third nucleotide is
1.3. BIOINFORMATIC APPROACH TO FINDING REGULATORY ELEMENTS 16
generally silent (referred to as synonymous changes). Regulatory regions are
under different selection pressure than the non-functional DNA and, conse-
quently, evolution can be used as a tool to locate them as well.
The discovery of regulatory regions in the intergenic DNA through cross-
species comparison is often termed phylogenetic footprinting, an analogy to
DNAase footprinting (Tagle et al., 1988). This is based on the observation
that functionally important regions tend to have a lower mutation rate than
non-functional regions. Therefore, it is a technique that can be used to pre-
dict transcription factor binding sites (TFBS). People have been using this
technique on well studied genes for a long time. They usually find the ho-
mologue of the gene of interest in many related species and, after sequencing
the upstream regions or DNAase hypersensitive sites, use various alignment
techniques to locate the specific region of interest that is most probably in-
volved in transcription regulation.
However, the protocol used for phylogenetic footprinting depends largely
on the gene studied. Indeed, for genes that play key roles in general bio-
logical processes, very few but distant species are used. For example, in the
study of the stem cell leukemia gene (bHLH transcription factor) promoter
region the authors used human, mouse, chicken, pufferfish (fugu) and ze-
brafish (Gottgens et al., 2002). For genes that are involved in taxa-specific
processes, remote species do not have homologues, and pair-wise comparison
with related species will not have enough resolving power. Recent approaches
have been using phylogenetic shadowing (the use of additive collective diver-
gence of many very close species to distinguish functional sites) with success.
As more and more fully sequenced genomes appear, this technique of phy-
logenetic shadowing is bound to give very interesting results in the future;
not only for taxa-specific genes. Presently, phylogenetic shadowing can only
be apply to very specific examples where enough orthologous sequences are
available, due to the lack of fully sequenced closed genomes. This is the
case for the study of the mammalian growth hormone gene and involves 13
different yet related mammals (Krawczak et al., 1999). Recently, another
group has been using with great success phylogenetic shadowing on different
regions of the human genome, using a total of 13 to 17 different primates
(Boffelli et al., 2003).
1.3. BIOINFORMATIC APPROACH TO FINDING REGULATORY ELEMENTS 17
Genome-wide phylogenetic footprinting, as opposed to gene-centric phylo-
genetic footprinting, is a fairly new technique because it needs the comple-
tion of at least two related genomes. The general strategy described in most
of the previous work so far has been to align sequences from orthologous
pairs in 2 or more species and, using known position weight matrices, predict
TFBS. Most of the time these techniques have graphical interfaces to dis-
play the result. Interestingly, the first eukaryotic organisms to be compared
were higher eukaryotes like human and mouse. With the newly sequenced
yeast genome of S.paradoxus, S. mikatae and S. bayanus (Kellis et al., 2003),
phylogenetic footprinting is now successfully applied in a large-scale fash-
ion in yeast. As most of the complex studies were traditionally done first
using yeast, an attempt of finding binding sites de novo using only phyloge-
netic and co-occurence information has been done by Chiang D et al.(Chiang
et al., 2003). They found around 1000 closely spaced hexamer pairs that are
conserved in at least 3 yeast species. Many of these examers correspond to
known transcription factor binding sites. Another study (Kellis et al., 2003)
looked at the conservation scores of motifs and found 72 genome-wide ele-
ments, including most of the known regulatory motifs as well as new motifs.
1.3.3.2 Methods for phylogenetic footprinting
As seen above, alignment tools have been developed in order to estimate if
DNA or protein sequences are derived from the same ancestor. In the case
of cis-regulatory regions, alignment techniques have been used extensively.
This approach consists of aligning regions of homology in the non-coding se-
quences in the vicinity of orthologous genes from two or more species. Most
of the work has been done on well-studied examples like the alpha-globin clus-
ter (Flint et al., 2001), the SCL loci (Gottgens et al., 2002), the Oxb4 gene
(Aparicio et al., 1995) and other regions that often correspond to loci involved
in human disease (Loots et al., 2000), (Dubchak et al., 2000). Nevertheless,
more general analysis has been done on whole genomes or functional sub-
sets (Levy et al., 2001), (Webb et al., 2002), (Elnitski et al., 2003)(Dieterich
et al., 2002).
Because regulatory elements tend to be quite short conserved sequences
relative to the background noise and the order and direction of conserva-
tion of these elements are not conserved all the time, algorithms like DBA
1.3. BIOINFORMATIC APPROACH TO FINDING REGULATORY ELEMENTS 18
(Jareborg et al., 1999)or bayer block aligner (Zhu et al., 1998) that focus
on aligning highly conserved ungapped blocks while allowing large gaps are
theoretically better-suited for identifying cis-regulatory regions. The work in
chapter 3 uses Promoterwise, an alignment program derived from DBA, to
analyse intergenic regions in higher eukaryotes. In practice, any alignment
technique will pick up regulatory elements located in modules of long highly
conserved regions. The question remains of how many regulatory elements
are located in non-conserved sequences. This question is species-dependent
but an increasing number of studies show evidence of modular organisation
of cis-regulatory sites (Berman et al., 2002) and other studies have shown
examples of regulatory elements being in very low sequence identity as well.
Substantial resolving power is added by including more than two sequences
in a multiple sequence alignment, since each lineage diverged independently
after separation from a common ancerstor. Programs that performed the
alignment are Yama2 (Chao et al., 1993), ClustalW (Thompson et al., 1994),
Multalign (Corpet, 1988), Dialign (Morgenstern et al., 1998) and others.
Since Dialign does not have gap penalty and starts by identifying short con-
served regions, this algorithm is more suited to identifying regulatory regions
than ClustalW.
Once the alignment is made, conserved sequences need to be located. On
alignment involving only two sequences, a simple metric of X % conservation
over at least Y nucleotides is usually used. Dubchak and al (Dubchak et al.,
2000) used two alignments, human and dog, as well as human and mouse 200
kb sequence (human 5q31), to define cutoff criteria X and Y for conserved
sequence based on maximising the percentage of regions that are common
in three species. In other cases, a simple ranking of identity scores seems
to give better results than fixed settings (Flint et al., 2001). For multiple
alignments, more parameters need to be taken in account, such as the phylo-
genetic relation between species or the nucleotide frequencies at each position.
Because alignment only provides the information of what region is common
to two or more species, the challenge for these techniques is to assess if these
regions of homology are indeed involved in regulation. This is why alignment
has often been used in conjunction with known transcription factor binding
sites, usually from the Transfac database(Wingender et al., 2000).
1.3. BIOINFORMATIC APPROACH TO FINDING REGULATORY ELEMENTS 19
Although motif over-representation techniques can theoretically be used for
phylogenetic footprinting, they were designed to compare evolutionary inde-
pendent sequences (see section 1.3.2) and therefore do not take into account
the evolutionary relationship between homologous sequences. To overcome
this problem, Blanchette and Tompa (Blanchette and Tompa, 2002) devel-
oped another method - Footprinter - which takes in account the phylogenetic
tree relating the sequences and therefore is more suitable for comparing or-
thologous sequences and identifies all the DNA motifs that have evolved in
a slower rate than the surrounding region.
A major drawback is the relative bad performance of this approach on a small
set of orthologues. Indeed, all these motif-finding techniques perform much
better with increasing amounts of sequences where the distinction between
conserved motifs and diverged background becomes clearer. Blanchette sys-
tematically used more than three species and increased the number of se-
quences by including paralogues. With the sequencing of more organisms
these will become less problematic but, in order to work, it will assume that
a majority of these organisms retains sufficiently conserved motifs within the
analysed segment, which may not be valid (see issues). Another problem is
that these methods do not work as well with large sequences and, as meto-
zoan promoters may lie a considerable distance away from the transcription
start site, this limits their utility.
Nevertheless, these approaches will find motifs that satisfy the criteria in-
dependently from the surrounding sequence identity. This is not the case
with global alignment, where the noise of the diverged non-functional back-
ground can overcome the short conserved signal.
1.3.4 Issues
All these studies, independent to the species complexity, show an overall en-
richment of putative transciption factor binding sites in conserved non-coding
genomic sequences or footprints, and many other studies have linked evolu-
tionary conserved regions to experimentally determined regulatory elements
(Aparicio et al., 1995). There is no doubt that this technique is successful
in finding TFBS genome-wide. However, it is also quite clear that even for
1.3. BIOINFORMATIC APPROACH TO FINDING REGULATORY ELEMENTS 20
relatively close species not all the TFBS are conserved. For example, only 50
percent of known TFBS are located in conserved regions, according to one
estimation for human and mouse (Levy and Hannenhalli, 2002). This leads
to the statement that these type of methods do not find all the transcrip-
tion factor binding sites but only a subset that is important enough to be
conserved thoughout all the species studied. It is nevertheless important to
understand why such conservation fails to happen in so many cases.
First of all, alteration in gene regulation and therefore alteration of the TFBS
seems to have been the primary substrate for the evolution of species. King
and Wilson ((King and Wilson, 1975)) suggested that most of the genetic
causes of phenotypic differences between humans and the great apes are the
regulatory sequences that control the timing and pattern of genic activity.
Many other examples of homologous genes have been shown to have distinct
temporal and spacial expression. As an example the B myosine heavy chain
is the major isoform in the adult ventricle of humans but not in hamsters
and consequently cis-acting element involved in the tissue specificity would
be expected to differ in the two species. Even in case of conservation of
the functional binding site, some divergence in the nucleotide sequence of
the site can be seen, even for very close species. For example, a study in-
volving an androgen-inducible gene in different mices species shows that the
regulatory sites for this gene have subtitution and insertion resulting in the
change of affinities for their respective nuclear factors and modification of
expression of the gene ((Chaudhuri et al., 1991)) Gene duplication arising
from whole genome or segmental duplication is also a substrate for mutation
in the regulatory region of both of these duplicated genes. (see duplication -
degeneration - complementation model proposed by (Force et al., 1999)).
Consequently, the conservation of cis-regulatory regions is a good indica-
tor of conservation of the spatial and temporal expression of an orthologue.
Loots et al (Loots et al., 2000) have used prior knowledge that transgenic
mice bearing the human 5q31 region containing Il4 Il13 and Il5 as well as the
regulatory regions, correctly expressed the human transgenes to propose the
hypothesis that the cis-regulatory region should be conserved from human to
mouse and found that it is, indeed, mostly the case by cross-species sequence
comparisons. Because this type of study is difficult to scale up at the whole
1.3. BIOINFORMATIC APPROACH TO FINDING REGULATORY ELEMENTS 21
genome level, genome-wide phylogenetic footprinting cannot integrate this
information yet. Conversely, non-conserved regions or divergence in shared
binding sites that have arisen from positive selection are very interesting be-
cause they can explain the difference between species, but for now there are
no techniques to distinguish between positive selection and random mutation.
Another issue has more to do with the characteristics of the TFBS. Indeed,
because of the relative small size of a regulatory motif (5-25 bp), these motifs
can easily be modified, duplicated, and reversed, and can appear or disap-
pear throughout evolution without affecting the expression of the gene. For
the alpha-globin cluster, the MARE motif has been found in human, mouse,
chicken and pufferfish, but in pufferfish this motif has a different location
and appears to be reversed (Flint et al., 2001). A very intersting study per-
formed by Ludwig et al.(Ludwig et al., 2000) showed that two strip elements
had undergone substitution and indels modifying considerably the cis-acting
elements and the spacing between them from two species as close as D. pseu-
doobscura and D. melanogaster. Yet the expression profile stays the same,
raising the hypothesis of compensatory mutation. These authors suggest that
stabilising selection has allowed mutational turnover of functionally impor-
tant sites and, at the same time, maintained functional conservation of gene
expression. They predict that such pattern of substitution will be a common
theme in cis-regulatory regions, but the extent of such substitution seems to
differ from species to species, with vertebrates having overall more conserved
cis-regulatory region than invertebrates.
Therefore, the question of how evolution effects cis-acting motifs is impor-
tant to consider. Ideally one would like to study orthologous genes in dif-
ferent species which have the same expression pattern but have had enough
evolutionary time that only functionally conserved sequences are alignable.
However, there seems to be no ideal large-scale comparison that would satisfy
the characteristics of all the genes. Intra-mammal comparisons (eg. mouse
to human) would include mammalian-specific genes but would also have a
large amount of non-functional conservation. Intra-vertebrate analysis (eg.
fish-human), on the other hand, can locate functional regions with more
specificity but (a) the signal is currently very hard to detect and (b) it is
considerably less obvious if one expects there to be functional conservation
of the same motifs.
1.3. BIOINFORMATIC APPROACH TO FINDING REGULATORY ELEMENTS 22
I will investigate alignments of promoter regions in chapter 3 and go on
to use these alignments to define motifs in chapter 4.
1.3.5 Finding eukaryotic promoters
Many methods to predict promoters or, more precisely, the location of the
transcription start site (TSS) have been developed in higher eukaryotes.
PromoterInspector (Scherf et al., 2000), for example, uses a set of over-
represented motifs in promoter regions. Another algorithm, Eponine (Down
and Hubbard, 2002) is a probabilistic method for detecting transcription
start sites in mammalian genomic sequences. It consists of a set of DNA
weight matrices recognising specific sequence motifs. Each of these elements
is associated with a position distribution relative to the transcription start
site. these elements are:
1. A diffuse preference for CpG motifs that correspond to the CpG island.
2. A TATAA box motif at around 30 bp upstream of the the TSS
3. Two CpG rich weight matrices flanking the TATAA motif.
This procedure is based on a model learned from the Eukaryotic Pro-
moter Database (EPD, (Schmid et al., 2004)), an annotated non-redundant
collection of eukaryotic POL II promoters, experimentally defined by a tran-
scription start site.
Chapter 2
Finding regulatory regionsusing functional information inyeast
2.1 Introduction
As we have seen in the introduction, transcription factors are one of the ma-
jor players in gene expression and bind to small stretches of semi-conserved
DNA that are usually located a certain limited distance upstream of the
gene. Contrary to the well defined gene structure, cis-regulatory elements
are poor in information, and even though it should be theoretically possible
to find these elements de novo using nothing but the DNA sequence, most
of the approaches so far have used additional information. One of the most
successful approaches involves the use of related genomes to find regions of
conservation. Chapter 3 is devoted to the use of comparative genomics to
locate cis-regulatory sites.
Another very successful approach to locate cis-regulatory elements is the
use of micro-array technology. Microarray analysis measures the amount of
specific mRNA in the cell; that is, the sum of the biosynthesis and the degra-
dation rate of the mRNA molecule. By repeating the measure at different
times under different conditions, it is possible to obtain the expression pro-
file for each gene studied. Genes that have similar expression profiles (called
co-regulated genes) are believed to be regulated by a similar set of regula-
tory elements. The mechanism for such a regulation is very different between
23
2.2. EXAMPLE: THE NUCLEOTIDE PATHWAY IN YEAST 24
prokaryotes and eukaryotes, and for the latter, each gene has its own regula-
tory regions. Therefore, binding sites for a particular transcription factor is
expected to be enriched in the upstream region of a set of co-regulated genes.
This has been proven to be true in numerous cases.
The approach taken here also uses the information about co-regulation, but
not derived from micro-array analysis; rather from the fact that genes that
have similar function have a strong tendency to form clusters of co-regulated
genes. Given the function of genes, it is therefore theoretically possible to
bypass the micro-array data and possibly define co-regulated genes via their
functional similarity. Eukaryotic genes whose products have similar func-
tionality should therefore display an enrichment in given cis-regulatory sites.
This is the basis of the method described in this chapter. Presented first
will be an example of a well studied pathway in yeast (the nucleotide path-
way) followed by a manual look for cis-regulatory elements before extending
the approach to an automatic procedure that would find regulatory elements
using any functional network.
2.2 Example: the nucleotide pathway in yeast
The cell cycle is a highly coordinated process that involves the production of
newly synthesised DNA strands. During S phase, the cell should possess an
elevated level of dNTPs - DNA precursors as well as all the enzymes and ac-
cessory proteins that are involved in the biosynthesis of DNA. The nucleotide
pathway is therefore a good example to test the hypothesis of co-regulation
within a pathway in yeast. Because nucleotides are also used at times other
than the S phase of the cell cycle and the pathway to produce nucleotides is
not a linear chain of reactions, one would expect only a subset of enzymes to
be co-regulated. In order to find potential regulatory motifs, the upstream
regions of all the genes encoding for enzymes that are involved in the DNA
polymerisation pathway were retrieved. The choice of DNA as the start com-
pound was made because a strong co-regulation within the subunits of the
polymerase is expected to occur. By manual analysis, two motifs appear to
be found significantly more often in the 0.5kb upstream of these genes. Using
the pathway relationships in KEGG, the two motifs were recursively found in
neighbouring enzymes in the pathway. Results are shown in Figure 2.1. This
figure shows the network of reactions leading toward biosynthesis of DNA
2.2. EXAMPLE: THE NUCLEOTIDE PATHWAY IN YEAST 25
dTDP
dCDP
dADP
dGDP
TDP dTTP
dCTP
dATP
dGTP
YBL035CYBR278WYKL114CYDL102WYEL055CYJR006WYNL102WYNL262WYOR 330CYPL167CYPR175W
TDP
dUMP
YOR074C
ADN
CDP
ADP
GDP
YER070W
YIL066C
YJL026W
YKL067W
ATP
YOR116C
dTMP
YJR057W
YGL180W
Unknown motif MluI motif
Figure 2.1: Diagram of the selected routes for the biosynthesis of DNAas described in KEGG. Labeled in blue are the enzymes that catalyse thereaction, and the circles are the compounds used by these enzymes. Thearrows have the direction of the reaction in ’normal’ physiological conditionof the cell but none of these reactions are considered irreversible. Certainenzymes are labeled with red circles and/or green rectangles (symbolisingthe Mull and the unknown motif respectively) that are found at least oncewithin 500 bp upstream of the genes.
and the presence of these motifs in upstream region of the corresponding
genes.
The first motif (ACGCGTNA) is well known and has been previously called
Mul1 site (Verma et al., 1991) Mul1 site has been shown to bind a regulatory
protein that is involved in the regulation of cell division cycle genes (CDC
genes). The genes experimentally verified to have this site in the upstream re-
gion are CDC21 (YOR074C), CDC2 (YDL102W), CDC6 (YJL194W)(Verma
et al., 1991) and POLI (YNL102W)(Moll et al., 1992), but these genes are
not the only ones to have this site, as shown in Picture 2.1. Further ex-
perimental work needs to be done on this site to verify the functionality of
these motifs as a binding site. The second motif does not seem to have been
reported in the literature.
This example was the result of careful manual analysis, but a fully auto-
mated procedure to find such motifs was developed. This uses the degree of
2.3. USEFUL FUNCTIONAL NETWORK 26
concordance of motifs present upstream of genes to any functional networks.
2.3 Useful functional network
The availability of functional information in a large-scale manner is essential
for this approach and is dependent on the organism studied as well as the
type of interaction.
The first type of functional information used is the direct protein-protein
interaction from two large-scale experiments in yeast published by (Gavin
et al., 2002) and (Ho et al., 2002). Another well known type of functional in-
formation used in the example above (see 2.2) is the small molecule metabolic
reaction catalysed by enzymes. In this instance, the link is not physical as
in the case of protein-protein interaction but rather indirect via metabolites.
Because of the relative simplicity of the yeast genome organisation and the
availability of large-scale experiments the choice was made to work mostly
with S. cerevisiae. The methodology is attempted on H. sapiens later in this
chapter.
2.3.1 Metabolic network
Enzymes are one of the best-characterised elements in the cell, being the first
biological molecule to be studied. Their metabolites are usually well defined
and because the product of an enzyme is usually the substrate of other re-
actions, relationships between enzymes are easily derived. In the early age
of biology, a continuous stretch of such relationships was called a pathway
and the concept of pathways still remains today, even though the topology
is better viewed as a network rather than linear pathways.
An example of a computationally defined metabolic network is the small
molecule metabolic network computationally described in the KEGG database
(Kanehisa, 1997), a store of all known enzymatic reactions for many species.
Taking only yeast, KEGG contains 623 enzyme encoding genes that cor-
respond to about 10% of the yeast gene set. This metabolic network can
be represented as a bipartite graph that contains two node types (enzymes
and metabolites) and two types of edges (enzymes linked to metabolites and
2.3. USEFUL FUNCTIONAL NETWORK 27
metabolites linked to enzymes). Under normal physiological conditions, the
enzymatic reaction has a direction from the substrate to the product, but in
this study directionality is not used. The resulting graph is therefore undi-
rected.
Because of focused interest into the enzyme relationships and the need for
simplification of the network structure, a monopartite graph can be derived
that will only contain one type of node (enzyme) and one type of edge (en-
zyme linked to enzyme). This procedure is illustrated in Figure 2.2. Ubiqui-
tously found substrate like water or CO2 are hubs in the bipartite graph, and
the resulting monopartite graph would connect enzymes that have no real
metabolic link; it is therefore important to remove these non-specific metabo-
lites from the bipartite graph (see Appendix B). The resulting monopartite
graph contains 623 nodes and 26,426 edges, with an average of 34 edges per
node.
2.3.2 Protein interaction
Because complexes are entities that can only be functional when all the nec-
essary proteins are present in the cell, information about direct or indirect
physical interactions between proteins within complexes should be very use-
ful for the discovery of cis-regulatory elements.
Two high throughput datasets on yeast protein interactions were used and
subsequently referred as the Cellzome network (Gavin et al., 2002) and the
MDS network (Ho et al., 2002). These datasets correspond to large-scale
identification of protein complexes in S.cerevisiae by mass spectrometry.
Both studies used a set of target proteins fused with either protein A and
the calmodulin binding peptide (Tandem affinity purification (Rigaut et al.,
1999) by Cellzome) or the Flag epitope tag (MDS). The resulting fused pro-
teins were purified together with the interacting yeast proteins, and the pu-
rified complexes were analysed by tandem mass spectrometry to identify the
associated proteins. The raw data that was used here consists of the bait
protein linked to all the identified proteins that co-precipitate with the pro-
tein.
One of the major difference between the two methodologies is that Cell-
2.4. GENERATING AND ASSESSING MOTIFS 28
zome uses the natural promoter to express the tag protein, while MDS uses
a construct under a strong inducible promoter. In the first case, some protein
may not be expressed in the condition of the experimentation (haploid cell
mid log), but the expression as well as the binding to other proteins reflects
better the physiological condition in a cell. On the other hand, the use of a
strong inducible promoter guarantees a detectable amount of tagged protein,
but the binding to other proteins may not reflect any biological interaction.
In both cases, these methods are unlikely to detect transient interaction or
interaction occurring only in specific states.
Similar to the metabolic network, protein interaction networks can be rep-
resented as bipartite graphs with complexes and proteins representing the
two type of nodes. A monopartite graph can be derived which only contains
proteins as nodes. The resulting monopartite graph contains 1,411 nodes and
34,844 edges for the Cellzome network, and 1,699 nodes for 151,670 edges in
the case of the MDS network.
2.4 Generating and assessing motifs
The basic work flow of the method is presented in Figure 2.3. The input data
are the upstream regions of the yeast genes (see materials and methods) and
functional information from either metabolic interactions or direct protein-
protein interactions. Two approaches were used to generate motifs and each
of these motifs were assessed using the functional network. A scoring scheme
was developed which quantitatively assesses the degree of concordance be-
tween the motif and the functional network. A significant score is given
to each motif using a brute force randomization procedure. The significant
motifs are then clustered.
2.4.1 Generating motifs
Two slightly different approaches for generating motifs can be used. Most
results shown in this chapter were generated using the over-represented motif
approach described in 2.4.1.1. The exhaustive approach was used during the
promoter scanning method described in 2.5.5
2.4. GENERATING AND ASSESSING MOTIFS 29
2 3 4 6 7 8
compounds/complexes
enzymes/proteins51
Bipartite representation of the interaction network
A B C
one nucleation set
1
2
3
5
4
6
7
8
enzymes/proteins
Unipartite representation of the interaction network
Figure 2.2: Monopartite (or unipartite) representation of the graph (bot-tom) derived from a bipartite representation (top). Only one type of node(white) is kept by adding an edge between two of these nodes only if theywere linked to a common black node in the bipartite graph. Label compoundand enzymes are for the KEGG network and label complexes and proteinsare for the protein interaction networks. A nucleation set is defined as allthe proteins that either act upon the same compound or are part of the samecomplex.
2.4. GENERATING AND ASSESSING MOTIFS 30
Teiresias is run on the selected set of genes. Parameters : minimum of 8 nucleotides2 wild card allowed.
Overlap the pattern network with the interaction network (unipartite representation)and calculate the overlap score.
Cluster the patterns that have a overlap score equal or higher that the cutoff value.
For each pattern found,build a pattern network (fully connectgraph with nodes being the selected genes).
Build a sequence logo for each pattern cluster.
Define a cutoff value.
Create pattern networksusing random genes withthe same number of nodes.
Overlap the random pattern network with the interaction network (unipartite representation)and calculate the overlap score.
All 6−7 and 8 mers motifs
Generating
Generating
AssessmentRandomization
workflow of the method
A : for the metabolic network B: for protein interaction network
Interaction network
For compound (A) or complex (B)link to 3 or more genes only. The upstreamregion of these genes were used for pattern discovery.
bipartite representation. motifs (first approach)
Motifs (second approach)
Figure 2.3: Overall schema of the procedure. Two possible pattern discoverysteps are possible before the assessment using the functional network. Toassess significance, a randomisation step is performed and significant motifsare then clustered.
2.4. GENERATING AND ASSESSING MOTIFS 31
2.4.1.1 Over-represented motifs
One part of the functional network (the nucleation set), derived from the bi-
partite representation of the functional network (see Figure 2.2) can be used
to derive a set of over-represented motifs that would be assessed using the
other part of the network.
As seen in the chapter 1, a broad range of programs that find over-represented
motifs in a set of genes is freely available. Most of these programs were
initially developed to find over-represented motifs within a cluster of co-
regulated genes derived from micro-array analysis. In all cases, given a back-
ground model, the algorithm would find over-represented motifs within a set
of sequences that are believed to lack any evolutionary relationships. The
advantage of such an approach is the enrichment of potential candidates in
the motif set, allowing a broader definition of the motif dictionary.
Some computational methods like Gibbs sampling or expectation maximi-
sation have a complex background model to eliminate ubiquitous motifs (for
example, low complexity motifs) from the significance set. Because the mea-
surement of the concordance with the functional network is the filtering step,
the approach used involved the proposed motifs set to be as large as possible.
Tereisias (Rigoutsos and Floratos, 1998), a fast algorithm that exhaustively
retrieves all possible motifs satisfying given parameters, is well suited for this
methodology. Using Tereisias, with loose parameters (see Appendix B for de-
tails), 197,922 patterns are generated from the entire KEGG network, while
197,111 and 320,405 patterns are generated from the Cellzome and MDS
networks respectively. Although these patterns are technically not random,
their numbers and distributions across genes are not convincing signals for
credible motifs, instead providing a large initial set to be filtered in the as-
sessment step.
2.4.1.2 Exhaustive enumeration
All possible motifs from a motif dictionary can be assessed in an exhaus-
tive manner. To avoid prohibitive computational time the search space was
limited to only discrete motifs of a given length without gaps that contain
enough information content (typically more than five nucleotides long) and
have at least two locations in the upstream region of all annotated genes
2.4. GENERATING AND ASSESSING MOTIFS 32
(typically less than 14 nucleotides long). This procedure of generating mo-
tifs has the advantage of being independent of the functional network and
the subsequent assessment step can be done on the whole functional graph
(without the need of removing the nucleation set; see 2.4.2). A potential
drawback of this approach is the limit of the search space that can be used
to avoid prohibitive computational time.
2.4.2 Assessment of the motifs using functional net-works
Functional networks are not simple clusters of functionally related proteins
but rather highly connected graphs that link some of the proteins together.
Simplistic notions of clustering will not capture the sparse co-regulation in-
side of these network. What can be defined as a set of genes involved in
the same biological function is not a trivial question and can not be an-
swered using simply the network. The computational problem faced here is,
therefore, not to find significant over-represented motifs in a cluster a genes
(which was only used in the initial step to derived a dictionary of putative
candidates), but rather to measure the concordance of a motif occurrence
with the functional network. The best method personally developed was to
create a ’motif network’; a fully connected graph containing all the genes
that possess the motif as nodes and super-impose this graph one the original
functional network. Only edges that have common nodes in both networks
remain and the resulting ’overlap network’ is the intersection of both the
functional and motif networks. If the technique for generating motifs uses
one part of the network (as is the case in 2.4.1.1), then this seed that built the
pattern (nucleation set) needs to be discounted from the functional network
before assessing the pattern.
The next step to be performed was to develop a single numerical value that
indicates the overall complexity of the overlap network in order to assess
and compare all the possible motifs. This overlap score needs to take into
account the number of nodes and edges in the overlap network. To do so,
all the edges of the graph were added to the score (apart from the edges
that belongs to the nucleation set if the motif that built the overlap graph
is derived from it). Contrary to the motif-network, the nodes in the initial
functional network do not have a fixed number of edges per node and, there-
2.4. GENERATING AND ASSESSING MOTIFS 33
fore, the overlap network was expected to vary greatly in size according to
the type of nodes involved. To address this issue, each added value from each
overlapping edge was weighed by a factor that is the sum of all the edges
that both node possess in the initial functional network (see equation B.3)
S =
√
√
√
√
∑
i
(1
ai + bi − 1)
Summation is over all common edges (i) present in both networks con-
necting node Ai to node Bi. The denominator ai + bi − 1 is the total number
of edges from both nodes, discounting the edge being counted. This proce-
dure is illustrated in Figure 2.4 and a more detail explanation is given in
Appendix B.
The issue remained as to assessing the significance of such an overlap score.
Theoretical derivation of the statistics is difficult, as the network topology
is variable. However it is computationally feasible using a brute force ran-
domisation technique. Indeed, the value of the overlap score depends mainly
on:
1 the number of times a particular motif is seen in the upstream region
of annotated gene.
2 the functional network topology.
3 the extent of concordance between the motif network and the functional
network.
[1] depends on each motif occurrence and [2] is fixed for a given network.
[3] is the aspect to be assessed.
[1] and [2] can be assessed for each motif occurrence using a brute force
randomisation and any deviation from this evaluation would be treated as a
significant concordance between the motif and the functional information.
To do so, 100,000 fully connected pattern networks of different sizes (from 0
to 500 nodes) with the gene identifiers being random but all the other aspects
of the network remaining the same, were generated. We then calculated the
overlap score of these random networks with the functional network.
2.4. GENERATING AND ASSESSING MOTIFS 34
H
D−E E−F
1
3+4−1
E−G
1 1
2 + 3 −1 1+3 −1+ = 0.86
= 0.44
edges used
overlap score =
Real pattern network
1
3 + 3 −1
edge used G−H
overlap score = +
genes used for building the pattern network
A
BC
E
Random pattern network
gene having the pattern
functional networkpattern network
overlap network
A
BC
E
F
G
DD
F
G
H
nucleation set
Figure 2.4: The left-hand panel shows an example real network, with athree edges forming an initial seed of nodes (A,B,C). For one of the patternsdiscovered using this seed, it also found genes (D,E,F,G), many of whichshare edges with the functional network. The overlap score in this case is0.86. In contrast, the right-hand panel shows an example random network,which was chosen to have the same number of nodes (4) as the proposedpattern network. In this case, however, only one edge is shared, and theoverlap score is 0.44.
2.4. GENERATING AND ASSESSING MOTIFS 35
0
0.5
1
1.5
2
2.5
3
3.5
0 50 100 150 200 250 300 350 400 450 500
over
lap
scor
e
pattern network size
random overlap score function of motif network size for the MDS network
0
0.5
1
1.5
2
2.5
3
3.5
0 50 100 150 200 250 300 350 400 450 500
over
lap
scor
e
pattern network size
real overlap score function of motif network size for the MDS network
Figure 2.5: Each point in these graphs corresponds to an overlap scorebetween a random motif network (top) or a real motif network (bottom)derived from the MDS network. The overlap score is a function of the size ofthe random motif network. The dotted line corresponds to the average fourstandard deviations to the mean overlap score for each pattern network size.
2.4. GENERATING AND ASSESSING MOTIFS 36
0
0.5
1
1.5
2
2.5
0 50 100 150 200 250 300 350 400 450 500
over
lap
scor
e
pattern network size
real overlap score function of pattern network size for the metabolic network
0
0.5
1
1.5
2
2.5
0 50 100 150 200 250 300 350 400 450 500
over
lap
scor
e
pattern network size
random overlap score function of pattern network size for the metabolic network
Figure 2.6: Each point in these graphs corresponds to an overlap scorebetween a random motif network (top) or a real motif network (bottom)derived from the KEGG network. The overlap score is a function of the sizeof the random motif network. The dotted line corresponds to the averagefour standard deviations to the mean overlap score for each pattern networksize.
2.4. GENERATING AND ASSESSING MOTIFS 37
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 50 100 150 200 250 300 350 400 450 500
over
lap
scor
e
pattern network size
real overlap score function of pattern network size for the cellzome network
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 50 100 150 200 250 300 350 400 450 500
over
lapp
sco
re
pattern network size
overlapp score function of random pattern network size for cellzome network
Figure 2.7: Each point in these graphs corresponds to an overlap scorebetween a random motif network (top) or a real motif network (bottom)derived from the Cellzome network. The overlap score is a function of thesize of the random motif network. The dotted line corresponds to the averagefour standard deviations to the mean overlap score for each pattern networksize.
2.5. RESULTS 38
2.5 Results
Figures 2.7, 2.5 and 2.6 show the overlap score of random and real pattern
networks as a function of the total occurence of the pattern for the cellzome,
MDS and metabolic networks respectively. In all cases, the randomised net-
works show a consistent, well behaved trend of linear increase score with
increasing number of nodes. For each network size, normality of distribution
for the overlap score was assessed, and it was found that, for small network
sizes, many networks have an overlap score of zero, which makes the distri-
bution skewed. This skewness becomes less significant as the network size
increases. For network sizes of more than 150, there is a good fit to normal
distribution (see Appendix B). The same tests have been performed using
the chi square method, resulting in similar conclusions.
A regression line corresponding to four standard deviations from the mean
of overlap scores for each network size was constructed, and real networks
having a score above this line were considered significant. Most of the pat-
terns found produce a network of genes that have little or no concordance
with the functional network. However, 647 motifs have a network of genes
that shows a much higher overlap score. In other words, these specific motifs
are found upstream of genes that have a significantly higher probability of
being interaction partners.
2.5.1 Significant motifs
These patterns required some further processing to be useful. First, the
pattern discovery systems output discrete patterns, so that, for example,
GANTATG and GNATATG would be treated as two distinct patterns de-
spite their obvious overlap. The patterns were clustered using their genomic
location (see Appendix B). This procedure is then followed by a single linkage
clustering, reducing the set of interesting patterns down to a total number of
42 motifs for the three functional networks considered in the study. Conser-
vative parameters were deliberately chosen to be sure of finding interesting
motifs.
The final set of 42 motifs, connecting to a total of 2,457 genes (about 40
percent of the yeast genome) are tabulated in Appendix C. Some clusters are
well known motifs that bind to known transcription factors in yeast, and the
2.5. RESULTS 39
regulated genes predicted by this analysis match the experimental evidence
previously published. A more detailed analysis of the motifs and their cor-
responding overlap networks are available online :
http://www.ebi.ac.uk/ettwille/genome research paper 2003/result overlap.html
What follows is a more detailed analysis of interesting motifs.
2.5.1.1 Motif GGTGGCAAA
One of the strongest motifs that have significance in both the Cellzome and
MDS networks is GGTGGCAAA. This motif, identified in cluster 6, has
been previously called proteasome associated control element or PACE, and
is known to bind to rpn4p, a transcription factor that controls expression
of genes related to the ubiquitin-proteasome pathway in yeast (Mannhaupt
et al., 1999).
In my hands both this motif and the reverse-complementary motif repre-
sented in cluster 23 are found mainly upstream of proteasome genes. Re-
stricted to only the overlap network, all the genes found are coding for
proteasome subunits apart for a protein from the cytoplasmic chaperonin
complex and a protein involved in the ubiquitin mediated degradation path-
way, both related to protein degradation pathway as well.
For the rest of the genes in the overlap network that are annotated as having
unknown function, strong evidence suggests, therefore, that these genes are
either proteasome subunit or more generally involved in protein degradation.
2.5.1.2 Motif TGACTC
The motif identified in cluster 33 (see appendix C) has been previously re-
ported to be located upstream of 30-40 yeast genes, encoding enzymes in 11
different amino-acid biosynthesis pathways (Arndt and Fink, 1986). This is
the well known binding site of the transcriptional regulator protein GCN4
that positively regulates the production of protein synthesis precursor in re-
sponse to amino-acid starvation.
The genes from the overlap network with the metabolic graph are encod-
2.5. RESULTS 40
ing mostly for protein involved in amino-acid biosynthesis, but also tRNA
synthetases of most amino-acids, as well as a couple of enzymes involved in
purine metabolism. Shown on Figure 2.8 is the overlap network obtained
using the exact motif TGACTC. The nodes in the highly connected part of
the network are the genes mainly involved in amino-acid biosynthesis path-
ways. Nodes on the periphery are mostly genes coding for tRNA synthetase
or genes involved in purine metabolism.
2.5.1.3 Motif AAAATTTT
The motif AAAATTTT is an interesting motif that scores very highly in
all three functional networks studied. Also known as poly(dA-dT) element,
this motif has been shown to create localised DNA distortion on either end
of the element (Koo et al., 2000) providing a region of access for transcrip-
tion factors (Iyer and Struhl, 1995). Indeed, in order to to bind efficiently
to the target site, most of the transcription factors need an open chromatin
(Koch and Thiele, 1999). In order to achieve this chromatin conformation,
the cell either remodels the chromatin after a stimulus or constitutively keeps
the chromatin open using DNA structural elements that induce nucleosome
destabilisation. The poly (dA-dT) is an example of such element. Main-
taining the chromatin in an open conformation allows rapid transcriptional
responses. In this study it was found that this apparently wildly occurring
pattern is found very often upstream of genes that are involved in transcrip-
tion and translation processes. Figure 2.9 shows the overlap network when
using the exact motif AAAATTTT on the metabolic network. The overlap
network topology is very different from the one derived from the TGACTC
motif, as it is formed mainly of two sets of highly connected nodes, one which
is mainly mRNA polymerase, the others are mainly tRNA synthetase.
The ubiquity of this motif for such basic processes suggests that it could
be a ’global state’ switch for yeast. For example, one hypothesis is that it
could be involved in a cell response to constantly changing conditions. Re-
adaptation often involves production of proteins and enzymes for the cell to
be able to use the new resources of that environment. Having a common
and simple regulatory element such as the adenine-thymine track, that con-
trols the rate of production of most of the genes that are involved in the
transcription/translation machinery, could enable the cell to rapidly boost
2.5. RESULTS 41
Figure 2.8: Overlap network for the exact motif TGACTC. Most of thegenes are coding for protein involved in amino-acid synthesis.
2.5. RESULTS 42
the production of new proteins and, therefore, quickly adapt to new situa-
tions. This is an interesting example of a functional motif that, even though
important for gene regulation, is probably not a binding site for a protein.
2.5.2 Non-random behaviour of significant motifs
Certain motifs with significant overlap scores also display other non-random
behaviour, such as a tight positional distribution relative to the start codon.
This is reflected by the standard deviation score or SD score (see appendix
B) which calculates how significant is the positional distribution of a certain
motif in overlap genes versus random genes that also have the motif. Figure
2.10 shows the location of cluster 4 (SD p value = 0.00 against Cellzome
network) for the overlap genes. A total of 15 motifs showed a significant
spatial distribution.
Because of the variability of the 5’UTR, a much tighter distribution should
be obtained when looking at the relative distance between the motif and the
transcription start site. Nevertheless, the amount of information regarding
the start of the transcription is very limited in yeast.
2.5.3 Assessment of known transcription factor bind-
ing sites
The process of finding new potential motifs depends on Tereisias parameters.
However, known motifs that do not satisfy this initial step of generating pat-
terns can still display significant overlap scores. From a list of putative tran-
scription factor binding sites, about 20 percent appear to have a significant
overlap score for at least one of the networks. Table 2.1 shows some of the
known sites that have significant overlap score(s).
The motif TATATAAA (an extended TATA box) shows a surprisingly
high overlap score with the metabolic and MDS networks, even though the
TATA box is present in most of the yeast genes. The consensus TATATAAA
is only present in 463 genes in the yeast genome. The 71 overlap genes do
not belong to any well defined functional group, but most of them are genes
that code for enzymes used in basal metabolism, eg. sugar metabolism.
2.5. RESULTS 43
Figure 2.9: Overlap network between the motif d(A)-d(T) network andthe metabolic network. Essentially the network can be clustered into twogroups of genes, the first group being composed of mostly genes involved intranscription; for example, tRNA synthetases and RNA polymerase subunits.The second group is composed of genes implicated in translation, such astranslation initiation factors or ribosomal proteins.
2.5. RESULTS 44
motif location from cluster 4
overlap genes ATG
600 bp upstream of translation start site
Figure 2.10: Motif locations on the genome relative to the start codon ofthe overlap genes (with Cellzome). The motif is GAGATGAG (see appendixC).
2.5.4 Inferring functionality to putative motifs
Because a sequence length of less than 10 defined nucleotides is expected
to occurs at random on the genome, the occurrence of motifs correspond-
ing to transcription factor binding sites (typically less than 10 mers) are
not going to be limited to the functional locations. Indeed, one can imag-
ine various mechanisms (wrong contexts, inaccessibility of DNA, chromatin
structure, for example) where a potential TFBS have no functionality. There-
fore, the set of genes that just have a putative motif in the upstream region
may be dominated by an overwhelming noise that can hide the subset of
genes where, indeed, the motif has a functional role. This is the case for
AAAATTTT which occurs upstream of 825 yeast genes, and no apparent
cluster of functionality can be derived from this set. Nevertheless, using only
the overlap network derived from a functional network where the motif shows
a significant score, it is possible to enrich the set with ’real’ locations and,
consequently, infer possible biological function to the motif. In this case, this
’overlap’ set includes currently only 106 genes; mostly genes encoding for
proteins involved in transcription.
2.5. RESULTS 45
Binding motifsfor
Litterature de-scription
overlapgenes
Consensus metabolic cellzome MDS
MET31/32 (Blaiseau et al.,1997)
methioninebiosynthe-sis
AAACTGTG 5.25 1.46 0.96
HAP2 (Mantovani,1998)
oxydativephosphory-lation
ACCAAT.A 6.51 0.11 1.36
GRF2/REB1 (Chasman et al.,1990)
unknown [TC]..[TC][TC]ACCCG 1.88 4.62 3.45
PHO4 (Hayashi and Os-hima, 1991)
Met ThrAsn syn-thesis
CACGTG 6.36 1.95 1.94
MBP1/MBF1 (Lowndes et al.,1991)
DNA repli-cation
ACGCGT.A 4.41 4.74 3.81
RPN4p (Mannhauptet al., 1999)
proteosome GGTGGCAAA 0.33 11.62 13.46
GCN4 (Hope and Struhl,1985)
AA synthe-sis
TGACTCA 8.66 3.39 2.87
CBF1 (Dowell et al.,1992)
unknown TCAC.TGA 5.21 0.5 1.23
TFIID-TBP (Struhl, 1995) unknown TATATAAA 4.91 1.39 3.61
Table 2.1: known transcription factor binding sites that have a significantoverlap scores. The values are the standard deviations from the mean ofrandom ’overlap score’. The overlap gene column is a functional annotationbased of overlap gene annotations.
This example show that this approach, in addition of finding potential TFBS
can also be used successfully to derive functionality to the motif, on the con-
dition that the overlap network with the appropriate functional network is
significant. Along the same line, functionality can be derived for genes with
unknown function if these genes are part of a significant overlap network.
This is the case for the motif GGTGGCAAA studied above, where the un-
known genes are most probably part of the protein degradation pathway in
yeast.
2.5.5 Promoter scanning
Instead of adopting a motif-centric view, the same analysis can be done for
one or a few sets of promoters. This approach is more applicable for exper-
imental biologists that work often on a limited set of genes. The upstream
region of a gene is scanned by sliding a variable window. The minimum
motif length is 6 and the maximum motif length is given by the number
of occurences in all upstream sequences (at least twice). For each sequence
defined in that window, the overlap score is then calculated using one or
many functional networks. The procedure is exactly the same as for 2.4.2
except that the nucleation set was not removed from the motif network. In
summary, all genes that have the motif (upstream region) form the motif
2.5. RESULTS 46
Cellzome
MDS
metabolism
0
overlapscore
5’ 3’
CCCGTCTA
500 bp upstream of YDR156W gene
AAAATTTT
CTCATCG
GTGGCAAAA
Figure 2.11: Example of a promoter scanning for the yeast gene YDR156Wencoding the RNA polymerase I subunit A14. The x-axis represents the po-sition on the window relative to the start of the studied gene (in bp). They-axis represents the overlap score normalised at three standard deviationsso that all the values less than 0 are not significant. The overlap scores forCellzome, MDS and metabolic network are in blue, red and yellow respec-tively.
network which is then overlapped with the functional network. The overlap
score obtained is normalized to 3 standard deviations from the mean of all
scores comming from 100 random motif networks having the same size as the
real motif network.
An example of such analysis is represented in Figure 2.11 The example
gene show in this figure is the RNA polymerase I subunit A14 gene. In the
proximal region of the promoter, two patterns have a strong overlap score:
AAAATTTT and CTCATCG. A significant overlap score can be seen also for
the MDS network and Kegg network (in case of AAAATTTT). Interestingly,
motif AAAATTTT occurs 569 times, CTCATCG occurs 209 times and both
motifs co-occurring on the same upstream regions happens 73 times. The co-
occurrence is much higher than one would expect by chance (p = 8.66.10−20).
This result is in accordance with the now-broad perception that transcription
factors binding sites do co-occur in functional units (Manke et al., 2003).
2.6. CONCLUSION 47
2.5.6 Discovering cis-regulatory elements using func-tional network in higher eukaryotes
This technique was applied with a negative result on human using the KEGG
database as the functional network. All the human genes were retrieved and
1kb upstream of the gene starts were repeat-masked. The same procedure
was applied to this new dataset and only one motif appears to be significant.
This motif is unknown and most probably is a false positive. This negative
result is not surprising, considering the much higher complexity of the human
genome compared to yeast.
One obvious reason is that a significant amount of regulatory regions may
be several kb away from the gene start in enhancer or locus-control regions.
Furthermore, because of the high number of coding genes, the signal-to-noise
ratio may be too small to produce anything significant. Yet, beyond techni-
cal problems inheritant to the genome complexity, gene regulation in human
probably obeys different rules than in yeast. Indeed, cells in humans are usu-
ally in a constant environment and do not need to adapt to different external
conditions by expressing new pathways. Furthermore, iso-enzymes are very
common in humans, and because they are expressed in very different tissues
and at different times, the regulation is very different. Large-scale protein-
protein interaction maps specific to a given cell type may be more suitable
than the metabolic network for such analysis. Unfortunately, to this date no
such large-scale study has been done on higher eukaryotes.
2.6 Conclusion
Many previous works have shown that genes with similar expression pro-
files are more likely to encode interacting proteins (Ge et al., 2001). This
study goes a step further by trying to use this relationship in order to find
cis-regulatory motifs, assuming that co-regulated genes have common reg-
ulatory element(s) in their upstream regions. This approach identifies 42
potential sites that are strongly suspected to be involved in gene expression,
most likely via transcriptional regulation. These correspond to some well-
known motifs and other novel cases.
The availability of good quality functional networks is a major limiting step
2.6. CONCLUSION 48
for this approach, especially when considering higher eukaryotes. With the
completion of more large scale studies using new techniques, this problem
will become less prevalent, and attempts to use this technique on higher eu-
karyotes like humans can be made. Chromatin IP appears to be one of the
most promising techniques to use for this approach. Indeed, Chromatin IP
identifies regions where a particular transcription factor binds and, by exten-
sion, also identifies downstream genes that are potential targets. Applied to
many transcription factors on a genome-wide analysis, the resulting network
can be used as the functional network. Chromatin IP has been used success-
fully on yeast (Lee et al., 2002), as well as on higher eukaryotes (Li et al.,
2003).
Beyond the usefulness for cis-regulatory motif discovery, this method can
also be used to infer functionality to a particular motif or gene. It can also
be used to refine the current understanding of functional interaction. Taking
the example of the nucleotide pathway discussed above, NTPs are known to
be used in many biological processes, and the type of enzymes that act upon
these compounds varies greatly. Nevertheless, the overlap network between
the nucleotide pathway and the MluI motif network essentially highlights
CDC genes involved in the cell cycle. This refinement towards functional
modules can be broadly applied to the metabolic or protein interactions,
or any functional network. Potentially, each significant overlap network ob-
tained here can be considered as a refinement of the initial network.
Now with the genome completion of four other yeast-related species, combin-
ing this analysis with evolutionary information would be expected to produce
even more interesting results. Indeed, one would expect an overall conserva-
tion of the functional network topology across related species.
Finally, the concept of overlap network can be applied in more biological
problems than just the discovery of cis-regulatory elements. One can imag-
ine, for example, the evaluation of some experimental networks relative to a
reference network using the overlap score.
Chapter 3
Evolution dynamic ofcis-regulatory regions in highereukaryotes
As discussed in the introduction cis-regulatory regions have very different
evolution dynamics than coding sequences potentially allowing insertion,
deletion, translocation and inversion to be quite common. Consequently,
the homology between even close species can be hard to detect and inter-
pret, as conventional tools have often been designed for coding sequences.
Promoterwise, a pair-wise alignment algorithm, has been specifically devel-
oped by Dr. Ewan Birney to address these types of issues. The basic schema
is represented in Figure 4.1.
The algorithm begins by localising every possible small ungapped matches of
six out of seven nucleotides. These matches are extended and merged when
possible. The algorithm then uses the pair-HMM from DBA (Jareborg et al.,
1999) to align the matches. The resulting hits are then sorted according to
the log-odd score. The aligned regions are independent of strand direction,
gap length and position in the sequence, making this procedure particularly
well suited for regulatory region comparisons.
Promoterwise has been used first on specific examples with manual cura-
tion of the data in order to identify potential problems, and then used to
perform a systematic homology search on the whole genome. This chapter
summarises the results obtained.
49
50
match A match C
gap
match (0.65) match (0.65)
gap (0.05)
match D
gap gap gap
match (0.65) match (0.65)
gap gap gap
match B
unmatch
unmatch (0.99)
unmatch
unmartch
blockopen (0.01)
rating the alignments(using log−odds bit−score)
PROMOTERWISE output
DNA Block Aligner
ATGGCGGTGGGGATCCAACC ATGGCGGAGGCGATACATCC
extention and merge of close seeds
Find small ungapped matches (6base pairs matches)as an heuristic for reducing alignment time
Figure 3.1: Promoterwise schema.
3.1. DETAILED ANALYSIS OF A SPECIFIC EXAMPLE : THE ATONAL 5 GENE 51
3.1 Detailed analysis of a specific example :
the Atonal 5 gene
A number of hand analyses was performed in collaboration with the verte-
brate developmental group of Jochen Wittbrodt. Those analyses are useful
not only for understanding specific gene expression patterns involved in ver-
tebrate development but also for providing insight for global comparisons.
By using a well described gene the result can be confidently interpreted.
3.1.1 The Atonal 5 protein
The study of the promoter of atonal 5 gene is a collaborative project with
Filippo Del Bene from the Wittbrodt lab.
The developing eye and, in particular, retinal neuron development, is a good
model for the study of pattern formation and cell fate determination in de-
veloping embryos with an active international research community. Atonal
5, a basic helix-loops-helix transcription factor, is a regulator of retinal gan-
glion cell (RGC) development and is expressed in retinal progenitors (Vetter
and Brown, 2001). In fact, the neuronal retina contains 7 different types of
neural and glial cells and atonal 5 is critical for the development of RGC.
The genes that have been shown to be regulated by atonal5 are:
1. Delta1 gene : induce the lateral inhibition of differentiation toward
RGC fate.(Schneider et al., 2001)
2. MyT1 gene : Allow the cell to escape Notch inhibition and adopt the
RGC fate. (Schneider et al., 2001)
3. Brn3 gene : POU homeodomain transcription factor, important for the
RGC development and survival.(Hutcheson and Vetter, 2001)
4. nAchR gene : Neural nicotinic acetylcholine receptor (Hernandez et al.,
1995)
Atonal5 has been shown to auto-regulate itself as well. (Matter-Sadzinski
et al., 2001).
3.1. DETAILED ANALYSIS OF A SPECIFIC EXAMPLE : THE ATONAL 5 GENE 52
Figure 3.2: transgenic medaka embryo expressing GFP construct undermedaka atonal5 promoter (5kb). GFP is expressed in a population of neu-rons in the retina which project their axons to the brain, forming the twooptic nerves that cross at the optic chiasm (Filippo Del Bene, personal com-munication).
3.1.2 The promoter of atonal5 gene
Filippo Del Bene has isolated and sequenced the 5kb region immediately
upstream of the gene in medaka (Oryzias latipes) and he experimentally
shows that the 5kb region upstream of the gene is sufficient to express a
reporter GFP in a correct spatial and temporal pattern. Figure 3.2 shows
the image of a fish embryo expressing GFP from a construct containing the
5kb upstream sequence of medaka atonal5 gene.
Atonal 5 is a good candidate for finding interesting regulatory motifs
because it has been experimentally shown that the 5kb upstream sequence
should possess all the necessary regulatory regions for its correct spatial and
temporal expression in the developing fish. This gene is also a good can-
didate because its product is involved in key pathways for eye formation, a
common process shared by most vertebrates and, as such, orthologues are
available from remote species like mammalian and chicken.
The first step was to find the correct homologue of the medaka atonal5 gene
in all fully-sequenced vertebrates currently available - human (Homo sapi-
ens), rat (Rattus norvegicus), mouse (Mus musculus), zebrafish (Danio rerio),
fugu (Fugu rubripes) and chicken (Gallus gallus). To do so, the medaka pro-
tein sequence of atonal5 was blastP against all the genomes, and the best
hit was retrieved as the orthologue gene. The orthologous genes annota-
3.1. DETAILED ANALYSIS OF A SPECIFIC EXAMPLE : THE ATONAL 5 GENE 53
specie Ensembl ID gene chromoso-mal location
upstreamlength
Homo sapiens ENSG00000179774 10.69883835-69885409
50983 bp
Rattus norvegi-cus
ENSRNOG00000000384 20.26945745-26950000
21114 bp
Mus musculus ENSMUSG00000036816 10.62748738-62771738
23420 bp
Danio rerio ENSDARG00000022606 ctg9353.113860-128060
14570 bp
Fugu rubripes SINFRUG00000130186 Chr-scaffold-1775.16614-21579
5094 bp
Gallus gallus ENSGALG00000003931 6.9845459-9845914
29799 bp
Oryzias latipes - - 2881 bp
Table 3.1: Atonal5 homologs Ensembl ID and locations on the ENSEMBL16.0 release apart from Gallus gallus that was done on the Ensembl pre-release. The upstream length is the length of the intergenic region until thenext upstream annotated gene.
tion (EnsEMBL IDs) and location on the genome are shown in Table 3.1.
These results are in accordance with the best-reciprocal hits in the Ensembl-
Compara database (Birney et al., 2004).
A blast of the 5kb upstream region of the medaka atonal5 gene with fugu
revealed a gene that has not been annotated in medaka. Once this upstream
gene is excluded (assuming, therefore, that no regulatory motifs can be lo-
cated within an coding sequence of the upstream gene, which is a reasonable
starting hypothesis), the resulting upstream medaka sequence is believed to
be 2881 bp long.
The upstream regions for the homologue genes were retrieved manually.
This corresponds to the whole non-coding region stretching from the up-
stream gene until the annotated gene start of the atonal5 homologue gene.
Promoterwise was run using a number of parameters for all possible pairs,
and each region of homology was compared manually with the other com-
parisons in order to find overlap of conserved sequences across at least three
3.1. DETAILED ANALYSIS OF A SPECIFIC EXAMPLE : THE ATONAL 5 GENE 54
Region1 : Common region upstream of the atonal5 gene in human, mouse,rat, chicken, fugu, zebra and medaka located about 500 bp upstream of the
annotated gene start (in human).
MEDAKA_REGIONFUGUZEBRAMOUSE_REGIONHUMANRATCHICKEN
TTTTTTT
GGGGGGG
GGGAAAG
AAAGGGG
GGTAAAA
GGGTTTT
TAGGAGG
TCAGGGG
TAGGAGA
|10
GGTAAAA
GGAGGGG
GGACCCC
GGTGCAA
GGAGGGG
GAAAAAA
GGGGGGG
GGGGGGT
GGGGGGG
CCTAAAC
|20
.A.CCCC
GGGGGGG
GGGCCCC
GGGCCCC
CAGCCCC
CCCGGGG
TTTC.CC
CCCCCCC
CCCCCCC
AAAAAAA
|30
CCCCCCC
CCCCCCC
TTTTTTT
GGGGGGG
CCCCCGC
TTTCCCT
GGGAAAA
CCCCCCC
CCCCCCC
AAAAAAA
|40
CCCCCCC
CCCCCCC
TTTTTTT
GGGGGGG
TTTTTTT
TTTTTTT
TCACCCT
MEDAKA_REGIONFUGUZEBRAMOUSE_REGIONHUMANRATCHICKEN
GAACCCA
.
.
.CCCG
.
.
.TTT.
|50
.
.
.CCCC
.
.
.CCCC
.
.
.TTTT
TT.TTTT
CT.GCGA
TC.AAAG
GG.GGGC
CC.CCCA
TTTCTCA
GCACCC.
|60
CCCTTTT
GATGTGT
TTCGTG.
CCTGGGT
TTACCCC
TTTTTTT
AAAAAAA
TTTTTTT
AAAAAAA
AAAAAAA
|70
AAGAAAA
AACTTTG
GGATTTC
CCTCCCC
TTCTTTT
GGGCCCC
CGCCCCT
TTTTTTT
CTCCCCC
CCACCCG
|80
.
.TCCCA
TTCTTTG
CGCCCCT
GTTACAC
CATTACT
CCACATG
.TTGGGT
GGGCACA
TTGTTTA
CCCCTCG
|90
TAATCTA
TGGGAGC
CCATTTA
Figure 3.3: the The best conserved regions in the upstream region of atonal5genes. Visualisation tool from Jalview (Clamp et al., 2004)
.
species.
Regions of homology with fugu stretch until 1636 nt upstream of the start
codon of atonal5 gene of medaka. This result is in accordance with the fact
that new constructions of little more than 2kb upstream the gene is sufficient
to trigger expression; 1.5kb seems to give weaker result (Experimental results
from Filippo Del Bene).
Essentially three regions with homology in at least two other species were
found, annotated as region one, two and three as show in Figure 3.3 and
figure 3.4. Region 1 represented in Figure 3.3 is the most proximal from the
start of the atonal5 gene in medaka (about 500 hundred base pairs away)
and the most conserved, as it is found in all six species studied. Because
of the long divergence time (450 million years) between mammals and fish,
only the strictly essential motif is presumed to have been conserved, which
includes the motif CCACCTG that is repeated twice with a conserved gap
of three nucleotides. This alignment also included chicken.
This very well conserved sequence is a good candidate for a transcription
factor binding site, possibly the atonal5 binding site itself, since the product
of atonal 5 gene regulates the expression of its gene (Matter-Sadzinski et al.,
3.1. DETAILED ANALYSIS OF A SPECIFIC EXAMPLE : THE ATONAL 5 GENE 55
2001). Flanking this motif is a conserved AG rich region (possibly a SP1
site) upstream and a putative TATAA box downstream. Interestingly, the
distance between the putative motif and the TATAA box is either 19 (mam-
mals), 14 (medaka, fugu) or 9 (zebrafish), and this almost perfect multiple
of 5 corresponds to a half turn of a DNA helix (apart from chicken).
Downstream from the putative TATAA box is a conserved CT rich region
(possibly a SP1 site as well). It is interesting to note that the whole region
is flanked by two putative SP1 sites that are reverse complements and can
therefore be involved in secondary structure of the DNA.
The annotated gene start is located about a 500 bp away from the TATAA
box, and this distance is more or less conserved between all the species stud-
ied.
Regions 2 and 3 represented in Figure 3.4 are only common to the fish species
and are located about 1150 and 1500 bp away from the start of the medaka
atonal gene respectively, much more upstream of the atonal5 gene than re-
gion 1. Because no mammalian sequences were included, the resolution is
not as good as for region 1.
3.1.3 The Atonal5 motif
If the motif CCACCTG is the binding site for the atonal 5 protein, genes
that also have this conserved motif may be target genes of transcription fac-
tor atonal 5.
The next step was, therefore, to find other genes that have a motif CCAC-
CTG or its reverse complement in the upstream region that is conserved
throughout human, mouse rat and fish orthologues. To find such cases, a
simple pattern matching program was developed. This program fetched all
the orthologous genes of human in mouse, rat, fugu and zebra and retrieved
the 5kb upstream sequences (see following section for more details). Pro-
moterwise was then run on the mammalian orthologous pairs. To be selected,
a gene needs to have:
1. the motif in conserved upstream region (Promoterwise bitscore > 25,
see next section for justification of this cut-off) when considering human-
3.1. DETAILED ANALYSIS OF A SPECIFIC EXAMPLE : THE ATONAL 5 GENE 56
Region2 : Common region upstream of the Atonal5 gene only in zebrafugu and medaka located about 2kb upstream of the annotated gene start
(in fugu).
D_rerioF_rubripesO_latipes
TCC
TAA
A.A
A.G
A.A
C.C
GGG
.AA
.CC
|10
.
.C
.AA
GGC
GGG
AAA
CCC
AAC
AAA
GGG
CCC
|20
AAA
GGG
CCC
AAA
GGG
CCC
TTT
GGG
GGG
TCC
|30
CAA
AAA
GAA
GCC
GTT
AAA
TCC
GAA
CTC
CCC
|40
AAA
AAG
TTT
CCC
TTT
GGG
CTT
AAA
AAA
D_rerioF_rubripesO_latipes
TTT
|50
CCA
AAA
AAA
TTT
GGG
AAA
A..
A..
AAA
CCG
|60
ACC
AAG
AAA
CCC
TTT
A..
A..
A..
C..
C..
|70
AGG
GGG
TTT
GAA
GGG
AAA
TTT
T..
GGG
GGG
|80
GGG
TAA
TTT
TTT
AAA
TTT
AAC
TTT
GGG
CCC
|90
CCC
CTC
AAA
AAA
TTA
TTT
GGG
D_rerioF_rubripesO_latipes
ACC
AAA
CCC
|100
AAA
GGG
CCC
ATT
GAC
AAA
ACC
TTC
TTG
AA.
|110
CGG
TTT
CCC
.CA
AAA
TTT
CCC
TCT
ACG
AAA
|120
T..
G..
C..
AAA
GA.
AA.
TT.
GGG
AAA
GCC
Region3 : Common region upstream of the Atonal5 gene only in zebrafugu and medaka located about 2.6kb upstream of the annotated gene start
(in fugu).
D_rerioF_rubripesO_latipes
GGA
AAA
AGA
GGA
GGG
GGG
AAA
AAT
CGG
|10
AAA
AGG
AAG
A..
A..
G..
GGG
CTT
TTT
CCC
|20
AAA
AAA
AAA
TTT
AAA
GAA
GGG
CCC
AAA
TTT
|30
GAA
AAT
AAA
AAA
TTT
TTT
AAC
CCC
AAA
ATT
|40
GCT
CCC
AAA
CCC
CCC
TTT
TTT
GAC
CTC
D_rerioF_rubripesO_latipes
TGC
|50
TTT
GGG
AAA
CCC
CCC
TTC
AAA
AAG
TTA
TTT
|60
AAA
CTG
AAA
GCC
TTG
GCG
AAG
C..
GA.
AAA
|70
GGG
CAC
CCC
TTT
CCC
AAG
TTT
CTC
TTT
CCG
|80
AAA
CCC
CCC
AAA
GGG
ACC
TTT
GGG
CCC
C.C
|90
C.A
C.C
A.G
A.C
T.C
C.T
T.G
D_rerioF_rubripesO_latipes
G.C
A.T
CTC
|100
GCG
GAC
TGT
GTG
GGT
GGT
GGC
GTA
TTC
TGG
|110
CCC
ATT
ACG
GAC
Figure 3.4: The two medaka regulatory regions common with fugu and ze-brafish. Visualisation tool from Jalview (Clamp et al., 2004)
.
3.1. DETAILED ANALYSIS OF A SPECIFIC EXAMPLE : THE ATONAL 5 GENE 57
mouse and human-rat comparisons.
2. the motif in the orthologous intergenic region of at least one fish (fugu
or zebrafish) independent of the alignment information.
The reason why no conservation information is used on fish is due to the
fact that functional region alignments are at the limit of detectability when
considering mammalian-fish comparison. Indeed, the conserved region 1 in
the atonal gene example had only a bitscore of 16 in human-fugu pair-wise
comparison, which can often occur by chance.
A total of 128 candidates were identified using the motif CCACCTG (or its
reverse complement). Shown in Figure 3.5 are a couple of examples where the
data was manually curated and aligned to the conserved species. In all these
examples no significant alignments were detected when considering mammal
and fish comparison; only the motif was present. These genes are also known
to be expressed in the retinal ganglion cell.
To validate the hypothesis that CCACCTG motif is the binding site for
atonal5, the known targets of atonal 5 (Delta1, MyT1, Brn3 and nACHR
see 3.1.1) were manually checked for the conserved motif within the 5kb up-
stream of the genes. The result is shown in Figure 3.6. Because these genes
were not found using the automated procedure, the motif will not be the
consensus CCACCTG at least for one of the species. This is the case for
Brn3, where only the human site is not the consensus sequence CCACCTG
but CCACCTC (reverse complement). In the case of Delta1 the consensus
sequence CCACCTG is replaced by GCACCTG in all species. The two other
target genes of atonal5 (MyT1 and nACHR) do not seem to possess similar
motifs within 5kb upstream of the genes.
3.1.4 Experimental validations
Filippo Del Bene (EMBL) confirmed experimentally both the predicted bind-
ing site for atonal and some of the predicted candidate genes. The binding
site was confirmed both in vitro andin vivo to be CCACCTG. EMSA (Elec-
trophoresis Mobility Shift Assay) were performed on the region containing
the two wild type motifs ( see figure 3.3). A similar assay was perfomed on
mutants where the motif was changed. Ath5 only binds the wild type motif
3.1. DETAILED ANALYSIS OF A SPECIFIC EXAMPLE : THE ATONAL 5 GENE 58
dlx2:dlx2 is known to be co-expressed with nBrn3 and seems to be involvedin defining the retinal ganglion and inner
nuclear layers of the developing and adult mouse retina (de Melo et al., 2003)
humanmouseratzebra
CCCC
AAAG
AAAG
CCCC
CCCC
TTTT
CCCG
GCGT
CTTC
|10
AAAT
CCCT
AAAT
CCCC
TTTT
GGGT
CTTG
CCCC
CCCC
AAAC
|20
CCCT
CCCC
AAAA
GGGG
GGGG
TTTT
GGGG
GGGG
CCCA
GGGG
|30
CCCA
CCAG
AAAC
CCCG
AAAG
AAAC
AAAA
GGGG
CCCG
AAAT
|40
AGGG
GGGT
CCCT
GGGA
CCCA
TTTA
GAAA
CCCC
CCCA
GGGC
rx:The gene encoding the Rx/rax transcription factor (Casarosa et al.,1997) belongs to a subfamily of the paired-like homeobox genes (Galliotet al., 1999). A previous report showed that RX was able to define theretina-diencephalon territory in the anterior neural plate (Andreazzoli
et al., 1999).
humanratmousefugu
CCCA
TTTG
CCCC
AAAT
GGGT
CCCT
AAAC
CCCT
GCCC
|10
TTAA
CCCA
AAAT
GGGG
CCTC
CCCC
AAAA
CCCC
CCCC
TTTT
|20
GGGG
GCCC
TTTT
CCCT
TTTC
AAAT
TCCT
GCCT
TTTT
CAAC
|30
AGGT
CCCG
TACC
GAAT
GGGC
CCCT
ATTT
GGG.
TGG.
CGG.
|40
AAA.
-AAC
GGGA
AAAA
CCCC
CACC
TCCT
TTTT
TCCC
TTTT
|50
CCGC
GGGA
GGGT
humanratmousefugu
GGTA
TTTC
GGGA
CCAC
CCCT
AAGG
CCCC
|60
CCTC
AAAC
GGGG
GGGT
CCCA
CCCT
AAAG
TTTC
slit1: Slit1 is express in retinal ganglion cell and is re-sponsible for regulating axon guidance and cell migration (Plump et al., 2002)
humanmouseratFugu
AAGC
TTCG
AACG
TTAG
TTGG
CCTT
ACTC
TTGC
TTTT
|10
TTAC
TTCG
CCCG
ATTC
TTGG
CCCG
TTTA
GGGG
.
.CC
TTTT
|20
TT.C
TTGC
CCCC
CCCC
AAAA
CCCC
CCCC
TTTT
GGGG
TCAG
|30
CCGA
AAGA
GGCC
AACA
AT.A
AG..
TTTT
GGGG
GGGG
.
.C.
|40
AAAA
GGGG
Figure 3.5: Example of candidate genes that possess the conserved consensusmotif CCACCTG within 5kb usptream for mammalian and at least one fish.
3.1. DETAILED ANALYSIS OF A SPECIFIC EXAMPLE : THE ATONAL 5 GENE 59
Delta1:
ENSG00000112577ENSMUSG00000014773ENSRNOG00000014667ENSDARG00000020219SINFRUG00000146981SINFRUG00000149486
AAAAAT
GGGGGC
CCCCGT
TTTTGC
CCCCAT
TTTTGC
TTTTCT
TTTTGC
CCCCCT
|10
TTTTGC
.
.
.CCT
.
.
.CTG
.
.
.CCG
CCCCGT
TTTGCG
CCCCTC
CCCGGT
GGGATG
CCCCGT
|20
AAAAAA
TTTTTT
TTTTTT
.
.
.
.GG
GGGGGT
TTTTGG
GGGGGA
CCCACG
GAGGCA
GGGGGG
|30
GAAGTA
GGGGGG
AAAAAA
GGGGGG
CCCCCC
AAAAAA
GGGGGG
GGGGGG
TTTTTT
GGGGGG
|40
CCCCCC
TTTTTT
GGGGGG
.
.
.
.
.C
ENSG00000112577ENSMUSG00000014773ENSRNOG00000014667ENSDARG00000020219SINFRUG00000146981SINFRUG00000149486
.
.
.
.
.C
.
.
.
.
.G
TTTTGG
CCCGCC
TTTTTC
GGGGGA
|50
CCCAAA
AAAAAA
TTTTTT
TTTTTT
AAAAAA
CCCCCC
CCCCCC
AAAAAA
TTTTTT
AAAAAA
|60
CCCCCC
AAAAAA
GGGGGG
CCCCCC
TTTTTT
GGGGGG
AAAAAA
GGGGGA
CCCAGA
GGGGGG
|70
CCCCGC
AAAAAA
CCCCGC
AAAAAA
AAAGGA
AAAAGA
GGGGGA
AAAAGA
GGGGGA
CCCAAA
|80
CCCAAC
AAAACT
CCCAGT
TTTACT
Brn3:
H_sapiensM_musculus_AM_musculus_BR_norvegicus_AR_norvegicus_BD_rerio
TTTTTT
CGGGGG
TTCTCA
AATATG
CATATC
CAGAGG
CGTGTT
CTATAC
GCACAT
|10
GTGTGG
AGAGAG
GGTGTG
CCGCGG
GAAAAG
CAGGGA
GGTGTA
GCGCGC
TTATAA
TGAGAG
|20
GGGGGG
AAAAAA
GGAGAG
GCGCG.
GCCCCC
AAAAAA
GGGGGG
GGGGGG
TTTTTT
GGGGGG
|30
GGGGGG
.
.
.
.
.A
.
.
.
.
.T
GGGGGG
GGGGGG
CCGCGG
AGAGAA
GGGGGT
GGAGAC
GGGGG.
|40
GCGCG.
TGGGGG
CGGGGG
AAAAAA
CCGCGT
CAGAGG
TGCGCG
H_sapiensM_musculus_AM_musculus_BR_norvegicus_AR_norvegicus_BD_rerio
GGAGAT
GAGAGG
GACACC
|50
CGAGAA
CAAAAT
TGGGGC
CGCGCT
GCGCGG
TCACAT
TCGCGG
CTATAT
TGGGGC
GCGCGA
|60
GCGCGC
CACACT
AGGGGG
GGAGAC
CCGCGA
CCGCGC
CGGGGT
Figure 3.6: motifs upstream of the Known Atonal 5 target genes Brn3 andDelta1. MyT1 and nACHR, the other two targets of Atonal5 do not possesssimilar motifs
3.1. DETAILED ANALYSIS OF A SPECIFIC EXAMPLE : THE ATONAL 5 GENE 60
and not the mutant where the E-box was altered. An in-vivo assay using
GFP expression confirmed the in-vitro result.
30 predicted direct target genes were tested for the expression pattern in
the fish retina and compared to the expression pattern of atonal itself. 20 of
these candidates show a similar expression pattern as atonal. In other terms,
more than 60 % of the predicted genes were confirmed to be co-expressed
with atonal.
3.1.5 Conclusion regarding this example
Aligning non-coding regions accurately and distinguishing the cis-regulatory
elements from the background noise is a much harder problem to solve than
for coding sequences. Nevertheless, the combination of an aligning procedure
adapted to non-coding DNA, such as Promoterwise, combined with a careful
manual analysis of the results obtained, is a powerful strategy that can give
impressive results. Indeed, for the atonal 5 regulatory region, the conserved
site CCACCTG in region 1 is most certainly a binding site for a transcrip-
tion factor, possibly atonal5. When a putative transcription factor binding
site is identified, it can be used to screen the entire genome to find potential
candidate genes that may be regulated by the same factor.
When applying this procedure it is important to consider, in addition
to the biological issues discussed in the introduction the following technical
issues:
1. Wrong gene annotation or wrong orthology mapping that leads to the
comparison of two regions that are not related.
2. Missing exons or unannotated 5’ UTR for one or both orthologous pairs
that lead either to two unrelated-regions or related regions that are not
upstream of the gene of interest but rather exonic or intronic sequences.
3. Upstream unannotated genes that would account for most of the signal,
since genes are usually more conserved than intergenic sequences.
4. If the upstream gene is in the opposite strand as the gene studied, then
potentially, the regulatory region of the upstream gene will be detected
3.2. GLOBAL RUN OF PROMOTERWISE 61
Run Repeat maskerRemove upstream genes
Get 5000 bp upstreamof genes
get homolog relationship(many to many relationships)
EACH HOMOLOGOUS PAIRSRUN PROMOTERWISE FOR
COMPARA 17_1ENSMART 17_1
C.Elegans C. briggsae
C. briggsae
M. musculusR. norvegicusF. rubripesD. rerioD. melanogasterD. pseudoobscuraA. gambiaeC. elegans
H. Sapiens
M. musculus
R. norvegicus F. rubripes
D. rerio
H. Sapiens
D. pseudoobscura
D. melanogaster
A. gambiae
Figure 3.7: Procedure for running Promoterwise on complete genomes.
as well.
With the continuous improvement of genome annotation these problems will
become less prevalent, and one can imagine an automated procedure that
would find potential transcription factor binding sites and automatically find
other candidates genes that may be under similar regulation. This is the focus
of the next section and the following chapter.
3.2 Global run of promoterwise
I wished to develop a comprehensive view of regulatory conservation between
the fully sequenced vertebrates. This analysis was done in order to have a
global idea of non-coding sequence conservation in the upstream region of
genes. The schema of the procedure is shown in 3.7. The species consid-
ered correspond to all the fully sequenced genomes in EnsEMBL (Birney
et al., 2004), and the relationships were derived from the Ensembl-Compara
database. In order to consider all possibilities, all relationships were used
that include best reciprocal hits (BRH) and reciprocal hits based on synteny
(RHS). For each orthologous pair, the 5kb repeat-masked sequence upstream
of each gene was retrieved and the upstream gene was removed if necessary.
Promoterwise was then run on these sequences.
3.2.1 Promoterwise : the algorithm
Dr. Ewan Birney developed promoterwise. Promoterwise is a pragmatic
heuristic of seeding from small ungapped matches (6 base pairs out of 7) in
3.2. GLOBAL RUN OF PROMOTERWISE 62
both strand, extending the seeds and merging close seeds. dynamic program-
ming style routines was then used across the resulting co-linear regions. To do
so, the established DNA block Aligner (DBA (Jareborg et al., 1999)) model
for the co-linear alignment. The DBA model allows small insertions and
deletions in functional region interrupted by potentially long non functional
regions. The resulting set of DBA alignments are then resolved into one set
of alignments by a simple greedy method of rating all the alignments by the
log-odds bit score and accepting progressively less likely alignments only if
they do not use bases used by previously accepted alignments. Promoterwise
has been incorporated into the Wise2 package.
3.2.2 Defining the cut-off
Promoterwise produces systematically random low scoring alignments. It
is therefore important to define a cut-off score in order to distinguish align-
ments due to negative selection from random alignments. Two methods were
developed which worked well in defining this cut-off.
3.2.2.1 Percentage of positive pairs function of the cut-off score
The first method looks at the percentage of positive pairs; the percentage of
pairs that have a hit above the cut-off function of the score cut-off. If we
assume that significant and non-significant hits have distinctive score distri-
bution and that a number of orthologous pairs are wrong or do not possess
related upstream sequences, then we expect to see a drastic drop of positive
pairs when the cut-off allows mostly significant hits only.
The results shown in Figure 3.8 follow what is expected. The drop of positive
pairs depends largely on the species considered, but overall, the drop occurs
somewhere around 25 bit score cut-off.
3.2.2.2 Strand conservation function of the cut-off score
If we assume that homologous regions between two species also retain their
strand direction most of the time, calculating the overall fraction of same
strand hits as a function of the bit score of the alignment is a good indicative
measurement of the fraction of homologous alignments versus random align-
ments. The result is shown in Figure 3.9. The fraction of same strand hits
3.2. GLOBAL RUN OF PROMOTERWISE 63
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100 120 140
perc
enta
ge o
f ups
trea
m r
egio
ns
score cutoff
percentage of upstream region that have an hit function of the score cutoff
human fuguhuman mouse
mouse rat
Figure 3.8: positive upstream region function of the score cut-off.
as a function of the bit score of the alignment depends on the two species
compared, but reaches a plateau very close to 1 at a score higher than 40
bits in most cases.
A notable exception is the comparison between rat and mouse upstream re-
gions, where a significant fraction of hits with bit score higher than 40 are still
on opposite strands relative to each other. This is very interesting and one
hypothesis to explain such an observation is that rat and mouse are so closely
related that (a) non-functional regions are still alignable, and (b) very often
these non-functional regions are inverted in one species without negative se-
lection. Further work needs to be done to assess the dynamics of inversion of
non-functional regions, but if this hypothesis is true it means that the rate of
inversion is quite high for non-functional regions in non-coding sequences, but
that these inversions are negatively selected in functional non-coding regions.
Inversion of well conserved regions between more remote species like human
and rodents is a rare event but does occur as Figure 3.10 shows.
3.2. GLOBAL RUN OF PROMOTERWISE 64
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100
ratio
fow
ard-
forw
ard
vers
us fo
war
d -r
ever
se
score
effect of the score on the proportion of reverse complement hits
D. melanogaster A. gambiaeC. elegans C. briggsae
F. rubripes D. rerioH. sapiens M. musculus
H. sapiens R. norvegicusM. musculus R. norvegicus
Figure 3.9: fraction of hits in both forward strands as a function of thescore. The high ratio of reverse complemented hits at low score is due tothe fact that these hits are random. Most of significant hits ( > 30) involvessequences in both the same strand.
3.2. GLOBAL RUN OF PROMOTERWISE 65
Figure 3.10: Alignments between H. sapiens gene ENSG00000091527 (top)and M. musculus gene ENSMUSG00000032803 (bottom). The green boxescorrespond to aligned regions of 25 or more bitscore on the plus/plus strand.The inverted region in red is about 300 bp long with an alignment score of119.
3.2.3 Results
Considering that a score of 25 is a conservative threshold for significant align-
ments, Table 3.2 shows the percentage of upstream regions that have at least
one region with a significant score. The percentage of upstream regions hav-
ing a Promoterwise score higher than 25 in the case of H. sapiens versus M.
musculus or H. sapiens versus R. norvegicus (91 million years divergence)
is around 60 %. As expected, rat and mouse comparison gives the highest
percentage of upstream regions having significant Promoterwise scores (73
%).
The similarity drops quite dramatically when comparing mammals and fish
(450 million years divergence) with 3.32% and 3.65% of H. sapiens genes
having significant upstream region homology with F. rubripes and D. re-
rio respectively. This figure is even more dramatic in A. gambiae and D.
melanogaster(250 million years divergence) where only 1.44 % of homologues
give significant Promoterwise scores.
3.2. GLOBAL RUN OF PROMOTERWISE 66
Mammalian and fish :
H. sapiens R. norvegicus M. musculus F. rubripes D. rerio
H. sapiens - 14653/983867.13 %
18328/ 1148362.65 %
10300/3423.32 %
7961/2913.65%
R. norvegicus - 18703/1364072.93 %
10456/3873.70 %
8049/3143.90 %
M. musculus - 10732/3403.17 %
8267/2763.33 %
F. rubripes - 7566/112814.9 %
D. rerio -
Diptera:
A. gambiae D. melanogaster
A. gambiae - 8025/116 1.44
%
D. melanogaster -
Nematodes :
C. elegans C. briggsae
C. elegans - 11714/662156.52 %
C. briggsae -
Table 3.2:Total orthologous pairs / number of pairs that have at least one region witha scores higher than 25 bits (percentage of positives). All the orthologouspairs were retrieved from Ensembl compara release 16.0.
3.2. GLOBAL RUN OF PROMOTERWISE 67
all possible orthologous pairsA Orthologous pairs with GO ID XB
orthologous pairs with conserved sequence (promoterwise score > cutoff)C
Orthologous pairs with conserved sequence and GO ID XD
A B
D
C
Hypergeometric distribution
Figure 3.11: hypergeometric distribution to calculate the probability ofseeing D by chance.
3.2.4 Genes with conserved 5’ proximal intergenic re-
gions
From the visual data, it seems that a high number of positive upstream re-
gions correspond to specific sets of genes. These fall into particular classes
of proteins; for example, transcription factors or key genes involves in devel-
opmental processes.
In order to systematically test for an enrichment of particular classes of
genes in highly conserved regulatory regions between fish and mammals, all
the Gene Ontology (GO) annotations (Harris et al., 2004) were mapped to
EnsEMBL genes and annotations that show a significant enrichment in the
positive set; that is, the set of genes that have significant conservation in
the upstream region, were selected. This enrichment was estimated using an
hypergeometric distribution as shown in Figure 3.11.
3.3. CONCLUSION 68
The result for conserved upstream regions between human-mouse and
human-fugu comparison are shown in Table 3.3 and 3.4 respectively.
The positive sets in both cases contain mostly genes involved in develop-
ment and are transcription factors. The human-mouse positive set contains
in addition genes involved in signal transduction pathways.
Conversely, the same study can be done for under-represented classes in
the positive set and Table 3.5 show the results for human-mouse comparison.
Globally, genes that code for proteins located in the ribosome, as well as
olfactory genes, seem to be under-represented. No significant classes can be
found when looking at human-zebrafish comparison.
Despite the fact that one can not rule out technical artifacts, such as a better
mapping for certain type of genes that would explain such an enrichment,
this result is in accordance with the common belief that key proteins involved
in developments such as transcription factors have very well conserved reg-
ulatory regions. This result also implies that the pattern of expression for
transcription factors and developmental genes are overall conserved through-
out evolution.
3.3 Conclusion
From the specific examples described in this chapter and many cases reported
in the literature, it nows becomes clear that conservation of non-coding DNA
can be detected by alignment-based methods for relatively close species only.
As shown in this chapter, alignment methods only work on remote species
in cases where genes that tend to be very well conserved across evolution
(eg. transcription factors or key genes involved in developmental processes)
which counts for only 3 % of the total genes. Significant alignments of re-
lated sequences usually conserved the strand direction, but in the case of
rat-mouse comparison, a significant number of alignable regions have been
flipped, suggesting that inversion of possibly neutral sequences is a common
process. If this is the case, one would expect the comparison between human
(Homo sapiens) and chimpanzee (Pan troglodytes) to behave the same way,
as the divergence time between these two species is very small.
3.3. CONCLUSION 69
GO cate-gory
type Probability GO annotation
GO:0005578 cellular component 8.31e-09 extracellular matrixGO:0006357 biological process 3.58e-09 regulation of tran-
scription from Pol IIpromoter
GO:0001501 biological process 3.24e-10 skeletal developmentGO:0007165 biological process 2.21e-10 signal transductionGO:0007267 biological process 1.94e-10 cell-cell signalingGO:0007399 biological process 2.80e-11 neurogenesisGO:0005634 cellular component 2.35e-11 nucleusGO:0007275 biological process 1.86e-11 developmentGO:0006355 biological process 1.78e-11 regulation of tran-
scription, DNA-dependent
GO:0003700 molecular function 4.02e-12 transcription factoractivity
Table 3.3: Ten most significant GO category enrichment when using human-mouse conserved upstream sequences (at least one region with Promoterwisebitscore > 100)
3.3. CONCLUSION 70
GO cate-gory
type Probability GO annotation
GO:0001501 biological process 0.000119 skeletal developmentGO:0007399 biological process 7.07e-05 neurogenesisGO:0003702 molecular function 5.06e-05 RNA polymerase II
transcription factoractivity
GO:0008151 biological process 3.30e-05 cell growth and/ormaintenance
GO:0007345 biological process 3.04e-05 embryogenesis andmorphogenesis
GO:0007507 biological process 2.05e-05 heart developmentGO:0005634 cellular component 1.38e-11 nucleusGO:0007275 biological process 1.29e-11 developmentGO:0003700 molecular function 6.07e-12 transcription factor
activityGO:0006355 biological process 5.46e-12 regulation of tran-
scription, DNA-dependent
Table 3.4: Ten most significant GO category enrichment when using human-fugu conserved upstream sequences (at least one region with Promoterwisebitscore > 25)
GO cate-gory
type Probability GO annotation
GO:0003735 molecular function 2.0e-7 structural constituentof ribosome
GO:0005739 cellular component 2.4e-6 mitochondrionGO:0005840 cellular component 1.4e-5 ribosomeGO:0004984 molecular function 1.4e-5 olfactory receptor ac-
tivity
Table 3.5: GO classes that are significantly under-represented in the setof conserved upstream sequences (at least one region with Promoterwisebitscore > 100)
3.3. CONCLUSION 71
Conversely, a comparison of very remote species gives very good resolving
power, as the non-functional homologous sequences are usually not alignable
anymore. Personal experience has shown that the best results have been ob-
tained by using a hybrid strategy, combining alignment from Promoterwise
(when comparing relatively close species like human and mouse) with basic
motif-search techniques (when dealing with more remote species like human
and fish). This strategy can be used to obtain a number of genes that have
conserved elements across mammals and fish, indicative of functionality. In
the case of the CCACCTG motif analysed in the first part of this chapter,
this strategy produced about 120 possible target genes, while a simple search
of this motif without the conservation information retrieves virtually all the
genes in the human genome. Because genes that have the same conserved
putative regulatory site may also be under similar regulatory mechanisms,
this method may be used as a quick and cheap alternative for micro-array
analysis in retrieving candidate genes.
This strategy implies that the motif is known either by previous experimen-
tal evidence or by a careful manual study of a regulatory region of specific
genes, as has been done and described for atonal 5.
In the next chapter, the results obtained here are used to go further and
automatically propose a set of motifs, based on the fact that they are glob-
ally found significantly more often within conserved regions.
Chapter 4
Defining a mammaliandictionary of regulatory motifs
4.1 Introduction
As we have seen in the previous chapter, the success of phylogenetic foot-
printing using alignment algorithms depends largely on the species distance
and the gene considered. Transcription factor genes or genes involved in key
processes, especially during embryogenesis, show strong promoter conserva-
tion but taxa-specific genes have no conservation in the promoter. In these
cases, the simple comparative genomic approach using alignment algorithm
is of no use.
Nevertheless, if the strategy is not gene-centric but rather to construct a
dictionary of regulatory elements used throughout the genome, then the sole
requirement for detection would be to have enough instances of conserved
motifs throughout the genome. In other words, a regulatory motif may be
significantly conserved even though the absolute conservation corresponds
only to a fraction of all possible cases. This is the basic approach of this
chapter.
Once the dictionary of motifs is constructed, the genome-wide distribution of
the motifs is then investigated and, based on these results, functional regions
for transcriptional control can be predicted.
72
4.2. FINDING FUNCTIONAL MOTIFS 73
4.2 Finding functional motifs
Figure 4.1 shows the schema of the procedure used to find possible motifs
that are found more often in conserved regions. The first few steps (labeled
red in the figure) were extensively studied in the previous chapter and consist
of retrieving the conserved regions in the 5 kb upstream of orthologous genes.
In this instance, because the approach is human-centric, I only considered
pair-wise comparisons between human and other species. Only alignments
that satisfy a certain cutoff were kept. The score cutoff has been also esti-
mated in the previous chapter to be around 25 bits; this is the cutoff that I
used here for intra-mammalian comparison.
Around three percent of all pair-wise comparisons gave significant align-
ment(s) when considering mammalian and fish. Therefore, the cutoff was
lowered to 10 bitscore for these species comparisons in order to include more
orthologous pairs. The rate of false positives was expected to increase, how-
ever, raising the issue of including fish in this study at all.
4.2.1 Derivation of a reliable motif dictionary
As shown on Figure 4.1 the next step was to generate all possible motifs,
typically all exact boxes of 6-7 and 8 mers, and for each instance, evaluate
the total occurrence in the human genome (in upstream regions of genes) and
the occurrence in conserved regions. The logic behind this is that functional
motifs should be distributed more often in conserved regions relative to the
total occurrence.
Motifs that are composed of two or more boxes separated by a fix or variable
distances were ignored. Furthermore I expect the effect of overlapping struc-
ture in motifs (Robin et al., 2002) to be minimum since it is not an absolute
count that is measured rather a ratio between total and conserved occurrence.
The best signal-to-noise ratio was obtained when regions in human that have
promoterwise hits above the appropriate cutoff for at least 2 other species out
of the four considered (mouse, rat, fugu, zebrafish) were defined as conserved.
As one can expect, most of the signal came from the human-mouse and
human-rat pairs, and very little from human-fugu and/or human-zebrafish
4.2. FINDING FUNCTIONAL MOTIFS 74
Generate all possible6−7−8 mer motifs.
Run Repeat maskerRemove upstream genes
humanmouseratfuguzebra
Motifs with high conservation
defined as conserved only if presentin conserved region of two or more homologous pairs.
RUN MOTIFWISE ON THE HUMAN GENOME
COMPARA 17_1get homolog relationship(many to many relationships)
ENSMART 17_1Get 5000 bp upstreamof genes
RUN PROMOTERWISE FOREACH HOMOLOGOUS PAIRS(with human as query sequence) See Chapter 3
human − mousehuman − rathuman − fuguhuman− zebra
conserved occurencefor each motif
Total occurence
in humanfor each motif
intra−mammal comparisons cut−off = 25mammal−fish comparisons cut−off = 10
Keep only sequence with score > cut−off
Figure 4.1: Schema of the procedure used to calculate to what degree amotif is found in conserved regions. The first part (in red) was discussed inchapter 3.
4.2. FINDING FUNCTIONAL MOTIFS 75
Downstream region Upstream region
Figure 4.2: Occurrence of all possible exact 6, 7 and 8 mer motifs in con-served regions as a function of the total occurrence. This analysis was doneupstream (right graph) and downstream (left graph) of human genes.
pairs. Dropping fugu and zebrafish from the analysis had, therefore, little
or no effect on the result. Nevertheless, these two species were proven to be
very useful in identifying candidate genes for experimental confirmation (see
section 4.3).
Figure 4.2 shows the distribution of all the possible motifs when considering
the upstream (right graph) and downstream region of genes (left graph). In
both cases, the x-axis is the total occurrence of motifs in either the upstream
or downstream sequences of human, while the y-axis represents the number
of times a motif occurs in conserved regions (as defined above) for upstream
or downstream human sequences.
In both cases the distribution of occurrence in conserved regions is a func-
tion of the total occurrences, with most of the motifs falling into a limited
range of possibilities. In the upstream regions, a significant number of motifs
show a different partition in favour of conserved locations. This represents
about 30,000 motifs; 34.8% of all possible motifs considered in that study.
A closer look at the composition of these motifs revealed the presence of at
least one CpG within the sequence for many of these motifs.
As described in the introduction, CpG is a special di-nucleotide that is under-
represented in mammalian genomes. To observe this under-representation,
4.2. FINDING FUNCTIONAL MOTIFS 76
density plots were generated of occurrence of CpG or non-CpG motifs for the
upstream and downstream regions. In both cases, CpG are under-represented
(see Figure 4.3).
The same analysis as before has been repeated, but this time, CpG was
differentiated from the non-CpG motifs. The results are shown in Figure 4.4.
The distributions across conserved and all regions in human are the same
for CpG and non-CpG motifs when considering the downstream region of
genes. However, a striking difference exists in upstream regions only; CpG
motifs tend to be found more often in conserved regions. These results are
in accordance with the hypothesis of CpG island being correlated with func-
tional regions.
To rule out the effect of the CG composition, the same analysis was repeated
but this time the criteria was the presence of at least one GpC (instead of
CpG) in the motifs. No difference in the distribution in conserved region
can be seen suggesting that it is specifically the CpG dinucleotide that is
correlated with the higher distribution in conserved region.
Clearly, it is not possible to ignore the CpG effect. Strategies, then, need to
be found in order to circumvent the CpG evolutionary dynamic. To do so,
these two approaches were developed:
1. First, CpG and non-CpG motif counts can be considered as two dis-
tinct distributions and outliers in both distributions can be retrieved.
outliers were defined as above a regression line corresponding to four
standard deviations from the mean of conservations for each total oc-
currence. The two sets are then concatenated together to form the final
set of significant motifs.
2. Secondly, a slightly different approach can be considered by only look-
ing at conserved regions. Indeed, in conserved regions, motifs can either
be fully conserved between the two species considered or have at least
one substitution or/and indel. Now the number of conserved occurrence
function of the total occurrence in conserved regions can be evaluated
for each motifs. The result can be seen in Figure 4.5. Using this met-
ric, no difference can be observed between the CpG and the non CpG
4.2. FINDING FUNCTIONAL MOTIFS 77
0 1000 2000 3000 4000 5000 6000 7000
0.00
000.
0005
0.00
100.
0015
0.00
200.
0025
0.00
300.
0035
density function of the distibition of the motif occurence in downstream regions in human
total occurence
Den
sity
0 1000 2000 3000 4000 5000 6000 7000
0.00
000.
0005
0.00
100.
0015
0.00
200.
0025
density function of the motif occurence distibution in upstream regions of H. sapiens
total occurence
Den
sity
Figure 4.3: Density function of the motif occurrence downstream (top) andupstream (bottom) for CpG (black) and non-CpG motifs (blue).
4.2. FINDING FUNCTIONAL MOTIFS 78
Downstream region upstream region
Figure 4.4: Same analysis as in 4.2, but with CpG motifs labeled in green.
motifs. The outliers defined as above were retrieved.
The two methodologies answer two slightly different questions. The first
one finds motifs that are distributed more often in conserved regions (the
motif by itself does not need to be conserved between species). The second
one finds motifs that, when found in conserved regions, have a tendency to
be conserved as well. In both approaches, the outliners have a higher con-
servation than expected, which is what’s expected for functional sites like
transcription factor binding sites.
Outliers were analysed and are shown in Table 4.1. Some motifs are well
known binding sites, as is the case for:
1. The activator protein 1 (AP-1) is a dimeric complex that can form
many different combinations of heterodimers and homodimers of JUN,
FOS, ATF and MAF protein families. The main DNA response-element
is the TPA-responsive element (TRE with the consensus binding site
TGACTCA), but different dimers can preferentially bind to the cAMP
response element (CRE, consensus binding site TGACGTCA). These
transcription factors are well known in the field of oncology as they
are considered to be highly oncogenic. Both the binding sites TRE
and CRE were found to be significantly more conserved in conserved
regions((Eferl and Wagner, 2003)).
2. GCGCATGCG is identical to the palindromic consensus binding site
(YGCGCATGCGCR) for α-PAL/NRF1 (also called NRF-1 α-PAL; α-
palindrome-binding protein; nuclear respiratory factor 1). α-PAL was
4.2. FINDING FUNCTIONAL MOTIFS 79
0
50
100
150
200
250
300
350
0 100 200 300 400 500 600
inde
ntic
al in
con
serv
ed
total in conserved regions
indentical motifs in conserved regions
motifs without CpGmotifs with CpG
Total motifs in conserved regions
cons
erve
d m
otif
s in
con
serv
ed r
egio
ns
Figure 4.5: Occurrence of conserved motif in conserved region as a functionof the total occurrence in conserved region. Globally CpG motifs (in green)and non CpG motifs have the same distribution of conserved occurrencesrelative to the occurrence in conserved regions.
4.2. FINDING FUNCTIONAL MOTIFS 80
motifs conservedonly
identicalin con-served
description
SCGGAAGYG + + Elk1
CCTTTAAG + + -
AGGAAGT + + -
GGAAGTGA + + -
CCACGTGA + + E-box
AGCCAATSR + + CAAT box
CTGACGT + + AP-1 CRE
RCGTCACK + + AP-1 CRE
YCCCGCCCCC + + SP1 site ((Berg, 1992))
ATGCAAAT + + -
TAATTA + - CHX10
TAATGAG + - -
GCCGGAA + + Elk1
TAAACA + + FREAC-2
CCCGGAAG + + -
GGTGAG + + -
TCACGTGA + + E-box
GATTGGT + + reverse-complement CAAT box
TTCCGCC + + -
CACGTGGG + + -
GCAGCTG + + AP-4
CCCTTTAA + + -
ATTGGCTG + + reverse-complement CAAT box
CGCAGGCG + + -
CGCGCGC + + -
CTATAAA + - Consensus sequence for TATAA box.
TGACTCAG + + AP-1 TRE TF.(transfac M00174)
CACGTGAC + + E-box
CCCTCCC + + SP1 site ((Berg, 1992))
GCGCATGCG + + α-PAL/NRF1
GCGCGTG + + -
GTTGCTA + + -
TGACATCA - + AP-1
AGGTCAC - + -
CCACCTGC - + E12
TGACGTCAC + + AP-1 CRE
CTCGCGAGA + + -
CCAATCAG + + CAAT box
CCATTGG + + reverse-complement CAAT box
TTCCGGT + + -
CCACGTGG + + -
TGCGCA + + -
Table 4.1: Significant motifs (+) for occurrence in conserved regions (secondcolumn) and/or for identical in conserved region (third column). Most ofthe motifs have significance in both metrics. The last column is a manualannotation of the motif based on literature search.
4.2. FINDING FUNCTIONAL MOTIFS 81
initially detected as a transcription factor involved in the regulation
of the expression of the eukaryotic Initiation Factor 2 α (eIF-2α), a
translation initiation factor (Jacob et al., 1989), but later the motif
has been found to be functional in other promoters (Drouin et al.,
1997)
3. CCCTCCC and CCCGCCC are elements found ubiquitously in eu-
karyotic promoters and are the fixation sites for SP1 protein. Known
for more than 20 years, SP1 sites were thought to be involved in basal
transcription mechanisms (Dynan and Tjian, 2000), but more recent
studies are challenging this perception towards a more complex mech-
anism of regulation in which many of the SP family members (Jackson
et al., 1990)(Black et al., 2001) and other transcription factors (BTEB-
BTEB2, (Nielsen et al., 1998) (Sogawa et al., 1993)) interact and com-
pete for these same sites. The outcome can either be an activation or
a repression of the target gene.
4. E-box (CACGTG) is found upstream of many genes and is the binding
site for Max-Max homodimer (Blackwood and Eisenman, 1991), Max
heterodimer with Myc (Blackwood and Eisenman, 1991), Mad1 (Ayer
et al., 1993), Mxil (Zervos et al., 1993), Mad3 and Mad4. All of these
transcription factors are well known proto-oncogenes (Ryan and Birnie,
1997).
5. the CAAT box is found in many eukaryotic promoters, usually about
75bp upstream of the start of transcription.
6. The ubiquitous TATAA box was reported to be found about 25-35 bp
away from many transcription start sites. It is the binding site of the
TATAA box binding protein (TBP) that is part of the basal transcrip-
tion machinery. The motif found here has an additional cytosine 5’ of
the consensus TATAA box.
Most of these elements bind either proteins involved in the basal transcrip-
tional machinery or transcription factors that have a broad range of activity.
This is due to the nature of the methodology employed, which only selects
motifs based on their overall enrichment in conserved regions or in conserva-
tion. Transcription factors that act upon a few genes would have very few
binding sites and the background of non-functional sites would hide the few
4.2. FINDING FUNCTIONAL MOTIFS 82
functional motifs.
Interestingly, most of the patterns show significance in both methodologies,
suggesting that these motifs are found more often in conserved regions and
are more conserved as well. A notable exception is the putative TATAA box
that is only more conserved in conserved regions.
4.2.2 Finding region of clustered motifs on the humangenome
The regulation of eukaryotic genes is complex and often involves multiple reg-
ulatory proteins that bind to the regulatory region within a relatively short
distance from each other. The concept of regulatory modules composed of
cluster of regulatory motifs has been highlighted in many published works
that show the presence of these regulatory modules upstream of well studied
genes (Arnone and Davidson, 1997), (Berman et al., 2002).
The concept of modules implies that the density of cis-regulatory elements
should be higher in these regions than anywhere else on the genome. Based
on this assumption, a search was made to find transcription control regions
on the human genome using Motifwise, an algorithm developed by Ewan
Birney, to predict regions of higher density of motifs from the dictionary of
section 4.2.1.
Using the whole human genome sequences, a total of 190,593 hits were found.
As a measure of how well Motifwise locates cis-regulatory regions, the distri-
bution of hits was plotted relative to the closest-annotated transcription start
or end sites as shown on Figure 4.6. As expected, most of the regions that
control transcription would be close to the transcription start site. While no
significant fraction of hits occurs around the end of the transcript, most of the
Motifwise hits occur within 4 kb of the transcript starts. This result is clear
evidence of a biological association between the clustering of cis-regulatory
sites (given by Motifwise hits) and the start of transcription.
To rule out the possibility of overtraining the data, the same analysis was
done on the human genome without chromosomes 6, 20 and 22. Motifwise
was then run on these missing chromosomes. The result was identical to the
one obtained using the whole genomes. Another possible artifact is the CpG
4.2. FINDING FUNCTIONAL MOTIFS 83
0
0.5
1
1.5
2
2.5
3
-4000 -2000 0 2000 4000
Per
cent
age
of a
ll hi
ts
distance (in bp)
density of prediction by motifwise relative to the annotated gene starts or ends
relative to gene startrelative to gene start (only non CpG motifs)
relative to gene end
Figure 4.6: density of predictions by Motifwise relative to the annotatedgene starts or ends.
4.2. FINDING FUNCTIONAL MOTIFS 84
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70 80 90
frac
tion
score cutoff
fraction of positive regions (within 4kb of genes) function of the score
positif motifs upstreamtransfac motifs upstream
positif motifs downstreamtransfac motif downstream
Figure 4.7: fraction of positive regions from Motifwise using either transfacmotifs or our positive set of motifs as a function of the score of the region.Positive regions are regions found by Motifwise that are within 4kb relativeto gene starts or gene ends.
4.2. FINDING FUNCTIONAL MOTIFS 85
Figure 4.8: Motifwise result : Ensembl detailed view showing one exampleof a Motifwise hit (in green) on the human genome. The gene (in red)Q96MV9 does not have any description. The hit is a few bp away from thetranscription start site.
islands that are generally located upstream of genes. A significant number of
motifs on the set have CpG, and Motifwise may, therefore, simply detect a
higher density of CpG common in CpG islands. To rule out this hypothesis,
a Motifwise run was done using only non-CpG motifs. The result shown in
Figure 4.6 shows still an enrichment of Motifwise hits around the transcrip-
tion start sites, though weaker than when using the whole motifset.
Another way of visualising the data is to evaluate the fraction of positive
hits (the one around the transcription start) function of the Promoterwise
score cut-off.
As a control, Transfac motifs were used in Motifwise to scan the human
genome. A total of 190,593 hits were found and located relative to the gene
start site. This number corresponds roughly at the global amount of hits
found using the conserved motif dictionary (190,425 hits).
The distribution, however, is very different. Indeed, at a cut-off of 10 bitscore
only 2.15 percent of the regions fall into the 4kb window around the start/end
of genes. Figure 4.7 shows the ratio of positive regions(within 4kb around
the start of gene) for the conserved motif dictionary and for the transfac mo-
tifs as a function of the score of the region in Motifwise. The percentage of
positive regions remains very low for the transfac motifs set (less than 5%).
4.3. EXPERIMENTAL EVALUATION OF THE METHODOLOGY 86
gene
names
motifs H. sapiens R. norvegicus M. musculus F. rubripes D. rerio
FOXM1 TCACGTGA ENSG00000111206
ENSMUSG00000001517
ENSRNOG00000005936
SINFRUG00000122591
ENSDARG00000003200
ARF3 TCTCGCGAGA ENSG00000134287
ENSMUSG00000022995
ENSRNOG00000013924
SINFRUG00000152889
not available
Q99JW1 CACTTCCGG ENSG00000129968
ENSMUSG00000003346
ENSRNOG00000018212
SINFRUG00000143059
not available
Q9BU67 GTCACGTG ENSG00000165782
ENSMUSG00000035953
ENSRNOG00000009948
SINFRUG00000151440
not available
SM31 CACGTGAC ENSG00000184900
ENSMUSG00000020265
ENSRNOG00000001222
SINFRUG00000140807
ENSDARG00000014254
ZIC1 CACGTGAC ENSG00000152977
ENSMUSG00000032368
ENSRNOG00000014644
SINFRUG00000141943
ENSDARG00000015567
Table 4.2: candidates genes and the corresponding conserved motifs withEnsembl IDs for all the species considered (Ensembl release 18).
4.3 Experimental evaluation of the method-
ology
In order to evaluate the methodology, candidate genes were selected that
satisfy a number of criteria and analysed in detail to locate the region of con-
servation and possible flanking conserved regions. These candidates should
have orthologues in both mammalian and fish and possess in the upstream
region a significant motif derived from the motif dictionary (from section
4.2.1). These motifs have to be conserved in human, mouse, rat and at least
Fugu. Table 4.2 summarises the orthologue information for each candidates.
Ideally, these candidates should have evidence of expression in the early em-
bryonic stage. Experimental analysis was done for all candidate genes by
marcel Souren in the group of Jochen Wittbrodt (EMBL-Heidelberg). The
respective promoter regions were cloned from the Fugu rubripes’ genome and
inserted into a reporter vector. Deletion around the identified motifs in the
promoter were done for 3 constructs. The specific deletion constructs showed
lower ubiquitous expression in all three cases. For details see (Ettwiller et al.,
2005)
4.3. EXPERIMENTAL EVALUATION OF THE METHODOLOGY 87
4.3.1 The FOXM1 gene
FOXM1 is part of the Forkhead box (FOX) transcription factor family and
has been implicated in both embryonic development and adult tissue home-
ostasis, and has known orthologues in rodent and fish. The common motif
over all known orthologues TCACGTGA is located about 1 kb away from
the coding start in fugu. The conservation of the entire region around the
motif is shown in Picture 4.9 and consists of essentially three blocks of con-
servation. The first block is a putative CAAT box reverse-complemented,
the second corresponds to the motif TCACGTGA, the third block consists
of an unknown motif.
H_sapiensM_musculusR_norvegicusF_rubripesD_rerio
CCCCG
CCCAT
GGGAG
GGGCG
AAAGT
AAATG
TCCGG
GCCAC
CCCCT
|10
CGGCC
GGGAC
AAAGG
GGGTC
AAATC
CCCCC
AAAGA
AAATT
GGGCA
G--CT
|20
CCCTG
CCCGT
GGGCG
GGGCT
CTTGT
GGGCC
CCCCG
CCC.C
GGG.C
AAAAA
|30
TTTTT
TTTTT
GGGGG
GGGGG
CCCCC
GGGCC
AAAAA
CCCAC
GGGT.
TCCCC
|40
TTTCA
CCCGC
CCCCG
.
.
.GT
.
.
.GG
GGGGA
TTTTT
CCCCC
AAAAA
CCCCC
|50
GGGGG
TTTTT
GGGGG
AAAAA
CCCC.
CCCG.
TGGC.
TCCG.
AAAG.
AAAC.
|60
CCCG.
GGGA.
CCCG.
TTTGG
CCCAA
CCCCG
GGGGC
CCCGG
H_sapiensM_musculusR_norvegicusF_rubripesD_rerio
CCCCC
GGGAA
|70
GGGGG
CCCCC
GGGAA
CCCCC
CCCGC
.
.
.CG
.
.
.GC
.
.
.CT
.
.
.AC
.
.
.GT
|80
.
.
.CG
.
.
.GC
.
.
.CG
.
.
.CC
.
.
.
.C
.
.
.AA
AAAAA
AAAAA
TTTAA
TTTTT
|90
TTTTT
CCCCC
AAAAA
AAAAA
AAAAA
CCCAA
AAAAT
GGGCC
CCCGT
GGGCC
|100
GGGCA
AAATC
AAACC
CCCTA
AAAGA
AAAAA
A--AT
CCCTG
TTTCC
GGGGT
|110
AAACC
Figure 4.9: Foxm1 regulatory region.
4.3.2 The ARF3 gene
ARF3 gene is part of the ADP-ribosylation factor family. ARF3 is predomi-
nately expressed in neuronal tissues during brain development but has been
found to be expressed in all tissues as well (Moss et al., 1990), making this
gene a good test candidate. The promoter has been reported to lack a TATA
and a CAAT box (Haun et al., 1993) and the region between -58 and -17
bp upstream of the transcription start site in human has been shown to be
essential for full expression of the gene. The common motif for this gene is
the palindromic sequence TCTCGCGAGA as show in Figure 4.10, and for
human is located between -58 and -17 bp of the transcription start site; con-
sistent with the above experiment results. The whole region around the motif
is well conserved across the mammalian but not across fugu and zebrafish.
4.3. EXPERIMENTAL EVALUATION OF THE METHODOLOGY 88
H_sapiensR_novegicusM_musculusF_rubripesD_rerio
TTTAA
GGGAG
CCCCG
TTTTC
GGGCA
CCCCG
AAAGT
GGGCC
CCCTT
|10
CCCTC
GGGCA
CCCTA
TTTGG
GGGGA
CCCCT
CCCGG
AAATC
TTTTG
GGGTA
|20
GGGAA
TTTCA
GGGGA
AAATC
TTTCC
GGGTA
GGGTA
GGGTA
TTTTT
CCCCC
|30
TTTTT
CCCCC
GGGGG
CCCCC
GGGGG
AAAAA
GGGGG
AAAAA
AAAAG
CCCCA
|40
TTTTT
GGGAG
CCCAT
CCCAC
GGGTC
CCCTC
TTTGC
ACCCA
GGGTT
CCCGC
Figure 4.10: ARF3 regulatory region.
4.3.3 The Q99JW1 gene
This gene is similar to CGI-67 protein which has been annotated as being a
serine protease. The common motif for this gene in all species studied, apart
from zebrafish, is CACTTCCGG. Little is known about the gene.
H_sapiensM_musculusR_norvegicusF_rubripes
CCCC
CCCG
TTTT
CCCG
GGGA
CCCC
GAAG
TTTT
CCCC
|10
AAAA
CCCC
TTTT
TTTT
CCCC
CCCC
GGGG
GGGG
GGGG
GGGT
|20
-G-T
CCCT
GGGA
GGGG
TTTT
GGGG
Figure 4.11: Q99JW1 regulatory region.
4.3.4 The Q9BU67 gene
The region of conservation between mammalian and fish extends about a
hundred base pairs around the motif GTCACGTG with a putative conserved
CAAT motif 25 bp upstream of the motif.
H_sapiensM_musculusR_norvegicusF_rubripes
CCCT
TTTC
GGGG
GGGA
TGGC
CCCC
TCCA
GGGA
GGGG
|10
AAAA
GGGG
GGGT
TTTT
AAAT
GGGA
CCCA
GGGT
CCCT
GGGG
|20
AAAG
TCTC
GCGT
GGGA
GCCA
CCCT
GCGT
T--T
CCCT
GAAG
|30
CCCC
TTTT
CTTG
CCCT
CCCC
AAAA
AAAA
TTTT
A--C
---A
|40
---G
---G
-CGG
GGGC
CCCT
TTTC
TCCT
GGGG
CCCG
AAAC
|50
GCCC
AAAA
AAAG
CCCG
CTTC
TTTC
CCCA
TGTC
AAAG
GGGG
|60
TTTT
CCCC
AAAA
CCCC
GGGG
TTTT
GGGG
---A
H_sapiensM_musculusR_norvegicusF_rubripes
---C
---G
|70
CCCA
AAAG
GGGG
-TTT
GGGG
TTTT
TTTT
TTTT
TAAT
GCCT
|80
CGGG
AGGT
AAAT
GCCA
CTTT
CCCT
C-AT
AAGT
GGCG
CCAC
|90
AAGT
GGCT
CCTC
ATAC
TACG
CCGC
TGGC
GGGT
GGAT
Figure 4.12: Q9BU67 regulatory region.
4.3. EXPERIMENTAL EVALUATION OF THE METHODOLOGY 89
4.3.5 The SM31 gene
Little is known about the gene function. The conserved motif is CACGT-
GAC located 200 bp away from the transcription start site in fugu (see Figure
4.13. The upstream sequence also contains two other weakly conserved re-
gions flanking the motif. Fugu seems to have the reverse complement of
the motif, but not the rest of the region. Marcel Souren from the verte-
H__sapiensM_musculusR_norvegicusF_rubripesD_rerio
TTTCT
CCCCC
GGGGG
TTTCT
GGGGG
ACCAG
AAAGA
CCCAC
GGGT-
|10
CCCT-
GGGT-
CCCG-
GCCG-
CCCT-
AGGG-
GAAGA
CCCTT
CCCCC
AAAAA
|20
CCCCC
GGGGG
TTTTT
GGGGG
AAAGA
CCCTC
CTTCC
CCCGA
CCCTG
C--GA
|30
GGGTA
TCCCA
GGGGC
CCCGC
C---A
GGGG-
GAGAA
CCCCC
CCCCC
AAAAA
|40
AAAAA
CCCTT
GCCCC
GAGAG
GGGTT
TCCTT
GGGCG
CCCAA
GGGTT
CCCTA
Figure 4.13: SM31 regulatory region.
brate developmental group of Jochen Wittbrodt made a deletion construct
of the medaka SM31 promoter as shown on 4.14. The 41 bp region that
contains the CACGTGAC motif was removed and the resulting promoter
(SM31del) placed upstream of the reporter gene GFP. The whole SM31 pro-
moter (SM31) was also constructed as a control. The construction was trans-
fected into medaka embryo and the transient expression of GFP monitored
at 24 and 48 hours after transfection. The construct SM31 show a strong
and uniform GFP expression across the fish embryo at both time as the
SM31del construct do not show any detectable GFP expression. This result
suggests that the region which contains the motif CACGTGAC is required
for a functional SM31 promoter.
4.3.6 The ZIC1 gene
Zic1 encodes a zinc-finger protein that is required for the development of the
dorsal neural tissue. It is present at high level of expression in the cerebellum
and developing cerebellum in human. The conserved motif CACGTGAC
(reverse complemented in D. rerio) is located about 300 bp away from the
annotated transcription start site and about 18 nt downstream of a conserved
putative CAAT box as show in figure 4.15. Downstream of the motif is a
CCCTCCC region that seems to be conserved as well (putative SP1 site).
4.4. CONCLUSION 90
4.4 Conclusion
This chapter shows that a reliable set of cis-regulatory motifs can be retrieved
by using unique statistical properties of regulatory sites in the context of
comparative genomics. Indeed, functional motifs are (a) found more often in
conserved regions, and (b) tend to be conserved, as well. So far, these two
properties have been used independently, but a combination of both maybe
more powerful to predict functionality.
In any case, these results confirm on a genomic scale the popular assumption
that regulatory sites should be found more conserved across species.
Another interesting finding is the fact that CpG motifs are globally found
more in conserved regions, and this is characteristic of upstream region of
genes. However, by only looking at conserved regions, CpG motifs are glob-
ally no more conserved than other motifs. This is probably due to the con-
servation of CpG island across mammals.
Because of the global nature of this approach, ubiquitous cis-regulatory el-
ements like the CAAT box show strong significance. This result suggests
that the methodology may be better adapted to localise the basic promoters,
rather than specific regulatory sites that are found to be functional only on
a limited number of genes. Nevertheless, variants of this method could be
considered; for example, the use of only a subset of genes that are known to
be co-expressed in order to predict more specific transcription factor binding
sites.
4.4. CONCLUSION 91
Figure 4.14: Expression of the reporter gene GFP under the control ofthe SM31 promoter in medaka embryo. Only the construct containing thewhole promoter show a constant and uniform GFP expression as the SM31delconstruct do not show any detectable GFP expression.(from the vertebratedevelopmental group of Jochen Wittbrodt)
4.4. CONCLUSION 92
H_sapiensR_novegicusM_musculusF_rubripesD_rerio
CCCCC
CCCCC
AAAAA
AAAAA
TTTTT
GCCGG
GCCTA
GTTGT
CCCCG
|10
GGGGG
CCCCT
.
.
.GC
CCCCC
ACCAA
GGGGA
CCCCG
GGGTC
TTTTA
.
.
.T.
|20
CCCCC
GGGGG
GGGGG
CTTGG
ACCGG
GGGCT
CCCCC
AAAAA
CCCCC
GGGGG
|30
TTTTT
GGGGG
AAAAC
CCCCT
ACCCG
.
.
.GC
.
.
.CC
.
.
.CC
.
.
.CC
CCCCC
|40
CCCCC
TTTTT
CCCCC
CCCCC
CCCTT
CCC.T
CCCCC
C..TC
TTTTC
GGGCC
Figure 4.15: ZIC1 regulatory region.
Chapter 5
Effect of the ATG triplet ongene expression in yeast
5.1 Introduction
As seen in the introduction, gene regulation also occurs at the post-transcriptional
level. With the collaboration of Thomas Schlitt I analysed the effect of an
additional ATG triplet upstream of gene starts in the yeast S. cerevisiae.
Additional ATG triplet(s) in the 5’ UTR can be used as the initiation codon
by the scanning ribosome. Because an upstream ATG is often quickly fol-
lowed by an in-frame stop codon, the mRNA can potentially be left without
ribosomes on most of its length, resulting possibly in the activation of the
NMD decay mechanism. This study has been done both at the genomic and
at the transcript level whenever UTR information was available.
5.2 ATG codon at the genomic level
ATG, in the intergenic context, should occur randomly across the yeast
genome with a change in the ATG distribution in genes. As translation
start sites are well annotated in yeast, I studied the distribution of ATG
upstream of the translation start site of all the genes in the genome.
Figure 5.1 shows the distribution of ATG around the coding ATG. The dis-
tance in the x axis is relative to the translation start site of the genes (origin)
and all ATG were counted in a window of -200 +200 bp. ATG distribution
tends to be fairly constant after a distance of 100 bp upstream of the cod-
93
5.2. ATG CODON AT THE GENOMIC LEVEL 94
ing start site. Before that, ATG tends to be under-represented, and this
tendency increases as the distance from the coding site diminishes. To rule
out lower complexity effects (CG/ AT content), the ATG reverse comple-
ment triplet (CAT) distribution was also retrieved and plotted in Figure 5.1.
No such negative selection can be seen on codon CAT. Other codons have
been tested (AGT, TGA) and, again, no such effects could be seen (data not
shown). This observation has been already made a few years ago in many
eukaryotic and prokaryotic genomes, including the yeast S. cerevisiae (Saito
and Tomita, 1999).
The average 5’ UTR in yeast has a predicted length of about 130 bp
(Rogozin et al., 2001) or less, indicating that the ATG codon is negatively
selected in 5’ UTRs. Figure 5.1 -A show the distribution of ATG in the
coding sequence as well (0 to +200bp). The three distributions of ATG cor-
respond to the 3 frames, with the lowest counts being the ATG in frame with
the ORF (coding for methionine), the medium counts as frame 1, and the
highest count as frame 2. Methionine (codon ATG) is rarely used in proteins,
as frame 1 would produce a codon [TGX] with X being either A, T, C or
G that corresponds to either tryptophan, cysteine or a stop codon; all rare
codons as well.
The next step was to include the expression information in order to anal-
yse the effect of the genomic distance of the first upstream ATG relative to
translation start sites on gene expression. Expression data were derived from
previous microarray analysis where only the absolute expression level for the
wild type yeast has been used ((Causton et al., 2001)). The genes has been
split into two groups; genes that have an ATG 5’ of the start codon that is
less than 50 bp upstream and the rest. The absolute expression value for
each gene was retrieved and the density distribution of these expression has
been plotted for both groups. The result obtained is summarised in Figure
5.2.
The two distributions are different with much more genes with low ex-
pression values for the close ATG set.
Another way of looking at the data is to retrieve and average all the ex-
pression values of genes that have an ATG between 0 and 40 bp upstream
of the translation start site, and repeat the operation by moving the window
until 200 bp. To measure the significance, 100 random datasets were gener-
ated by shuffling the expression values and were compared with the real data.
5.2. ATG CODON AT THE GENOMIC LEVEL 95
0
50
100
150
200
250
300
-200 -150 -100 -50 0 50 100 150 200
coun
t
distance (in bp)
ATG and CAT distribution relative to coding starts
ATGCAT
0
20
40
60
80
100
120
140
-200 -150 -100 -50 0
coun
t
distance (in bp)
ATGCAT
[B]
[A]
Figure 5.1: Distribution of ATG and the reverse complement CAT tripletsupstream of the putative coding start. The number of triplet ATG and CATis counted for each relative distance 5’ upstream of the start codon of allthe annotated genes in yeast. [A] Distribution using a window of 200 bpupstream and downstream of the putative coding starts (0). [B] Close-upwindow between -200 and 0 bp away from putative coding start.
5.2. ATG CODON AT THE GENOMIC LEVEL 96
0 200 400 600 800 1000
0.00
00.
001
0.00
20.
003
0.00
4
density function of the distribution of expression values for near (black) and distant (blue) ATG
expression value
Den
sity
Figure 5.2: Density distribution of expression values for genes that have aclose first upstream ATG (distance < 50 bp, in black) and genes that have adistant first upstream ATG (distance > 50 bp).
5.2. ATG CODON AT THE GENOMIC LEVEL 97
150
200
250
300
350
400
450
500
550
600
650
700
0 20 40 60 80 100 120 140 160
abso
lute
exp
ress
ion
ATG location (bp)
moving average of expression function of the first 5’ ATG position relative to the start site for all yeast genes
random datareal data
Figure 5.3: Effect of the first upstream ATG triplet distance on expressionin yeast.
5.3. ATG CODON AT THE TRANSCRIPT LEVEL 98
The result is plotted on Figure 5.3. The random data shows an average ex-
pression that is constant function of the ATG location as the real data shows
a good correlation between the average expression value and the upstream
ATG distance. This correlation is good until 120 bp, which is consistent with
the result of Figure 5.1 and the literature.
Clearly, the location of the first ATG upstream of the translation start site
has an effect on global gene expression. Nevertheless, using genomic data
restricts the interpretation of the result.
The main issue of considering genomic sequences instead of transcript in-
formation is the inability to distinguish UTR from upstream regions and,
consequently, it is not possible to distinguish an ATG in the UTR or simply
a random ATG occurring in the intergenic DNA.
5.3 ATG codon at the transcript level
UTR sequences are, therefore, valuable information for the correct interpre-
tation of the result. However, full length cDNA sequences in yeast are sparse
and EST (Express Sequence Tags) are not guarantees to be full length with
a bias for 3’ ESTs. Nevertheless, considerably more ESTs are available, and
the most 5’ EST of a given gene can be still informative if located in the 5’
UTR.
I mapped all available EST sequences to the yeast genome using blast ((Altschul
et al., 1990)) and located the start of most upstream EST for each gene. If
located at least 10 bp upstream from the coding start, the resulting 5’UTR
(that is, the sequence from the start of the EST to the start of the coding
sequence) is analysed for possible ATG. Using this approach, a total of 515
yeast genes have 5’ UTR sequences. This figure is largely underestimated,
as most of the ESTs do not provide full length cDNA information.
To study the effect of ATG on expression, the exact same approach as in
5.2 can be applied here. The distribution of expression values was analysed
for UTR, with and without ATG, and the result is summarised in figure 5.4.
Here, as well, the two distributions are different with much more genes with
low expression values for the UTR set with ATG.
5.3. ATG CODON AT THE TRANSCRIPT LEVEL 99
−200 0 200 400 600 800 1000
0.00
000.
0005
0.00
100.
0015
0.00
200.
0025
0.00
30
density function of the distributionof expression values for UTR with (black) and without (blue)G
expression value
Den
sity
Figure 5.4: Effect of the presence of an ATG in the 5’UTR on expression inyeast.
5.4. THE UPF GENES 100
This result suggests that an upstream ATG in the 5’ UTR of transcript
has a global negative effect on gene expression in yeast S. cerevisiae. One
possible mechanism that has been reported in the literature as a surveillance
mechanism is the nonsense-mediated mRNA decay, or NMD. As seen in the
introduction, this surveillance mechanism promptly removes mRNA having
frameshift or nonsense mutation. An additional ATG upstream of the coding
region can be used by the scanning ribosome, and a stop codon will be
promptly reached.
5.4 The upf genes
In S. cerevisiae, three genes are required for NMD (see chapter introduction).
Mutation of each of these genes and subsequent analysis of the mutant tran-
scriptome has been monitored using microarray analysis by (Lelivelt and
Culbertson, 1999). The main result of their analysis is that mutation of
UPF genes causes accumulation of hundreds of genes. This result suggests
that NMD, in addition of being a surveillance mechanism, is also involved in
the regulation of numerous mRNAs. One hypothesis that the authors men-
tioned in the discussion is the following :
“Although naturally occurring mRNA do not typically contain a prema-
ture stop codon, they could be targeted for rapid decay by an alternate
mechanism. For example, they might contain a stop codon at the end of
a translatable upstream ORF or some other sequence element that serves a
targeting function, or the normal stop codon the end of the ORF might have
the atypical property of triggering rapid decay. In any cases, it seems likely
that the Upf proteins cause changes in the abundance of naturally occurring
mRNAs through a mechanism involving mRNA decay”.
An opportunity is given here to strengthen their hypothesis by comparing
the data they obtained to our ATG location information.
For all the genes, the absolute expression values in wild type yeast and
upf123 mutant were retrieved, and the ratio between wild type versus mutant
was calculated. The same analysis as in section 5.2 was done (see Figure 5.3),
this time replacing the expression value by the ratio on the Y axis. If the
ATG location had no effect on the NMD degradation pathway, one would
expect a constant average ratio independent of the ATG location.
5.4. THE UPF GENES 101
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
0 20 40 60 80 100 120 140
ratio
distance in bp
effect of the first upstream ATG on the ratio upf123- mutant over wild type
upf123 mutant over wild typeupf123 mutant over upf2 mutant
Figure 5.5: Effect of the presence of an ATG in the ratio upf123 mutantUPF123 wild type.
5.5. CONCLUSION 102
As shown on Figure 5.5 this is clearly not the case. The upf123 mutant
globally shows an increased amount of transcript for genes that have an up-
stream ATG less than 60 bp away from the coding start on the genomic level.
In their paper (Lelivelt and Culbertson, 1999), the authors also mentioned
that ’the same mRNAs respond to loss of UPF function regardless of which
of the UPF genes is disrupted’. In order to test the significance of the result
obtained on Figure 5.5, the ratio of expression values for upf123/upf2 was
calculated and plotted in Figure 5.5. No such increase of transcript level can
be noticed when comparing upf123 and upf2 mutant.
This result suggests that the ratio difference between the mutant and the
wild type for genes having an ATG within 60 bp usptream of the coding
start is significant. This result also confirms the above statement by Mr Le-
livelt and Mr Culbertson (Lelivelt and Culbertson, 1999). The same result
was obtained when replacing upf2 mutant with upf1 or upf3 mutant (data
not shown).
At the transcript level, the analysis was not as clearly defined as the ge-
nomic data. Perhaps this is due to the limited amount of 5’ UTRs in the
yeast and a strong noise level from the microarray data.
5.5 Conclusion
This chapter focuses on the effect of an ATG (and therefore a potential addi-
tional ORF) upstream of the main transcript in yeast genes. Looking on the
genomic level, a strong correlation can be made between the ATG distance
and expression. By only studying the transcripts with 5’ UTR information,
the same correlation can be made between the expression and presence or
absence of ATG. These results suggest that a potential uORF induces a
downregulation of the transcript, and the UPF data suggest that the NMD
could be the mechanism for such downregulation.
These predictions now need to be confirmed using experimental procedures.
I suggest site-directed mutation in order to remove upstream ATG and anal-
yse the effect on expression in wild type and upf123 mutant yeast.
5.5. CONCLUSION 103
As NMD is also found in human, a similar study needs to be done on higher
eukaryotes.
Chapter 6
Conclusion
This thesis presents different computational methods developed to locate
cis-regulatory motifs in eukaryotes. Basically two types of biological infor-
mation have been successfully used; the first uses the information of possible
co-regulation of genes to derive a dictionary of interesting motifs, whereas
the second uses a comparative genomics approach, based on the fact that
functional regions are under negative selection. Both of these approaches
have been widely used in the literature to derive functional motifs. Never-
theless, the methods presented here have taken novel approaches that have
highlighted new aspects of the problem.
1. co-regulation : As we have seen in the introduction, the conventional
approach is to group genes on the basis of similar expression profiles
and then use the group of genes to derive over-represented motifs in
that cluster. These methods, however, are limited since they employ
a partitioning to identify co-regulation under particular experimental
conditions. The computational method that I developed first identifies
genes likely to be co-expressed solely because their gene products have
been experimentally interacting or are involved in the same metabolic
pathway. The second step of the methods identifies all the genes that
have a particular motif in the upstream region. These two sets of genes
are then compared using a graph overlap approach. Only if the motif
has a certain non-random concordance with the functional network that
the motif is selected. This approach is novel in two ways: the first is
using the information of pathway to deduce co-regulation; the second
uses a graph overlap to assess the motif and not the over-representation
of it.
104
6.1. PERSPECTIVE AND FURTHER WORK 105
2. comparative genomics : The approach used here is an hybrid be-
tween alignment algorithms (that can only be applied to relatively close
species) and motif-based methods (that work only if enough remote
species are used). Applications are numerous, from the identification
of potential targets of a given transcription factors to the derivation of
a motif dictionary.
6.1 Perspective and further work
Clearly cis-regulatory elements are biologically very important and perturba-
tion of these regions leads in human to cancer and numerous other diseases
(Cooper, 1992). Despite their major role in gene-expression control, very
few such elements have been well characterized and mapped on the genome,
mainly because of their apparent low information content.
Nevertheless, the trans-regulatory elements efficiently locate these regions
and recruit the transcription machinery. So why can we not accurately pre-
dict the location of such elements? Are we missing key information or do
we need to decipher a complicated code? I believe there are at least two
additional aspects to consider :
1. A “regulatory” code : It is clear that we have not fully understood
the dynamics of how a transfactor finds the proper site and binds the
DNA. Conversely, looking at the protein sequence or the structure when
available, it is not possible to deduce the binding site. This is an area
of active research with some preliminary success (Benos et al., 2002).
Looking at the DNA sequence, the very low information content of
a typical binding site prevents any accurate prediction, but the few
cases that have been well studied suggest that the coordinated binding
of many transcription factors triggers the activation of the gene and,
therefore, better prediction of such sites should take in account the
context; that is, the presence of other cis-regulatory elements relative
to each other. This is not a trivial task, as the relative distances are
highly variable and can be important but preliminary works are also
encouraging (Manke et al., 2003).
2. A missing information : As described in the introduction, the
epigenetic state of the DNA determines the accessibility of a particular
6.1. PERSPECTIVE AND FURTHER WORK 106
region to biological molecules, and the knowledge of this state does
not seem to be clearly encoded in the primary sequence. It is not
clear the extent of the role of the epigenetic factor on the binding of
transcription factor, but many recent publications tend to clearly show
a significant role (Cremer and Cremer, 2001). Accurate knowledge of
the chromatin state and the location of DNA regions relative to other
nucleus components could, therefore, be missing elements for a better
comprehension of gene expression regulation.
Both points are the subject of many studies and I expect good progress in
that field in the next few years.
Another area of interest is the difference of gene expression in different species
and the phenotypic evolution that results from it. As we have seen in the
literature, much work has been done in cis-regulatory region conservation
across species, with this Ph.D being one of many examples. Nevertheless,
considerably less effort has been done when concerning the differences which
are likely to constitute an important component in phenotypic evolution.
This topic is under-represented in the present genomic studies, yet is very
important in many aspects. Indeed, King and Wilson suggested that most
of the genetic cause of phenotypic differences between humans and the great
apes are the regulatory sequences that control the timing and pattern of genic
activity (King and Wilson, 1975).
This suggestion, made almost 30 years ago, is now supported by a couple
of studies that clearly show the extent of transcription factor binding site
divergence between even very close species. For example, a study done by
Dermatzalis et al. (Dermitzakis and Clark, 2002) suggested that 32 to 40 %
of known cis-regulatory regions in human are not functional in rodents.
Even in distinct populations of the same species, alteration of cis-regulatory
regions seems to be widespread and result in allelic divergence in expression
level of the genes. This polymorphism in the population is believed to have
profound influence in disease and drug susceptibility between individuals, as
well as be the primary substrate for the evolution of species. A study done
by Rockman et al. (Rockman et al., 2002) estimates that humans have more
than 16,000 functional cis-regulatory variants, a much higher figure than for
amino-acid variations. With the completion and release of the chimp genome
6.1. PERSPECTIVE AND FURTHER WORK 107
and the systematic detection of SNPs information within the human popu-
lation, a lot more interesting work can be done at that level.
Evolution of cis-regulatory sites can either be caused by a gradual semi-
neutral mutation that involves a single or few nucleotide change(s) or be
caused by a drastic change due to a whole functional rearrangement. In the
first case, because of the small size of transcription factors binding sites and
the degeneracy of the protein-DNA recognition code, sites can be easily mod-
ified or spontaneously appear somewhere else without major consequences on
the phenotype for the next generation. On the contrary, a deletion, insertion
or inversion of whole regions that contain regulatory sites is most probably
going to have profound effects on gene expression and, if positively selected,
will also have a profound effect on the population phenotype. The point to
stress here is the time scale: while the first scenario may result in gradually
subtle variations, the second may lead to immediate phenotypic effects with
strong selection pressure.
The first obvious question that comes to mind involves the expression pat-
tern and timing of these particular genes in species where a large rearrange-
ment in the regulatory region has occurred. These genes can be compared
to the types of genes that have been shown to be very well conserved. Are
these genes encoding for key proteins in the network of protein interactions,
or, on the contrary, are they encoding for peripheral proteins that are not
essential for the species’ survival? One can imagine ’universal’ genes that
allow for drastic changes in their promoter, suggesting hot spot elements for
phenotypic variations and speciation.
Appendix A
Publications during the PhDwork
1. Ettwiller L and Paten B. Guilt by Multiple Association Heredity, 2004Apr 7
2. Ettwiller L, Down T, Andrews D, Paten B, Wittbrodt J, Birney E.Derivation of a reliable cis-regulatory motif dictionary from genomesequence information. Manuscript in preparation.
3. Ettwiller L, Rung J, Birney E. (2003). Discovering Novel cis-RegulatoryMotifs Using Functional Networks. Genome Research, 13:883-895.
4. Ureta-Vidal A., Ettwiller L., Birney E.(2003). Comparative genomics:genome-wide analysis in metazoan eukaryotes. Nat Rev Genet., 4:251-62.
5. Bateman A., Birney E., Cerruti L., Durbin R., Etwiller L., Eddy SR.,Griffiths-Jones S., Howe KL., Marshall M., Sonnhammer EL.(2002).The Pfam protein families database. Nucleic Acids Res., 30:276-80.
108
Appendix B
Finding regulatory motifs usingfunctional network in yeast :material and method
B.1 Networks generation
B.1.1 Metabolic network
The KEGG database (Kanehisa, 1997) was used for this study. Only reac-tions linked to enzymes of the yeast S. cerevisiae were used. All reactionswere considered as reversible, resulting in an undirected graph. Interac-tions are only represented once to avoid signal amplification. A BLAST(ftp://ftp.mcbi.mih.gov/blast) of all versus all was performed for the up-stream sequences (600 bp) of all yeast genes and interactions that involvedgenes with homologous upstream sequences were removed from the network(blastn on plus/plus strands with all default parameters except for Expecta-tion value e set at 0.000001). A total number of 24 interactions were removedfrom the network. Furthermore some metabolic compounds that are involvedin many reactions were removed from the dataset. This includes H20, ATP,NAD, NADH, NADPH, NADP, ADP, CoA, O2, C02, NH3, pyrophosphate,UDP, ”Protein”, ”peptide” and phosphate.
B.1.2 Protein interaction network
Direct protein-protein interaction data were derived from two datasets ofexperimental results, identified as Cellzome (Gavin et al., 2002) and MDS(Ho et al., 2002) datasets. Both are based on a large-scale approach to sys-tematically identify protein complexes in S. cerevisiae. As for the metabolicnetwork, the same BLAST all versus all as in B.1.1 was performed for the
109
B.2. PATTERN SEARCH 110
upstream sequences of all yeast genes and protein interactions that involvedcorresponding genes with homologous upstream sequences were removed fromthe networks. A total of 2 and 30 interactions were removed from cellzomeand MDS network respectively.
B.2 Pattern search
The DNA regions considered are a fixed length of 600 base pairs upstream ofS. cerevisiae genes (we also tried 400 and 300 bp with more or less equivalentresults). The genome data used are the S. cerevisiae strain S288C completegenome (The yeast genome directory 1997). The pattern-searching programused for this study is Teiresias (Rigoutsos and Floratos, 1998) Teiresias isa combinatorial algorithm that indentifies any motifs satisfying given cutoff.The cutoff used here were the following : L=8, W=10, k=3 -v (for nucleationsets less that 10 genes, k=4 otherwise)with L being the number of literalsin the pattern, W being the maximum extend of an elementary pattern Kused with -v being the minimum number of sequences the motif appears.The patterns obtained are therefore at least 8 defined nucleotides long witha maximum of 2 wild cards allowed.
B.3 Overlap score
The overlap score represents the number of common edges between the initialfunctional network and the proposed pattern network, normalised by thenumber of edges connected to the considered nodes. Each common edge iscounted once but divided by the total number of edges from the two nodes;in addition, the total number is raised to the power 0.5 as this corrects forthe tendancy of larger networks producing large scores. The final form isshown in Equation B.3.
We do not count the initial seed edges which generated potential patternsin the scoring function.
S =
√
√
√
√
∑
i
(1
ai + bi − 1)
Summation is over all common edges (i) present in both networks con-necting node Ai to node Bi. The denominator ai + bi − 1 is the total numberof edges from both nodes, discounting the edge being counted.
In order to model the overlap score, random networks of the same sizeas the proposed pattern network were created by choosing genes at random,including the seed nodes. The overlap score is calculated in an identical
B.4. STANDARD DEVIATION SCORE 111
manner. Other randomisation procedures were experimented with, produc-ing essentially identical results. There was observed a linear relationshipbetween the number of nodes in the pattern network and the score. Thisrelationship was calculated using the linear regression formula.
Normality assessment was done using the Shapiro-Wilk test (Shapiro andWilk, 1965). This test calculates a W statistic that tests whether a randomsample of continuous values, x1, x2, ..., xn come from (specifically) a normaldistribution.For random networks having a size greater than 150 nodes, the percentagevalue of the Sharpiro-Wilk hypothesis test with p value greater than 0.02 are84 percent, 92.5 percent and 95 percent for the Cellzome, KEGG and MDSnetworks respectively.
B.4 Standard deviation score
Given a set of upstream regions containing a pattern A, the standard devia-tion of the different locations of this pattern with respect to the start codonof the genes is calculated as:
σa =
√
∑
( X − µ)2
N − 1
with σa being the standard deviation of the pattern a, N the number ofsequences in the set, µ the average location in respect to the start codonand X the location. The standard deviation score is based on comparing thestandard deviation for the set of X genes that comprise an overlap networkwith the standard deviation for a set of X random genes that have the samepattern. This comparison is done one hundred times per pattern and a p valuecall standard deviation score can be calculated from these comparisons. Thisscore reflects a better conservation of the upstream location of the patternwithin the overlap network. It is assumed here that a real pattern shouldconserve its position relative to the transcription starting site and that theUTR regions in yeast are about the same length for all the genes within aset.
B.5 Pattern clustering and sequence logo gen-
eration
Clustering was based on the genomic location of the patterns. For eachpattern derived, all the exact locations of its occurence in the upstream
B.5. PATTERN CLUSTERING AND SEQUENCE LOGO GENERATION 112
regions of all the genes in the yeast genome. Two patterns were linkedtogether if they shared at least 40 percent of genomic locations (exact location+/- 5 bp) for at least one pattern location profile. A final cluster containsall the patterns that are linked together (single linkage clustering). For eachcluster of more than one motif, a sequence logo was then derived by retrievingall sequences in the upstream region of overlap genes that match at least oneof the motifs in the cluster. The sequences obtained were then aligned anda profile logo was built, based on the information content of each position inthe alignment. Appendix C shows the different clusters obtained.
Appendix C
Yeast significant motifs
id occ. motifs seq logo net. SDKEGG
SDcell
SDMDS
function
cluster1
1941 316 MCK 5.73 14.80 15.04 transcription- translationprocesses
cluster2
454 MCK 8.15 0.24 5.61 unknown
cluster3
1384 MC 3.62 4.42 5.60 unknown
cluster4
359 117 MC 1.24 10.49 9.86 RNAmetabolism
cluster5
413 27 MC 2.62 9.92 7.87 RNAmetabolism
cluster6
599 6 MC 0.61 10.19 12.97 proteosome
cluster7
141 4 M 1.95 1.61 4.30 unknown
113
114
cluster8
52 2 M 0.97 0.62 4.02 cell cycle
cluster9
72 1 M 0.30 0.71 3.87 unknown
cluster10
156 7 MC 1.71 3.79 2.57 mRNA splic-ing
cluster11
62 1 M 2.01 1.38 4.98 unknown
cluster12
68 1 M 3.29 2.11 2.92 unknown
cluster13
155 5 M 1.25 0.53 3.85 cell cycle
cluster14
98 2 MC 0.47 2.06 2.58 unknown
cluster15
14 2 M 0.85 0.13 6.18 unknown
cluster16
10 2 M 1.69 0.62 4.75 unknown
cluster17
34 2 M 0.78 1.05 4.05 unknown
cluster18
24 1 M 0.88 2.28 5.77 unknown
cluster19
26 5 C 0.54 5.00 1.22 unknown
115
cluster20
26 1 C 0.72 4.46 2.86 unknown
cluster21
56 3 C 0.54 3.67 2.17 cell cycle
cluster22
73 3 C 0.23 4.08 1.73 unknown
cluster23
51 6 MC 0.65 4.92 3.20 proteosome
cluster24
159 10 C 0.81 5.55 2.65 unknown
cluster25
91 6 C 0.60 5.33 0.16 transcription
cluster26
89 2 C 0.56 3.72 1.58 unknown
cluster27
16 2 C 2.02 5.57 3.29 unknown
cluster28
25 3 C 1.66 5.29 1.53 unknown
cluster29
25 2 C 0.08 4.40 0.59 unknown
cluster30
26 4 C 2.08 7.14 0.86 unknown
cluster31
95 5 K 5.37 0.08 0.46 unknown
116
cluster32
22 2 K 3.00 2.49 1.32 unknown
cluster33
34 4 K 9.80 2.33 2.21 AA synthesis
cluster34
55 10 K 4.64 1.34 4.42 sugarmetabolism
cluster35
31 3 K 8.77 0.64 1.16 ATP synthe-sis
cluster36
17 2 K 2.74 1.17 0.14 ethanol utili-sation
cluster37
19 1 K 3.89 0.44 0.98 unknown
cluster38
34 1 K 3.17 0.04 0.11 unknown
cluster39
103 2 K 4.24 0.68 2.51 unknown
cluster40
56 2 K 6.02 2.25 0.85 unknown
cluster41
48 1 K 5.62 1.24 0.35 unknown
cluster42
25 1 K 2.73 0.13 0.04 unknown
117
Table C.1: Summary of all the significant motifs foundusing functional networks. Occ (occurence) is the totalnumber of genes in the overlap network(s) derived fromthe relevant functional network(s) (see network column).Motifs is the number of motifs used to built the sequencelogo. The column network shows where the motif hasbeen initially found having a significant overlap score,with net (network) K = KEGG, C = Cellzome and M= MDS. the standard deviation columns, SD KEGG, SDcell and SD MDS are the motif standard deviation fromthe mean of random “overlap scores” apply to the func-tional network KEGG, Cellzome, MDS respectively. Thefunction column is a functional annotation based on theoverlap genes annotation.
Bibliography
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J.
(1990). Basic local alignment search tool. J Mol Biol, 215:403–410.
Andreazzoli, M., Gestri, G., Angeloni, D., Menna, E., and Barsacchi, G.
(1999). Role of Xrx1 in Xenopus eye and anterior brain development. De-
velopment, 126:2451–2460.
Aparicio, S., Morrison, A., Gould, A., Gilthorpe, J., Chaudhuri, C., Rigby,
P., Krumlauf, R., and Brenner, S. (1995). Detecting conserved regulatory
elements with the model genome of the Japanese puffer fish, Fugu rubripes.
Proc Natl Acad Sci U S A, 92:1684–1688.
Arndt, K. and Fink, G. R. (1986). GCN4 protein, a positive transcrip-
tion factor in yeast, binds general control promoters at all 5’ TGACTC 3’
sequences. Proc Natl Acad Sci U S A, 83:8516–8520.
Arnone, M. I. and Davidson, E. H. (1997). The hardwiring of development:
organization and function of genomic regulatory systems. Development,
124:1851–1864.
Ayer, D. E., Kretzner, L., and Eisenman, R. N. (1993). Mad: a het-
erodimeric partner for Max that antagonizes Myc transcriptional activity.
Cell, 72:211–222.
Bailey, T. L. and Elkan, C. (1995). The value of prior knowledge in discov-
ering motifs with MEME. Proc Int Conf Intell Syst Mol Biol, 3:21–29.
Benos, P. V., Lapedes, A. S., and Stormo, G. D. (2002). Is there a code for
protein-DNA recognition? Probab(ilistical)ly. Bioessays, 24:466–475.
Berg, J. M. (1992). Sp1 and the subfamily of zinc finger proteins with
guanine-rich binding sites. Proc Natl Acad Sci U S A, 89:11109–11110.
118
BIBLIOGRAPHY 119
Berman, B. P., Nibu, Y., Pfeiffer, B. D., Tomancak, P., Celniker, S. E.,
Levine, M., Rubin, G. M., and Eisen, M. B. (2002). Exploiting transcription
factor binding site clustering to identify cis-regulatory modules involved in
pattern formation in the Drosophila genome. Proc Natl Acad Sci U S A,
99:757–762.
Birney, E., Andrews, T. D., Bevan, P., Caccamo, M., Chen, Y., Clarke, L.,
Coates, G., Cuff, J., Curwen, V., Cutts, T., Down, T., Eyras, E., Fernandez-
Suarez, X. M., Gane, P., Gibbins, B., Gilbert, J., Hammond, M., Hotz,
H. R., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Keenan,
S., Lehvaslaiho, H., McVicker, G., Melsopp, C., Meidl, P., Mongin, E.,
Pettett, R., Potter, S., Proctor, G., Rae, M., Searle, S., Slater, G., Smedley,
D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Ureta-
Vidal, A., Woodwark, K. C., Cameron, G., Durbin, R., Cox, A., Hubbard,
T., and Clamp, M. (2004). An overview of Ensembl. Genome Res, 14:925–
928.
Black, A. R., Black, J. D., and Azizkhan-Clifford, J. (2001). Sp1 and krppel-
like factor family of transcription factors in cell growth regulation and can-
cer. J Cell Physiol, 188:143–160.
Blackwood, E. M. and Eisenman, R. N. (1991). Max: a helix-loop-helix
zipper protein that forms a sequence-specific DNA-binding complex with
Myc. Science, 251:1211–1217.
Blaiseau, P. L., Isnard, A. D., Surdin-Kerjan, Y., and Thomas, D. (1997).
Met31p and Met32p, two related zinc finger proteins, are involved in tran-
scriptional regulation of yeast sulfur amino acid metabolism. Mol Cell Biol,
17:3640–3648.
Blanchette, M. and Tompa, M. (2002). Discovery of regulatory elements
by a computational method for phylogenetic footprinting. Genome Res,
12:739–748.
Boffelli, D., McAuliffe, J., Ovcharenko, D., Lewis, K. D., Ovcharenko, I.,
Pachter, L., and Rubin, E. M. (2003). Phylogenetic shadowing of primate se-
quences to find functional regions of the human genome. Science, 299:1391–
1394.
BIBLIOGRAPHY 120
Brazma, A., Jonassen, I., Vilo, J., and Ukkonen, E. (1998). Predicting gene
regulatory elements in silico on a genomic scale. Genome Res, 8:1202–1215.
Bungert, J., Dave, U., Lim, K. C., Lieuw, K. H., Shavit, J. A., Liu, Q.,
and Engel, J. D. (1995). Synergistic regulation of human beta-globin gene
switching by locus control region elements HS3 and HS4. Genes Dev,
9:3083–3096.
Burge, C. and Karlin, S. (1997). Prediction of complete gene structures in
human genomic DNA. J Mol Biol, 268:78–94.
Casarosa, S., Andreazzoli, M., Simeone, A., and Barsacchi, G. (1997). Xrx1,
a novel Xenopus homeobox gene expressed during eye and pineal gland
development. Mech Dev, 61:187–198.
Causton, H. C., Ren, B., Koh, S. S., Harbison, C. T., Kanin, E., Jennings,
E. G., Lee, T. I., True, H. L., Lander, E. S., and Young, R. A. (2001). Re-
modeling of yeast genome expression in response to environmental changes.
Mol Biol Cell, 12:323–337.
Chan, R. J., You, M., and Feng, G. S. (2004). Identification of trans-acting
factors by electrophoretic mobility shift assay. Methods Mol Biol, 249:7–20.
Chao, K. M., Hardison, R. C., and Miller, W. (1993). Constrained sequence
alignment. Bull Math Biol, 55:503–524.
Chasman, D. I., Lue, N. F., Buchman, A. R., LaPointe, J. W., Lorch, Y.,
and Kornberg, R. D. (1990). A yeast protein that influences the chromatin
structure of UASG and functions as a powerful auxiliary gene activator.
Genes Dev, 4:503–514.
Chaudhuri, A., Barbour, K. W., and Berger, F. G. (1991). Evolution of
messenger RNA structure and regulation in the genus Mus: the androgen-
inducible RP2 mRNAs. Mol Biol Evol, 8:641–653.
Chiang, D. Y., Moses, A. M., Kellis, M., Lander, E. S., and Eisen, M. B.
(2003). Phylogenetically and spatially conserved word pairs associated with
gene-expression changes in yeasts. Genome Biol, 4:R43–R43.
Clamp, M., Cuff, J., Searle, S. M., and Barton, G. J. (2004). The Jalview
Java alignment editor. Bioinformatics, 20:426–427.
BIBLIOGRAPHY 121
Cooper, D. N. (1992). Regulatory mutations and human genetic disease.
Ann Med, 24:427–437.
Corpet, F. (1988). Multiple sequence alignment with hierarchical clustering.
Nucleic Acids Res, 16:10881–10890.
Cremer, T. and Cremer, C. (2001). Chromosome territories, nuclear archi-
tecture and gene regulation in mammalian cells. Nat Rev Genet, 2:292–301.
Crollius, H. R., Jaillon, O., Bernot, A., Dasilva, C., Bouneau, L., Fis-
cher, C., Fizames, C., Wincker, P., Brottier, P., Qutier, F., Saurin, W.,
and Weissenbach, J. (2000). Estimate of human gene number provided
by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat
Genet, 25:235–238.
Cui, Y., Hagan, K. W., Zhang, S., and Peltz, S. W. (1995). Identification
and characterization of genes that are required for the accelerated degra-
dation of mRNAs containing a premature translational termination codon.
Genes Dev, 9:423–436.
de Melo, J., Qiu, X., Du, G., Cristante, L., and Eisenstat, D. D. (2003).
Dlx1, Dlx2, Pax6, Brn3b, and Chx10 homeobox gene expression defines the
retinal ganglion and inner nuclear layers of the developing and adult mouse
retina. J Comp Neurol, 461:187–204.
Dermitzakis, E. T. and Clark, A. G. (2002). Evolution of transcription
factor binding sites in Mammalian gene regulatory regions: conservation
and turnover. Mol Biol Evol, 19:1114–1121.
Dieterich, C., Cusack, B., Wang, H., Rateitschak, K., Krause, A., and Vin-
gron, M. (2002). Annotating regulatory DNA based on man-mouse genomic
comparison. Bioinformatics, pages S84–S90.
Dowell, S. J., Tsang, J. S., and Mellor, J. (1992). The centromere and
promoter factor 1 of yeast contains a dimerisation domain located carboxy-
terminal to the bHLH domain. Nucleic Acids Res, 20:4229–4236.
Down, T. A. and Hubbard, T. J. (2002). Computational detection and
location of transcription start sites in mammalian genomic DNA. Genome
Res, 12:458–461.
BIBLIOGRAPHY 122
Drouin, R., Angers, M., Dallaire, N., Rose, T. M., Khandjian, W., and
Rousseau, F. (1997). Structural and functional characterization of the hu-
man FMR1 promoter reveals similarities with the hnRNP-A2 promoter re-
gion. Hum Mol Genet, 6:2051–2060.
Dubchak, I., Brudno, M., Loots, G. G., Pachter, L., Mayor, C., Rubin,
E. M., and Frazer, K. A. (2000). Active conservation of noncoding sequences
revealed by three-way species comparisons. Genome Res, 10:1304–1306.
Dynan, W. S. and Tjian, R. (2000). Control of eukaryotic messenger RNA
synthesis by sequence-specific DNA-binding proteins. Nature, 316:774–778.
Eddy, S. R. (2001). Non-coding RNA genes and the modern RNA world.
Nat Rev Genet, 2:919–929.
Eferl, R. and Wagner, E. F. (2003). AP-1: a double-edged sword in tumori-
genesis. Nat Rev Cancer, 3:859–868.
Elnitski, L., Hardison, R. C., Li, J., Yang, S., Kolbe, D., Eswara, P.,
O’Connor, M. J., Schwartz, S., Miller, W., and Chiaromonte, F. (2003).
Distinguishing regulatory DNA from neutral sites. Genome Res, 13:64–72.
Ettwiller, L., Paten, B., Souren, M., Loosli, F., Wittbrodt, J., and Birney, E.
(2005). The discovery, positioning and verification of a set of transcription-
associated motifs in vertebrates. Genome Biol, 6:R104–R104.
Flint, J., Tufarelli, C., Peden, J., Clark, K., Daniels, R. J., Hardison, R.,
Miller, W., Philipsen, S., Tan-Un, K. C., McMorrow, T., Frampton, J., Al-
ter, B. P., Frischauf, A. M., and Higgs, D. R. (2001). Comparative genome
analysis delimits a chromosomal domain and identifies key regulatory ele-
ments in the alpha globin cluster. Hum Mol Genet, 10:371–382.
Force, A., Lynch, M., Pickett, F. B., Amores, A., Yan, Y. L., and Postleth-
wait, J. (1999). Preservation of duplicate genes by complementary, degen-
erative mutations. Genetics, 151:1531–1545.
Galliot, B., de Vargas, C., and Miller, D. (1999). Evolution of homeobox
genes: Q50 Paired-like genes founded the Paired class. Dev Genes Evol,
209:186–197.
BIBLIOGRAPHY 123
Gavin, A. C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer,
A., Schultz, J., Rick, J. M., Michon, A. M., Cruciat, C. M., Remor, M.,
Hfert, C., Schelder, M., Brajenovic, M., Ruffner, H., Merino, A., Klein,
K., Hudak, M., Dickson, D., Rudi, T., Gnau, V., Bauch, A., Bastuck, S.,
Huhse, B., Leutwein, C., Heurtier, M. A., Copley, R. R., Edelmann, A.,
Querfurth, E., Rybin, V., Drewes, G., Raida, M., Bouwmeester, T., Bork,
P., Seraphin, B., Kuster, B., Neubauer, G., and Superti-Furga, G. (2002).
Functional organization of the yeast proteome by systematic analysis of
protein complexes. Nature, 415:141–147.
Ge, H., Liu, Z., Church, G. M., and Vidal, M. (2001). Correlation be-
tween transcriptome and interactome mapping data from Saccharomyces
cerevisiae. Nat Genet, 29:482–486.
Gottgens, B., Barton, L. M., Chapman, M. A., Sinclair, A. M., Knudsen,
B., Grafham, D., Gilbert, J. G., Rogers, J., Bentley, D. R., and Green, A. R.
(2002). Transcriptional regulation of the stem cell leukemia gene (SCL)--
comparative analysis of five vertebrate SCL loci. Genome Res, 12:749–759.
Harris, M. A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger,
R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C., Richter, J., Rubin,
G. M., Blake, J. A., Bult, C., Dolan, M., Drabkin, H., Eppig, J. T., Hill,
D. P., Ni, L., Ringwald, M., Balakrishnan, R., Cherry, J. M., Christie,
K. R., Costanzo, M. C., Dwight, S. S., Engel, S., Fisk, D. G., Hirschman,
J. E., Hong, E. L., Nash, R. S., Sethuraman, A., Theesfeld, C. L., Botstein,
D., Dolinski, K., Feierbach, B., Berardini, T., Mundodi, S., Rhee, S. Y.,
Apweiler, R., Barrell, D., Camon, E., Dimmer, E., Lee, V., Chisholm, R.,
Gaudet, P., Kibbe, W., Kishore, R., Schwarz, E. M., Sternberg, P., Gwinn,
M., Hannick, L., Wortman, J., Berriman, M., Wood, V., de la Cruz, N.,
Tonellato, P., Jaiswal, P., Seigfried, T., and White, R. (2004). The Gene
Ontology (GO) database and informatics resource. Nucleic Acids Res, pages
D258–D261.
Haun, R. S., Moss, J., and Vaughan, M. (1993). Characterization of the
human ADP-ribosylation factor 3 promoter. J Biol Chem, 268:8793–8800.
Hayashi, N. and Oshima, Y. (1991). Specific cis-acting sequence for PHO8
expression interacts with PHO4 protein, a positive regulatory factor, in
Saccharomyces cerevisiae. Mol Cell Biol, 11:785–794.
BIBLIOGRAPHY 124
Hernandez, M. C., Erkman, L., Matter-Sadzinski, L., Roztocil, T., Ballivet,
M., and Matter, J. M. (1995). Characterization of the nicotinic acetylcholine
receptor beta 3 gene. J Biol Chem, 270:3224–3233.
Hertz, G. Z. and Stormo, G. D. (2000). Identifying DNA and protein pat-
terns with statistically significant alignments of multiple sequences. Bioin-
formatics, 15:563–577.
Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., Moore, L., Adams, S. L., Mil-
lar, A., Taylor, P., Bennett, K., Boutilier, K., Yang, L., Wolting, C., Don-
aldson, I., Schandorff, S., Shewnarane, J., Vo, M., Taggart, J., Goudreault,
M., Muskat, B., Alfarano, C., Dewar, D., Lin, Z., Michalickova, K., Willems,
A. R., Sassi, H., Nielsen, P. A., Rasmussen, K. J., Andersen, J. R., Johansen,
L. E., Hansen, L. H., Jespersen, H., Podtelejnikov, A., Nielsen, E., Craw-
ford, J., Poulsen, V., Srensen, B. D., Matthiesen, J., Hendrickson, R. C.,
Gleeson, F., Pawson, T., Moran, M. F., Durocher, D., Mann, M., Hogue,
C. W., Figeys, D., and Tyers, M. (2002). Systematic identification of pro-
tein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature,
415:180–183.
Hope, I. A. and Struhl, K. (1985). GCN4 protein, synthesized in vitro,
binds HIS3 regulatory sequences: implications for general control of amino
acid biosynthetic genes in yeast. Cell, 43:177–188.
Hughes, J. D., Estep, P. W., Tavazoie, S., and Church, G. M. (2000). Com-
putational identification of cis-regulatory elements associated with groups
of functionally related genes in Saccharomyces cerevisiae. J Mol Biol,
296:1205–1214.
Hutcheson, D. A. and Vetter, M. L. (2001). The bHLH factors Xath5 and
XNeuroD can upregulate the expression of XBrn3d, a POU-homeodomain
transcription factor. Dev Biol, 232:327–338.
IHGSC, Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C.,
Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R.,
Gage, D., Harris, K., Heaford, A., Howland, J., Kann, L., Lehoczky, J.,
LeVine, R., McEwan, P., McKernan, K., Meldrim, J., Mesirov, J. P., Mi-
randa, C., Morris, W., Naylor, J., Raymond, C., Rosetti, M., Santos, R.,
Sheridan, A., Sougnez, C., Stange-Thomann, N., Stojanovic, N., Subra-
manian, A., Wyman, D., Rogers, J., Sulston, J., Ainscough, R., Beck, S.,
BIBLIOGRAPHY 125
Bentley, D., Burton, J., Clee, C., Carter, N., Coulson, A., Deadman, R.,
Deloukas, P., Dunham, A., Dunham, I., Durbin, R., French, L., Grafham,
D., Gregory, S., Hubbard, T., Humphray, S., Hunt, A., Jones, M., Lloyd,
C., McMurray, A., Matthews, L., Mercer, S., Milne, S., Mullikin, J. C.,
Mungall, A., Plumb, R., Ross, M., Shownkeen, R., Sims, S., Waterston,
R. H., Wilson, R. K., Hillier, L. W., McPherson, J. D., Marra, M. A.,
Mardis, E. R., Fulton, L. A., Chinwalla, A. T., Pepin, K. H., Gish, W. R.,
Chissoe, S. L., Wendl, M. C., Delehaunty, K. D., Miner, T. L., Delehaunty,
A., Kramer, J. B., Cook, L. L., Fulton, R. S., Johnson, D. L., Minx, P. J.,
Clifton, S. W., Hawkins, T., Branscomb, E., Predki, P., Richardson, P.,
Wenning, S., Slezak, T., Doggett, N., Cheng, J. F., Olsen, A., Lucas, S.,
Elkin, C., Uberbacher, E., Frazier, M., Gibbs, R. A., Muzny, D. M., Scherer,
S. E., Bouck, J. B., Sodergren, E. J., Worley, K. C., Rives, C. M., Gorrell,
J. H., Metzker, M. L., Naylor, S. L., Kucherlapati, R. S., Nelson, D. L., We-
instock, G. M., Sakaki, Y., Fujiyama, A., Hattori, M., Yada, T., Toyoda, A.,
Itoh, T., Kawagoe, C., Watanabe, H., Totoki, Y., Taylor, T., Weissenbach,
J., Heilig, R., Saurin, W., Artiguenave, F., Brottier, P., Bruls, T., Pelletier,
E., Robert, C., Wincker, P., Smith, D. R., Doucette-Stamm, L., Ruben-
field, M., Weinstock, K., Lee, H. M., Dubois, J., Rosenthal, A., Platzer,
M., Nyakatura, G., Taudien, S., Rump, A., Yang, H., Yu, J., Wang, J.,
Huang, G., Gu, J., Hood, L., Rowen, L., Madan, A., Qin, S., Davis, R. W.,
Federspiel, N. A., Abola, A. P., Proctor, M. J., Myers, R. M., Schmutz, J.,
Dickson, M., Grimwood, J., Cox, D. R., Olson, M. V., Kaul, R., Raymond,
C., Shimizu, N., Kawasaki, K., Minoshima, S., Evans, G. A., Athanasiou,
M., Schultz, R., Roe, B. A., Chen, F., Pan, H., Ramser, J., Lehrach, H.,
Reinhardt, R., McCombie, W. R., de la Bastide, M., Dedhia, N., Blcker, H.,
Hornischer, K., Nordsiek, G., Agarwala, R., Aravind, L., Bailey, J. A., Bate-
man, A., Batzoglou, S., Birney, E., Bork, P., Brown, D. G., Burge, C. B.,
Cerutti, L., Chen, H. C., Church, D., Clamp, M., Copley, R. R., Doerks, T.,
Eddy, S. R., Eichler, E. E., Furey, T. S., Galagan, J., Gilbert, J. G., Har-
mon, C., Hayashizaki, Y., Haussler, D., Hermjakob, H., Hokamp, K., Jang,
W., Johnson, L. S., Jones, T. A., Kasif, S., Kaspryzk, A., Kennedy, S., Kent,
W. J., Kitts, P., Koonin, E. V., Korf, I., Kulp, D., Lancet, D., Lowe, T. M.,
McLysaght, A., Mikkelsen, T., Moran, J. V., Mulder, N., Pollara, V. J.,
Ponting, C. P., Schuler, G., Schultz, J., Slater, G., Smit, A. F., Stupka, E.,
Szustakowski, J., Thierry-Mieg, D., Thierry-Mieg, J., Wagner, L., Wallis,
J., Wheeler, R., Williams, A., Wolf, Y. I., Wolfe, K. H., Yang, S. P., Yeh,
BIBLIOGRAPHY 126
R. F., Collins, F., Guyer, M. S., Peterson, J., Felsenfeld, A., Wetterstrand,
K. A., Patrinos, A., Morgan, M. J., Szustakowki, J., de Jong, P., Catanese,
J. J., Osoegawa, K., Shizuya, H., Choi, S., and Chen, Y. J. (2001). Initial
sequencing and analysis of the human genome. Nature, 409:860–921.
Iyer, V. and Struhl, K. (1995). Poly(dA:dT), a ubiquitous promoter element
that stimulates transcription via its intrinsic DNA structure. EMBO J,
14:2570–2579.
Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore, T., Lee, J. C.,
Trent, J. M., Staudt, L. M., Hudson, J., Boguski, M. S., Lashkari, D.,
Shalon, D., Botstein, D., and Brown, P. O. (1999). The transcriptional
program in the response of human fibroblasts to serum. Science, 283:83–87.
Jackson, S. P., MacDonald, J. J., Lees-Miller, S., and Tjian, R. (1990). GC
box binding induces phosphorylation of Sp1 by a DNA-dependent protein
kinase. Cell, 63:155–165.
Jacob, F., Perrin, D., Sanchez, C., and Monod, J. (1960). [Operon: a group
of genes with the expression coordinated by an operator]. C R Hebd Seances
Acad Sci, 250:1727–1729.
Jacob, W. F., Silverman, T. A., Cohen, R. B., and Safer, B. (1989). Iden-
tification and characterization of a novel transcription factor participating
in the expression of eukaryotic initiation factor 2 alpha. J Biol Chem,
264:20372–20384.
Jareborg, N., Birney, E., and Durbin, R. (1999). Comparative analysis of
noncoding regions of 77 orthologous mouse and human gene pairs. Genome
Res, 9:815–824.
Kanehisa, M. (1997). A database for post-genome analysis. Trends Genet,
13:375–376.
Kellis, M., Patterson, N., Endrizzi, M., Birren, B., and Lander, E. S. (2003).
Sequencing and comparison of yeast species to identify genes and regulatory
elements. Nature, 423:241–254.
King, M. C. and Wilson, A. C. (1975). Evolution at two levels in humans
and chimpanzees. Science, 188:107–116.
BIBLIOGRAPHY 127
Koch, K. A. and Thiele, D. J. (1999). Functional analysis of a homopoly-
meric (dA-dT) element that provides nucleosomal access to yeast and mam-
malian transcription factors. J Biol Chem, 274:23752–23760.
Koo, H. S., Wu, H. M., and Crothers, D. M. (2000). DNA bending at
adenine . Nature, 320:501–506.
Krawczak, M., Chuzhanova, N. A., and Cooper, D. N. (1999). Evolution
of the proximal promoter region of the mammalian growth hormone gene.
Gene, 237:143–151.
Lawrence, J. G. and Roth, J. R. (1996). Selfish operons: horizontal transfer
may drive the evolution of gene clusters. Genetics, 143:1843–1860.
Leblanc, B. and Moss, T. (2001). DNase I footprinting. Methods Mol Biol,
148:31–38.
Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber,
G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I.,
Zeitlinger, J., Jennings, E. G., Murray, H. L., Gordon, D. B., Ren, B.,
Wyrick, J. J., Tagne, J. B., Volkert, T. L., Fraenkel, E., Gifford, D. K., and
Young, R. A. (2002). Transcriptional regulatory networks in Saccharomyces
cerevisiae. Science, 298:799–804.
Lelivelt, M. J. and Culbertson, M. R. (1999). Yeast Upf proteins required
for RNA surveillance affect global expression of the yeast transcriptome.
Mol Cell Biol, 19:6710–6719.
Levy, S. and Hannenhalli, S. (2002). Identification of transcription factor
binding sites in the human genome sequence. Mamm Genome, 13:510–514.
Levy, S., Hannenhalli, S., and Workman, C. (2001). Enrichment of regu-
latory signals in conserved non-coding genomic sequence. Bioinformatics,
17:871–877.
Li, Z., Calcar, S. V., Qu, C., Cavenee, W. K., Zhang, M. Q., and Ren,
B. (2003). A global transcriptional regulatory role for c-Myc in Burkitt’s
lymphoma cells. Proc Natl Acad Sci U S A, 100:8164–8169.
Loots, G. G., Locksley, R. M., Blankespoor, C. M., Wang, Z. E., Miller,
W., Rubin, E. M., and Frazer, K. A. (2000). Identification of a coordinate
BIBLIOGRAPHY 128
regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons.
Science, 288:136–140.
Lowndes, N. F., Johnson, A. L., and Johnston, L. H. (1991). Coordination of
expression of DNA synthesis genes in budding yeast by a cell-cycle regulated
trans factor. Nature, 350:247–250.
Ludwig, M. Z., Bergman, C., Patel, N. H., and Kreitman, M. (2000). Ev-
idence for stabilizing selection in a eukaryotic enhancer element. Nature,
403:564–567.
Majewski, J. and Ott, J. (2002). Distribution and characterization of regu-
latory elements in the human genome. Genome Res, 12:1827–1836.
Manke, T., Bringas, R., and Vingron, M. (2003). Correlating protein-DNA
and protein-protein interaction networks. J Mol Biol, 333:75–85.
Mannhaupt, G., Schnall, R., Karpov, V., Vetter, I., and Feldmann, H.
(1999). Rpn4p acts as a transcription factor by binding to PACE, a nonamer
box found upstream of 26S proteasomal and other genes in yeast. FEBS
Lett, 450:27–34.
Mantovani, R. (1998). A survey of 178 NF-Y binding CCAAT boxes. Nu-
cleic Acids Res, 26:1135–1143.
Maquat, L. E. and Carmichael, G. G. (2001). Quality control of mRNA
function. Cell, 104:173–176.
Matter-Sadzinski, L., Matter, J. M., Ong, M. T., Hernandez, J., and Bal-
livet, M. (2001). Specification of neurotransmitter receptor identity in devel-
oping retina: the chick ATH5 promoter integrates the positive and negative
effects of several bHLH proteins. Development, 128:217–231.
Moll, T., Dirick, L., Auer, H., Bonkovsky, J., and Nasmyth, K. (1992).
SWI6 is a regulatory subunit of two different cell cycle START-dependent
transcription factors in Saccharomyces cerevisiae. J Cell Sci Suppl, 16:87–
96.
Morgenstern, B., Frech, K., Dress, A., and Werner, T. (1998). DIALIGN:
finding local similarities by multiple sequence alignment. Bioinformatics,
14:290–294.
BIBLIOGRAPHY 129
Moss, J., Tsuchiya, M., Tsai, S. C., Adamik, R., Bobak, D. A., Price, S. R.,
Nightingale, M. S., and Vaughan, M. (1990). Structural and functional char-
acterization of ADP-ribosylation factors, 20 kDa guanine nucleotide-binding
proteins that activate cholera toxin. Adv Second Messenger Phosphoprotein
Res, 24:83–88.
Munro, H. N., Aziz, N., Leibold, E. A., Murray, M., Rogers, J., Vass,
J. K., and White, K. (1988). The ferritin genes: structure, expression, and
regulation. Ann N Y Acad Sci, 526:113–123.
Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable
to the search for similarities in the amino acid sequence of two proteins. J
Mol Biol, 48:443–453.
Niehrs, C. and Pollet, N. (1999). Synexpression groups in eukaryotes. Na-
ture, 402:483–487.
Nielsen, S. J., Praestegaard, M., Jorgensen, H. F., and Clark, B. F. (1998).
Different Sp1 family members differentially affect transcription from the
human elongation factor 1 A-1 gene promoter. Biochem J, pages 511–517.
Parker, R. and Song, H. (2004). The enzymes and control of eukaryotic
mRNA turnover. Nat Struct Mol Biol, 11:121–127.
Pearson, W. R. (1991). Searching protein sequence libraries: comparison
of the sensitivity and selectivity of the Smith-Waterman and FASTA algo-
rithms. Genomics, 11:635–650.
Plump, A. S., Erskine, L., Sabatier, C., Brose, K., Epstein, C. J., Goodman,
C. S., Mason, C. A., and Tessier-Lavigne, M. (2002). Slit1 and Slit2 coop-
erate to prevent premature midline crossing of retinal axons in the mouse
visual system. Neuron, 33:219–232.
Rigaut, G., Shevchenko, A., Rutz, B., Wilm, M., Mann, M., and Sraphin,
B. (1999). A generic protein purification method for protein complex char-
acterization and proteome exploration. Nat Biotechnol, 17:1030–1032.
Rigoutsos, I. and Floratos, A. (1998). Combinatorial pattern discovery in
biological sequences: The TEIRESIAS algorithm. Bioinformatics, 14:55–67.
BIBLIOGRAPHY 130
Robin, S., Daudin, J. J., Richard, H., Sagot, M. F., and Schbath, S. (2002).
Occurrence probability of structured motifs in random sequences. J Comput
Biol, 9:761–773.
Rockman, M. V., Wray, G. A., and Wray, G. A. (2002). Abundant raw
material for cis-regulatory evolution in humans. Mol Biol Evol, 19:1991–
2004.
Rogozin, I. B., Kochetov, A. V., Kondrashov, F. A., Koonin, E. V., and
Milanesi, L. (2001). Presence of ATG triplets in 5’ untranslated regions
of eukaryotic cDNAs correlates with a ’weak’ context of the start codon.
Bioinformatics, 17:890–900.
Roth, F. P., Hughes, J. D., Estep, P. W., and Church, G. M. (1998). Finding
DNA regulatory motifs within unaligned noncoding sequences clustered by
whole-genome mRNA quantitation. Nat Biotechnol, 16:939–945.
Ryan, K. M. and Birnie, G. D. (1997). Analysis of E-box DNA binding
during myeloid differentiation reveals complexes that contain Mad but not
Max. Biochem J, pages 79–85.
Saito, R. and Tomita, M. (1999). On negative selection against ATG
triplets near start codons in eukaryotic and prokaryotic genomes. J Mol
Evol, 48:213–217.
Salgado, H., Moreno-Hagelsieb, G., Smith, T. F., and Collado-Vides, J.
(2000). Operons in Escherichia coli: genomic analyses and predictions.
Proc Natl Acad Sci U S A, 97:6652–6657.
Schell, T., Kocher, T., Wilm, M., Seraphin, B., Kulozik, A. E., and Hentze,
M. W. (2003). Complexes between the nonsense-mediated mRNA de-
cay pathway factor human upf1 (up-frameshift protein 1) and essential
nonsense-mediated mRNA decay factors in HeLa cells. Biochem J, 373:775–
783.
Scherf, M., Klingenhoff, A., and Werner, T. (2000). Highly specific localiza-
tion of promoter regions in large genomic sequences by PromoterInspector:
a novel context analysis approach. J Mol Biol, 297:599–606.
BIBLIOGRAPHY 131
Schmid, C. D., Praz, V., Delorenzi, M., Prier, R., and Bucher, P. (2004).
The Eukaryotic Promoter Database EPD: the impact of in silico primer
extension. Nucleic Acids Res, pages D82–D85.
Schneider, M. L., Turner, D. L., and Vetter, M. L. (2001). Notch signaling
can inhibit Xath5 function in the neural plate and developing retina. Mol
Cell Neurosci, 18:458–472.
Shapiro, S. and Wilk, M. (1965). 591-611. Biometrika, 52:–.
Smith, T. F. and Waterman, M. S. (1981). Identification of common molec-
ular subsequences. J Mol Biol, 147:195–197.
Sogawa, K., Imataka, H., Yamasaki, Y., Kusume, H., Abe, H., and Fujii-
Kuriyama, Y. (1993). cDNA cloning and transcriptional properties of a
novel GC box-binding protein, BTEB2. Nucleic Acids Res, 21:1527–1532.
Struhl, K. (1995). Yeast transcriptional regulatory mechanisms. Annu Rev
Genet, 29:651–674.
Sved, J. and Bird, A. (1990). The expected equilibrium of the CpG dinu-
cleotide in vertebrate genomes under a mutation model. Proc Natl Acad Sci
U S A, 87:4692–4696.
Tagle, D. A., Koop, B. F., Goodman, M., Slightom, J. L., Hess, D. L.,
and Jones, R. T. (1988). Embryonic epsilon and gamma globin genes of a
prosimian primate (Galago crassicaudatus). J Mol Biol, 203:439–455.
Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). CLUSTAL
W: improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and weight ma-
trix choice. Nucleic Acids Res, 22:4673–4680.
Turner, B. M. (2000). Histone acetylation and an epigenetic code. Bioessays,
22:836–845.
Verma, R., Patapoutian, A., Gordon, C. B., and Campbell, J. L. (1991).
Identification and purification of a factor that binds to the Mlu I cell cycle
box of yeast DNA replication genes. Proc Natl Acad Sci U S A, 88:7155–
7159.
BIBLIOGRAPHY 132
Vetter, M. L. and Brown, N. L. (2001). The role of basic helix-loop-helix
genes in vertebrate retinogenesis. Semin Cell Dev Biol, 12:491–498.
Walhout, A. J., Reboul, J., Shtanko, O., Bertin, N., Vaglio, P., Ge, H., Lee,
H., Doucette-Stamm, L., Gunsalus, K. C., Schetter, A. J., Morton, D. G.,
Kemphues, K. J., Reinke, V., Kim, S. K., Piano, F., and Vidal, M. (2002).
Integrating interactome, phenome, and transcriptome mapping data for the
C. Curr Biol, 12:1952–1958.
Walter, J. and Biggin, M. D. (1996). DNA binding specificity of two home-
odomain proteins in vitro and in Drosophila embryos. Proc Natl Acad Sci
U S A, 93:2680–2685.
Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J. F.,
Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., An-
tonarakis, S. E., Attwood, J., Baertsch, R., Bailey, J., Barlow, K., Beck,
S., Berry, E., Birren, B., Bloom, T., Bork, P., Botcherby, M., Bray, N.,
Brent, M. R., Brown, D. G., Brown, S. D., Bult, C., Burton, J., Butler,
J., Campbell, R. D., Carninci, P., Cawley, S., Chiaromonte, F., Chinwalla,
A. T., Church, D. M., Clamp, M., Clee, C., Collins, F. S., Cook, L. L.,
Copley, R. R., Coulson, A., Couronne, O., Cuff, J., Curwen, V., Cutts, T.,
Daly, M., David, R., Davies, J., Delehaunty, K. D., Deri, J., Dermitzakis,
E. T., Dewey, C., Dickens, N. J., Diekhans, M., Dodge, S., Dubchak, I.,
Dunn, D. M., Eddy, S. R., Elnitski, L., Emes, R. D., Eswara, P., Eyras,
E., Felsenfeld, A., Fewell, G. A., Flicek, P., Foley, K., Frankel, W. N., Ful-
ton, L. A., Fulton, R. S., Furey, T. S., Gage, D., Gibbs, R. A., Glusman,
G., Gnerre, S., Goldman, N., Goodstadt, L., Grafham, D., Graves, T. A.,
Green, E. D., Gregory, S., Guig, R., Guyer, M., Hardison, R. C., Haussler,
D., Hayashizaki, Y., Hillier, L. W., Hinrichs, A., Hlavina, W., Holzer, T.,
Hsu, F., Hua, A., Hubbard, T., Hunt, A., Jackson, I., Jaffe, D. B., John-
son, L. S., Jones, M., Jones, T. A., Joy, A., Kamal, M., Karlsson, E. K.,
Karolchik, D., Kasprzyk, A., Kawai, J., Keibler, E., Kells, C., Kent, W. J.,
Kirby, A., Kolbe, D. L., Korf, I., Kucherlapati, R. S., Kulbokas, E. J., Kulp,
D., Landers, T., Leger, J. P., Leonard, S., Letunic, I., Levine, R., Li, J., Li,
M., Lloyd, C., Lucas, S., Ma, B., Maglott, D. R., Mardis, E. R., Matthews,
L., Mauceli, E., Mayer, J. H., McCarthy, M., McCombie, W. R., McLaren,
S., McLay, K., McPherson, J. D., Meldrim, J., Meredith, B., Mesirov, J. P.,
Miller, W., Miner, T. L., Mongin, E., Montgomery, K. T., Morgan, M.,
BIBLIOGRAPHY 133
Mott, R., Mullikin, J. C., Muzny, D. M., Nash, W. E., Nelson, J. O., Nhan,
M. N., Nicol, R., Ning, Z., Nusbaum, C., O’Connor, M. J., Okazaki, Y.,
Oliver, K., Overton-Larty, E., Pachter, L., Parra, G., Pepin, K. H., Peter-
son, J., Pevzner, P., Plumb, R., Pohl, C. S., Poliakov, A., Ponce, T. C.,
Ponting, C. P., Potter, S., Quail, M., Reymond, A., Roe, B. A., Roskin,
K. M., Rubin, E. M., Rust, A. G., Santos, R., Sapojnikov, V., Schultz, B.,
Schultz, J., Schwartz, M. S., Schwartz, S., Scott, C., Seaman, S., Searle, S.,
Sharpe, T., Sheridan, A., Shownkeen, R., Sims, S., Singer, J. B., Slater, G.,
Smit, A., Smith, D. R., Spencer, B., Stabenau, A., Stange-Thomann, N.,
Sugnet, C., Suyama, M., Tesler, G., Thompson, J., Torrents, D., Trevaskis,
E., Tromp, J., Ucla, C., Ureta-Vidal, A., Vinson, J. P., Niederhausern, A.
C. V., Wade, C. M., Wall, M., Weber, R. J., Weiss, R. B., Wendl, M. C.,
West, A. P., Wetterstrand, K., Wheeler, R., Whelan, S., Wierzbowski, J.,
Willey, D., Williams, S., Wilson, R. K., Winter, E., Worley, K. C., Wyman,
D., Yang, S., Yang, S. P., Zdobnov, E. M., Zody, M. C., and Lander, E. S.
(2002). Initial sequencing and comparative analysis of the mouse genome.
Nature, 420:520–562.
Watson, J. D. and Crick, F. H. (1953). Molecular structure of nucleic acids;
a structure for deoxyribose nucleic acid. Nature, 171:737–738.
Webb, C. T., Shabalina, S. A., Ogurtsov, A. Y., and Kondrashov, A. S.
(2002). Analysis of similarity within 142 pairs of orthologous intergenic
regions of Caenorhabditis elegans and Caenorhabditis briggsae. Nucleic
Acids Res, 30:1233–1239.
Weinmann, A. S. and Farnham, P. J. (2002). Identification of unknown
target genes of human transcription factors using chromatin immunopre-
cipitation. Methods, 26:37–47.
Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Mein-
hardt, T., Prss, M., Reuter, I., and Schacherer, F. (2000). TRANSFAC:
an integrated system for gene expression regulation. Nucleic Acids Res,
28:316–319.
Wray, G. A., Hahn, M. W., Abouheif, E., Balhoff, J. P., Pizer, M., Rock-
man, M. V., Romano, L. A., and Wray, G. A. (2003). The evolution of
transcriptional regulation in eukaryotes. Mol Biol Evol, 20:1377–1419.
BIBLIOGRAPHY 134
Zervos, A. S., Gyuris, J., and Brent, R. (1993). Mxi1, a protein that specif-
ically interacts with Max to bind Myc-Max recognition sites. Cell, 72:223–
232.
Zhu, J., Liu, J. S., and Lawrence, C. E. (1998). Bayesian adaptive sequence
alignment algorithms. Bioinformatics, 14:25–39.