Computational inverstigations into cis-regulation in Eukaryotes · 2013. 3. 1. · Abel Ureta-Vidal, Manu Mongin, Martin Hammond and Arek Kasprzyk. I would also like to thanks Ewan

Computational investigations into

cis-regulation in eukaryotes

Laurence EttwillerJesus College

A dissertation submitted to the University of Cambridgefor the degree of Doctor of Philosophy

European Molecular Biology LaboratoryEuropean Bioinformatics InstituteWellcome Trust Genome CampusHinxton, Cambridge, CB10 1SD

United Kingdom

Email: [email protected]

December 22, 2005

To my grandmother, for everything she taught me, especiallycourage, perseverance and so many other things.

This thesis is the result of my own work and includes nothing which is theoutcome of work done in collaboration except where specifically indicated inthe text.

This thesis does not exceed the specified length limit of 300 pages as de-fined by the Biology Degree Committee.

This thesis has been typeset in 12pt type using LATEX2ε according to thespecifications defined by the Board of Graduate Studies and the BiologyDegree Committee.

ii

I would like to thanks everyone who supported me during myPhD. This include the Ensembl team, especially Ben Paten,

Abel Ureta-Vidal, Manu Mongin, Martin Hammond and ArekKasprzyk. I would also like to thanks Ewan Birney , my

supervisor for all his help and support. Lastly, I thank myParents, my family and my friends, Arnaud, Sylvain, Chloe,Wei, Shu Ching and Ling and of course my boyfriend Tom.

This thesis presents essentially two computational methods that I devel-opped to locate cis-regulatory motifs in eukaryotes. Both methods are basedon information that have been shown in the past to be successful in locatingregulatory regions but the approaches I used are novel.

The first method is based on the information about co-regulation of genes toderive a dictionary of interesting motifs. This is done by uncovering potentialmappings between the upstream regulatory sequences of genes and proteinfunctions in S. cerevisiae. In contrast to the conventional approach that usesco-regulated groups of genes on the basis of similar expression profiles, co-expression has been investigated using functional networks. The idea behindthe investigation is that proteins involved in the same cellular process shouldbe regulated in synergy. Motifs of interest should therefore both be limitedto a specific set of genes, and this set of genes should have a significant non-random correlation with the input functional information.

The second method uses comparative genomics and the notion that func-tional regions are conserved across species. This method predicts a dictio-nary of regulatory motifs based on occurrence in non-coding regions that areconserved between many vertebrate species. Once the dictionary of motifsis obtained, the genome-wide distribution of the motifs is then investigatedand, based on these results, functional regions for transcriptional control arepredicted.

ii

Contents

1 Introduction 11.1 Biological background . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 DNA accessibility . . . . . . . . . . . . . . . . . . . . . 51.1.2 Trans-activator/repressor and cis regulatory elements . 61.1.3 Post-transcriptional regulation of gene expression . . . 71.1.4 Gene regulation and cellular function . . . . . . . . . . 9

1.2 Experimental approaches to find regulatory regions . . . . . . 121.2.1 one-by-one gene analysis . . . . . . . . . . . . . . . . . 121.2.2 High throughput analysis . . . . . . . . . . . . . . . . . 13

1.3 Bioinformatic approach to finding regulatory elements . . . . . 141.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 141.3.2 Finding over-represented motifs on unrelated sequences 151.3.3 Phylogenetic footprinting to find cis-regulatory elements 151.3.4 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.3.5 Finding eukaryotic promoters . . . . . . . . . . . . . . 22

2 Finding regulatory regions using functional information inyeast 232.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2 Example: the nucleotide pathway in yeast . . . . . . . . . . . 242.3 Useful functional network . . . . . . . . . . . . . . . . . . . . 26

2.3.1 Metabolic network . . . . . . . . . . . . . . . . . . . . 262.3.2 Protein interaction . . . . . . . . . . . . . . . . . . . . 27

2.4 Generating and assessing motifs . . . . . . . . . . . . . . . . . 282.4.1 Generating motifs . . . . . . . . . . . . . . . . . . . . . 282.4.2 Assessment of the motifs using functional networks . . 32

2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.5.1 Significant motifs . . . . . . . . . . . . . . . . . . . . . 382.5.2 Non-random behaviour of significant motifs . . . . . . 422.5.3 Assessment of known transcription factor binding sites 422.5.4 Inferring functionality to putative motifs . . . . . . . . 44

iii

2.5.5 Promoter scanning . . . . . . . . . . . . . . . . . . . . 452.5.6 Discovering cis-regulatory elements using functional net-

work in higher eukaryotes . . . . . . . . . . . . . . . . 472.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3 Evolution dynamic of cis-regulatory regions in higher eukary-otes 493.1 Detailed analysis of a specific example : the Atonal 5 gene . . 51

3.1.1 The Atonal 5 protein . . . . . . . . . . . . . . . . . . . 513.1.2 The promoter of atonal5 gene . . . . . . . . . . . . . . 523.1.3 The Atonal5 motif . . . . . . . . . . . . . . . . . . . . 553.1.4 Experimental validations . . . . . . . . . . . . . . . . . 573.1.5 Conclusion regarding this example . . . . . . . . . . . 60

3.2 Global run of promoterwise . . . . . . . . . . . . . . . . . . . 613.2.1 Promoterwise : the algorithm . . . . . . . . . . . . . . 613.2.2 Defining the cut-off . . . . . . . . . . . . . . . . . . . . 623.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.2.4 Genes with conserved 5’ proximal intergenic regions . . 67

3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4 Defining a mammalian dictionary of regulatory motifs 724.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.2 Finding functional motifs . . . . . . . . . . . . . . . . . . . . . 73

4.2.1 Derivation of a reliable motif dictionary . . . . . . . . . 734.2.2 Finding region of clustered motifs on the human genome 82

4.3 Experimental evaluation of the methodology . . . . . . . . . . 864.3.1 The FOXM1 gene . . . . . . . . . . . . . . . . . . . . . 874.3.2 The ARF3 gene . . . . . . . . . . . . . . . . . . . . . . 874.3.3 The Q99JW1 gene . . . . . . . . . . . . . . . . . . . . 884.3.4 The Q9BU67 gene . . . . . . . . . . . . . . . . . . . . 884.3.5 The SM31 gene . . . . . . . . . . . . . . . . . . . . . . 894.3.6 The ZIC1 gene . . . . . . . . . . . . . . . . . . . . . . 89

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5 Effect of the ATG triplet on gene expression in yeast 935.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.2 ATG codon at the genomic level . . . . . . . . . . . . . . . . . 935.3 ATG codon at the transcript level . . . . . . . . . . . . . . . . 985.4 The upf genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

iv

6 Conclusion 1046.1 Perspective and further work . . . . . . . . . . . . . . . . . . . 105

A Publications during the PhD work 108

B Finding regulatory motifs using functional network in yeast: material and method 109B.1 Networks generation . . . . . . . . . . . . . . . . . . . . . . . 109

B.1.1 Metabolic network . . . . . . . . . . . . . . . . . . . . 109B.1.2 Protein interaction network . . . . . . . . . . . . . . . 109

B.2 Pattern search . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.3 Overlap score . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.4 Standard deviation score . . . . . . . . . . . . . . . . . . . . . 111B.5 Pattern clustering and sequence logo generation . . . . . . . . 111

C Yeast significant motifs 113

Bibliography 118

v

List of Tables

2.1 Assessment of known sites . . . . . . . . . . . . . . . . . . . . 45

3.1 Atonal5 homologs gene names and locations. . . . . . . . . . . 533.2 Scores for different species . . . . . . . . . . . . . . . . . . . . 663.3 Human-mouse enriched gene classes . . . . . . . . . . . . . . . 693.4 Human-fugu enriched gene classes . . . . . . . . . . . . . . . . 703.5 Human-mouse under-represented gene classes . . . . . . . . . 70

4.1 Table of motifs. . . . . . . . . . . . . . . . . . . . . . . . . . . 804.2 Candidates : ensembl id . . . . . . . . . . . . . . . . . . . . . 86

C.1 Significant motifs in yeast . . . . . . . . . . . . . . . . . . . . 117

vi

List of Figures

1.1 Gene structure. . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Gene expression - an overview . . . . . . . . . . . . . . . . . . 41.3 Lac operon in bacteria . . . . . . . . . . . . . . . . . . . . . . 101.4 Example of a synexpression group in higher eukaryote . . . . . 11

2.1 Example of the nucleotides pathway in yeast . . . . . . . . . . 252.2 Graph data structure . . . . . . . . . . . . . . . . . . . . . . . 292.3 Overall schema . . . . . . . . . . . . . . . . . . . . . . . . . . 302.4 Overlap score explanation . . . . . . . . . . . . . . . . . . . . 342.5 Overlap score distribution for MDS network. . . . . . . . . . . 352.6 Overlap score distribution for KEGG network. . . . . . . . . . 362.7 Overlap score distribution for Cellzome network. . . . . . . . . 372.8 Overlap network for motif TGACTC . . . . . . . . . . . . . . 412.9 Overlap network for motif d(A)-d(T) . . . . . . . . . . . . . . 432.10 Motif location relative to coding start site . . . . . . . . . . . 442.11 Promoter scanning example . . . . . . . . . . . . . . . . . . . 46

3.1 Promoterwise: the schema . . . . . . . . . . . . . . . . . . . . 503.2 GFP construct under Atonal5 promoter . . . . . . . . . . . . . 523.3 Conserved region 1 in the Atonal 5 promoter . . . . . . . . . . 543.4 Conserved region in the Atonal 5 promoter . . . . . . . . . . . 563.5 Candidate genes for CCACCTG motif . . . . . . . . . . . . . 583.6 Known Atonal5 targets with conserved motifs . . . . . . . . . 593.7 Schema of the procedure . . . . . . . . . . . . . . . . . . . . . 613.8 Promoterwise: positive upstream region function of the score

cut-off. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.9 Promoterwise: are hits reverse-complemented ? . . . . . . . . 643.10 Promoterwise: example of an inversion . . . . . . . . . . . . . 653.11 GO category . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.1 Schema of the procedure. . . . . . . . . . . . . . . . . . . . . . 744.2 Occurrence of motifs in conserved/ non conserved regions. . . 75

vii

4.3 Density function of the motif occurrence. . . . . . . . . . . . . 774.4 Occurrence of motifs in conserved/ non conserved regions for

cg motifs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.5 Conserved motifs in conserved regions. . . . . . . . . . . . . . 794.6 Density of motifwise hits around gene starts . . . . . . . . . . 834.7 Comparison with transfac . . . . . . . . . . . . . . . . . . . . 844.8 Motifwise example . . . . . . . . . . . . . . . . . . . . . . . . 854.9 Candidate : Foxm1 . . . . . . . . . . . . . . . . . . . . . . . . 874.10 Candidate :ARF3 . . . . . . . . . . . . . . . . . . . . . . . . 884.11 Candidate :Q99JW1 . . . . . . . . . . . . . . . . . . . . . . . 884.12 Candidate : Q9BU67 . . . . . . . . . . . . . . . . . . . . . . . 884.13 Candidate :SM31 . . . . . . . . . . . . . . . . . . . . . . . . . 894.14 Candidate :SM31 fish construct. . . . . . . . . . . . . . . . . . 914.15 Candidate :ZIC1 . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.1 Distribution of ATG upstream of the coding start. . . . . . . . 955.2 Density distribution of expression data in yeast . . . . . . . . 965.3 Effect of the first 5’ ATG on expression in yeast . . . . . . . . 975.4 Effect of the presence of an ATG in the 5’UTR on expression

in yeast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.5 The upf mutants . . . . . . . . . . . . . . . . . . . . . . . . . 101

viii

Chapter 1

Introduction

This past decade has witnessed a major change in the biological sciences

due to the rapid development of high throughput technologies; in particular,

DNA sequencing. It is now possible to sequence whole genomes, and as of

now around 20 eukaryote and over 100 prokaryote genomes are either finished

or about to be finished. This includes the completion of the human genome in

2001 by the International Human Genome Sequencing Consortium (IHGSC

et al., 2001), one of the great milestones in the field of biology. Considering

that just 50 years passed since the discovery of the structure of DNA by Crick

and Watson (Watson and Crick, 1953), this is a important advance in biology.

The availability of this flood of information of genome sequences and other

data has also revolutionized the way scientists approach biological problems.

The analysis of these data has tremendous potential from the understanding

of basic biological processes to human medicine. However, this raw informa-

tion needs to be treated using computational procedures and as a consequence

of such demand, the field of bioinformatics has blossomed.

The field of computational biology is quite large and overall its aim is to

answer biological questions using computational tools. The mechanisms of

gene regulation and more specifically the prediction of cis-regulatory elements

is one of the questions that remains mostly unsolved. This is the subject of

my PhD work presented here. I will first introduce the biological background

before introducing the existing computational approaches to attempt to solve

this challenge.

1

1.1. BIOLOGICAL BACKGROUND 2

1.1 Biological background

Deoxyribonucleic acid (DNA), the molecule that stores the genetic informa-

tion of nearly all organisms, is a polymer composed of 4 single chemical

units called nucleotides. The polymer is arranged in a double helix of two

complementary anti-parallel chains. In eukaryotes the DNA is organized in

chromosomes and the complete set of chromosomes constitutes the genome.

For example, Homo sapiens has 3.2x109 base pairs (bp) in 24 chromosomes

that contain virtually all the information any cell needs for its maintenance,

propagation and differentiation. Most of the Homo sapiens cells are diploids;

they possess 2 complete sets of chromosomes.

One of the earliest features discovered in the DNA is the coding gene and we

know now that it is encoded on a limited physical stretch of DNA that ulti-

mately determines the sequence of a protein (see Figure 1.1 for details). The

gene is composed of exons interrupted by introns and the coding sequence

is flanked by untranslated regions (UTR) necessary for the stability and reg-

ulation of the transcript. The protein coding gene has a well understood

grammar, making it relatively easy to differentiate this feature from the rest

of the genome (Burge and Karlin, 1997).

Scientists have a good idea of the number of coding genes per mammalian

genome currently estimated to be between 25,000-28,000 for human (Crollius

et al., 2000) or mouse (Waterston et al., 2002). Other non-coding genes were

also characterised, genes for tRNA and rRNA being the most studied ones.

Recently, more non-coding RNA types have been discovered (Eddy, 2001).

For example, micro RNA was found to be involved in the regulation of the

translation of coding mRNAs and, so far, no good estimates were given as

for how many of these non-coding genes are present on the genome.

Proteins, the product of coding genes, are the basic functional molecules

of the cell and have many roles, from catalysing biochemical reactions to

regulating complex pathways. The DNA that is not coding for proteins has

a number of associated functions including gene regulation.

To be functionally active, coding genes need to be transcribed into mRNA

molecules and, in turn, are translated into proteins that may or may not need

further processing to become functional. This whole process, common to all


Exon1 exon2

coding sequence

intergenic DNA intron

gene

5’UT

R

3’UT

Rintergenic DNA

Figure 1.1: Typical gene structure in eukaryotes: the gene contains exon(s)and often intron(s) that are spliced out during maturation

living organisms, is termed gene expression and is fundamental to the un-

derstanding of life. Gene expression involves many steps that are described

in more detail in Figure 1.2. This first step, commonly called transcription

consists of generating a pre-messenger RNA (or pre-mRNA) from a DNA

template and the intron(s) are spliced out during the maturation of the tran-

script to form a mature mRNA. In eukaryotes, most of the synthesis of mRNA

precursor is done by the RNA polymerase II complex.

The mRNA is, in turn, used as a template for the synthesis of the polypeptide

chain in a process called translation, and is catalysed by ribosomes. While

being synthesised, the nascent polypeptide adopts a 3D structure and even-

tually forms a native protein with biological function. Only a subset of all

possible proteins are present in a particular cell type, and it is important to

keep tight control of this subset. The presence of a protein at a wrong time

or place can be deleterious for the cell or the organism.

The regulation of gene expression is therefore crucial for living organisms

and happens in all the stages described in Figure 1.2. Nevertheless, the com-

mitment of the cell to make mRNA is the most effective point of control in

gene expression. Despite its importance, many aspects of the regulation of

transcription remain unclear. What is known is that some of its elements lie

mainly in the intergenic DNA; the other elements are epigenetic, but it is

not clear in what proportion the epigenetic factors influence gene regulation.

It is the success of recruiting the transcriptional machinery a few base pairs

upstream of the start of the gene that determines the expression of the gene.

As we will see below, many levels of regulation dictate this success or failure.


G A A A G C T

T T TC CG AT

G

C

C A

GT

A T

A

A U G C A G A A A G C U

LYSMET GLN ALA

TRANSCRIPTION

TRANSLATION

DNA

mRNA

Protein

Figure 1.2: The central dogma in biology: One DNA strand is used as atemplate to synthesise a pre-mRNA by the RNA polymerase (transcription).This pre-mRNA is then matured into a mRNA and, consequently, is used as atemplate by the ribosome machinery to produce a polypeptide (translation).


1.1.1 DNA accessibility

The transcription machinery, as well as the necessary associated proteins,

need to physically reach the location on the DNA molecule that will permit

the complex to start the transcription. Yet the genome is compressed by

a linear factor of about 1x104 - 1x105, and this compression is achieved by

proteins -mainly histones, but also non-histone proteins- to form a dynamic

polymer call chromatin. The degree and type of compression varies accord-

ing to many factors that influence the chromatin conformation. At a much

higher level, this compression consequently forms a well-defined chromosomal

architecture with densely and loosely packed regions. Both the chromosomal

architecture and the chromatin structure determines the accessibility of the

DNA to the transcriptional machinery.

1. Chromosomal architecture: Recent technical advances have shown ev-

idence of discrete territories in an individual chromosome where some

parts of the DNA are deeply buried and others are easy accessible by

a battery of proteins (Cremer and Cremer, 2001). This architecture is

well-defined and is cell type specific or developmental specific, leading

to the fact that different part of the DNA is accessible in different cells.

The location of the DNA region relative to other regions in the nucleus,

such as interchromatin compartment or nuclear lamina, is very impor-

tant for gene expression and a remodelling of such architecture leads

to a long-term change in gene expression.

2. Chromatin conformation: Even if a DNA region is exposed to less

condensed areas of the nucleus, the local structure of the chromatin af-

fects the accessibility of the transcription start site. It has been shown

that many post-transcriptional modifications of the histones determine

the state of the chromatin (open or closed), and only open chromatin

allows efficient gene transcription. Histone modifications are acetyla-

tion, phosphorylation and methylation, and the combinatorial nature

of these modifications have brought people to propose a histone-code

(Turner, 2000) along the same lines as the genetic code.


1.1.2 Trans-activator/repressor and cis regulatory ele-ments

In exposed open chromatin regions, trans-activators and repressors play a

key role in gene expression.

These transacting elements are proteins that either bind directly to the DNA

or bind to another transfactor. The mode of control depends on the nature

of the protein, but usually directly enhances and/or inhibits the initiation of

transcription or can play a role in modifying the chromatin structure as well.

Transcription factor that binds DNA has generally two domains, the acti-

vation domain and the DNA binding domain and may form homo- or hetero-

dimers. The binding to DNA is usually sequence-specific, meaning that selec-

tivity is given by direct contact between the polypeptide chain of the protein

and the exposed edges of the base pairs in the DNA (usually in the major

groove). These direct interactions can be complemented by the bendability

of the DNA, but this is usually a secondary effect. Each transcription factor

recognises a specific DNA sequence called a cis-regulatory element.

These elements are usually located in intergenic DNA around the gene that

they regulate, but can also be found in introns (especially the first intron

(Majewski and Ott, 2002)). The promoter is the region located directly

upstream of the gene. In addition to containing gene-specific regulatory ele-

ments, the promoter can also contain all the necessary binding sites for the

basal transcription machinery like the CAAT or the TATAA box, though not

all promoters contain these signals.

Many mammalian promoters also contain so called ’CpG islands’. CpG is a

special di-nucleotide in the human genome. Indeed, in higher eukaryotes a

significant number of CpG dinucleotides are methylated and the methylated

nucleotide is misrecognised by the DNA polymerase machinery with a higher

frequency than the background mutation rate (Sved and Bird, 1990). The

amount of CpG in the genome is therefore much lower than expected. In

cis-regulatory regions, methylation occurs less frequently around functional

elements in order to keep the chromatin open. Consequently, the fraction of

CpGs is higher there. The consequence is that these are easily recognised


CpG rich regions around genes called CpG islands.

Cis-regulatory elements found further away from the genes are in regions

called modules (also called enhancer or locus-control regions). A module is

defined as a cluster of binding sites that produces a discrete aspect of the

total transcription profile. A single module typically contains about 6 to 15

binding sites and binds 4 to 8 different transcription factors (Arnone and

Davidson, 1997).

Variation of the affinity of the binding site is commonly acheived by slightly

changing the nucleotide sequence of the element. This variability in the se-

quence element results in a fine-tune control of the expression of the gene,

but also implies that the strictly conserved sequence can be very small (typi-

cally 6-10 bp) and very difficult to detect compared to the background noise.

An excellent review on cis-regulatory sites was done by (Wray et al., 2003).

1.1.3 Post-transcriptional regulation of gene expres-sion

Once the pre mRNA is synthesised, transcript maturation and turn-over, as

well as translation, are mechanisms under tight control as well. For example,

the rate of translation of the ferritin heavy chain mRNA is controlled by the

iron-responsive element (IRE) binding protein that acts as a translational

repressor by binding to the IRE site located on the transcript (Munro et al.,

1988).

Post-transcriptional regulation is mainly achieved by controlling the rate of

degradation of the messenger RNA. Indeed, at any moment the total amount

of a specific transcript in the cell is the result of two antagonistic processes,

namely the rate of RNA synthesis (transcription), and the rate of degrada-

tion (RNA catabolism). The degradation process is an active process that

involves many regulatory and enzymatic steps. It seems to be a waste of

energy to actively degrade a transcript, but it has been shown that degra-

dation is also a powerful mechanism for gene regulation. Furthermore, each

transcript has a different degradation rate, and this rate can vary greatly

from condition to condition (cell type, cell cycle, stress).


Many pathways of mRNA turnover have been reported in the literature

(Parker and Song, 2004). The most studied process involves shortening of

the poly(A) tail followed by the decapping of the transcript, and finally the

5’-3’ exonucleolytic degradation. Other pathways involve either the direct

decapping of the transcript, the use of a 3’ to 5’ exonucleolytic decay or the

use of endonucleolytic cleavage by endonucleases.

Each transcript have a different intrinsic susceptibility to be degraded by

these pathways and, in addition, cis- or trans-factors can act upon the tran-

script and change its rate of degradation. For example, it has been shown

that premature termination codons trigger the decapping of the mRNA

which exposes the transcript to 5’ to 3’ exonuclease degradation (Maquat

and Carmichael, 2001). This process, known as nonsense-mediated mRNA

decay, or NMD, is known to be used as a surveillance mechanism to promptly

remove mRNA having frameshift or nonsense mutation.

The exact mechanism of NMD remains obscure but it is known to be tightly

coupled with translation by ribosomes. In mammals, if translation termi-

nates more than 50-55 nucleotides upstream of the last exon-exon junction,

the transcript is considered premature and NMD is triggered. In yeast, where

less transcripts bear introns, NMD seems to be triggered when a significant

amount of the mRNA length is free of ribosomes. In yeast, genetic stud-

ies identified three proteins that are involve in NMD (upf1, upf2 and upf3).

Mutation in one of these three proteins leads to a defective NMD without af-

fecting the other degradation processes (Schell et al., 2003)(Cui et al., 1995).

The homologues of upf1, upf2 and upf3 are found in human and were shown

to be involved in NMD as well.

Since most eukaryotic translation happens by a scanning process and not

via an internal ribosome entry, a premature stop codon triggered by an up-

stream open reading frame (uORF), for example, can possibly result in an

extended 3’ region of the transcript free of ribosomes and turn on the NMD

process for that transcript. Chapter 5 is devoted to the study of uORFs in

yeast transcripts and the effect on gene-expression.


1.1.4 Gene regulation and cellular function

Sets of genes are usually expressed simultaneously in order to produce pro-

teins that, together, perform a given task. To be functionally active, pro-

teins need to associate with others, and the type of association defines the

functional information. Consequently, genes that are co-regulated are of-

ten functionally related. This has been proven to be true for many cases,

both in eukaryotes and prokaryotes, despite a very different mechanism of

co-regulation between these two groups.

1.1.4.1 Mechanisms of co-regulation of functionally related pro-teins in prokaryotes

Co-regulation in prokaryotes is often due to operons, and even though this

work involves only eukaryotes, operons are nice examples of a well studied

mechanism that keeps functionally related genes under similar regulations.

Operons in prokaryotes were first described by Jacob F. and Monod J. in

1960 (Jacob et al., 1960). The operon is a coordinately regulated unit that

contains a set of genes. At the genomic level this unit consists of genes that

are contiguous on the same strand of DNA and a regulatory unit located

directly in the upstream region. Operons have been studied for many years,

and it has been show that, in most cases, operon units contain functionally

related genes, often the complete set of genes involved in one particular path-

way. This organisation is believed to be advantageous for the coordinated

expression of related genes, but it has also been suggested that operons play

an important role in gene transfer because a complete functional unit can be

given to another bacteria by only the transfer of a single limited stretch of

DNA (Lawrence and Roth, 1996). On the genome, the presence of operons

leads to interesting features: that functionally related genes remains together

even across many species, and that genes within operons have much shorter

intergenic distances. Based on these observations, one study estimated a to-

tal of 630-700 operons in E.coli (Salgado et al., 2000). Figure 1.3 shows one

of the most studied operons in E. coli, the Lac operon.

Operons occure rarely in eukaryotes, apart from nematodes where a large

portion of genes is arranged in operons. The mechanism for nematode oper-

ons is entirely different from bacteria. The bacteria operon produces a poly-


promoter operator Lac Operon structural genes

Transcription

Translation

Lactose Operon in E. Coli

Galactosidase TransacetylasePermease

Figure 1.3: The lac operon in E. coli consists of 3 genes involved in thecatabolism of lactose. These genes are under the control of a single promoterthat is repressed by the operator in absence of lactose. Once the promoter isactivated a polycistronic mRNA is synthesised.

cistronic mRNA, while the nematode produces a polycistronic pre-mRNA

that is trans-spliced into many mono-cistronic mRNAs. Like prokaryotes,

genes that encode for functionally related proteins have been shown to occur

often in the same operon, suggesting a similar selection pressure in nema-

todes to co-express functionally related proteins. In higher eukaryotes operon

structures have not been characterized and seems unlikely to occur.

1.1.4.2 Mechanisms of gene expression in eukaryotes

Co-expression of genes that are involved in a common process have been

widely reported. An excellent review article by (Niehrs and Pollet, 1999)

summarises the current knowledge of co-expression of functionally related

genes in eukaryotes, or what the authors call a ’synexpression group’. In

yeast, where expression of the entire transcriptome can be easily monitored,

synexpression groups were reported in various biological processes like the

cell cycle, metabolism or protein bio-synthesis. Synexpression groups have

also been wildly reported in higher eukaryotes, including mammalian organ-

isms. For example, genes involved in the synthesis of cholesterol also have a


reductaseHMG CoA

C5C6 C15 C30 C30 C30 C29C2

C4+

C5

IPP isomerase

Squaleneepoxidase

Cyt. P450demethylase

FDPfarnesyltransferase

CholesterolC27

B

A

Figure 1.4: A) biosynthesis pathway for the production of cholesterol inhumans.(from (Niehrs and Pollet, 1999)) B) expression profiles of HMG CoAreductase (1) IPP isomerase (triplicate)(2-4), farnesyl-diphosphate farnesyltransferase (5), squalene epoxidase (6), Cytochrome P450 lanosterol 1,4-alfa-demethylase (7) in starved human fibroblasts after serum addition. These 7genes have similar expression profiles and are functionally related by beingpart of the same metabolic pathway (Iyer et al., 1999).

very similar expression in starved human fibroblasts after serum addition, as

shown in Figure 1.4. (Niehrs and Pollet, 1999) and (Iyer et al., 1999).

Since proteins involved in the same biological process also physically inter-

act together or form complexes, people have correlated synexpression with

protein interactions or complexes in yeast (Ge et al., 2001) as well as in

Drosophila (Walhout et al., 2002).

Contrary to prokaryotes, where the mechanism of regulation of operons is

well understood, much less is known about the mechanism of co-expression

of synexpression groups in eukaryotes. For instance, some current mod-

els suggest that tissue-specific genes in higher eukaryotes are arranged in

discrete, independently controlled segments of chromatin. Enhancers and

locus-control regions (LCR) also affect many genes. A well known example

is the globin cluster in humans. The globin genes are under the control of

a single LCR that lies far upstream from the cluster and appear to act by

controlling chromatin condensation (Bungert et al., 1995). Many LCR are

thought to be present in the human genome, and they regulate a variety of

cell type specific genes.

Nevertheless, despite this higher level of control, regulation can also be

achieved by the binding of regulatory elements in the proximal promoter

1.2. EXPERIMENTAL APPROACHES TO FIND REGULATORY REGIONS 12

and, therefore, co-regulated genes should have significantly more of a given

cis-regulatory motif in their upstream sequences. This has been shown to be

true in yeast by the work of (Hughes et al., 2000). In this work they used sets

of genes grouped from different sources (YPD, Munich Information Center

for protein Sequence, SGD).

1.2 Experimental approaches to find regula-

tory regions

Cis-regulatory elements have been studied for decades by a myriad of sci-

entists. Most of the techniques they have developed are labor-intensive and

difficult to scale up and, consequently, focus on one gene or one element of

regulation. More recently, global approaches to deciphering gene regulation

in a genome-wide manner have been applied.

1.2.1 one-by-one gene analysis

Many techniques have been developed to localise the binding site for regu-

latory proteins. The most used ones are DNase footprinting (Leblanc and

Moss, 2001) and mobility shift assay. Both are based on the modification of

physical properties of the DNA fragment when proteins specifically bind. The

first method, DNase footprinting, uses the fact that the DNA is protected by

the binding protein from degradation by DNaseI. The other method, mobil-

ity shift assay (Chan et al., 2004), uses the differential mobility of the DNA

fragment on a non-denaturing gel when the protein is bound to the DNA.

Another approach is a genetic analysis where isolation of mutants in the

DNA binding site help to identify which residues in the binding site are im-

portant (Walter and Biggin, 1996).

Even though some attempts to use these techniques in a high throughput

manner on the entire genome have been tested, these approaches remain

time-consuming and, consequently, can only be applied to a few cases at a

time.

1.2. EXPERIMENTAL APPROACHES TO FIND REGULATORY REGIONS 13

1.2.2 High throughput analysis

To study the entire transcriptome of an organism, a number of high through-

put methods have been developed. Two methods are particularly relevant

to derive cis-regulatory regions. The first approach is indirect and involves

micro-array technology; the second one, ChipIP, attempts to directly locate

the regions of importance for the binding of trans-factor.

Micro-array analysis monitors the relative amount of transcript in a pop-

ulation of cells at a given time by measuring the hybridisation between an

immobilised DNA or oligonucleotide sequence and the corresponding cDNA

derived from the sample. This measure can either be absolute (that is, the

intensity of hybridisation in relation with the amount of transcript in the

cell) or relative (the intensity of hybridisation in condition 1 relative to the

intensity of hybridisation in condition 2, in order to measure differential ex-

pression). For the relative measure, two dyes are used to label the cDNA from

the two samples respectively, and are hybridised onto the same immobilised

probe. By repeating the measure at different times and/or under different

conditions, it is possible to obtain the expression profiles for a large set of

genes. Genes that have similar expression profiles are said to be co-regulated.

Because co-regulated genes are believed to be under the control of a similar

set of transcription factors, these genes should possess common regulatory

regions. Micro-arrays therefore only indirectly find cis-regulatory regions by

providing co-regulation information, but this approach has been proven to be

very successful, particularly in S. cerevisiae (Brazma et al., 1998), (Hughes

et al., 2000).

Although yeast are eukaryotes and therefore have greater complexity than

bacteria, they share many of the technical advantages that permit an ease of

handling for diverse investigations. Furthermore, the yeast genomic organi-

sation also shows much lower complexity than higher eukaryotes : therefore

it has been harder to find cis-regulatory motifs in higher eukaryotes using

microarray.

Chromatin immunoprecipitation (ChIP) (Weinmann and Farnham, 2002)

does not monitor the gene expression per se but instead investigates di-

rectly the interactions between proteins; for example, transcription factors

1.3. BIOINFORMATIC APPROACH TO FINDING REGULATORY ELEMENTS 14

and DNA. Coupled with whole genome DNA microarrays, ChIP allows the

identification of the DNA binding sites of any given transcription factor and

by extension, can infer possible co-expression. In both the micro-array and

chromatin IP approach, bioinformatic tools are needed in order to identify

over-represented motifs that are believed to be cis-regulatory elements.

1.3 Bioinformatic approach to finding regu-

latory elements

1.3.1 Background

Bioinformatics is based on the prediction of certain characteristics of biolog-

ical entities. These entities can be sequences and in this case, one of the

most common approaches is to find other related sequences in order to infer

function or to gather more information about the given sequence. Finding

related sequences is achieved by using alignment algorithms that also pro-

duce the best alignment.

Homologous sequences derived from a common ancestor can undergo substi-

tution, insertion and deletion, and the rate of these changes varies according

to the section pressure. Alignment algorithms should therefore take all these

events in account. Many such algorithms were developed and can be classified

according to their characteristics. These tools can be clustered roughly into

global and local algorithms which, in turn, can be separated into pair-wise

and multiple alignment methods. Pairwise global alignment algorithms such

as the one developed by Needleman and Wunsch (Needleman and Wunsch,

1970) consider the entire sequences, whereas local alignment algorithms such

as the one developed by Smith and Waterman (Smith and Waterman, 1981)

focuses on the region of greatest homology. Fasta (Pearson, 1991) and Blast

(Altschul et al., 1990), both pair-wise local alignment algorithms, provide

rapid alternatives to the Smith-Waterman tool by finding exactly matching

words. This step confines the subsequent search to a small fraction of the

entire search space. Many other alignment algorithms have been developed,

each to answer specific questions.

As outlined earlier, binding sites on DNA for transcription factors are usually

very small, and two identical binding sites usually are not due to common


ancestry. Alignment algorithms are of use only if the surrounding sequence is

believed to be derived from the same ancestor and the identity high enough.

That is often not the case and, therefore, algorithms based on finding over-

represented motifs in a set of sequence is sometimes a better approach to

find protein binding sites on DNA. As we have seen in section 1.2.2 the set

of sequences can be, for example, derived from microarray analysis and are

not believed to have common ancestor.

1.3.2 Finding over-represented motifs on unrelated se-

quences

Typically, data derived from microarray analysis where clustered into co-

expressed genes that are believed to have common motifs in the corresponding

upstream regions. These studies have been done mostly on yeast, and many

algorithms to find over-represented motifs have been developed. MEME

(Bailey and Elkan, 1995), AlignACE (Roth et al., 1998) and DIALIGN

(Morgenstern et al., 1998),Teiresias (Rigoutsos and Floratos, 1998) are four

example of such techniques but many more have been reported in the liter-

ature (Hertz and Stormo, 2000)(Brazma et al., 1998)(Hughes et al., 2000).

1.3.3 Phylogenetic footprinting to find cis-regulatoryelements

1.3.3.1 Introduction

Evolutionary information is used extensively in computational biology to in-

fer function. For example, if two entities share features, then knowledge can

be inferred between them; if two genes share sequence homology, and hence

a common ancestor, they are likely to share a similar function. This notion

has been widely applied in bioinformatics and is routinely used in automated

genome annotation.

Different functional elements in the genomes are under different selection

pressure. A good example of this is the coding region where substitution of

the third position is far more common than at the other positions. Because

of the degeneracy of the genetic code, mutation of the third nucleotide is


generally silent (referred to as synonymous changes). Regulatory regions are

under different selection pressure than the non-functional DNA and, conse-

quently, evolution can be used as a tool to locate them as well.

The discovery of regulatory regions in the intergenic DNA through cross-

species comparison is often termed phylogenetic footprinting, an analogy to

DNAase footprinting (Tagle et al., 1988). This is based on the observation

that functionally important regions tend to have a lower mutation rate than

non-functional regions. Therefore, it is a technique that can be used to pre-

dict transcription factor binding sites (TFBS). People have been using this

technique on well studied genes for a long time. They usually find the ho-

mologue of the gene of interest in many related species and, after sequencing

the upstream regions or DNAase hypersensitive sites, use various alignment

techniques to locate the specific region of interest that is most probably in-

volved in transcription regulation.

However, the protocol used for phylogenetic footprinting depends largely

on the gene studied. Indeed, for genes that play key roles in general bio-

logical processes, very few but distant species are used. For example, in the

study of the stem cell leukemia gene (bHLH transcription factor) promoter

region the authors used human, mouse, chicken, pufferfish (fugu) and ze-

brafish (Gottgens et al., 2002). For genes that are involved in taxa-specific

processes, remote species do not have homologues, and pair-wise comparison

with related species will not have enough resolving power. Recent approaches

have been using phylogenetic shadowing (the use of additive collective diver-

gence of many very close species to distinguish functional sites) with success.

As more and more fully sequenced genomes appear, this technique of phy-

logenetic shadowing is bound to give very interesting results in the future;

not only for taxa-specific genes. Presently, phylogenetic shadowing can only

be apply to very specific examples where enough orthologous sequences are

available, due to the lack of fully sequenced closed genomes. This is the

case for the study of the mammalian growth hormone gene and involves 13

different yet related mammals (Krawczak et al., 1999). Recently, another

group has been using with great success phylogenetic shadowing on different

regions of the human genome, using a total of 13 to 17 different primates

(Boffelli et al., 2003).


Genome-wide phylogenetic footprinting, as opposed to gene-centric phylo-

genetic footprinting, is a fairly new technique because it needs the comple-

tion of at least two related genomes. The general strategy described in most

of the previous work so far has been to align sequences from orthologous

pairs in 2 or more species and, using known position weight matrices, predict

TFBS. Most of the time these techniques have graphical interfaces to dis-

play the result. Interestingly, the first eukaryotic organisms to be compared

were higher eukaryotes like human and mouse. With the newly sequenced

yeast genome of S.paradoxus, S. mikatae and S. bayanus (Kellis et al., 2003),

phylogenetic footprinting is now successfully applied in a large-scale fash-

ion in yeast. As most of the complex studies were traditionally done first

using yeast, an attempt of finding binding sites de novo using only phyloge-

netic and co-occurence information has been done by Chiang D et al.(Chiang

et al., 2003). They found around 1000 closely spaced hexamer pairs that are

conserved in at least 3 yeast species. Many of these examers correspond to

known transcription factor binding sites. Another study (Kellis et al., 2003)

looked at the conservation scores of motifs and found 72 genome-wide ele-

ments, including most of the known regulatory motifs as well as new motifs.

1.3.3.2 Methods for phylogenetic footprinting

As seen above, alignment tools have been developed in order to estimate if

DNA or protein sequences are derived from the same ancestor. In the case

of cis-regulatory regions, alignment techniques have been used extensively.

This approach consists of aligning regions of homology in the non-coding se-

quences in the vicinity of orthologous genes from two or more species. Most

of the work has been done on well-studied examples like the alpha-globin clus-

ter (Flint et al., 2001), the SCL loci (Gottgens et al., 2002), the Oxb4 gene

(Aparicio et al., 1995) and other regions that often correspond to loci involved

in human disease (Loots et al., 2000), (Dubchak et al., 2000). Nevertheless,

more general analysis has been done on whole genomes or functional sub-

sets (Levy et al., 2001), (Webb et al., 2002), (Elnitski et al., 2003)(Dieterich

et al., 2002).

Because regulatory elements tend to be quite short conserved sequences

relative to the background noise and the order and direction of conserva-

tion of these elements are not conserved all the time, algorithms like DBA


(Jareborg et al., 1999)or bayer block aligner (Zhu et al., 1998) that focus

on aligning highly conserved ungapped blocks while allowing large gaps are

theoretically better-suited for identifying cis-regulatory regions. The work in

chapter 3 uses Promoterwise, an alignment program derived from DBA, to

analyse intergenic regions in higher eukaryotes. In practice, any alignment

technique will pick up regulatory elements located in modules of long highly

conserved regions. The question remains of how many regulatory elements

are located in non-conserved sequences. This question is species-dependent

but an increasing number of studies show evidence of modular organisation

of cis-regulatory sites (Berman et al., 2002) and other studies have shown

examples of regulatory elements being in very low sequence identity as well.

Substantial resolving power is added by including more than two sequences

in a multiple sequence alignment, since each lineage diverged independently

after separation from a common ancerstor. Programs that performed the

alignment are Yama2 (Chao et al., 1993), ClustalW (Thompson et al., 1994),

Multalign (Corpet, 1988), Dialign (Morgenstern et al., 1998) and others.

Since Dialign does not have gap penalty and starts by identifying short con-

served regions, this algorithm is more suited to identifying regulatory regions

than ClustalW.

Once the alignment is made, conserved sequences need to be located. On

alignment involving only two sequences, a simple metric of X % conservation

over at least Y nucleotides is usually used. Dubchak and al (Dubchak et al.,

2000) used two alignments, human and dog, as well as human and mouse 200

kb sequence (human 5q31), to define cutoff criteria X and Y for conserved

sequence based on maximising the percentage of regions that are common

in three species. In other cases, a simple ranking of identity scores seems

to give better results than fixed settings (Flint et al., 2001). For multiple

alignments, more parameters need to be taken in account, such as the phylo-

genetic relation between species or the nucleotide frequencies at each position.

Because alignment only provides the information of what region is common

to two or more species, the challenge for these techniques is to assess if these

regions of homology are indeed involved in regulation. This is why alignment

has often been used in conjunction with known transcription factor binding

sites, usually from the Transfac database(Wingender et al., 2000).


Although motif over-representation techniques can theoretically be used for

phylogenetic footprinting, they were designed to compare evolutionary inde-

pendent sequences (see section 1.3.2) and therefore do not take into account

the evolutionary relationship between homologous sequences. To overcome

this problem, Blanchette and Tompa (Blanchette and Tompa, 2002) devel-

oped another method - Footprinter - which takes in account the phylogenetic

tree relating the sequences and therefore is more suitable for comparing or-

thologous sequences and identifies all the DNA motifs that have evolved in

a slower rate than the surrounding region.

A major drawback is the relative bad performance of this approach on a small

set of orthologues. Indeed, all these motif-finding techniques perform much

better with increasing amounts of sequences where the distinction between

conserved motifs and diverged background becomes clearer. Blanchette sys-

tematically used more than three species and increased the number of se-

quences by including paralogues. With the sequencing of more organisms

these will become less problematic but, in order to work, it will assume that

a majority of these organisms retains sufficiently conserved motifs within the

analysed segment, which may not be valid (see issues). Another problem is

that these methods do not work as well with large sequences and, as meto-

zoan promoters may lie a considerable distance away from the transcription

start site, this limits their utility.

Nevertheless, these approaches will find motifs that satisfy the criteria in-

dependently from the surrounding sequence identity. This is not the case

with global alignment, where the noise of the diverged non-functional back-

ground can overcome the short conserved signal.

1.3.4 Issues

All these studies, independent to the species complexity, show an overall en-

richment of putative transciption factor binding sites in conserved non-coding

genomic sequences or footprints, and many other studies have linked evolu-

tionary conserved regions to experimentally determined regulatory elements

(Aparicio et al., 1995). There is no doubt that this technique is successful

in finding TFBS genome-wide. However, it is also quite clear that even for


relatively close species not all the TFBS are conserved. For example, only 50

percent of known TFBS are located in conserved regions, according to one

estimation for human and mouse (Levy and Hannenhalli, 2002). This leads

to the statement that these type of methods do not find all the transcrip-

tion factor binding sites but only a subset that is important enough to be

conserved thoughout all the species studied. It is nevertheless important to

understand why such conservation fails to happen in so many cases.

First of all, alteration in gene regulation and therefore alteration of the TFBS

seems to have been the primary substrate for the evolution of species. King

and Wilson ((King and Wilson, 1975)) suggested that most of the genetic

causes of phenotypic differences between humans and the great apes are the

regulatory sequences that control the timing and pattern of genic activity.

Many other examples of homologous genes have been shown to have distinct

temporal and spacial expression. As an example the B myosine heavy chain

is the major isoform in the adult ventricle of humans but not in hamsters

and consequently cis-acting element involved in the tissue specificity would

be expected to differ in the two species. Even in case of conservation of

the functional binding site, some divergence in the nucleotide sequence of

the site can be seen, even for very close species. For example, a study in-

volving an androgen-inducible gene in different mices species shows that the

regulatory sites for this gene have subtitution and insertion resulting in the

change of affinities for their respective nuclear factors and modification of

expression of the gene ((Chaudhuri et al., 1991)) Gene duplication arising

from whole genome or segmental duplication is also a substrate for mutation

in the regulatory region of both of these duplicated genes. (see duplication -

degeneration - complementation model proposed by (Force et al., 1999)).

Consequently, the conservation of cis-regulatory regions is a good indica-

tor of conservation of the spatial and temporal expression of an orthologue.

Loots et al (Loots et al., 2000) have used prior knowledge that transgenic

mice bearing the human 5q31 region containing Il4 Il13 and Il5 as well as the

regulatory regions, correctly expressed the human transgenes to propose the

hypothesis that the cis-regulatory region should be conserved from human to

mouse and found that it is, indeed, mostly the case by cross-species sequence

comparisons. Because this type of study is difficult to scale up at the whole


genome level, genome-wide phylogenetic footprinting cannot integrate this

information yet. Conversely, non-conserved regions or divergence in shared

binding sites that have arisen from positive selection are very interesting be-

cause they can explain the difference between species, but for now there are

no techniques to distinguish between positive selection and random mutation.

Another issue has more to do with the characteristics of the TFBS. Indeed,

because of the relative small size of a regulatory motif (5-25 bp), these motifs

can easily be modified, duplicated, and reversed, and can appear or disap-

pear throughout evolution without affecting the expression of the gene. For

the alpha-globin cluster, the MARE motif has been found in human, mouse,

chicken and pufferfish, but in pufferfish this motif has a different location

and appears to be reversed (Flint et al., 2001). A very intersting study per-

formed by Ludwig et al.(Ludwig et al., 2000) showed that two strip elements

had undergone substitution and indels modifying considerably the cis-acting

elements and the spacing between them from two species as close as D. pseu-

doobscura and D. melanogaster. Yet the expression profile stays the same,

raising the hypothesis of compensatory mutation. These authors suggest that

stabilising selection has allowed mutational turnover of functionally impor-

tant sites and, at the same time, maintained functional conservation of gene

expression. They predict that such pattern of substitution will be a common

theme in cis-regulatory regions, but the extent of such substitution seems to

differ from species to species, with vertebrates having overall more conserved

cis-regulatory region than invertebrates.

Therefore, the question of how evolution effects cis-acting motifs is impor-

tant to consider. Ideally one would like to study orthologous genes in dif-

ferent species which have the same expression pattern but have had enough

evolutionary time that only functionally conserved sequences are alignable.

However, there seems to be no ideal large-scale comparison that would satisfy

the characteristics of all the genes. Intra-mammal comparisons (eg. mouse

to human) would include mammalian-specific genes but would also have a

large amount of non-functional conservation. Intra-vertebrate analysis (eg.

fish-human), on the other hand, can locate functional regions with more

specificity but (a) the signal is currently very hard to detect and (b) it is

considerably less obvious if one expects there to be functional conservation

of the same motifs.


I will investigate alignments of promoter regions in chapter 3 and go on

to use these alignments to define motifs in chapter 4.

1.3.5 Finding eukaryotic promoters

Many methods to predict promoters or, more precisely, the location of the

transcription start site (TSS) have been developed in higher eukaryotes.

PromoterInspector (Scherf et al., 2000), for example, uses a set of over-

represented motifs in promoter regions. Another algorithm, Eponine (Down

and Hubbard, 2002) is a probabilistic method for detecting transcription

start sites in mammalian genomic sequences. It consists of a set of DNA

weight matrices recognising specific sequence motifs. Each of these elements

is associated with a position distribution relative to the transcription start

site. these elements are:

1. A diffuse preference for CpG motifs that correspond to the CpG island.

2. A TATAA box motif at around 30 bp upstream of the the TSS

3. Two CpG rich weight matrices flanking the TATAA motif.

This procedure is based on a model learned from the Eukaryotic Pro-

moter Database (EPD, (Schmid et al., 2004)), an annotated non-redundant

collection of eukaryotic POL II promoters, experimentally defined by a tran-

scription start site.

Chapter 2

Finding regulatory regionsusing functional information inyeast

2.1 Introduction

As we have seen in the introduction, transcription factors are one of the ma-

jor players in gene expression and bind to small stretches of semi-conserved

DNA that are usually located a certain limited distance upstream of the

gene. Contrary to the well defined gene structure, cis-regulatory elements

are poor in information, and even though it should be theoretically possible

to find these elements de novo using nothing but the DNA sequence, most

of the approaches so far have used additional information. One of the most

successful approaches involves the use of related genomes to find regions of

conservation. Chapter 3 is devoted to the use of comparative genomics to

locate cis-regulatory sites.

Another very successful approach to locate cis-regulatory elements is the

use of micro-array technology. Microarray analysis measures the amount of

specific mRNA in the cell; that is, the sum of the biosynthesis and the degra-

dation rate of the mRNA molecule. By repeating the measure at different

times under different conditions, it is possible to obtain the expression pro-

file for each gene studied. Genes that have similar expression profiles (called

co-regulated genes) are believed to be regulated by a similar set of regula-

tory elements. The mechanism for such a regulation is very different between

23

2.2. EXAMPLE: THE NUCLEOTIDE PATHWAY IN YEAST 24

prokaryotes and eukaryotes, and for the latter, each gene has its own regula-

tory regions. Therefore, binding sites for a particular transcription factor is

expected to be enriched in the upstream region of a set of co-regulated genes.

This has been proven to be true in numerous cases.

The approach taken here also uses the information about co-regulation, but

not derived from micro-array analysis; rather from the fact that genes that

have similar function have a strong tendency to form clusters of co-regulated

genes. Given the function of genes, it is therefore theoretically possible to

bypass the micro-array data and possibly define co-regulated genes via their

functional similarity. Eukaryotic genes whose products have similar func-

tionality should therefore display an enrichment in given cis-regulatory sites.

This is the basis of the method described in this chapter. Presented first

will be an example of a well studied pathway in yeast (the nucleotide path-

way) followed by a manual look for cis-regulatory elements before extending

the approach to an automatic procedure that would find regulatory elements

using any functional network.

2.2 Example: the nucleotide pathway in yeast

The cell cycle is a highly coordinated process that involves the production of

newly synthesised DNA strands. During S phase, the cell should possess an

elevated level of dNTPs - DNA precursors as well as all the enzymes and ac-

cessory proteins that are involved in the biosynthesis of DNA. The nucleotide

pathway is therefore a good example to test the hypothesis of co-regulation

within a pathway in yeast. Because nucleotides are also used at times other

than the S phase of the cell cycle and the pathway to produce nucleotides is

not a linear chain of reactions, one would expect only a subset of enzymes to

be co-regulated. In order to find potential regulatory motifs, the upstream

regions of all the genes encoding for enzymes that are involved in the DNA

polymerisation pathway were retrieved. The choice of DNA as the start com-

pound was made because a strong co-regulation within the subunits of the

polymerase is expected to occur. By manual analysis, two motifs appear to

be found significantly more often in the 0.5kb upstream of these genes. Using

the pathway relationships in KEGG, the two motifs were recursively found in

neighbouring enzymes in the pathway. Results are shown in Figure 2.1. This

figure shows the network of reactions leading toward biosynthesis of DNA

2.2. EXAMPLE: THE NUCLEOTIDE PATHWAY IN YEAST 25

dTDP

dCDP

dADP

dGDP

TDP dTTP

dCTP

dATP

dGTP

YBL035CYBR278WYKL114CYDL102WYEL055CYJR006WYNL102WYNL262WYOR 330CYPL167CYPR175W

TDP

dUMP

YOR074C

ADN

CDP

ADP

GDP

YER070W

YIL066C

YJL026W

YKL067W

ATP

YOR116C

dTMP

YJR057W

YGL180W

Unknown motif MluI motif

Figure 2.1: Diagram of the selected routes for the biosynthesis of DNAas described in KEGG. Labeled in blue are the enzymes that catalyse thereaction, and the circles are the compounds used by these enzymes. Thearrows have the direction of the reaction in ’normal’ physiological conditionof the cell but none of these reactions are considered irreversible. Certainenzymes are labeled with red circles and/or green rectangles (symbolisingthe Mull and the unknown motif respectively) that are found at least oncewithin 500 bp upstream of the genes.

and the presence of these motifs in upstream region of the corresponding

genes.

The first motif (ACGCGTNA) is well known and has been previously called

Mul1 site (Verma et al., 1991) Mul1 site has been shown to bind a regulatory

protein that is involved in the regulation of cell division cycle genes (CDC

genes). The genes experimentally verified to have this site in the upstream re-

gion are CDC21 (YOR074C), CDC2 (YDL102W), CDC6 (YJL194W)(Verma

et al., 1991) and POLI (YNL102W)(Moll et al., 1992), but these genes are

not the only ones to have this site, as shown in Picture 2.1. Further ex-

perimental work needs to be done on this site to verify the functionality of

these motifs as a binding site. The second motif does not seem to have been

reported in the literature.

This example was the result of careful manual analysis, but a fully auto-

mated procedure to find such motifs was developed. This uses the degree of

2.3. USEFUL FUNCTIONAL NETWORK 26

concordance of motifs present upstream of genes to any functional networks.

2.3 Useful functional network

The availability of functional information in a large-scale manner is essential

for this approach and is dependent on the organism studied as well as the

type of interaction.

The first type of functional information used is the direct protein-protein

interaction from two large-scale experiments in yeast published by (Gavin

et al., 2002) and (Ho et al., 2002). Another well known type of functional in-

formation used in the example above (see 2.2) is the small molecule metabolic

reaction catalysed by enzymes. In this instance, the link is not physical as

in the case of protein-protein interaction but rather indirect via metabolites.

Because of the relative simplicity of the yeast genome organisation and the

availability of large-scale experiments the choice was made to work mostly

with S. cerevisiae. The methodology is attempted on H. sapiens later in this

chapter.

2.3.1 Metabolic network

Enzymes are one of the best-characterised elements in the cell, being the first

biological molecule to be studied. Their metabolites are usually well defined

and because the product of an enzyme is usually the substrate of other re-

actions, relationships between enzymes are easily derived. In the early age

of biology, a continuous stretch of such relationships was called a pathway

and the concept of pathways still remains today, even though the topology

is better viewed as a network rather than linear pathways.

An example of a computationally defined metabolic network is the small

molecule metabolic network computationally described in the KEGG database

(Kanehisa, 1997), a store of all known enzymatic reactions for many species.

Taking only yeast, KEGG contains 623 enzyme encoding genes that cor-

respond to about 10% of the yeast gene set. This metabolic network can

be represented as a bipartite graph that contains two node types (enzymes

and metabolites) and two types of edges (enzymes linked to metabolites and

2.3. USEFUL FUNCTIONAL NETWORK 27

metabolites linked to enzymes). Under normal physiological conditions, the

enzymatic reaction has a direction from the substrate to the product, but in

this study directionality is not used. The resulting graph is therefore undi-

rected.

Because of focused interest into the enzyme relationships and the need for

simplification of the network structure, a monopartite graph can be derived

that will only contain one type of node (enzyme) and one type of edge (en-

zyme linked to enzyme). This procedure is illustrated in Figure 2.2. Ubiqui-

tously found substrate like water or CO2 are hubs in the bipartite graph, and

the resulting monopartite graph would connect enzymes that have no real

metabolic link; it is therefore important to remove these non-specific metabo-

lites from the bipartite graph (see Appendix B). The resulting monopartite

graph contains 623 nodes and 26,426 edges, with an average of 34 edges per

node.

2.3.2 Protein interaction

Because complexes are entities that can only be functional when all the nec-

essary proteins are present in the cell, information about direct or indirect

physical interactions between proteins within complexes should be very use-

ful for the discovery of cis-regulatory elements.

Two high throughput datasets on yeast protein interactions were used and

subsequently referred as the Cellzome network (Gavin et al., 2002) and the

MDS network (Ho et al., 2002). These datasets correspond to large-scale

identification of protein complexes in S.cerevisiae by mass spectrometry.

Both studies used a set of target proteins fused with either protein A and

the calmodulin binding peptide (Tandem affinity purification (Rigaut et al.,

1999) by Cellzome) or the Flag epitope tag (MDS). The resulting fused pro-

teins were purified together with the interacting yeast proteins, and the pu-

rified complexes were analysed by tandem mass spectrometry to identify the

associated proteins. The raw data that was used here consists of the bait

protein linked to all the identified proteins that co-precipitate with the pro-

tein.

One of the major difference between the two methodologies is that Cell-

2.4. GENERATING AND ASSESSING MOTIFS 28

zome uses the natural promoter to express the tag protein, while MDS uses

a construct under a strong inducible promoter. In the first case, some protein

may not be expressed in the condition of the experimentation (haploid cell

mid log), but the expression as well as the binding to other proteins reflects

better the physiological condition in a cell. On the other hand, the use of a

strong inducible promoter guarantees a detectable amount of tagged protein,

but the binding to other proteins may not reflect any biological interaction.

In both cases, these methods are unlikely to detect transient interaction or

interaction occurring only in specific states.

Similar to the metabolic network, protein interaction networks can be rep-

resented as bipartite graphs with complexes and proteins representing the

two type of nodes. A monopartite graph can be derived which only contains

proteins as nodes. The resulting monopartite graph contains 1,411 nodes and

34,844 edges for the Cellzome network, and 1,699 nodes for 151,670 edges in

the case of the MDS network.

2.4 Generating and assessing motifs

The basic work flow of the method is presented in Figure 2.3. The input data

are the upstream regions of the yeast genes (see materials and methods) and

functional information from either metabolic interactions or direct protein-

protein interactions. Two approaches were used to generate motifs and each

of these motifs were assessed using the functional network. A scoring scheme

was developed which quantitatively assesses the degree of concordance be-

tween the motif and the functional network. A significant score is given

to each motif using a brute force randomization procedure. The significant

motifs are then clustered.

2.4.1 Generating motifs

Two slightly different approaches for generating motifs can be used. Most

results shown in this chapter were generated using the over-represented motif

approach described in 2.4.1.1. The exhaustive approach was used during the

promoter scanning method described in 2.5.5


2 3 4 6 7 8

compounds/complexes

enzymes/proteins51

Bipartite representation of the interaction network

A B C

one nucleation set

1

2

3

5

4

6

7

8

enzymes/proteins

Unipartite representation of the interaction network

Figure 2.2: Monopartite (or unipartite) representation of the graph (bot-tom) derived from a bipartite representation (top). Only one type of node(white) is kept by adding an edge between two of these nodes only if theywere linked to a common black node in the bipartite graph. Label compoundand enzymes are for the KEGG network and label complexes and proteinsare for the protein interaction networks. A nucleation set is defined as allthe proteins that either act upon the same compound or are part of the samecomplex.


Teiresias is run on the selected set of genes. Parameters : minimum of 8 nucleotides2 wild card allowed.

Overlap the pattern network with the interaction network (unipartite representation)and calculate the overlap score.

Cluster the patterns that have a overlap score equal or higher that the cutoff value.

For each pattern found,build a pattern network (fully connectgraph with nodes being the selected genes).

Build a sequence logo for each pattern cluster.

Define a cutoff value.

Create pattern networksusing random genes withthe same number of nodes.

Overlap the random pattern network with the interaction network (unipartite representation)and calculate the overlap score.

All 6−7 and 8 mers motifs

Generating

Generating

AssessmentRandomization

workflow of the method

A : for the metabolic network B: for protein interaction network

Interaction network

For compound (A) or complex (B)link to 3 or more genes only. The upstreamregion of these genes were used for pattern discovery.

bipartite representation. motifs (first approach)

Motifs (second approach)

Figure 2.3: Overall schema of the procedure. Two possible pattern discoverysteps are possible before the assessment using the functional network. Toassess significance, a randomisation step is performed and significant motifsare then clustered.


2.4.1.1 Over-represented motifs

One part of the functional network (the nucleation set), derived from the bi-

partite representation of the functional network (see Figure 2.2) can be used

to derive a set of over-represented motifs that would be assessed using the

other part of the network.

As seen in the chapter 1, a broad range of programs that find over-represented

motifs in a set of genes is freely available. Most of these programs were

initially developed to find over-represented motifs within a cluster of co-

regulated genes derived from micro-array analysis. In all cases, given a back-

ground model, the algorithm would find over-represented motifs within a set

of sequences that are believed to lack any evolutionary relationships. The

advantage of such an approach is the enrichment of potential candidates in

the motif set, allowing a broader definition of the motif dictionary.

Some computational methods like Gibbs sampling or expectation maximi-

sation have a complex background model to eliminate ubiquitous motifs (for

example, low complexity motifs) from the significance set. Because the mea-

surement of the concordance with the functional network is the filtering step,

the approach used involved the proposed motifs set to be as large as possible.

Tereisias (Rigoutsos and Floratos, 1998), a fast algorithm that exhaustively

retrieves all possible motifs satisfying given parameters, is well suited for this

methodology. Using Tereisias, with loose parameters (see Appendix B for de-

tails), 197,922 patterns are generated from the entire KEGG network, while

197,111 and 320,405 patterns are generated from the Cellzome and MDS

networks respectively. Although these patterns are technically not random,

their numbers and distributions across genes are not convincing signals for

credible motifs, instead providing a large initial set to be filtered in the as-

sessment step.

2.4.1.2 Exhaustive enumeration

All possible motifs from a motif dictionary can be assessed in an exhaus-

tive manner. To avoid prohibitive computational time the search space was

limited to only discrete motifs of a given length without gaps that contain

enough information content (typically more than five nucleotides long) and

have at least two locations in the upstream region of all annotated genes


(typically less than 14 nucleotides long). This procedure of generating mo-

tifs has the advantage of being independent of the functional network and

the subsequent assessment step can be done on the whole functional graph

(without the need of removing the nucleation set; see 2.4.2). A potential

drawback of this approach is the limit of the search space that can be used

to avoid prohibitive computational time.

2.4.2 Assessment of the motifs using functional net-works

Functional networks are not simple clusters of functionally related proteins

but rather highly connected graphs that link some of the proteins together.

Simplistic notions of clustering will not capture the sparse co-regulation in-

side of these network. What can be defined as a set of genes involved in

the same biological function is not a trivial question and can not be an-

swered using simply the network. The computational problem faced here is,

therefore, not to find significant over-represented motifs in a cluster a genes

(which was only used in the initial step to derived a dictionary of putative

candidates), but rather to measure the concordance of a motif occurrence

with the functional network. The best method personally developed was to

create a ’motif network’; a fully connected graph containing all the genes

that possess the motif as nodes and super-impose this graph one the original

functional network. Only edges that have common nodes in both networks

remain and the resulting ’overlap network’ is the intersection of both the

functional and motif networks. If the technique for generating motifs uses

one part of the network (as is the case in 2.4.1.1), then this seed that built the

pattern (nucleation set) needs to be discounted from the functional network

before assessing the pattern.

The next step to be performed was to develop a single numerical value that

indicates the overall complexity of the overlap network in order to assess

and compare all the possible motifs. This overlap score needs to take into

account the number of nodes and edges in the overlap network. To do so,

all the edges of the graph were added to the score (apart from the edges

that belongs to the nucleation set if the motif that built the overlap graph

is derived from it). Contrary to the motif-network, the nodes in the initial

functional network do not have a fixed number of edges per node and, there-


fore, the overlap network was expected to vary greatly in size according to

the type of nodes involved. To address this issue, each added value from each

overlapping edge was weighed by a factor that is the sum of all the edges

that both node possess in the initial functional network (see equation B.3)

S =

√

√

√

√

∑

i

(1

ai + bi − 1)

Summation is over all common edges (i) present in both networks con-

necting node Ai to node Bi. The denominator ai + bi − 1 is the total number

of edges from both nodes, discounting the edge being counted. This proce-

dure is illustrated in Figure 2.4 and a more detail explanation is given in

Appendix B.

The issue remained as to assessing the significance of such an overlap score.

Theoretical derivation of the statistics is difficult, as the network topology

is variable. However it is computationally feasible using a brute force ran-

domisation technique. Indeed, the value of the overlap score depends mainly

on:

1 the number of times a particular motif is seen in the upstream region

of annotated gene.

2 the functional network topology.

3 the extent of concordance between the motif network and the functional

network.

[1] depends on each motif occurrence and [2] is fixed for a given network.

[3] is the aspect to be assessed.

[1] and [2] can be assessed for each motif occurrence using a brute force

randomisation and any deviation from this evaluation would be treated as a

significant concordance between the motif and the functional information.

To do so, 100,000 fully connected pattern networks of different sizes (from 0

to 500 nodes) with the gene identifiers being random but all the other aspects

of the network remaining the same, were generated. We then calculated the

overlap score of these random networks with the functional network.


H

D−E E−F

1

3+4−1

E−G

1 1

2 + 3 −1 1+3 −1+ = 0.86

= 0.44

edges used

overlap score =

Real pattern network

1

3 + 3 −1

edge used G−H

overlap score = +

genes used for building the pattern network

A

BC

E

Random pattern network

gene having the pattern

functional networkpattern network

overlap network

A

BC

E

F

G

DD

F

G

H

nucleation set

Figure 2.4: The left-hand panel shows an example real network, with athree edges forming an initial seed of nodes (A,B,C). For one of the patternsdiscovered using this seed, it also found genes (D,E,F,G), many of whichshare edges with the functional network. The overlap score in this case is0.86. In contrast, the right-hand panel shows an example random network,which was chosen to have the same number of nodes (4) as the proposedpattern network. In this case, however, only one edge is shared, and theoverlap score is 0.44.


0

0.5

1

1.5

2

2.5

3

3.5

0 50 100 150 200 250 300 350 400 450 500

over

lap

scor

e

pattern network size

random overlap score function of motif network size for the MDS network

0

0.5

1

1.5

2

2.5

3

3.5

0 50 100 150 200 250 300 350 400 450 500

over

lap

scor

e


real overlap score function of motif network size for the MDS network

Figure 2.5: Each point in these graphs corresponds to an overlap scorebetween a random motif network (top) or a real motif network (bottom)derived from the MDS network. The overlap score is a function of the size ofthe random motif network. The dotted line corresponds to the average fourstandard deviations to the mean overlap score for each pattern network size.


0

0.5

1

1.5

2

2.5

0 50 100 150 200 250 300 350 400 450 500

over

lap

scor

e


real overlap score function of pattern network size for the metabolic network

0

0.5

1

1.5

2

2.5

0 50 100 150 200 250 300 350 400 450 500

over

lap

scor

e


random overlap score function of pattern network size for the metabolic network

Figure 2.6: Each point in these graphs corresponds to an overlap scorebetween a random motif network (top) or a real motif network (bottom)derived from the KEGG network. The overlap score is a function of the sizeof the random motif network. The dotted line corresponds to the averagefour standard deviations to the mean overlap score for each pattern networksize.


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 50 100 150 200 250 300 350 400 450 500

over

lap

scor

e


real overlap score function of pattern network size for the cellzome network

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 50 100 150 200 250 300 350 400 450 500

over

lapp

sco

re


overlapp score function of random pattern network size for cellzome network

Figure 2.7: Each point in these graphs corresponds to an overlap scorebetween a random motif network (top) or a real motif network (bottom)derived from the Cellzome network. The overlap score is a function of thesize of the random motif network. The dotted line corresponds to the averagefour standard deviations to the mean overlap score for each pattern networksize.

2.5. RESULTS 38

2.5 Results

Figures 2.7, 2.5 and 2.6 show the overlap score of random and real pattern

networks as a function of the total occurence of the pattern for the cellzome,

MDS and metabolic networks respectively. In all cases, the randomised net-

works show a consistent, well behaved trend of linear increase score with

increasing number of nodes. For each network size, normality of distribution

for the overlap score was assessed, and it was found that, for small network

sizes, many networks have an overlap score of zero, which makes the distri-

bution skewed. This skewness becomes less significant as the network size

increases. For network sizes of more than 150, there is a good fit to normal

distribution (see Appendix B). The same tests have been performed using

the chi square method, resulting in similar conclusions.

A regression line corresponding to four standard deviations from the mean

of overlap scores for each network size was constructed, and real networks

having a score above this line were considered significant. Most of the pat-

terns found produce a network of genes that have little or no concordance

with the functional network. However, 647 motifs have a network of genes

that shows a much higher overlap score. In other words, these specific motifs

are found upstream of genes that have a significantly higher probability of

being interaction partners.

2.5.1 Significant motifs

These patterns required some further processing to be useful. First, the

pattern discovery systems output discrete patterns, so that, for example,

GANTATG and GNATATG would be treated as two distinct patterns de-

spite their obvious overlap. The patterns were clustered using their genomic

location (see Appendix B). This procedure is then followed by a single linkage

clustering, reducing the set of interesting patterns down to a total number of

42 motifs for the three functional networks considered in the study. Conser-

vative parameters were deliberately chosen to be sure of finding interesting

motifs.

The final set of 42 motifs, connecting to a total of 2,457 genes (about 40

percent of the yeast genome) are tabulated in Appendix C. Some clusters are

well known motifs that bind to known transcription factors in yeast, and the

2.5. RESULTS 39

regulated genes predicted by this analysis match the experimental evidence

previously published. A more detailed analysis of the motifs and their cor-

responding overlap networks are available online :

http://www.ebi.ac.uk/ettwille/genome research paper 2003/result overlap.html

What follows is a more detailed analysis of interesting motifs.

2.5.1.1 Motif GGTGGCAAA

One of the strongest motifs that have significance in both the Cellzome and

MDS networks is GGTGGCAAA. This motif, identified in cluster 6, has

been previously called proteasome associated control element or PACE, and

is known to bind to rpn4p, a transcription factor that controls expression

of genes related to the ubiquitin-proteasome pathway in yeast (Mannhaupt

et al., 1999).

In my hands both this motif and the reverse-complementary motif repre-

sented in cluster 23 are found mainly upstream of proteasome genes. Re-

stricted to only the overlap network, all the genes found are coding for

proteasome subunits apart for a protein from the cytoplasmic chaperonin

complex and a protein involved in the ubiquitin mediated degradation path-

way, both related to protein degradation pathway as well.

For the rest of the genes in the overlap network that are annotated as having

unknown function, strong evidence suggests, therefore, that these genes are

either proteasome subunit or more generally involved in protein degradation.

2.5.1.2 Motif TGACTC

The motif identified in cluster 33 (see appendix C) has been previously re-

ported to be located upstream of 30-40 yeast genes, encoding enzymes in 11

different amino-acid biosynthesis pathways (Arndt and Fink, 1986). This is

the well known binding site of the transcriptional regulator protein GCN4

that positively regulates the production of protein synthesis precursor in re-

sponse to amino-acid starvation.

The genes from the overlap network with the metabolic graph are encod-

2.5. RESULTS 40

ing mostly for protein involved in amino-acid biosynthesis, but also tRNA

synthetases of most amino-acids, as well as a couple of enzymes involved in

purine metabolism. Shown on Figure 2.8 is the overlap network obtained

using the exact motif TGACTC. The nodes in the highly connected part of

the network are the genes mainly involved in amino-acid biosynthesis path-

ways. Nodes on the periphery are mostly genes coding for tRNA synthetase

or genes involved in purine metabolism.

2.5.1.3 Motif AAAATTTT

The motif AAAATTTT is an interesting motif that scores very highly in

all three functional networks studied. Also known as poly(dA-dT) element,

this motif has been shown to create localised DNA distortion on either end

of the element (Koo et al., 2000) providing a region of access for transcrip-

tion factors (Iyer and Struhl, 1995). Indeed, in order to to bind efficiently

to the target site, most of the transcription factors need an open chromatin

(Koch and Thiele, 1999). In order to achieve this chromatin conformation,

the cell either remodels the chromatin after a stimulus or constitutively keeps

the chromatin open using DNA structural elements that induce nucleosome

destabilisation. The poly (dA-dT) is an example of such element. Main-

taining the chromatin in an open conformation allows rapid transcriptional

responses. In this study it was found that this apparently wildly occurring

pattern is found very often upstream of genes that are involved in transcrip-

tion and translation processes. Figure 2.9 shows the overlap network when

using the exact motif AAAATTTT on the metabolic network. The overlap

network topology is very different from the one derived from the TGACTC

motif, as it is formed mainly of two sets of highly connected nodes, one which

is mainly mRNA polymerase, the others are mainly tRNA synthetase.

The ubiquity of this motif for such basic processes suggests that it could

be a ’global state’ switch for yeast. For example, one hypothesis is that it

could be involved in a cell response to constantly changing conditions. Re-

adaptation often involves production of proteins and enzymes for the cell to

be able to use the new resources of that environment. Having a common

and simple regulatory element such as the adenine-thymine track, that con-

trols the rate of production of most of the genes that are involved in the

transcription/translation machinery, could enable the cell to rapidly boost

2.5. RESULTS 41

Figure 2.8: Overlap network for the exact motif TGACTC. Most of thegenes are coding for protein involved in amino-acid synthesis.

2.5. RESULTS 42

the production of new proteins and, therefore, quickly adapt to new situa-

tions. This is an interesting example of a functional motif that, even though

important for gene regulation, is probably not a binding site for a protein.

2.5.2 Non-random behaviour of significant motifs

Certain motifs with significant overlap scores also display other non-random

behaviour, such as a tight positional distribution relative to the start codon.

This is reflected by the standard deviation score or SD score (see appendix

B) which calculates how significant is the positional distribution of a certain

motif in overlap genes versus random genes that also have the motif. Figure

2.10 shows the location of cluster 4 (SD p value = 0.00 against Cellzome

network) for the overlap genes. A total of 15 motifs showed a significant

spatial distribution.

Because of the variability of the 5’UTR, a much tighter distribution should

be obtained when looking at the relative distance between the motif and the

transcription start site. Nevertheless, the amount of information regarding

the start of the transcription is very limited in yeast.

2.5.3 Assessment of known transcription factor bind-

ing sites

The process of finding new potential motifs depends on Tereisias parameters.

However, known motifs that do not satisfy this initial step of generating pat-

terns can still display significant overlap scores. From a list of putative tran-

scription factor binding sites, about 20 percent appear to have a significant

overlap score for at least one of the networks. Table 2.1 shows some of the

known sites that have significant overlap score(s).

The motif TATATAAA (an extended TATA box) shows a surprisingly

high overlap score with the metabolic and MDS networks, even though the

TATA box is present in most of the yeast genes. The consensus TATATAAA

is only present in 463 genes in the yeast genome. The 71 overlap genes do

not belong to any well defined functional group, but most of them are genes

that code for enzymes used in basal metabolism, eg. sugar metabolism.

2.5. RESULTS 43

Figure 2.9: Overlap network between the motif d(A)-d(T) network andthe metabolic network. Essentially the network can be clustered into twogroups of genes, the first group being composed of mostly genes involved intranscription; for example, tRNA synthetases and RNA polymerase subunits.The second group is composed of genes implicated in translation, such astranslation initiation factors or ribosomal proteins.

2.5. RESULTS 44

motif location from cluster 4

overlap genes ATG

600 bp upstream of translation start site

Figure 2.10: Motif locations on the genome relative to the start codon ofthe overlap genes (with Cellzome). The motif is GAGATGAG (see appendixC).

2.5.4 Inferring functionality to putative motifs

Because a sequence length of less than 10 defined nucleotides is expected

to occurs at random on the genome, the occurrence of motifs correspond-

ing to transcription factor binding sites (typically less than 10 mers) are

not going to be limited to the functional locations. Indeed, one can imag-

ine various mechanisms (wrong contexts, inaccessibility of DNA, chromatin

structure, for example) where a potential TFBS have no functionality. There-

fore, the set of genes that just have a putative motif in the upstream region

may be dominated by an overwhelming noise that can hide the subset of

genes where, indeed, the motif has a functional role. This is the case for

AAAATTTT which occurs upstream of 825 yeast genes, and no apparent

cluster of functionality can be derived from this set. Nevertheless, using only

the overlap network derived from a functional network where the motif shows

a significant score, it is possible to enrich the set with ’real’ locations and,

consequently, infer possible biological function to the motif. In this case, this

’overlap’ set includes currently only 106 genes; mostly genes encoding for

proteins involved in transcription.

2.5. RESULTS 45

Binding motifsfor

Litterature de-scription

overlapgenes

Consensus metabolic cellzome MDS

MET31/32 (Blaiseau et al.,1997)

methioninebiosynthe-sis

AAACTGTG 5.25 1.46 0.96

HAP2 (Mantovani,1998)

oxydativephosphory-lation

ACCAAT.A 6.51 0.11 1.36

GRF2/REB1 (Chasman et al.,1990)

unknown [TC]..[TC][TC]ACCCG 1.88 4.62 3.45

PHO4 (Hayashi and Os-hima, 1991)

Met ThrAsn syn-thesis

CACGTG 6.36 1.95 1.94

MBP1/MBF1 (Lowndes et al.,1991)

DNA repli-cation

ACGCGT.A 4.41 4.74 3.81

RPN4p (Mannhauptet al., 1999)

proteosome GGTGGCAAA 0.33 11.62 13.46

GCN4 (Hope and Struhl,1985)

AA synthe-sis

TGACTCA 8.66 3.39 2.87

CBF1 (Dowell et al.,1992)

unknown TCAC.TGA 5.21 0.5 1.23

TFIID-TBP (Struhl, 1995) unknown TATATAAA 4.91 1.39 3.61

Table 2.1: known transcription factor binding sites that have a significantoverlap scores. The values are the standard deviations from the mean ofrandom ’overlap score’. The overlap gene column is a functional annotationbased of overlap gene annotations.

This example show that this approach, in addition of finding potential TFBS

can also be used successfully to derive functionality to the motif, on the con-

dition that the overlap network with the appropriate functional network is

significant. Along the same line, functionality can be derived for genes with

unknown function if these genes are part of a significant overlap network.

This is the case for the motif GGTGGCAAA studied above, where the un-

known genes are most probably part of the protein degradation pathway in

yeast.

2.5.5 Promoter scanning

Instead of adopting a motif-centric view, the same analysis can be done for

one or a few sets of promoters. This approach is more applicable for exper-

imental biologists that work often on a limited set of genes. The upstream

region of a gene is scanned by sliding a variable window. The minimum

motif length is 6 and the maximum motif length is given by the number

of occurences in all upstream sequences (at least twice). For each sequence

defined in that window, the overlap score is then calculated using one or

many functional networks. The procedure is exactly the same as for 2.4.2

except that the nucleation set was not removed from the motif network. In

summary, all genes that have the motif (upstream region) form the motif

2.5. RESULTS 46

Cellzome

MDS

metabolism

0

overlapscore

5’ 3’

CCCGTCTA

500 bp upstream of YDR156W gene

AAAATTTT

CTCATCG

GTGGCAAAA

Figure 2.11: Example of a promoter scanning for the yeast gene YDR156Wencoding the RNA polymerase I subunit A14. The x-axis represents the po-sition on the window relative to the start of the studied gene (in bp). They-axis represents the overlap score normalised at three standard deviationsso that all the values less than 0 are not significant. The overlap scores forCellzome, MDS and metabolic network are in blue, red and yellow respec-tively.

network which is then overlapped with the functional network. The overlap

score obtained is normalized to 3 standard deviations from the mean of all

scores comming from 100 random motif networks having the same size as the

real motif network.

An example of such analysis is represented in Figure 2.11 The example

gene show in this figure is the RNA polymerase I subunit A14 gene. In the

proximal region of the promoter, two patterns have a strong overlap score:

AAAATTTT and CTCATCG. A significant overlap score can be seen also for

the MDS network and Kegg network (in case of AAAATTTT). Interestingly,

motif AAAATTTT occurs 569 times, CTCATCG occurs 209 times and both

motifs co-occurring on the same upstream regions happens 73 times. The co-

occurrence is much higher than one would expect by chance (p = 8.66.10−20).

This result is in accordance with the now-broad perception that transcription

factors binding sites do co-occur in functional units (Manke et al., 2003).

2.6. CONCLUSION 47

2.5.6 Discovering cis-regulatory elements using func-tional network in higher eukaryotes

This technique was applied with a negative result on human using the KEGG

database as the functional network. All the human genes were retrieved and

1kb upstream of the gene starts were repeat-masked. The same procedure

was applied to this new dataset and only one motif appears to be significant.

This motif is unknown and most probably is a false positive. This negative

result is not surprising, considering the much higher complexity of the human

genome compared to yeast.

One obvious reason is that a significant amount of regulatory regions may

be several kb away from the gene start in enhancer or locus-control regions.

Furthermore, because of the high number of coding genes, the signal-to-noise

ratio may be too small to produce anything significant. Yet, beyond techni-

cal problems inheritant to the genome complexity, gene regulation in human

probably obeys different rules than in yeast. Indeed, cells in humans are usu-

ally in a constant environment and do not need to adapt to different external

conditions by expressing new pathways. Furthermore, iso-enzymes are very

common in humans, and because they are expressed in very different tissues

and at different times, the regulation is very different. Large-scale protein-

protein interaction maps specific to a given cell type may be more suitable

than the metabolic network for such analysis. Unfortunately, to this date no

such large-scale study has been done on higher eukaryotes.

2.6 Conclusion

Many previous works have shown that genes with similar expression pro-

files are more likely to encode interacting proteins (Ge et al., 2001). This

study goes a step further by trying to use this relationship in order to find

cis-regulatory motifs, assuming that co-regulated genes have common reg-

ulatory element(s) in their upstream regions. This approach identifies 42

potential sites that are strongly suspected to be involved in gene expression,

most likely via transcriptional regulation. These correspond to some well-

known motifs and other novel cases.

The availability of good quality functional networks is a major limiting step

2.6. CONCLUSION 48

for this approach, especially when considering higher eukaryotes. With the

completion of more large scale studies using new techniques, this problem

will become less prevalent, and attempts to use this technique on higher eu-

karyotes like humans can be made. Chromatin IP appears to be one of the

most promising techniques to use for this approach. Indeed, Chromatin IP

identifies regions where a particular transcription factor binds and, by exten-

sion, also identifies downstream genes that are potential targets. Applied to

many transcription factors on a genome-wide analysis, the resulting network

can be used as the functional network. Chromatin IP has been used success-

fully on yeast (Lee et al., 2002), as well as on higher eukaryotes (Li et al.,

2003).

Beyond the usefulness for cis-regulatory motif discovery, this method can

also be used to infer functionality to a particular motif or gene. It can also

be used to refine the current understanding of functional interaction. Taking

the example of the nucleotide pathway discussed above, NTPs are known to

be used in many biological processes, and the type of enzymes that act upon

these compounds varies greatly. Nevertheless, the overlap network between

the nucleotide pathway and the MluI motif network essentially highlights

CDC genes involved in the cell cycle. This refinement towards functional

modules can be broadly applied to the metabolic or protein interactions,

or any functional network. Potentially, each significant overlap network ob-

tained here can be considered as a refinement of the initial network.

Now with the genome completion of four other yeast-related species, combin-

ing this analysis with evolutionary information would be expected to produce

even more interesting results. Indeed, one would expect an overall conserva-

tion of the functional network topology across related species.

Finally, the concept of overlap network can be applied in more biological

problems than just the discovery of cis-regulatory elements. One can imag-

ine, for example, the evaluation of some experimental networks relative to a

reference network using the overlap score.

Chapter 3

Evolution dynamic ofcis-regulatory regions in highereukaryotes

As discussed in the introduction cis-regulatory regions have very different

evolution dynamics than coding sequences potentially allowing insertion,

deletion, translocation and inversion to be quite common. Consequently,

the homology between even close species can be hard to detect and inter-

pret, as conventional tools have often been designed for coding sequences.

Promoterwise, a pair-wise alignment algorithm, has been specifically devel-

oped by Dr. Ewan Birney to address these types of issues. The basic schema

is represented in Figure 4.1.

The algorithm begins by localising every possible small ungapped matches of

six out of seven nucleotides. These matches are extended and merged when

possible. The algorithm then uses the pair-HMM from DBA (Jareborg et al.,

1999) to align the matches. The resulting hits are then sorted according to

the log-odd score. The aligned regions are independent of strand direction,

gap length and position in the sequence, making this procedure particularly

well suited for regulatory region comparisons.

Promoterwise has been used first on specific examples with manual cura-

tion of the data in order to identify potential problems, and then used to

perform a systematic homology search on the whole genome. This chapter

summarises the results obtained.

49

50

match A match C

gap

match (0.65) match (0.65)

gap (0.05)

match D

gap gap gap

match (0.65) match (0.65)

gap gap gap

match B

unmatch

unmatch (0.99)

unmatch

unmartch

blockopen (0.01)

rating the alignments(using log−odds bit−score)

PROMOTERWISE output

DNA Block Aligner

ATGGCGGTGGGGATCCAACC ATGGCGGAGGCGATACATCC

extention and merge of close seeds

Find small ungapped matches (6base pairs matches)as an heuristic for reducing alignment time

Figure 3.1: Promoterwise schema.

3.1. DETAILED ANALYSIS OF A SPECIFIC EXAMPLE : THE ATONAL 5 GENE 51

3.1 Detailed analysis of a specific example :

the Atonal 5 gene

A number of hand analyses was performed in collaboration with the verte-

brate developmental group of Jochen Wittbrodt. Those analyses are useful

not only for understanding specific gene expression patterns involved in ver-

tebrate development but also for providing insight for global comparisons.

By using a well described gene the result can be confidently interpreted.

3.1.1 The Atonal 5 protein

The study of the promoter of atonal 5 gene is a collaborative project with

Filippo Del Bene from the Wittbrodt lab.

The developing eye and, in particular, retinal neuron development, is a good

model for the study of pattern formation and cell fate determination in de-

veloping embryos with an active international research community. Atonal

5, a basic helix-loops-helix transcription factor, is a regulator of retinal gan-

glion cell (RGC) development and is expressed in retinal progenitors (Vetter

and Brown, 2001). In fact, the neuronal retina contains 7 different types of

neural and glial cells and atonal 5 is critical for the development of RGC.

The genes that have been shown to be regulated by atonal5 are:

1. Delta1 gene : induce the lateral inhibition of differentiation toward

RGC fate.(Schneider et al., 2001)

2. MyT1 gene : Allow the cell to escape Notch inhibition and adopt the

RGC fate. (Schneider et al., 2001)

3. Brn3 gene : POU homeodomain transcription factor, important for the

RGC development and survival.(Hutcheson and Vetter, 2001)

4. nAchR gene : Neural nicotinic acetylcholine receptor (Hernandez et al.,

1995)

Atonal5 has been shown to auto-regulate itself as well. (Matter-Sadzinski

et al., 2001).


Figure 3.2: transgenic medaka embryo expressing GFP construct undermedaka atonal5 promoter (5kb). GFP is expressed in a population of neu-rons in the retina which project their axons to the brain, forming the twooptic nerves that cross at the optic chiasm (Filippo Del Bene, personal com-munication).

3.1.2 The promoter of atonal5 gene

Filippo Del Bene has isolated and sequenced the 5kb region immediately

upstream of the gene in medaka (Oryzias latipes) and he experimentally

shows that the 5kb region upstream of the gene is sufficient to express a

reporter GFP in a correct spatial and temporal pattern. Figure 3.2 shows

the image of a fish embryo expressing GFP from a construct containing the

5kb upstream sequence of medaka atonal5 gene.

Atonal 5 is a good candidate for finding interesting regulatory motifs

because it has been experimentally shown that the 5kb upstream sequence

should possess all the necessary regulatory regions for its correct spatial and

temporal expression in the developing fish. This gene is also a good can-

didate because its product is involved in key pathways for eye formation, a

common process shared by most vertebrates and, as such, orthologues are

available from remote species like mammalian and chicken.

The first step was to find the correct homologue of the medaka atonal5 gene

in all fully-sequenced vertebrates currently available - human (Homo sapi-

ens), rat (Rattus norvegicus), mouse (Mus musculus), zebrafish (Danio rerio),

fugu (Fugu rubripes) and chicken (Gallus gallus). To do so, the medaka pro-

tein sequence of atonal5 was blastP against all the genomes, and the best

hit was retrieved as the orthologue gene. The orthologous genes annota-


specie Ensembl ID gene chromoso-mal location

upstreamlength

Homo sapiens ENSG00000179774 10.69883835-69885409

50983 bp

Rattus norvegi-cus

ENSRNOG00000000384 20.26945745-26950000

21114 bp

Mus musculus ENSMUSG00000036816 10.62748738-62771738

23420 bp

Danio rerio ENSDARG00000022606 ctg9353.113860-128060

14570 bp

Fugu rubripes SINFRUG00000130186 Chr-scaffold-1775.16614-21579

5094 bp

Gallus gallus ENSGALG00000003931 6.9845459-9845914

29799 bp

Oryzias latipes - - 2881 bp

Table 3.1: Atonal5 homologs Ensembl ID and locations on the ENSEMBL16.0 release apart from Gallus gallus that was done on the Ensembl pre-release. The upstream length is the length of the intergenic region until thenext upstream annotated gene.

tion (EnsEMBL IDs) and location on the genome are shown in Table 3.1.

These results are in accordance with the best-reciprocal hits in the Ensembl-

Compara database (Birney et al., 2004).

A blast of the 5kb upstream region of the medaka atonal5 gene with fugu

revealed a gene that has not been annotated in medaka. Once this upstream

gene is excluded (assuming, therefore, that no regulatory motifs can be lo-

cated within an coding sequence of the upstream gene, which is a reasonable

starting hypothesis), the resulting upstream medaka sequence is believed to

be 2881 bp long.

The upstream regions for the homologue genes were retrieved manually.

This corresponds to the whole non-coding region stretching from the up-

stream gene until the annotated gene start of the atonal5 homologue gene.

Promoterwise was run using a number of parameters for all possible pairs,

and each region of homology was compared manually with the other com-

parisons in order to find overlap of conserved sequences across at least three


Region1 : Common region upstream of the atonal5 gene in human, mouse,rat, chicken, fugu, zebra and medaka located about 500 bp upstream of the

annotated gene start (in human).

MEDAKA_REGIONFUGUZEBRAMOUSE_REGIONHUMANRATCHICKEN

TTTTTTT

GGGGGGG

GGGAAAG

AAAGGGG

GGTAAAA

GGGTTTT

TAGGAGG

TCAGGGG

TAGGAGA

|10

GGTAAAA

GGAGGGG

GGACCCC

GGTGCAA

GGAGGGG

GAAAAAA

GGGGGGG

GGGGGGT

GGGGGGG

CCTAAAC

|20

.A.CCCC

GGGGGGG

GGGCCCC

GGGCCCC

CAGCCCC

CCCGGGG

TTTC.CC

CCCCCCC

CCCCCCC

AAAAAAA

|30

CCCCCCC

CCCCCCC

TTTTTTT

GGGGGGG

CCCCCGC

TTTCCCT

GGGAAAA

CCCCCCC

CCCCCCC

AAAAAAA

|40

CCCCCCC

CCCCCCC

TTTTTTT

GGGGGGG

TTTTTTT

TTTTTTT

TCACCCT

MEDAKA_REGIONFUGUZEBRAMOUSE_REGIONHUMANRATCHICKEN

GAACCCA

.

.

.CCCG

.

.

.TTT.

|50

.

.

.CCCC

.

.

.CCCC

.

.

.TTTT

TT.TTTT

CT.GCGA

TC.AAAG

GG.GGGC

CC.CCCA

TTTCTCA

GCACCC.

|60

CCCTTTT

GATGTGT

TTCGTG.

CCTGGGT

TTACCCC

TTTTTTT

AAAAAAA

TTTTTTT

AAAAAAA

AAAAAAA

|70

AAGAAAA

AACTTTG

GGATTTC

CCTCCCC

TTCTTTT

GGGCCCC

CGCCCCT

TTTTTTT

CTCCCCC

CCACCCG

|80

.

.TCCCA

TTCTTTG

CGCCCCT

GTTACAC

CATTACT

CCACATG

.TTGGGT

GGGCACA

TTGTTTA

CCCCTCG

|90

TAATCTA

TGGGAGC

CCATTTA

Figure 3.3: the The best conserved regions in the upstream region of atonal5genes. Visualisation tool from Jalview (Clamp et al., 2004)

.

species.

Regions of homology with fugu stretch until 1636 nt upstream of the start

codon of atonal5 gene of medaka. This result is in accordance with the fact

that new constructions of little more than 2kb upstream the gene is sufficient

to trigger expression; 1.5kb seems to give weaker result (Experimental results

from Filippo Del Bene).

Essentially three regions with homology in at least two other species were

found, annotated as region one, two and three as show in Figure 3.3 and

figure 3.4. Region 1 represented in Figure 3.3 is the most proximal from the

start of the atonal5 gene in medaka (about 500 hundred base pairs away)

and the most conserved, as it is found in all six species studied. Because

of the long divergence time (450 million years) between mammals and fish,

only the strictly essential motif is presumed to have been conserved, which

includes the motif CCACCTG that is repeated twice with a conserved gap

of three nucleotides. This alignment also included chicken.

This very well conserved sequence is a good candidate for a transcription

factor binding site, possibly the atonal5 binding site itself, since the product

of atonal 5 gene regulates the expression of its gene (Matter-Sadzinski et al.,


2001). Flanking this motif is a conserved AG rich region (possibly a SP1

site) upstream and a putative TATAA box downstream. Interestingly, the

distance between the putative motif and the TATAA box is either 19 (mam-

mals), 14 (medaka, fugu) or 9 (zebrafish), and this almost perfect multiple

of 5 corresponds to a half turn of a DNA helix (apart from chicken).

Downstream from the putative TATAA box is a conserved CT rich region

(possibly a SP1 site as well). It is interesting to note that the whole region

is flanked by two putative SP1 sites that are reverse complements and can

therefore be involved in secondary structure of the DNA.

The annotated gene start is located about a 500 bp away from the TATAA

box, and this distance is more or less conserved between all the species stud-

ied.

Regions 2 and 3 represented in Figure 3.4 are only common to the fish species

and are located about 1150 and 1500 bp away from the start of the medaka

atonal gene respectively, much more upstream of the atonal5 gene than re-

gion 1. Because no mammalian sequences were included, the resolution is

not as good as for region 1.

3.1.3 The Atonal5 motif

If the motif CCACCTG is the binding site for the atonal 5 protein, genes

that also have this conserved motif may be target genes of transcription fac-

tor atonal 5.

The next step was, therefore, to find other genes that have a motif CCAC-

CTG or its reverse complement in the upstream region that is conserved

throughout human, mouse rat and fish orthologues. To find such cases, a

simple pattern matching program was developed. This program fetched all

the orthologous genes of human in mouse, rat, fugu and zebra and retrieved

the 5kb upstream sequences (see following section for more details). Pro-

moterwise was then run on the mammalian orthologous pairs. To be selected,

a gene needs to have:

1. the motif in conserved upstream region (Promoterwise bitscore > 25,

see next section for justification of this cut-off) when considering human-


Region2 : Common region upstream of the Atonal5 gene only in zebrafugu and medaka located about 2kb upstream of the annotated gene start

(in fugu).

D_rerioF_rubripesO_latipes

TCC

TAA

A.A

A.G

A.A

C.C

GGG

.AA

.CC

|10

.

.C

.AA

GGC

GGG

AAA

CCC

AAC

AAA

GGG

CCC

|20

AAA

GGG

CCC

AAA

GGG

CCC

TTT

GGG

GGG

TCC

|30

CAA

AAA

GAA

GCC

GTT

AAA

TCC

GAA

CTC

CCC

|40

AAA

AAG

TTT

CCC

TTT

GGG

CTT

AAA

AAA


TTT

|50

CCA

AAA

AAA

TTT

GGG

AAA

A..

A..

AAA

CCG

|60

ACC

AAG

AAA

CCC

TTT

A..

A..

A..

C..

C..

|70

AGG

GGG

TTT

GAA

GGG

AAA

TTT

T..

GGG

GGG

|80

GGG

TAA

TTT

TTT

AAA

TTT

AAC

TTT

GGG

CCC

|90

CCC

CTC

AAA

AAA

TTA

TTT

GGG


ACC

AAA

CCC

|100

AAA

GGG

CCC

ATT

GAC

AAA

ACC

TTC

TTG

AA.

|110

CGG

TTT

CCC

.CA

AAA

TTT

CCC

TCT

ACG

AAA

|120

T..

G..

C..

AAA

GA.

AA.

TT.

GGG

AAA

GCC

Region3 : Common region upstream of the Atonal5 gene only in zebrafugu and medaka located about 2.6kb upstream of the annotated gene start

(in fugu).


GGA

AAA

AGA

GGA

GGG

GGG

AAA

AAT

CGG

|10

AAA

AGG

AAG

A..

A..

G..

GGG

CTT

TTT

CCC

|20

AAA

AAA

AAA

TTT

AAA

GAA

GGG

CCC

AAA

TTT

|30

GAA

AAT

AAA

AAA

TTT

TTT

AAC

CCC

AAA

ATT

|40

GCT

CCC

AAA

CCC

CCC

TTT

TTT

GAC

CTC


TGC

|50

TTT

GGG

AAA

CCC

CCC

TTC

AAA

AAG

TTA

TTT

|60

AAA

CTG

AAA

GCC

TTG

GCG

AAG

C..

GA.

AAA

|70

GGG

CAC

CCC

TTT

CCC

AAG

TTT

CTC

TTT

CCG

|80

AAA

CCC

CCC

AAA

GGG

ACC

TTT

GGG

CCC

C.C

|90

C.A

C.C

A.G

A.C

T.C

C.T

T.G


G.C

A.T

CTC

|100

GCG

GAC

TGT

GTG

GGT

GGT

GGC

GTA

TTC

TGG

|110

CCC

ATT

ACG

GAC

Figure 3.4: The two medaka regulatory regions common with fugu and ze-brafish. Visualisation tool from Jalview (Clamp et al., 2004)

.


mouse and human-rat comparisons.

2. the motif in the orthologous intergenic region of at least one fish (fugu

or zebrafish) independent of the alignment information.

The reason why no conservation information is used on fish is due to the

fact that functional region alignments are at the limit of detectability when

considering mammalian-fish comparison. Indeed, the conserved region 1 in

the atonal gene example had only a bitscore of 16 in human-fugu pair-wise

comparison, which can often occur by chance.

A total of 128 candidates were identified using the motif CCACCTG (or its

reverse complement). Shown in Figure 3.5 are a couple of examples where the

data was manually curated and aligned to the conserved species. In all these

examples no significant alignments were detected when considering mammal

and fish comparison; only the motif was present. These genes are also known

to be expressed in the retinal ganglion cell.

To validate the hypothesis that CCACCTG motif is the binding site for

atonal5, the known targets of atonal 5 (Delta1, MyT1, Brn3 and nACHR

see 3.1.1) were manually checked for the conserved motif within the 5kb up-

stream of the genes. The result is shown in Figure 3.6. Because these genes

were not found using the automated procedure, the motif will not be the

consensus CCACCTG at least for one of the species. This is the case for

Brn3, where only the human site is not the consensus sequence CCACCTG

but CCACCTC (reverse complement). In the case of Delta1 the consensus

sequence CCACCTG is replaced by GCACCTG in all species. The two other

target genes of atonal5 (MyT1 and nACHR) do not seem to possess similar

motifs within 5kb upstream of the genes.

3.1.4 Experimental validations

Filippo Del Bene (EMBL) confirmed experimentally both the predicted bind-

ing site for atonal and some of the predicted candidate genes. The binding

site was confirmed both in vitro andin vivo to be CCACCTG. EMSA (Elec-

trophoresis Mobility Shift Assay) were performed on the region containing

the two wild type motifs ( see figure 3.3). A similar assay was perfomed on

mutants where the motif was changed. Ath5 only binds the wild type motif


dlx2:dlx2 is known to be co-expressed with nBrn3 and seems to be involvedin defining the retinal ganglion and inner

nuclear layers of the developing and adult mouse retina (de Melo et al., 2003)

humanmouseratzebra

CCCC

AAAG

AAAG

CCCC

CCCC

TTTT

CCCG

GCGT

CTTC

|10

AAAT

CCCT

AAAT

CCCC

TTTT

GGGT

CTTG

CCCC

CCCC

AAAC

|20

CCCT

CCCC

AAAA

GGGG

GGGG

TTTT

GGGG

GGGG

CCCA

GGGG

|30

CCCA

CCAG

AAAC

CCCG

AAAG

AAAC

AAAA

GGGG

CCCG

AAAT

|40

AGGG

GGGT

CCCT

GGGA

CCCA

TTTA

GAAA

CCCC

CCCA

GGGC

rx:The gene encoding the Rx/rax transcription factor (Casarosa et al.,1997) belongs to a subfamily of the paired-like homeobox genes (Galliotet al., 1999). A previous report showed that RX was able to define theretina-diencephalon territory in the anterior neural plate (Andreazzoli

et al., 1999).

humanratmousefugu

CCCA

TTTG

CCCC

AAAT

GGGT

CCCT

AAAC

CCCT

GCCC

|10

TTAA

CCCA

AAAT

GGGG

CCTC

CCCC

AAAA

CCCC

CCCC

TTTT

|20

GGGG

GCCC

TTTT

CCCT

TTTC

AAAT

TCCT

GCCT

TTTT

CAAC

|30

AGGT

CCCG

TACC

GAAT

GGGC

CCCT

ATTT

GGG.

TGG.

CGG.

|40

AAA.

-AAC

GGGA

AAAA

CCCC

CACC

TCCT

TTTT

TCCC

TTTT

|50

CCGC

GGGA

GGGT

humanratmousefugu

GGTA

TTTC

GGGA

CCAC

CCCT

AAGG

CCCC

|60

CCTC

AAAC

GGGG

GGGT

CCCA

CCCT

AAAG

TTTC

slit1: Slit1 is express in retinal ganglion cell and is re-sponsible for regulating axon guidance and cell migration (Plump et al., 2002)

humanmouseratFugu

AAGC

TTCG

AACG

TTAG

TTGG

CCTT

ACTC

TTGC

TTTT

|10

TTAC

TTCG

CCCG

ATTC

TTGG

CCCG

TTTA

GGGG

.

.CC

TTTT

|20

TT.C

TTGC

CCCC

CCCC

AAAA

CCCC

CCCC

TTTT

GGGG

TCAG

|30

CCGA

AAGA

GGCC

AACA

AT.A

AG..

TTTT

GGGG

GGGG

.

.C.

|40

AAAA

GGGG

Figure 3.5: Example of candidate genes that possess the conserved consensusmotif CCACCTG within 5kb usptream for mammalian and at least one fish.


Delta1:

ENSG00000112577ENSMUSG00000014773ENSRNOG00000014667ENSDARG00000020219SINFRUG00000146981SINFRUG00000149486

AAAAAT

GGGGGC

CCCCGT

TTTTGC

CCCCAT

TTTTGC

TTTTCT

TTTTGC

CCCCCT

|10

TTTTGC

.

.

.CCT

.

.

.CTG

.

.

.CCG

CCCCGT

TTTGCG

CCCCTC

CCCGGT

GGGATG

CCCCGT

|20

AAAAAA

TTTTTT

TTTTTT

.

.

.

.GG

GGGGGT

TTTTGG

GGGGGA

CCCACG

GAGGCA

GGGGGG

|30

GAAGTA

GGGGGG

AAAAAA

GGGGGG

CCCCCC

AAAAAA

GGGGGG

GGGGGG

TTTTTT

GGGGGG

|40

CCCCCC

TTTTTT

GGGGGG

.

.

.

.

.C

ENSG00000112577ENSMUSG00000014773ENSRNOG00000014667ENSDARG00000020219SINFRUG00000146981SINFRUG00000149486

.

.

.

.

.C

.

.

.

.

.G

TTTTGG

CCCGCC

TTTTTC

GGGGGA

|50

CCCAAA

AAAAAA

TTTTTT

TTTTTT

AAAAAA

CCCCCC

CCCCCC

AAAAAA

TTTTTT

AAAAAA

|60

CCCCCC

AAAAAA

GGGGGG

CCCCCC

TTTTTT

GGGGGG

AAAAAA

GGGGGA

CCCAGA

GGGGGG

|70

CCCCGC

AAAAAA

CCCCGC

AAAAAA

AAAGGA

AAAAGA

GGGGGA

AAAAGA

GGGGGA

CCCAAA

|80

CCCAAC

AAAACT

CCCAGT

TTTACT

Brn3:

H_sapiensM_musculus_AM_musculus_BR_norvegicus_AR_norvegicus_BD_rerio

TTTTTT

CGGGGG

TTCTCA

AATATG

CATATC

CAGAGG

CGTGTT

CTATAC

GCACAT

|10

GTGTGG

AGAGAG

GGTGTG

CCGCGG

GAAAAG

CAGGGA

GGTGTA

GCGCGC

TTATAA

TGAGAG

|20

GGGGGG

AAAAAA

GGAGAG

GCGCG.

GCCCCC

AAAAAA

GGGGGG

GGGGGG

TTTTTT

GGGGGG

|30

GGGGGG

.

.

.

.

.A

.

.

.

.

.T

GGGGGG

GGGGGG

CCGCGG

AGAGAA

GGGGGT

GGAGAC

GGGGG.

|40

GCGCG.

TGGGGG

CGGGGG

AAAAAA

CCGCGT

CAGAGG

TGCGCG

H_sapiensM_musculus_AM_musculus_BR_norvegicus_AR_norvegicus_BD_rerio

GGAGAT

GAGAGG

GACACC

|50

CGAGAA

CAAAAT

TGGGGC

CGCGCT

GCGCGG

TCACAT

TCGCGG

CTATAT

TGGGGC

GCGCGA

|60

GCGCGC

CACACT

AGGGGG

GGAGAC

CCGCGA

CCGCGC

CGGGGT

Figure 3.6: motifs upstream of the Known Atonal 5 target genes Brn3 andDelta1. MyT1 and nACHR, the other two targets of Atonal5 do not possesssimilar motifs


and not the mutant where the E-box was altered. An in-vivo assay using

GFP expression confirmed the in-vitro result.

30 predicted direct target genes were tested for the expression pattern in

the fish retina and compared to the expression pattern of atonal itself. 20 of

these candidates show a similar expression pattern as atonal. In other terms,

more than 60 % of the predicted genes were confirmed to be co-expressed

with atonal.

3.1.5 Conclusion regarding this example

Aligning non-coding regions accurately and distinguishing the cis-regulatory

elements from the background noise is a much harder problem to solve than

for coding sequences. Nevertheless, the combination of an aligning procedure

adapted to non-coding DNA, such as Promoterwise, combined with a careful

manual analysis of the results obtained, is a powerful strategy that can give

impressive results. Indeed, for the atonal 5 regulatory region, the conserved

site CCACCTG in region 1 is most certainly a binding site for a transcrip-

tion factor, possibly atonal5. When a putative transcription factor binding

site is identified, it can be used to screen the entire genome to find potential

candidate genes that may be regulated by the same factor.

When applying this procedure it is important to consider, in addition

to the biological issues discussed in the introduction the following technical

issues:

1. Wrong gene annotation or wrong orthology mapping that leads to the

comparison of two regions that are not related.

2. Missing exons or unannotated 5’ UTR for one or both orthologous pairs

that lead either to two unrelated-regions or related regions that are not

upstream of the gene of interest but rather exonic or intronic sequences.

3. Upstream unannotated genes that would account for most of the signal,

since genes are usually more conserved than intergenic sequences.

4. If the upstream gene is in the opposite strand as the gene studied, then

potentially, the regulatory region of the upstream gene will be detected

3.2. GLOBAL RUN OF PROMOTERWISE 61

Run Repeat maskerRemove upstream genes

Get 5000 bp upstreamof genes

get homolog relationship(many to many relationships)

EACH HOMOLOGOUS PAIRSRUN PROMOTERWISE FOR

COMPARA 17_1ENSMART 17_1

C.Elegans C. briggsae

C. briggsae

M. musculusR. norvegicusF. rubripesD. rerioD. melanogasterD. pseudoobscuraA. gambiaeC. elegans

H. Sapiens

M. musculus

R. norvegicus F. rubripes

D. rerio

H. Sapiens

D. pseudoobscura

D. melanogaster

A. gambiae

Figure 3.7: Procedure for running Promoterwise on complete genomes.

as well.

With the continuous improvement of genome annotation these problems will

become less prevalent, and one can imagine an automated procedure that

would find potential transcription factor binding sites and automatically find

other candidates genes that may be under similar regulation. This is the focus

of the next section and the following chapter.

3.2 Global run of promoterwise

I wished to develop a comprehensive view of regulatory conservation between

the fully sequenced vertebrates. This analysis was done in order to have a

global idea of non-coding sequence conservation in the upstream region of

genes. The schema of the procedure is shown in 3.7. The species consid-

ered correspond to all the fully sequenced genomes in EnsEMBL (Birney

et al., 2004), and the relationships were derived from the Ensembl-Compara

database. In order to consider all possibilities, all relationships were used

that include best reciprocal hits (BRH) and reciprocal hits based on synteny

(RHS). For each orthologous pair, the 5kb repeat-masked sequence upstream

of each gene was retrieved and the upstream gene was removed if necessary.

Promoterwise was then run on these sequences.

3.2.1 Promoterwise : the algorithm

Dr. Ewan Birney developed promoterwise. Promoterwise is a pragmatic

heuristic of seeding from small ungapped matches (6 base pairs out of 7) in


both strand, extending the seeds and merging close seeds. dynamic program-

ming style routines was then used across the resulting co-linear regions. To do

so, the established DNA block Aligner (DBA (Jareborg et al., 1999)) model

for the co-linear alignment. The DBA model allows small insertions and

deletions in functional region interrupted by potentially long non functional

regions. The resulting set of DBA alignments are then resolved into one set

of alignments by a simple greedy method of rating all the alignments by the

log-odds bit score and accepting progressively less likely alignments only if

they do not use bases used by previously accepted alignments. Promoterwise

has been incorporated into the Wise2 package.

3.2.2 Defining the cut-off

Promoterwise produces systematically random low scoring alignments. It

is therefore important to define a cut-off score in order to distinguish align-

ments due to negative selection from random alignments. Two methods were

developed which worked well in defining this cut-off.

3.2.2.1 Percentage of positive pairs function of the cut-off score

The first method looks at the percentage of positive pairs; the percentage of

pairs that have a hit above the cut-off function of the score cut-off. If we

assume that significant and non-significant hits have distinctive score distri-

bution and that a number of orthologous pairs are wrong or do not possess

related upstream sequences, then we expect to see a drastic drop of positive

pairs when the cut-off allows mostly significant hits only.

The results shown in Figure 3.8 follow what is expected. The drop of positive

pairs depends largely on the species considered, but overall, the drop occurs

somewhere around 25 bit score cut-off.

3.2.2.2 Strand conservation function of the cut-off score

If we assume that homologous regions between two species also retain their

strand direction most of the time, calculating the overall fraction of same

strand hits as a function of the bit score of the alignment is a good indicative

measurement of the fraction of homologous alignments versus random align-

ments. The result is shown in Figure 3.9. The fraction of same strand hits


0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140

perc

enta

ge o

f ups

trea

m r

egio

ns

score cutoff

percentage of upstream region that have an hit function of the score cutoff

human fuguhuman mouse

mouse rat

Figure 3.8: positive upstream region function of the score cut-off.

as a function of the bit score of the alignment depends on the two species

compared, but reaches a plateau very close to 1 at a score higher than 40

bits in most cases.

A notable exception is the comparison between rat and mouse upstream re-

gions, where a significant fraction of hits with bit score higher than 40 are still

on opposite strands relative to each other. This is very interesting and one

hypothesis to explain such an observation is that rat and mouse are so closely

related that (a) non-functional regions are still alignable, and (b) very often

these non-functional regions are inverted in one species without negative se-

lection. Further work needs to be done to assess the dynamics of inversion of

non-functional regions, but if this hypothesis is true it means that the rate of

inversion is quite high for non-functional regions in non-coding sequences, but

that these inversions are negatively selected in functional non-coding regions.

Inversion of well conserved regions between more remote species like human

and rodents is a rare event but does occur as Figure 3.10 shows.


0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100

ratio

fow

ard-

forw

ard

vers

us fo

war

d -r

ever

se

score

effect of the score on the proportion of reverse complement hits

D. melanogaster A. gambiaeC. elegans C. briggsae

F. rubripes D. rerioH. sapiens M. musculus

H. sapiens R. norvegicusM. musculus R. norvegicus

Figure 3.9: fraction of hits in both forward strands as a function of thescore. The high ratio of reverse complemented hits at low score is due tothe fact that these hits are random. Most of significant hits ( > 30) involvessequences in both the same strand.


Figure 3.10: Alignments between H. sapiens gene ENSG00000091527 (top)and M. musculus gene ENSMUSG00000032803 (bottom). The green boxescorrespond to aligned regions of 25 or more bitscore on the plus/plus strand.The inverted region in red is about 300 bp long with an alignment score of119.

3.2.3 Results

Considering that a score of 25 is a conservative threshold for significant align-

ments, Table 3.2 shows the percentage of upstream regions that have at least

one region with a significant score. The percentage of upstream regions hav-

ing a Promoterwise score higher than 25 in the case of H. sapiens versus M.

musculus or H. sapiens versus R. norvegicus (91 million years divergence)

is around 60 %. As expected, rat and mouse comparison gives the highest

percentage of upstream regions having significant Promoterwise scores (73

%).

The similarity drops quite dramatically when comparing mammals and fish

(450 million years divergence) with 3.32% and 3.65% of H. sapiens genes

having significant upstream region homology with F. rubripes and D. re-

rio respectively. This figure is even more dramatic in A. gambiae and D.

melanogaster(250 million years divergence) where only 1.44 % of homologues

give significant Promoterwise scores.


Mammalian and fish :

H. sapiens R. norvegicus M. musculus F. rubripes D. rerio

H. sapiens - 14653/983867.13 %

18328/ 1148362.65 %

10300/3423.32 %

7961/2913.65%

R. norvegicus - 18703/1364072.93 %

10456/3873.70 %

8049/3143.90 %

M. musculus - 10732/3403.17 %

8267/2763.33 %

F. rubripes - 7566/112814.9 %

D. rerio -

Diptera:

A. gambiae D. melanogaster

A. gambiae - 8025/116 1.44

%

D. melanogaster -

Nematodes :

C. elegans C. briggsae

C. elegans - 11714/662156.52 %

C. briggsae -

Table 3.2:Total orthologous pairs / number of pairs that have at least one region witha scores higher than 25 bits (percentage of positives). All the orthologouspairs were retrieved from Ensembl compara release 16.0.


all possible orthologous pairsA Orthologous pairs with GO ID XB

orthologous pairs with conserved sequence (promoterwise score > cutoff)C

Orthologous pairs with conserved sequence and GO ID XD

A B

D

C

Hypergeometric distribution

Figure 3.11: hypergeometric distribution to calculate the probability ofseeing D by chance.

3.2.4 Genes with conserved 5’ proximal intergenic re-

gions

From the visual data, it seems that a high number of positive upstream re-

gions correspond to specific sets of genes. These fall into particular classes

of proteins; for example, transcription factors or key genes involves in devel-

opmental processes.

In order to systematically test for an enrichment of particular classes of

genes in highly conserved regulatory regions between fish and mammals, all

the Gene Ontology (GO) annotations (Harris et al., 2004) were mapped to

EnsEMBL genes and annotations that show a significant enrichment in the

positive set; that is, the set of genes that have significant conservation in

the upstream region, were selected. This enrichment was estimated using an

hypergeometric distribution as shown in Figure 3.11.

3.3. CONCLUSION 68

The result for conserved upstream regions between human-mouse and

human-fugu comparison are shown in Table 3.3 and 3.4 respectively.

The positive sets in both cases contain mostly genes involved in develop-

ment and are transcription factors. The human-mouse positive set contains

in addition genes involved in signal transduction pathways.

Conversely, the same study can be done for under-represented classes in

the positive set and Table 3.5 show the results for human-mouse comparison.

Globally, genes that code for proteins located in the ribosome, as well as

olfactory genes, seem to be under-represented. No significant classes can be

found when looking at human-zebrafish comparison.

Despite the fact that one can not rule out technical artifacts, such as a better

mapping for certain type of genes that would explain such an enrichment,

this result is in accordance with the common belief that key proteins involved

in developments such as transcription factors have very well conserved reg-

ulatory regions. This result also implies that the pattern of expression for

transcription factors and developmental genes are overall conserved through-

out evolution.

3.3 Conclusion

From the specific examples described in this chapter and many cases reported

in the literature, it nows becomes clear that conservation of non-coding DNA

can be detected by alignment-based methods for relatively close species only.

As shown in this chapter, alignment methods only work on remote species

in cases where genes that tend to be very well conserved across evolution

(eg. transcription factors or key genes involved in developmental processes)

which counts for only 3 % of the total genes. Significant alignments of re-

lated sequences usually conserved the strand direction, but in the case of

rat-mouse comparison, a significant number of alignable regions have been

flipped, suggesting that inversion of possibly neutral sequences is a common

process. If this is the case, one would expect the comparison between human

(Homo sapiens) and chimpanzee (Pan troglodytes) to behave the same way,

as the divergence time between these two species is very small.

3.3. CONCLUSION 69

GO cate-gory

type Probability GO annotation

GO:0005578 cellular component 8.31e-09 extracellular matrixGO:0006357 biological process 3.58e-09 regulation of tran-

scription from Pol IIpromoter

GO:0001501 biological process 3.24e-10 skeletal developmentGO:0007165 biological process 2.21e-10 signal transductionGO:0007267 biological process 1.94e-10 cell-cell signalingGO:0007399 biological process 2.80e-11 neurogenesisGO:0005634 cellular component 2.35e-11 nucleusGO:0007275 biological process 1.86e-11 developmentGO:0006355 biological process 1.78e-11 regulation of tran-

scription, DNA-dependent

GO:0003700 molecular function 4.02e-12 transcription factoractivity

Table 3.3: Ten most significant GO category enrichment when using human-mouse conserved upstream sequences (at least one region with Promoterwisebitscore > 100)

3.3. CONCLUSION 70

GO cate-gory


GO:0001501 biological process 0.000119 skeletal developmentGO:0007399 biological process 7.07e-05 neurogenesisGO:0003702 molecular function 5.06e-05 RNA polymerase II

transcription factoractivity

GO:0008151 biological process 3.30e-05 cell growth and/ormaintenance

GO:0007345 biological process 3.04e-05 embryogenesis andmorphogenesis

GO:0007507 biological process 2.05e-05 heart developmentGO:0005634 cellular component 1.38e-11 nucleusGO:0007275 biological process 1.29e-11 developmentGO:0003700 molecular function 6.07e-12 transcription factor

activityGO:0006355 biological process 5.46e-12 regulation of tran-

scription, DNA-dependent

Table 3.4: Ten most significant GO category enrichment when using human-fugu conserved upstream sequences (at least one region with Promoterwisebitscore > 25)

GO cate-gory


GO:0003735 molecular function 2.0e-7 structural constituentof ribosome

GO:0005739 cellular component 2.4e-6 mitochondrionGO:0005840 cellular component 1.4e-5 ribosomeGO:0004984 molecular function 1.4e-5 olfactory receptor ac-

tivity

Table 3.5: GO classes that are significantly under-represented in the setof conserved upstream sequences (at least one region with Promoterwisebitscore > 100)

3.3. CONCLUSION 71

Conversely, a comparison of very remote species gives very good resolving

power, as the non-functional homologous sequences are usually not alignable

anymore. Personal experience has shown that the best results have been ob-

tained by using a hybrid strategy, combining alignment from Promoterwise

(when comparing relatively close species like human and mouse) with basic

motif-search techniques (when dealing with more remote species like human

and fish). This strategy can be used to obtain a number of genes that have

conserved elements across mammals and fish, indicative of functionality. In

the case of the CCACCTG motif analysed in the first part of this chapter,

this strategy produced about 120 possible target genes, while a simple search

of this motif without the conservation information retrieves virtually all the

genes in the human genome. Because genes that have the same conserved

putative regulatory site may also be under similar regulatory mechanisms,

this method may be used as a quick and cheap alternative for micro-array

analysis in retrieving candidate genes.

This strategy implies that the motif is known either by previous experimen-

tal evidence or by a careful manual study of a regulatory region of specific

genes, as has been done and described for atonal 5.

In the next chapter, the results obtained here are used to go further and

automatically propose a set of motifs, based on the fact that they are glob-

ally found significantly more often within conserved regions.

Chapter 4

Defining a mammaliandictionary of regulatory motifs

4.1 Introduction

As we have seen in the previous chapter, the success of phylogenetic foot-

printing using alignment algorithms depends largely on the species distance

and the gene considered. Transcription factor genes or genes involved in key

processes, especially during embryogenesis, show strong promoter conserva-

tion but taxa-specific genes have no conservation in the promoter. In these

cases, the simple comparative genomic approach using alignment algorithm

is of no use.

Nevertheless, if the strategy is not gene-centric but rather to construct a

dictionary of regulatory elements used throughout the genome, then the sole

requirement for detection would be to have enough instances of conserved

motifs throughout the genome. In other words, a regulatory motif may be

significantly conserved even though the absolute conservation corresponds

only to a fraction of all possible cases. This is the basic approach of this

chapter.

Once the dictionary of motifs is constructed, the genome-wide distribution of

the motifs is then investigated and, based on these results, functional regions

for transcriptional control can be predicted.

72

4.2. FINDING FUNCTIONAL MOTIFS 73

4.2 Finding functional motifs

Figure 4.1 shows the schema of the procedure used to find possible motifs

that are found more often in conserved regions. The first few steps (labeled

red in the figure) were extensively studied in the previous chapter and consist

of retrieving the conserved regions in the 5 kb upstream of orthologous genes.

In this instance, because the approach is human-centric, I only considered

pair-wise comparisons between human and other species. Only alignments

that satisfy a certain cutoff were kept. The score cutoff has been also esti-

mated in the previous chapter to be around 25 bits; this is the cutoff that I

used here for intra-mammalian comparison.

Around three percent of all pair-wise comparisons gave significant align-

ment(s) when considering mammalian and fish. Therefore, the cutoff was

lowered to 10 bitscore for these species comparisons in order to include more

orthologous pairs. The rate of false positives was expected to increase, how-

ever, raising the issue of including fish in this study at all.

4.2.1 Derivation of a reliable motif dictionary

As shown on Figure 4.1 the next step was to generate all possible motifs,

typically all exact boxes of 6-7 and 8 mers, and for each instance, evaluate

the total occurrence in the human genome (in upstream regions of genes) and

the occurrence in conserved regions. The logic behind this is that functional

motifs should be distributed more often in conserved regions relative to the

total occurrence.

Motifs that are composed of two or more boxes separated by a fix or variable

distances were ignored. Furthermore I expect the effect of overlapping struc-

ture in motifs (Robin et al., 2002) to be minimum since it is not an absolute

count that is measured rather a ratio between total and conserved occurrence.

The best signal-to-noise ratio was obtained when regions in human that have

promoterwise hits above the appropriate cutoff for at least 2 other species out

of the four considered (mouse, rat, fugu, zebrafish) were defined as conserved.

As one can expect, most of the signal came from the human-mouse and

human-rat pairs, and very little from human-fugu and/or human-zebrafish


Generate all possible6−7−8 mer motifs.

Run Repeat maskerRemove upstream genes

humanmouseratfuguzebra

Motifs with high conservation

defined as conserved only if presentin conserved region of two or more homologous pairs.

RUN MOTIFWISE ON THE HUMAN GENOME

COMPARA 17_1get homolog relationship(many to many relationships)

ENSMART 17_1Get 5000 bp upstreamof genes

RUN PROMOTERWISE FOREACH HOMOLOGOUS PAIRS(with human as query sequence) See Chapter 3

human − mousehuman − rathuman − fuguhuman− zebra

conserved occurencefor each motif

Total occurence

in humanfor each motif

intra−mammal comparisons cut−off = 25mammal−fish comparisons cut−off = 10

Keep only sequence with score > cut−off

Figure 4.1: Schema of the procedure used to calculate to what degree amotif is found in conserved regions. The first part (in red) was discussed inchapter 3.


Downstream region Upstream region

Figure 4.2: Occurrence of all possible exact 6, 7 and 8 mer motifs in con-served regions as a function of the total occurrence. This analysis was doneupstream (right graph) and downstream (left graph) of human genes.

pairs. Dropping fugu and zebrafish from the analysis had, therefore, little

or no effect on the result. Nevertheless, these two species were proven to be

very useful in identifying candidate genes for experimental confirmation (see

section 4.3).

Figure 4.2 shows the distribution of all the possible motifs when considering

the upstream (right graph) and downstream region of genes (left graph). In

both cases, the x-axis is the total occurrence of motifs in either the upstream

or downstream sequences of human, while the y-axis represents the number

of times a motif occurs in conserved regions (as defined above) for upstream

or downstream human sequences.

In both cases the distribution of occurrence in conserved regions is a func-

tion of the total occurrences, with most of the motifs falling into a limited

range of possibilities. In the upstream regions, a significant number of motifs

show a different partition in favour of conserved locations. This represents

about 30,000 motifs; 34.8% of all possible motifs considered in that study.

A closer look at the composition of these motifs revealed the presence of at

least one CpG within the sequence for many of these motifs.

As described in the introduction, CpG is a special di-nucleotide that is under-

represented in mammalian genomes. To observe this under-representation,


density plots were generated of occurrence of CpG or non-CpG motifs for the

upstream and downstream regions. In both cases, CpG are under-represented

(see Figure 4.3).

The same analysis as before has been repeated, but this time, CpG was

differentiated from the non-CpG motifs. The results are shown in Figure 4.4.

The distributions across conserved and all regions in human are the same

for CpG and non-CpG motifs when considering the downstream region of

genes. However, a striking difference exists in upstream regions only; CpG

motifs tend to be found more often in conserved regions. These results are

in accordance with the hypothesis of CpG island being correlated with func-

tional regions.

To rule out the effect of the CG composition, the same analysis was repeated

but this time the criteria was the presence of at least one GpC (instead of

CpG) in the motifs. No difference in the distribution in conserved region

can be seen suggesting that it is specifically the CpG dinucleotide that is

correlated with the higher distribution in conserved region.

Clearly, it is not possible to ignore the CpG effect. Strategies, then, need to

be found in order to circumvent the CpG evolutionary dynamic. To do so,

these two approaches were developed:

1. First, CpG and non-CpG motif counts can be considered as two dis-

tinct distributions and outliers in both distributions can be retrieved.

outliers were defined as above a regression line corresponding to four

standard deviations from the mean of conservations for each total oc-

currence. The two sets are then concatenated together to form the final

set of significant motifs.

2. Secondly, a slightly different approach can be considered by only look-

ing at conserved regions. Indeed, in conserved regions, motifs can either

be fully conserved between the two species considered or have at least

one substitution or/and indel. Now the number of conserved occurrence

function of the total occurrence in conserved regions can be evaluated

for each motifs. The result can be seen in Figure 4.5. Using this met-

ric, no difference can be observed between the CpG and the non CpG


0 1000 2000 3000 4000 5000 6000 7000

0.00

000.

0005

0.00

100.

0015

0.00

200.

0025

0.00

300.

0035

density function of the distibition of the motif occurence in downstream regions in human

total occurence

Den

sity

0 1000 2000 3000 4000 5000 6000 7000

0.00

000.

0005

0.00

100.

0015

0.00

200.

0025

density function of the motif occurence distibution in upstream regions of H. sapiens

total occurence

Den

sity

Figure 4.3: Density function of the motif occurrence downstream (top) andupstream (bottom) for CpG (black) and non-CpG motifs (blue).


Downstream region upstream region

Figure 4.4: Same analysis as in 4.2, but with CpG motifs labeled in green.

motifs. The outliers defined as above were retrieved.

The two methodologies answer two slightly different questions. The first

one finds motifs that are distributed more often in conserved regions (the

motif by itself does not need to be conserved between species). The second

one finds motifs that, when found in conserved regions, have a tendency to

be conserved as well. In both approaches, the outliners have a higher con-

servation than expected, which is what’s expected for functional sites like

transcription factor binding sites.

Outliers were analysed and are shown in Table 4.1. Some motifs are well

known binding sites, as is the case for:

1. The activator protein 1 (AP-1) is a dimeric complex that can form

many different combinations of heterodimers and homodimers of JUN,

FOS, ATF and MAF protein families. The main DNA response-element

is the TPA-responsive element (TRE with the consensus binding site

TGACTCA), but different dimers can preferentially bind to the cAMP

response element (CRE, consensus binding site TGACGTCA). These

transcription factors are well known in the field of oncology as they

are considered to be highly oncogenic. Both the binding sites TRE

and CRE were found to be significantly more conserved in conserved

regions((Eferl and Wagner, 2003)).

2. GCGCATGCG is identical to the palindromic consensus binding site

(YGCGCATGCGCR) for α-PAL/NRF1 (also called NRF-1 α-PAL; α-

palindrome-binding protein; nuclear respiratory factor 1). α-PAL was


0

50

100

150

200

250

300

350

0 100 200 300 400 500 600

inde

ntic

al in

con

serv

ed

total in conserved regions

indentical motifs in conserved regions

motifs without CpGmotifs with CpG

Total motifs in conserved regions

cons

erve

d m

otif

s in

con

serv

ed r

egio

ns

Figure 4.5: Occurrence of conserved motif in conserved region as a functionof the total occurrence in conserved region. Globally CpG motifs (in green)and non CpG motifs have the same distribution of conserved occurrencesrelative to the occurrence in conserved regions.


motifs conservedonly

identicalin con-served

description

SCGGAAGYG + + Elk1

CCTTTAAG + + -

AGGAAGT + + -

GGAAGTGA + + -

CCACGTGA + + E-box

AGCCAATSR + + CAAT box

CTGACGT + + AP-1 CRE

RCGTCACK + + AP-1 CRE

YCCCGCCCCC + + SP1 site ((Berg, 1992))

ATGCAAAT + + -

TAATTA + - CHX10

TAATGAG + - -

GCCGGAA + + Elk1

TAAACA + + FREAC-2

CCCGGAAG + + -

GGTGAG + + -

TCACGTGA + + E-box

GATTGGT + + reverse-complement CAAT box

TTCCGCC + + -

CACGTGGG + + -

GCAGCTG + + AP-4

CCCTTTAA + + -

ATTGGCTG + + reverse-complement CAAT box

CGCAGGCG + + -

CGCGCGC + + -

CTATAAA + - Consensus sequence for TATAA box.

TGACTCAG + + AP-1 TRE TF.(transfac M00174)

CACGTGAC + + E-box

CCCTCCC + + SP1 site ((Berg, 1992))

GCGCATGCG + + α-PAL/NRF1

GCGCGTG + + -

GTTGCTA + + -

TGACATCA - + AP-1

AGGTCAC - + -

CCACCTGC - + E12

TGACGTCAC + + AP-1 CRE

CTCGCGAGA + + -

CCAATCAG + + CAAT box

CCATTGG + + reverse-complement CAAT box

TTCCGGT + + -

CCACGTGG + + -

TGCGCA + + -

Table 4.1: Significant motifs (+) for occurrence in conserved regions (secondcolumn) and/or for identical in conserved region (third column). Most ofthe motifs have significance in both metrics. The last column is a manualannotation of the motif based on literature search.


initially detected as a transcription factor involved in the regulation

of the expression of the eukaryotic Initiation Factor 2 α (eIF-2α), a

translation initiation factor (Jacob et al., 1989), but later the motif

has been found to be functional in other promoters (Drouin et al.,

1997)

3. CCCTCCC and CCCGCCC are elements found ubiquitously in eu-

karyotic promoters and are the fixation sites for SP1 protein. Known

for more than 20 years, SP1 sites were thought to be involved in basal

transcription mechanisms (Dynan and Tjian, 2000), but more recent

studies are challenging this perception towards a more complex mech-

anism of regulation in which many of the SP family members (Jackson

et al., 1990)(Black et al., 2001) and other transcription factors (BTEB-

BTEB2, (Nielsen et al., 1998) (Sogawa et al., 1993)) interact and com-

pete for these same sites. The outcome can either be an activation or

a repression of the target gene.

4. E-box (CACGTG) is found upstream of many genes and is the binding

site for Max-Max homodimer (Blackwood and Eisenman, 1991), Max

heterodimer with Myc (Blackwood and Eisenman, 1991), Mad1 (Ayer

et al., 1993), Mxil (Zervos et al., 1993), Mad3 and Mad4. All of these

transcription factors are well known proto-oncogenes (Ryan and Birnie,

1997).

5. the CAAT box is found in many eukaryotic promoters, usually about

75bp upstream of the start of transcription.

6. The ubiquitous TATAA box was reported to be found about 25-35 bp

away from many transcription start sites. It is the binding site of the

TATAA box binding protein (TBP) that is part of the basal transcrip-

tion machinery. The motif found here has an additional cytosine 5’ of

the consensus TATAA box.

Most of these elements bind either proteins involved in the basal transcrip-

tional machinery or transcription factors that have a broad range of activity.

This is due to the nature of the methodology employed, which only selects

motifs based on their overall enrichment in conserved regions or in conserva-

tion. Transcription factors that act upon a few genes would have very few

binding sites and the background of non-functional sites would hide the few


functional motifs.

Interestingly, most of the patterns show significance in both methodologies,

suggesting that these motifs are found more often in conserved regions and

are more conserved as well. A notable exception is the putative TATAA box

that is only more conserved in conserved regions.

4.2.2 Finding region of clustered motifs on the humangenome

The regulation of eukaryotic genes is complex and often involves multiple reg-

ulatory proteins that bind to the regulatory region within a relatively short

distance from each other. The concept of regulatory modules composed of

cluster of regulatory motifs has been highlighted in many published works

that show the presence of these regulatory modules upstream of well studied

genes (Arnone and Davidson, 1997), (Berman et al., 2002).

The concept of modules implies that the density of cis-regulatory elements

should be higher in these regions than anywhere else on the genome. Based

on this assumption, a search was made to find transcription control regions

on the human genome using Motifwise, an algorithm developed by Ewan

Birney, to predict regions of higher density of motifs from the dictionary of

section 4.2.1.

Using the whole human genome sequences, a total of 190,593 hits were found.

As a measure of how well Motifwise locates cis-regulatory regions, the distri-

bution of hits was plotted relative to the closest-annotated transcription start

or end sites as shown on Figure 4.6. As expected, most of the regions that

control transcription would be close to the transcription start site. While no

significant fraction of hits occurs around the end of the transcript, most of the

Motifwise hits occur within 4 kb of the transcript starts. This result is clear

evidence of a biological association between the clustering of cis-regulatory

sites (given by Motifwise hits) and the start of transcription.

To rule out the possibility of overtraining the data, the same analysis was

done on the human genome without chromosomes 6, 20 and 22. Motifwise

was then run on these missing chromosomes. The result was identical to the

one obtained using the whole genomes. Another possible artifact is the CpG


0

0.5

1

1.5

2

2.5

3

-4000 -2000 0 2000 4000

Per

cent

age

of a

ll hi

ts

distance (in bp)

density of prediction by motifwise relative to the annotated gene starts or ends

relative to gene startrelative to gene start (only non CpG motifs)

relative to gene end

Figure 4.6: density of predictions by Motifwise relative to the annotatedgene starts or ends.


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70 80 90

frac

tion

score cutoff

fraction of positive regions (within 4kb of genes) function of the score

positif motifs upstreamtransfac motifs upstream

positif motifs downstreamtransfac motif downstream

Figure 4.7: fraction of positive regions from Motifwise using either transfacmotifs or our positive set of motifs as a function of the score of the region.Positive regions are regions found by Motifwise that are within 4kb relativeto gene starts or gene ends.


Figure 4.8: Motifwise result : Ensembl detailed view showing one exampleof a Motifwise hit (in green) on the human genome. The gene (in red)Q96MV9 does not have any description. The hit is a few bp away from thetranscription start site.

islands that are generally located upstream of genes. A significant number of

motifs on the set have CpG, and Motifwise may, therefore, simply detect a

higher density of CpG common in CpG islands. To rule out this hypothesis,

a Motifwise run was done using only non-CpG motifs. The result shown in

Figure 4.6 shows still an enrichment of Motifwise hits around the transcrip-

tion start sites, though weaker than when using the whole motifset.

Another way of visualising the data is to evaluate the fraction of positive

hits (the one around the transcription start) function of the Promoterwise

score cut-off.

As a control, Transfac motifs were used in Motifwise to scan the human

genome. A total of 190,593 hits were found and located relative to the gene

start site. This number corresponds roughly at the global amount of hits

found using the conserved motif dictionary (190,425 hits).

The distribution, however, is very different. Indeed, at a cut-off of 10 bitscore

only 2.15 percent of the regions fall into the 4kb window around the start/end

of genes. Figure 4.7 shows the ratio of positive regions(within 4kb around

the start of gene) for the conserved motif dictionary and for the transfac mo-

tifs as a function of the score of the region in Motifwise. The percentage of

positive regions remains very low for the transfac motifs set (less than 5%).

4.3. EXPERIMENTAL EVALUATION OF THE METHODOLOGY 86

gene

names

motifs H. sapiens R. norvegicus M. musculus F. rubripes D. rerio

FOXM1 TCACGTGA ENSG00000111206

ENSMUSG00000001517

ENSRNOG00000005936

SINFRUG00000122591

ENSDARG00000003200

ARF3 TCTCGCGAGA ENSG00000134287

ENSMUSG00000022995

ENSRNOG00000013924

SINFRUG00000152889

not available

Q99JW1 CACTTCCGG ENSG00000129968

ENSMUSG00000003346

ENSRNOG00000018212

SINFRUG00000143059

not available

Q9BU67 GTCACGTG ENSG00000165782

ENSMUSG00000035953

ENSRNOG00000009948

SINFRUG00000151440

not available

SM31 CACGTGAC ENSG00000184900

ENSMUSG00000020265

ENSRNOG00000001222

SINFRUG00000140807

ENSDARG00000014254

ZIC1 CACGTGAC ENSG00000152977

ENSMUSG00000032368

ENSRNOG00000014644

SINFRUG00000141943

ENSDARG00000015567

Table 4.2: candidates genes and the corresponding conserved motifs withEnsembl IDs for all the species considered (Ensembl release 18).

4.3 Experimental evaluation of the method-

ology

In order to evaluate the methodology, candidate genes were selected that

satisfy a number of criteria and analysed in detail to locate the region of con-

servation and possible flanking conserved regions. These candidates should

have orthologues in both mammalian and fish and possess in the upstream

region a significant motif derived from the motif dictionary (from section

4.2.1). These motifs have to be conserved in human, mouse, rat and at least

Fugu. Table 4.2 summarises the orthologue information for each candidates.

Ideally, these candidates should have evidence of expression in the early em-

bryonic stage. Experimental analysis was done for all candidate genes by

marcel Souren in the group of Jochen Wittbrodt (EMBL-Heidelberg). The

respective promoter regions were cloned from the Fugu rubripes’ genome and

inserted into a reporter vector. Deletion around the identified motifs in the

promoter were done for 3 constructs. The specific deletion constructs showed

lower ubiquitous expression in all three cases. For details see (Ettwiller et al.,

2005)


4.3.1 The FOXM1 gene

FOXM1 is part of the Forkhead box (FOX) transcription factor family and

has been implicated in both embryonic development and adult tissue home-

ostasis, and has known orthologues in rodent and fish. The common motif

over all known orthologues TCACGTGA is located about 1 kb away from

the coding start in fugu. The conservation of the entire region around the

motif is shown in Picture 4.9 and consists of essentially three blocks of con-

servation. The first block is a putative CAAT box reverse-complemented,

the second corresponds to the motif TCACGTGA, the third block consists

of an unknown motif.

H_sapiensM_musculusR_norvegicusF_rubripesD_rerio

CCCCG

CCCAT

GGGAG

GGGCG

AAAGT

AAATG

TCCGG

GCCAC

CCCCT

|10

CGGCC

GGGAC

AAAGG

GGGTC

AAATC

CCCCC

AAAGA

AAATT

GGGCA

G--CT

|20

CCCTG

CCCGT

GGGCG

GGGCT

CTTGT

GGGCC

CCCCG

CCC.C

GGG.C

AAAAA

|30

TTTTT

TTTTT

GGGGG

GGGGG

CCCCC

GGGCC

AAAAA

CCCAC

GGGT.

TCCCC

|40

TTTCA

CCCGC

CCCCG

.

.

.GT

.

.

.GG

GGGGA

TTTTT

CCCCC

AAAAA

CCCCC

|50

GGGGG

TTTTT

GGGGG

AAAAA

CCCC.

CCCG.

TGGC.

TCCG.

AAAG.

AAAC.

|60

CCCG.

GGGA.

CCCG.

TTTGG

CCCAA

CCCCG

GGGGC

CCCGG

H_sapiensM_musculusR_norvegicusF_rubripesD_rerio

CCCCC

GGGAA

|70

GGGGG

CCCCC

GGGAA

CCCCC

CCCGC

.

.

.CG

.

.

.GC

.

.

.CT

.

.

.AC

.

.

.GT

|80

.

.

.CG

.

.

.GC

.

.

.CG

.

.

.CC

.

.

.

.C

.

.

.AA

AAAAA

AAAAA

TTTAA

TTTTT

|90

TTTTT

CCCCC

AAAAA

AAAAA

AAAAA

CCCAA

AAAAT

GGGCC

CCCGT

GGGCC

|100

GGGCA

AAATC

AAACC

CCCTA

AAAGA

AAAAA

A--AT

CCCTG

TTTCC

GGGGT

|110

AAACC

Figure 4.9: Foxm1 regulatory region.

4.3.2 The ARF3 gene

ARF3 gene is part of the ADP-ribosylation factor family. ARF3 is predomi-

nately expressed in neuronal tissues during brain development but has been

found to be expressed in all tissues as well (Moss et al., 1990), making this

gene a good test candidate. The promoter has been reported to lack a TATA

and a CAAT box (Haun et al., 1993) and the region between -58 and -17

bp upstream of the transcription start site in human has been shown to be

essential for full expression of the gene. The common motif for this gene is

the palindromic sequence TCTCGCGAGA as show in Figure 4.10, and for

human is located between -58 and -17 bp of the transcription start site; con-

sistent with the above experiment results. The whole region around the motif

is well conserved across the mammalian but not across fugu and zebrafish.


H_sapiensR_novegicusM_musculusF_rubripesD_rerio

TTTAA

GGGAG

CCCCG

TTTTC

GGGCA

CCCCG

AAAGT

GGGCC

CCCTT

|10

CCCTC

GGGCA

CCCTA

TTTGG

GGGGA

CCCCT

CCCGG

AAATC

TTTTG

GGGTA

|20

GGGAA

TTTCA

GGGGA

AAATC

TTTCC

GGGTA

GGGTA

GGGTA

TTTTT

CCCCC

|30

TTTTT

CCCCC

GGGGG

CCCCC

GGGGG

AAAAA

GGGGG

AAAAA

AAAAG

CCCCA

|40

TTTTT

GGGAG

CCCAT

CCCAC

GGGTC

CCCTC

TTTGC

ACCCA

GGGTT

CCCGC

Figure 4.10: ARF3 regulatory region.

4.3.3 The Q99JW1 gene

This gene is similar to CGI-67 protein which has been annotated as being a

serine protease. The common motif for this gene in all species studied, apart

from zebrafish, is CACTTCCGG. Little is known about the gene.

H_sapiensM_musculusR_norvegicusF_rubripes

CCCC

CCCG

TTTT

CCCG

GGGA

CCCC

GAAG

TTTT

CCCC

|10

AAAA

CCCC

TTTT

TTTT

CCCC

CCCC

GGGG

GGGG

GGGG

GGGT

|20

-G-T

CCCT

GGGA

GGGG

TTTT

GGGG

Figure 4.11: Q99JW1 regulatory region.

4.3.4 The Q9BU67 gene

The region of conservation between mammalian and fish extends about a

hundred base pairs around the motif GTCACGTG with a putative conserved

CAAT motif 25 bp upstream of the motif.


CCCT

TTTC

GGGG

GGGA

TGGC

CCCC

TCCA

GGGA

GGGG

|10

AAAA

GGGG

GGGT

TTTT

AAAT

GGGA

CCCA

GGGT

CCCT

GGGG

|20

AAAG

TCTC

GCGT

GGGA

GCCA

CCCT

GCGT

T--T

CCCT

GAAG

|30

CCCC

TTTT

CTTG

CCCT

CCCC

AAAA

AAAA

TTTT

A--C

---A

|40

---G

---G

-CGG

GGGC

CCCT

TTTC

TCCT

GGGG

CCCG

AAAC

|50

GCCC

AAAA

AAAG

CCCG

CTTC

TTTC

CCCA

TGTC

AAAG

GGGG

|60

TTTT

CCCC

AAAA

CCCC

GGGG

TTTT

GGGG

---A


---C

---G

|70

CCCA

AAAG

GGGG

-TTT

GGGG

TTTT

TTTT

TTTT

TAAT

GCCT

|80

CGGG

AGGT

AAAT

GCCA

CTTT

CCCT

C-AT

AAGT

GGCG

CCAC

|90

AAGT

GGCT

CCTC

ATAC

TACG

CCGC

TGGC

GGGT

GGAT

Figure 4.12: Q9BU67 regulatory region.


4.3.5 The SM31 gene

Little is known about the gene function. The conserved motif is CACGT-

GAC located 200 bp away from the transcription start site in fugu (see Figure

4.13. The upstream sequence also contains two other weakly conserved re-

gions flanking the motif. Fugu seems to have the reverse complement of

the motif, but not the rest of the region. Marcel Souren from the verte-

H__sapiensM_musculusR_norvegicusF_rubripesD_rerio

TTTCT

CCCCC

GGGGG

TTTCT

GGGGG

ACCAG

AAAGA

CCCAC

GGGT-

|10

CCCT-

GGGT-

CCCG-

GCCG-

CCCT-

AGGG-

GAAGA

CCCTT

CCCCC

AAAAA

|20

CCCCC

GGGGG

TTTTT

GGGGG

AAAGA

CCCTC

CTTCC

CCCGA

CCCTG

C--GA

|30

GGGTA

TCCCA

GGGGC

CCCGC

C---A

GGGG-

GAGAA

CCCCC

CCCCC

AAAAA

|40

AAAAA

CCCTT

GCCCC

GAGAG

GGGTT

TCCTT

GGGCG

CCCAA

GGGTT

CCCTA

Figure 4.13: SM31 regulatory region.

brate developmental group of Jochen Wittbrodt made a deletion construct

of the medaka SM31 promoter as shown on 4.14. The 41 bp region that

contains the CACGTGAC motif was removed and the resulting promoter

(SM31del) placed upstream of the reporter gene GFP. The whole SM31 pro-

moter (SM31) was also constructed as a control. The construction was trans-

fected into medaka embryo and the transient expression of GFP monitored

at 24 and 48 hours after transfection. The construct SM31 show a strong

and uniform GFP expression across the fish embryo at both time as the

SM31del construct do not show any detectable GFP expression. This result

suggests that the region which contains the motif CACGTGAC is required

for a functional SM31 promoter.

4.3.6 The ZIC1 gene

Zic1 encodes a zinc-finger protein that is required for the development of the

dorsal neural tissue. It is present at high level of expression in the cerebellum

and developing cerebellum in human. The conserved motif CACGTGAC

(reverse complemented in D. rerio) is located about 300 bp away from the

annotated transcription start site and about 18 nt downstream of a conserved

putative CAAT box as show in figure 4.15. Downstream of the motif is a

CCCTCCC region that seems to be conserved as well (putative SP1 site).

4.4. CONCLUSION 90

4.4 Conclusion

This chapter shows that a reliable set of cis-regulatory motifs can be retrieved

by using unique statistical properties of regulatory sites in the context of

comparative genomics. Indeed, functional motifs are (a) found more often in

conserved regions, and (b) tend to be conserved, as well. So far, these two

properties have been used independently, but a combination of both maybe

more powerful to predict functionality.

In any case, these results confirm on a genomic scale the popular assumption

that regulatory sites should be found more conserved across species.

Another interesting finding is the fact that CpG motifs are globally found

more in conserved regions, and this is characteristic of upstream region of

genes. However, by only looking at conserved regions, CpG motifs are glob-

ally no more conserved than other motifs. This is probably due to the con-

servation of CpG island across mammals.

Because of the global nature of this approach, ubiquitous cis-regulatory el-

ements like the CAAT box show strong significance. This result suggests

that the methodology may be better adapted to localise the basic promoters,

rather than specific regulatory sites that are found to be functional only on

a limited number of genes. Nevertheless, variants of this method could be

considered; for example, the use of only a subset of genes that are known to

be co-expressed in order to predict more specific transcription factor binding

sites.

4.4. CONCLUSION 91

Figure 4.14: Expression of the reporter gene GFP under the control ofthe SM31 promoter in medaka embryo. Only the construct containing thewhole promoter show a constant and uniform GFP expression as the SM31delconstruct do not show any detectable GFP expression.(from the vertebratedevelopmental group of Jochen Wittbrodt)

4.4. CONCLUSION 92

H_sapiensR_novegicusM_musculusF_rubripesD_rerio

CCCCC

CCCCC

AAAAA

AAAAA

TTTTT

GCCGG

GCCTA

GTTGT

CCCCG

|10

GGGGG

CCCCT

.

.

.GC

CCCCC

ACCAA

GGGGA

CCCCG

GGGTC

TTTTA

.

.

.T.

|20

CCCCC

GGGGG

GGGGG

CTTGG

ACCGG

GGGCT

CCCCC

AAAAA

CCCCC

GGGGG

|30

TTTTT

GGGGG

AAAAC

CCCCT

ACCCG

.

.

.GC

.

.

.CC

.

.

.CC

.

.

.CC

CCCCC

|40

CCCCC

TTTTT

CCCCC

CCCCC

CCCTT

CCC.T

CCCCC

C..TC

TTTTC

GGGCC

Figure 4.15: ZIC1 regulatory region.

Chapter 5

Effect of the ATG triplet ongene expression in yeast

5.1 Introduction

As seen in the introduction, gene regulation also occurs at the post-transcriptional

level. With the collaboration of Thomas Schlitt I analysed the effect of an

additional ATG triplet upstream of gene starts in the yeast S. cerevisiae.

Additional ATG triplet(s) in the 5’ UTR can be used as the initiation codon

by the scanning ribosome. Because an upstream ATG is often quickly fol-

lowed by an in-frame stop codon, the mRNA can potentially be left without

ribosomes on most of its length, resulting possibly in the activation of the

NMD decay mechanism. This study has been done both at the genomic and

at the transcript level whenever UTR information was available.

5.2 ATG codon at the genomic level

ATG, in the intergenic context, should occur randomly across the yeast

genome with a change in the ATG distribution in genes. As translation

start sites are well annotated in yeast, I studied the distribution of ATG

upstream of the translation start site of all the genes in the genome.

Figure 5.1 shows the distribution of ATG around the coding ATG. The dis-

tance in the x axis is relative to the translation start site of the genes (origin)

and all ATG were counted in a window of -200 +200 bp. ATG distribution

tends to be fairly constant after a distance of 100 bp upstream of the cod-

93

5.2. ATG CODON AT THE GENOMIC LEVEL 94

ing start site. Before that, ATG tends to be under-represented, and this

tendency increases as the distance from the coding site diminishes. To rule

out lower complexity effects (CG/ AT content), the ATG reverse comple-

ment triplet (CAT) distribution was also retrieved and plotted in Figure 5.1.

No such negative selection can be seen on codon CAT. Other codons have

been tested (AGT, TGA) and, again, no such effects could be seen (data not

shown). This observation has been already made a few years ago in many

eukaryotic and prokaryotic genomes, including the yeast S. cerevisiae (Saito

and Tomita, 1999).

The average 5’ UTR in yeast has a predicted length of about 130 bp

(Rogozin et al., 2001) or less, indicating that the ATG codon is negatively

selected in 5’ UTRs. Figure 5.1 -A show the distribution of ATG in the

coding sequence as well (0 to +200bp). The three distributions of ATG cor-

respond to the 3 frames, with the lowest counts being the ATG in frame with

the ORF (coding for methionine), the medium counts as frame 1, and the

highest count as frame 2. Methionine (codon ATG) is rarely used in proteins,

as frame 1 would produce a codon [TGX] with X being either A, T, C or

G that corresponds to either tryptophan, cysteine or a stop codon; all rare

codons as well.

The next step was to include the expression information in order to anal-

yse the effect of the genomic distance of the first upstream ATG relative to

translation start sites on gene expression. Expression data were derived from

previous microarray analysis where only the absolute expression level for the

wild type yeast has been used ((Causton et al., 2001)). The genes has been

split into two groups; genes that have an ATG 5’ of the start codon that is

less than 50 bp upstream and the rest. The absolute expression value for

each gene was retrieved and the density distribution of these expression has

been plotted for both groups. The result obtained is summarised in Figure

5.2.

The two distributions are different with much more genes with low ex-

pression values for the close ATG set.

Another way of looking at the data is to retrieve and average all the ex-

pression values of genes that have an ATG between 0 and 40 bp upstream

of the translation start site, and repeat the operation by moving the window

until 200 bp. To measure the significance, 100 random datasets were gener-

ated by shuffling the expression values and were compared with the real data.


0

50

100

150

200

250

300

-200 -150 -100 -50 0 50 100 150 200

coun

t

distance (in bp)

ATG and CAT distribution relative to coding starts

ATGCAT

0

20

40

60

80

100

120

140

-200 -150 -100 -50 0

coun

t

distance (in bp)

ATGCAT

[B]

[A]

Figure 5.1: Distribution of ATG and the reverse complement CAT tripletsupstream of the putative coding start. The number of triplet ATG and CATis counted for each relative distance 5’ upstream of the start codon of allthe annotated genes in yeast. [A] Distribution using a window of 200 bpupstream and downstream of the putative coding starts (0). [B] Close-upwindow between -200 and 0 bp away from putative coding start.


0 200 400 600 800 1000

0.00

00.

001

0.00

20.

003

0.00

4

density function of the distribution of expression values for near (black) and distant (blue) ATG

expression value

Den

sity

Figure 5.2: Density distribution of expression values for genes that have aclose first upstream ATG (distance < 50 bp, in black) and genes that have adistant first upstream ATG (distance > 50 bp).


150

200

250

300

350

400

450

500

550

600

650

700

0 20 40 60 80 100 120 140 160

abso

lute

exp

ress

ion

ATG location (bp)

moving average of expression function of the first 5’ ATG position relative to the start site for all yeast genes

random datareal data

Figure 5.3: Effect of the first upstream ATG triplet distance on expressionin yeast.

5.3. ATG CODON AT THE TRANSCRIPT LEVEL 98

The result is plotted on Figure 5.3. The random data shows an average ex-

pression that is constant function of the ATG location as the real data shows

a good correlation between the average expression value and the upstream

ATG distance. This correlation is good until 120 bp, which is consistent with

the result of Figure 5.1 and the literature.

Clearly, the location of the first ATG upstream of the translation start site

has an effect on global gene expression. Nevertheless, using genomic data

restricts the interpretation of the result.

The main issue of considering genomic sequences instead of transcript in-

formation is the inability to distinguish UTR from upstream regions and,

consequently, it is not possible to distinguish an ATG in the UTR or simply

a random ATG occurring in the intergenic DNA.

5.3 ATG codon at the transcript level

UTR sequences are, therefore, valuable information for the correct interpre-

tation of the result. However, full length cDNA sequences in yeast are sparse

and EST (Express Sequence Tags) are not guarantees to be full length with

a bias for 3’ ESTs. Nevertheless, considerably more ESTs are available, and

the most 5’ EST of a given gene can be still informative if located in the 5’

UTR.

I mapped all available EST sequences to the yeast genome using blast ((Altschul

et al., 1990)) and located the start of most upstream EST for each gene. If

located at least 10 bp upstream from the coding start, the resulting 5’UTR

(that is, the sequence from the start of the EST to the start of the coding

sequence) is analysed for possible ATG. Using this approach, a total of 515

yeast genes have 5’ UTR sequences. This figure is largely underestimated,

as most of the ESTs do not provide full length cDNA information.

To study the effect of ATG on expression, the exact same approach as in

5.2 can be applied here. The distribution of expression values was analysed

for UTR, with and without ATG, and the result is summarised in figure 5.4.

Here, as well, the two distributions are different with much more genes with

low expression values for the UTR set with ATG.

5.3. ATG CODON AT THE TRANSCRIPT LEVEL 99

−200 0 200 400 600 800 1000

0.00

000.

0005

0.00

100.

0015

0.00

200.

0025

0.00

30

density function of the distributionof expression values for UTR with (black) and without (blue)G

expression value

Den

sity

Figure 5.4: Effect of the presence of an ATG in the 5’UTR on expression inyeast.

5.4. THE UPF GENES 100

This result suggests that an upstream ATG in the 5’ UTR of transcript

has a global negative effect on gene expression in yeast S. cerevisiae. One

possible mechanism that has been reported in the literature as a surveillance

mechanism is the nonsense-mediated mRNA decay, or NMD. As seen in the

introduction, this surveillance mechanism promptly removes mRNA having

frameshift or nonsense mutation. An additional ATG upstream of the coding

region can be used by the scanning ribosome, and a stop codon will be

promptly reached.

5.4 The upf genes

In S. cerevisiae, three genes are required for NMD (see chapter introduction).

Mutation of each of these genes and subsequent analysis of the mutant tran-

scriptome has been monitored using microarray analysis by (Lelivelt and

Culbertson, 1999). The main result of their analysis is that mutation of

UPF genes causes accumulation of hundreds of genes. This result suggests

that NMD, in addition of being a surveillance mechanism, is also involved in

the regulation of numerous mRNAs. One hypothesis that the authors men-

tioned in the discussion is the following :

“Although naturally occurring mRNA do not typically contain a prema-

ture stop codon, they could be targeted for rapid decay by an alternate

mechanism. For example, they might contain a stop codon at the end of

a translatable upstream ORF or some other sequence element that serves a

targeting function, or the normal stop codon the end of the ORF might have

the atypical property of triggering rapid decay. In any cases, it seems likely

that the Upf proteins cause changes in the abundance of naturally occurring

mRNAs through a mechanism involving mRNA decay”.

An opportunity is given here to strengthen their hypothesis by comparing

the data they obtained to our ATG location information.

For all the genes, the absolute expression values in wild type yeast and

upf123 mutant were retrieved, and the ratio between wild type versus mutant

was calculated. The same analysis as in section 5.2 was done (see Figure 5.3),

this time replacing the expression value by the ratio on the Y axis. If the

ATG location had no effect on the NMD degradation pathway, one would

expect a constant average ratio independent of the ATG location.

5.4. THE UPF GENES 101

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

0 20 40 60 80 100 120 140

ratio

distance in bp

effect of the first upstream ATG on the ratio upf123- mutant over wild type

upf123 mutant over wild typeupf123 mutant over upf2 mutant

Figure 5.5: Effect of the presence of an ATG in the ratio upf123 mutantUPF123 wild type.

5.5. CONCLUSION 102

As shown on Figure 5.5 this is clearly not the case. The upf123 mutant

globally shows an increased amount of transcript for genes that have an up-

stream ATG less than 60 bp away from the coding start on the genomic level.

In their paper (Lelivelt and Culbertson, 1999), the authors also mentioned

that ’the same mRNAs respond to loss of UPF function regardless of which

of the UPF genes is disrupted’. In order to test the significance of the result

obtained on Figure 5.5, the ratio of expression values for upf123/upf2 was

calculated and plotted in Figure 5.5. No such increase of transcript level can

be noticed when comparing upf123 and upf2 mutant.

This result suggests that the ratio difference between the mutant and the

wild type for genes having an ATG within 60 bp usptream of the coding

start is significant. This result also confirms the above statement by Mr Le-

livelt and Mr Culbertson (Lelivelt and Culbertson, 1999). The same result

was obtained when replacing upf2 mutant with upf1 or upf3 mutant (data

not shown).

At the transcript level, the analysis was not as clearly defined as the ge-

nomic data. Perhaps this is due to the limited amount of 5’ UTRs in the

yeast and a strong noise level from the microarray data.

5.5 Conclusion

This chapter focuses on the effect of an ATG (and therefore a potential addi-

tional ORF) upstream of the main transcript in yeast genes. Looking on the

genomic level, a strong correlation can be made between the ATG distance

and expression. By only studying the transcripts with 5’ UTR information,

the same correlation can be made between the expression and presence or

absence of ATG. These results suggest that a potential uORF induces a

downregulation of the transcript, and the UPF data suggest that the NMD

could be the mechanism for such downregulation.

These predictions now need to be confirmed using experimental procedures.

I suggest site-directed mutation in order to remove upstream ATG and anal-

yse the effect on expression in wild type and upf123 mutant yeast.

5.5. CONCLUSION 103

As NMD is also found in human, a similar study needs to be done on higher

eukaryotes.

Chapter 6

Conclusion

This thesis presents different computational methods developed to locate

cis-regulatory motifs in eukaryotes. Basically two types of biological infor-

mation have been successfully used; the first uses the information of possible

co-regulation of genes to derive a dictionary of interesting motifs, whereas

the second uses a comparative genomics approach, based on the fact that

functional regions are under negative selection. Both of these approaches

have been widely used in the literature to derive functional motifs. Never-

theless, the methods presented here have taken novel approaches that have

highlighted new aspects of the problem.

1. co-regulation : As we have seen in the introduction, the conventional

approach is to group genes on the basis of similar expression profiles

and then use the group of genes to derive over-represented motifs in

that cluster. These methods, however, are limited since they employ

a partitioning to identify co-regulation under particular experimental

conditions. The computational method that I developed first identifies

genes likely to be co-expressed solely because their gene products have

been experimentally interacting or are involved in the same metabolic

pathway. The second step of the methods identifies all the genes that

have a particular motif in the upstream region. These two sets of genes

are then compared using a graph overlap approach. Only if the motif

has a certain non-random concordance with the functional network that

the motif is selected. This approach is novel in two ways: the first is

using the information of pathway to deduce co-regulation; the second

uses a graph overlap to assess the motif and not the over-representation

of it.

104

6.1. PERSPECTIVE AND FURTHER WORK 105

2. comparative genomics : The approach used here is an hybrid be-

tween alignment algorithms (that can only be applied to relatively close

species) and motif-based methods (that work only if enough remote

species are used). Applications are numerous, from the identification

of potential targets of a given transcription factors to the derivation of

a motif dictionary.

6.1 Perspective and further work

Clearly cis-regulatory elements are biologically very important and perturba-

tion of these regions leads in human to cancer and numerous other diseases

(Cooper, 1992). Despite their major role in gene-expression control, very

few such elements have been well characterized and mapped on the genome,

mainly because of their apparent low information content.

Nevertheless, the trans-regulatory elements efficiently locate these regions

and recruit the transcription machinery. So why can we not accurately pre-

dict the location of such elements? Are we missing key information or do

we need to decipher a complicated code? I believe there are at least two

additional aspects to consider :

1. A “regulatory” code : It is clear that we have not fully understood

the dynamics of how a transfactor finds the proper site and binds the

DNA. Conversely, looking at the protein sequence or the structure when

available, it is not possible to deduce the binding site. This is an area

of active research with some preliminary success (Benos et al., 2002).

Looking at the DNA sequence, the very low information content of

a typical binding site prevents any accurate prediction, but the few

cases that have been well studied suggest that the coordinated binding

of many transcription factors triggers the activation of the gene and,

therefore, better prediction of such sites should take in account the

context; that is, the presence of other cis-regulatory elements relative

to each other. This is not a trivial task, as the relative distances are

highly variable and can be important but preliminary works are also

encouraging (Manke et al., 2003).

2. A missing information : As described in the introduction, the

epigenetic state of the DNA determines the accessibility of a particular


region to biological molecules, and the knowledge of this state does

not seem to be clearly encoded in the primary sequence. It is not

clear the extent of the role of the epigenetic factor on the binding of

transcription factor, but many recent publications tend to clearly show

a significant role (Cremer and Cremer, 2001). Accurate knowledge of

the chromatin state and the location of DNA regions relative to other

nucleus components could, therefore, be missing elements for a better

comprehension of gene expression regulation.

Both points are the subject of many studies and I expect good progress in

that field in the next few years.

Another area of interest is the difference of gene expression in different species

and the phenotypic evolution that results from it. As we have seen in the

literature, much work has been done in cis-regulatory region conservation

across species, with this Ph.D being one of many examples. Nevertheless,

considerably less effort has been done when concerning the differences which

are likely to constitute an important component in phenotypic evolution.

This topic is under-represented in the present genomic studies, yet is very

important in many aspects. Indeed, King and Wilson suggested that most

of the genetic cause of phenotypic differences between humans and the great

apes are the regulatory sequences that control the timing and pattern of genic

activity (King and Wilson, 1975).

This suggestion, made almost 30 years ago, is now supported by a couple

of studies that clearly show the extent of transcription factor binding site

divergence between even very close species. For example, a study done by

Dermatzalis et al. (Dermitzakis and Clark, 2002) suggested that 32 to 40 %

of known cis-regulatory regions in human are not functional in rodents.

Even in distinct populations of the same species, alteration of cis-regulatory

regions seems to be widespread and result in allelic divergence in expression

level of the genes. This polymorphism in the population is believed to have

profound influence in disease and drug susceptibility between individuals, as

well as be the primary substrate for the evolution of species. A study done

by Rockman et al. (Rockman et al., 2002) estimates that humans have more

than 16,000 functional cis-regulatory variants, a much higher figure than for

amino-acid variations. With the completion and release of the chimp genome


and the systematic detection of SNPs information within the human popu-

lation, a lot more interesting work can be done at that level.

Evolution of cis-regulatory sites can either be caused by a gradual semi-

neutral mutation that involves a single or few nucleotide change(s) or be

caused by a drastic change due to a whole functional rearrangement. In the

first case, because of the small size of transcription factors binding sites and

the degeneracy of the protein-DNA recognition code, sites can be easily mod-

ified or spontaneously appear somewhere else without major consequences on

the phenotype for the next generation. On the contrary, a deletion, insertion

or inversion of whole regions that contain regulatory sites is most probably

going to have profound effects on gene expression and, if positively selected,

will also have a profound effect on the population phenotype. The point to

stress here is the time scale: while the first scenario may result in gradually

subtle variations, the second may lead to immediate phenotypic effects with

strong selection pressure.

The first obvious question that comes to mind involves the expression pat-

tern and timing of these particular genes in species where a large rearrange-

ment in the regulatory region has occurred. These genes can be compared

to the types of genes that have been shown to be very well conserved. Are

these genes encoding for key proteins in the network of protein interactions,

or, on the contrary, are they encoding for peripheral proteins that are not

essential for the species’ survival? One can imagine ’universal’ genes that

allow for drastic changes in their promoter, suggesting hot spot elements for

phenotypic variations and speciation.

Appendix A

Publications during the PhDwork

1. Ettwiller L and Paten B. Guilt by Multiple Association Heredity, 2004Apr 7

2. Ettwiller L, Down T, Andrews D, Paten B, Wittbrodt J, Birney E.Derivation of a reliable cis-regulatory motif dictionary from genomesequence information. Manuscript in preparation.

3. Ettwiller L, Rung J, Birney E. (2003). Discovering Novel cis-RegulatoryMotifs Using Functional Networks. Genome Research, 13:883-895.

4. Ureta-Vidal A., Ettwiller L., Birney E.(2003). Comparative genomics:genome-wide analysis in metazoan eukaryotes. Nat Rev Genet., 4:251-62.

5. Bateman A., Birney E., Cerruti L., Durbin R., Etwiller L., Eddy SR.,Griffiths-Jones S., Howe KL., Marshall M., Sonnhammer EL.(2002).The Pfam protein families database. Nucleic Acids Res., 30:276-80.

108

Appendix B

Finding regulatory motifs usingfunctional network in yeast :material and method

B.1 Networks generation

B.1.1 Metabolic network

The KEGG database (Kanehisa, 1997) was used for this study. Only reac-tions linked to enzymes of the yeast S. cerevisiae were used. All reactionswere considered as reversible, resulting in an undirected graph. Interac-tions are only represented once to avoid signal amplification. A BLAST(ftp://ftp.mcbi.mih.gov/blast) of all versus all was performed for the up-stream sequences (600 bp) of all yeast genes and interactions that involvedgenes with homologous upstream sequences were removed from the network(blastn on plus/plus strands with all default parameters except for Expecta-tion value e set at 0.000001). A total number of 24 interactions were removedfrom the network. Furthermore some metabolic compounds that are involvedin many reactions were removed from the dataset. This includes H20, ATP,NAD, NADH, NADPH, NADP, ADP, CoA, O2, C02, NH3, pyrophosphate,UDP, ”Protein”, ”peptide” and phosphate.

B.1.2 Protein interaction network

Direct protein-protein interaction data were derived from two datasets ofexperimental results, identified as Cellzome (Gavin et al., 2002) and MDS(Ho et al., 2002) datasets. Both are based on a large-scale approach to sys-tematically identify protein complexes in S. cerevisiae. As for the metabolicnetwork, the same BLAST all versus all as in B.1.1 was performed for the

109

B.2. PATTERN SEARCH 110

upstream sequences of all yeast genes and protein interactions that involvedcorresponding genes with homologous upstream sequences were removed fromthe networks. A total of 2 and 30 interactions were removed from cellzomeand MDS network respectively.

B.2 Pattern search

The DNA regions considered are a fixed length of 600 base pairs upstream ofS. cerevisiae genes (we also tried 400 and 300 bp with more or less equivalentresults). The genome data used are the S. cerevisiae strain S288C completegenome (The yeast genome directory 1997). The pattern-searching programused for this study is Teiresias (Rigoutsos and Floratos, 1998) Teiresias isa combinatorial algorithm that indentifies any motifs satisfying given cutoff.The cutoff used here were the following : L=8, W=10, k=3 -v (for nucleationsets less that 10 genes, k=4 otherwise)with L being the number of literalsin the pattern, W being the maximum extend of an elementary pattern Kused with -v being the minimum number of sequences the motif appears.The patterns obtained are therefore at least 8 defined nucleotides long witha maximum of 2 wild cards allowed.

B.3 Overlap score

The overlap score represents the number of common edges between the initialfunctional network and the proposed pattern network, normalised by thenumber of edges connected to the considered nodes. Each common edge iscounted once but divided by the total number of edges from the two nodes;in addition, the total number is raised to the power 0.5 as this corrects forthe tendancy of larger networks producing large scores. The final form isshown in Equation B.3.

We do not count the initial seed edges which generated potential patternsin the scoring function.

S =

√

√

√

√

∑

i

(1

ai + bi − 1)

Summation is over all common edges (i) present in both networks con-necting node Ai to node Bi. The denominator ai + bi − 1 is the total numberof edges from both nodes, discounting the edge being counted.

In order to model the overlap score, random networks of the same sizeas the proposed pattern network were created by choosing genes at random,including the seed nodes. The overlap score is calculated in an identical

B.4. STANDARD DEVIATION SCORE 111

manner. Other randomisation procedures were experimented with, produc-ing essentially identical results. There was observed a linear relationshipbetween the number of nodes in the pattern network and the score. Thisrelationship was calculated using the linear regression formula.

Normality assessment was done using the Shapiro-Wilk test (Shapiro andWilk, 1965). This test calculates a W statistic that tests whether a randomsample of continuous values, x1, x2, ..., xn come from (specifically) a normaldistribution.For random networks having a size greater than 150 nodes, the percentagevalue of the Sharpiro-Wilk hypothesis test with p value greater than 0.02 are84 percent, 92.5 percent and 95 percent for the Cellzome, KEGG and MDSnetworks respectively.

B.4 Standard deviation score

Given a set of upstream regions containing a pattern A, the standard devia-tion of the different locations of this pattern with respect to the start codonof the genes is calculated as:

σa =

√

∑

( X − µ)2

N − 1

with σa being the standard deviation of the pattern a, N the number ofsequences in the set, µ the average location in respect to the start codonand X the location. The standard deviation score is based on comparing thestandard deviation for the set of X genes that comprise an overlap networkwith the standard deviation for a set of X random genes that have the samepattern. This comparison is done one hundred times per pattern and a p valuecall standard deviation score can be calculated from these comparisons. Thisscore reflects a better conservation of the upstream location of the patternwithin the overlap network. It is assumed here that a real pattern shouldconserve its position relative to the transcription starting site and that theUTR regions in yeast are about the same length for all the genes within aset.

B.5 Pattern clustering and sequence logo gen-

eration

Clustering was based on the genomic location of the patterns. For eachpattern derived, all the exact locations of its occurence in the upstream

B.5. PATTERN CLUSTERING AND SEQUENCE LOGO GENERATION 112

regions of all the genes in the yeast genome. Two patterns were linkedtogether if they shared at least 40 percent of genomic locations (exact location+/- 5 bp) for at least one pattern location profile. A final cluster containsall the patterns that are linked together (single linkage clustering). For eachcluster of more than one motif, a sequence logo was then derived by retrievingall sequences in the upstream region of overlap genes that match at least oneof the motifs in the cluster. The sequences obtained were then aligned anda profile logo was built, based on the information content of each position inthe alignment. Appendix C shows the different clusters obtained.

Appendix C

Yeast significant motifs

id occ. motifs seq logo net. SDKEGG

SDcell

SDMDS

function

cluster1

1941 316 MCK 5.73 14.80 15.04 transcription- translationprocesses

cluster2

454 MCK 8.15 0.24 5.61 unknown

cluster3

1384 MC 3.62 4.42 5.60 unknown

cluster4

359 117 MC 1.24 10.49 9.86 RNAmetabolism

cluster5

413 27 MC 2.62 9.92 7.87 RNAmetabolism

cluster6

599 6 MC 0.61 10.19 12.97 proteosome

cluster7

141 4 M 1.95 1.61 4.30 unknown

113

114

cluster8

52 2 M 0.97 0.62 4.02 cell cycle

cluster9

72 1 M 0.30 0.71 3.87 unknown

cluster10

156 7 MC 1.71 3.79 2.57 mRNA splic-ing

cluster11

62 1 M 2.01 1.38 4.98 unknown

cluster12

68 1 M 3.29 2.11 2.92 unknown

cluster13

155 5 M 1.25 0.53 3.85 cell cycle

cluster14

98 2 MC 0.47 2.06 2.58 unknown

cluster15

14 2 M 0.85 0.13 6.18 unknown

cluster16

10 2 M 1.69 0.62 4.75 unknown

cluster17

34 2 M 0.78 1.05 4.05 unknown

cluster18

24 1 M 0.88 2.28 5.77 unknown

cluster19

26 5 C 0.54 5.00 1.22 unknown

115

cluster20

26 1 C 0.72 4.46 2.86 unknown

cluster21

56 3 C 0.54 3.67 2.17 cell cycle

cluster22

73 3 C 0.23 4.08 1.73 unknown

cluster23

51 6 MC 0.65 4.92 3.20 proteosome

cluster24

159 10 C 0.81 5.55 2.65 unknown

cluster25

91 6 C 0.60 5.33 0.16 transcription

cluster26

89 2 C 0.56 3.72 1.58 unknown

cluster27

16 2 C 2.02 5.57 3.29 unknown

cluster28

25 3 C 1.66 5.29 1.53 unknown

cluster29

25 2 C 0.08 4.40 0.59 unknown

cluster30

26 4 C 2.08 7.14 0.86 unknown

cluster31

95 5 K 5.37 0.08 0.46 unknown

116

cluster32

22 2 K 3.00 2.49 1.32 unknown

cluster33

34 4 K 9.80 2.33 2.21 AA synthesis

cluster34

55 10 K 4.64 1.34 4.42 sugarmetabolism

cluster35

31 3 K 8.77 0.64 1.16 ATP synthe-sis

cluster36

17 2 K 2.74 1.17 0.14 ethanol utili-sation

cluster37

19 1 K 3.89 0.44 0.98 unknown

cluster38

34 1 K 3.17 0.04 0.11 unknown

cluster39

103 2 K 4.24 0.68 2.51 unknown

cluster40

56 2 K 6.02 2.25 0.85 unknown

cluster41

48 1 K 5.62 1.24 0.35 unknown

cluster42

25 1 K 2.73 0.13 0.04 unknown

117

Table C.1: Summary of all the significant motifs foundusing functional networks. Occ (occurence) is the totalnumber of genes in the overlap network(s) derived fromthe relevant functional network(s) (see network column).Motifs is the number of motifs used to built the sequencelogo. The column network shows where the motif hasbeen initially found having a significant overlap score,with net (network) K = KEGG, C = Cellzome and M= MDS. the standard deviation columns, SD KEGG, SDcell and SD MDS are the motif standard deviation fromthe mean of random “overlap scores” apply to the func-tional network KEGG, Cellzome, MDS respectively. Thefunction column is a functional annotation based on theoverlap genes annotation.

Bibliography

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J.

(1990). Basic local alignment search tool. J Mol Biol, 215:403–410.

Andreazzoli, M., Gestri, G., Angeloni, D., Menna, E., and Barsacchi, G.

(1999). Role of Xrx1 in Xenopus eye and anterior brain development. De-

velopment, 126:2451–2460.

Aparicio, S., Morrison, A., Gould, A., Gilthorpe, J., Chaudhuri, C., Rigby,

P., Krumlauf, R., and Brenner, S. (1995). Detecting conserved regulatory

elements with the model genome of the Japanese puffer fish, Fugu rubripes.

Proc Natl Acad Sci U S A, 92:1684–1688.

Arndt, K. and Fink, G. R. (1986). GCN4 protein, a positive transcrip-

tion factor in yeast, binds general control promoters at all 5’ TGACTC 3’

sequences. Proc Natl Acad Sci U S A, 83:8516–8520.

Arnone, M. I. and Davidson, E. H. (1997). The hardwiring of development:

organization and function of genomic regulatory systems. Development,

124:1851–1864.

Ayer, D. E., Kretzner, L., and Eisenman, R. N. (1993). Mad: a het-

erodimeric partner for Max that antagonizes Myc transcriptional activity.

Cell, 72:211–222.

Bailey, T. L. and Elkan, C. (1995). The value of prior knowledge in discov-

ering motifs with MEME. Proc Int Conf Intell Syst Mol Biol, 3:21–29.

Benos, P. V., Lapedes, A. S., and Stormo, G. D. (2002). Is there a code for

protein-DNA recognition? Probab(ilistical)ly. Bioessays, 24:466–475.

Berg, J. M. (1992). Sp1 and the subfamily of zinc finger proteins with

guanine-rich binding sites. Proc Natl Acad Sci U S A, 89:11109–11110.

118

BIBLIOGRAPHY 119

Berman, B. P., Nibu, Y., Pfeiffer, B. D., Tomancak, P., Celniker, S. E.,

Levine, M., Rubin, G. M., and Eisen, M. B. (2002). Exploiting transcription

factor binding site clustering to identify cis-regulatory modules involved in

pattern formation in the Drosophila genome. Proc Natl Acad Sci U S A,

99:757–762.

Birney, E., Andrews, T. D., Bevan, P., Caccamo, M., Chen, Y., Clarke, L.,

Coates, G., Cuff, J., Curwen, V., Cutts, T., Down, T., Eyras, E., Fernandez-

Suarez, X. M., Gane, P., Gibbins, B., Gilbert, J., Hammond, M., Hotz,

H. R., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Keenan,

S., Lehvaslaiho, H., McVicker, G., Melsopp, C., Meidl, P., Mongin, E.,

Pettett, R., Potter, S., Proctor, G., Rae, M., Searle, S., Slater, G., Smedley,

D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Ureta-

Vidal, A., Woodwark, K. C., Cameron, G., Durbin, R., Cox, A., Hubbard,

T., and Clamp, M. (2004). An overview of Ensembl. Genome Res, 14:925–

928.

Black, A. R., Black, J. D., and Azizkhan-Clifford, J. (2001). Sp1 and krppel-

like factor family of transcription factors in cell growth regulation and can-

cer. J Cell Physiol, 188:143–160.

Blackwood, E. M. and Eisenman, R. N. (1991). Max: a helix-loop-helix

zipper protein that forms a sequence-specific DNA-binding complex with

Myc. Science, 251:1211–1217.

Blaiseau, P. L., Isnard, A. D., Surdin-Kerjan, Y., and Thomas, D. (1997).

Met31p and Met32p, two related zinc finger proteins, are involved in tran-

scriptional regulation of yeast sulfur amino acid metabolism. Mol Cell Biol,

17:3640–3648.

Blanchette, M. and Tompa, M. (2002). Discovery of regulatory elements

by a computational method for phylogenetic footprinting. Genome Res,

12:739–748.

Boffelli, D., McAuliffe, J., Ovcharenko, D., Lewis, K. D., Ovcharenko, I.,

Pachter, L., and Rubin, E. M. (2003). Phylogenetic shadowing of primate se-

quences to find functional regions of the human genome. Science, 299:1391–

1394.

BIBLIOGRAPHY 120

Brazma, A., Jonassen, I., Vilo, J., and Ukkonen, E. (1998). Predicting gene

regulatory elements in silico on a genomic scale. Genome Res, 8:1202–1215.

Bungert, J., Dave, U., Lim, K. C., Lieuw, K. H., Shavit, J. A., Liu, Q.,

and Engel, J. D. (1995). Synergistic regulation of human beta-globin gene

switching by locus control region elements HS3 and HS4. Genes Dev,

9:3083–3096.

Burge, C. and Karlin, S. (1997). Prediction of complete gene structures in

human genomic DNA. J Mol Biol, 268:78–94.

Casarosa, S., Andreazzoli, M., Simeone, A., and Barsacchi, G. (1997). Xrx1,

a novel Xenopus homeobox gene expressed during eye and pineal gland

development. Mech Dev, 61:187–198.

Causton, H. C., Ren, B., Koh, S. S., Harbison, C. T., Kanin, E., Jennings,

E. G., Lee, T. I., True, H. L., Lander, E. S., and Young, R. A. (2001). Re-

modeling of yeast genome expression in response to environmental changes.

Mol Biol Cell, 12:323–337.

Chan, R. J., You, M., and Feng, G. S. (2004). Identification of trans-acting

factors by electrophoretic mobility shift assay. Methods Mol Biol, 249:7–20.

Chao, K. M., Hardison, R. C., and Miller, W. (1993). Constrained sequence

alignment. Bull Math Biol, 55:503–524.

Chasman, D. I., Lue, N. F., Buchman, A. R., LaPointe, J. W., Lorch, Y.,

and Kornberg, R. D. (1990). A yeast protein that influences the chromatin

structure of UASG and functions as a powerful auxiliary gene activator.

Genes Dev, 4:503–514.

Chaudhuri, A., Barbour, K. W., and Berger, F. G. (1991). Evolution of

messenger RNA structure and regulation in the genus Mus: the androgen-

inducible RP2 mRNAs. Mol Biol Evol, 8:641–653.

Chiang, D. Y., Moses, A. M., Kellis, M., Lander, E. S., and Eisen, M. B.

(2003). Phylogenetically and spatially conserved word pairs associated with

gene-expression changes in yeasts. Genome Biol, 4:R43–R43.

Clamp, M., Cuff, J., Searle, S. M., and Barton, G. J. (2004). The Jalview

Java alignment editor. Bioinformatics, 20:426–427.

BIBLIOGRAPHY 121

Cooper, D. N. (1992). Regulatory mutations and human genetic disease.

Ann Med, 24:427–437.

Corpet, F. (1988). Multiple sequence alignment with hierarchical clustering.

Nucleic Acids Res, 16:10881–10890.

Cremer, T. and Cremer, C. (2001). Chromosome territories, nuclear archi-

tecture and gene regulation in mammalian cells. Nat Rev Genet, 2:292–301.

Crollius, H. R., Jaillon, O., Bernot, A., Dasilva, C., Bouneau, L., Fis-

cher, C., Fizames, C., Wincker, P., Brottier, P., Qutier, F., Saurin, W.,

and Weissenbach, J. (2000). Estimate of human gene number provided

by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat

Genet, 25:235–238.

Cui, Y., Hagan, K. W., Zhang, S., and Peltz, S. W. (1995). Identification

and characterization of genes that are required for the accelerated degra-

dation of mRNAs containing a premature translational termination codon.

Genes Dev, 9:423–436.

de Melo, J., Qiu, X., Du, G., Cristante, L., and Eisenstat, D. D. (2003).

Dlx1, Dlx2, Pax6, Brn3b, and Chx10 homeobox gene expression defines the

retinal ganglion and inner nuclear layers of the developing and adult mouse

retina. J Comp Neurol, 461:187–204.

Dermitzakis, E. T. and Clark, A. G. (2002). Evolution of transcription

factor binding sites in Mammalian gene regulatory regions: conservation

and turnover. Mol Biol Evol, 19:1114–1121.

Dieterich, C., Cusack, B., Wang, H., Rateitschak, K., Krause, A., and Vin-

gron, M. (2002). Annotating regulatory DNA based on man-mouse genomic

comparison. Bioinformatics, pages S84–S90.

Dowell, S. J., Tsang, J. S., and Mellor, J. (1992). The centromere and

promoter factor 1 of yeast contains a dimerisation domain located carboxy-

terminal to the bHLH domain. Nucleic Acids Res, 20:4229–4236.

Down, T. A. and Hubbard, T. J. (2002). Computational detection and

location of transcription start sites in mammalian genomic DNA. Genome

Res, 12:458–461.

BIBLIOGRAPHY 122

Drouin, R., Angers, M., Dallaire, N., Rose, T. M., Khandjian, W., and

Rousseau, F. (1997). Structural and functional characterization of the hu-

man FMR1 promoter reveals similarities with the hnRNP-A2 promoter re-

gion. Hum Mol Genet, 6:2051–2060.

Dubchak, I., Brudno, M., Loots, G. G., Pachter, L., Mayor, C., Rubin,

E. M., and Frazer, K. A. (2000). Active conservation of noncoding sequences

revealed by three-way species comparisons. Genome Res, 10:1304–1306.

Dynan, W. S. and Tjian, R. (2000). Control of eukaryotic messenger RNA

synthesis by sequence-specific DNA-binding proteins. Nature, 316:774–778.

Eddy, S. R. (2001). Non-coding RNA genes and the modern RNA world.

Nat Rev Genet, 2:919–929.

Eferl, R. and Wagner, E. F. (2003). AP-1: a double-edged sword in tumori-

genesis. Nat Rev Cancer, 3:859–868.

Elnitski, L., Hardison, R. C., Li, J., Yang, S., Kolbe, D., Eswara, P.,

O’Connor, M. J., Schwartz, S., Miller, W., and Chiaromonte, F. (2003).

Distinguishing regulatory DNA from neutral sites. Genome Res, 13:64–72.

Ettwiller, L., Paten, B., Souren, M., Loosli, F., Wittbrodt, J., and Birney, E.

(2005). The discovery, positioning and verification of a set of transcription-

associated motifs in vertebrates. Genome Biol, 6:R104–R104.

Flint, J., Tufarelli, C., Peden, J., Clark, K., Daniels, R. J., Hardison, R.,

Miller, W., Philipsen, S., Tan-Un, K. C., McMorrow, T., Frampton, J., Al-

ter, B. P., Frischauf, A. M., and Higgs, D. R. (2001). Comparative genome

analysis delimits a chromosomal domain and identifies key regulatory ele-

ments in the alpha globin cluster. Hum Mol Genet, 10:371–382.

Force, A., Lynch, M., Pickett, F. B., Amores, A., Yan, Y. L., and Postleth-

wait, J. (1999). Preservation of duplicate genes by complementary, degen-

erative mutations. Genetics, 151:1531–1545.

Galliot, B., de Vargas, C., and Miller, D. (1999). Evolution of homeobox

genes: Q50 Paired-like genes founded the Paired class. Dev Genes Evol,

209:186–197.

BIBLIOGRAPHY 123

Gavin, A. C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer,

A., Schultz, J., Rick, J. M., Michon, A. M., Cruciat, C. M., Remor, M.,

Hfert, C., Schelder, M., Brajenovic, M., Ruffner, H., Merino, A., Klein,

K., Hudak, M., Dickson, D., Rudi, T., Gnau, V., Bauch, A., Bastuck, S.,

Huhse, B., Leutwein, C., Heurtier, M. A., Copley, R. R., Edelmann, A.,

Querfurth, E., Rybin, V., Drewes, G., Raida, M., Bouwmeester, T., Bork,

P., Seraphin, B., Kuster, B., Neubauer, G., and Superti-Furga, G. (2002).

Functional organization of the yeast proteome by systematic analysis of

protein complexes. Nature, 415:141–147.

Ge, H., Liu, Z., Church, G. M., and Vidal, M. (2001). Correlation be-

tween transcriptome and interactome mapping data from Saccharomyces

cerevisiae. Nat Genet, 29:482–486.

Gottgens, B., Barton, L. M., Chapman, M. A., Sinclair, A. M., Knudsen,

B., Grafham, D., Gilbert, J. G., Rogers, J., Bentley, D. R., and Green, A. R.

(2002). Transcriptional regulation of the stem cell leukemia gene (SCL)--

comparative analysis of five vertebrate SCL loci. Genome Res, 12:749–759.

Harris, M. A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger,

R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C., Richter, J., Rubin,

G. M., Blake, J. A., Bult, C., Dolan, M., Drabkin, H., Eppig, J. T., Hill,

D. P., Ni, L., Ringwald, M., Balakrishnan, R., Cherry, J. M., Christie,

K. R., Costanzo, M. C., Dwight, S. S., Engel, S., Fisk, D. G., Hirschman,

J. E., Hong, E. L., Nash, R. S., Sethuraman, A., Theesfeld, C. L., Botstein,

D., Dolinski, K., Feierbach, B., Berardini, T., Mundodi, S., Rhee, S. Y.,

Apweiler, R., Barrell, D., Camon, E., Dimmer, E., Lee, V., Chisholm, R.,

Gaudet, P., Kibbe, W., Kishore, R., Schwarz, E. M., Sternberg, P., Gwinn,

M., Hannick, L., Wortman, J., Berriman, M., Wood, V., de la Cruz, N.,

Tonellato, P., Jaiswal, P., Seigfried, T., and White, R. (2004). The Gene

Ontology (GO) database and informatics resource. Nucleic Acids Res, pages

D258–D261.

Haun, R. S., Moss, J., and Vaughan, M. (1993). Characterization of the

human ADP-ribosylation factor 3 promoter. J Biol Chem, 268:8793–8800.

Hayashi, N. and Oshima, Y. (1991). Specific cis-acting sequence for PHO8

expression interacts with PHO4 protein, a positive regulatory factor, in

Saccharomyces cerevisiae. Mol Cell Biol, 11:785–794.

BIBLIOGRAPHY 124

Hernandez, M. C., Erkman, L., Matter-Sadzinski, L., Roztocil, T., Ballivet,

M., and Matter, J. M. (1995). Characterization of the nicotinic acetylcholine

receptor beta 3 gene. J Biol Chem, 270:3224–3233.

Hertz, G. Z. and Stormo, G. D. (2000). Identifying DNA and protein pat-

terns with statistically significant alignments of multiple sequences. Bioin-

formatics, 15:563–577.

Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., Moore, L., Adams, S. L., Mil-

lar, A., Taylor, P., Bennett, K., Boutilier, K., Yang, L., Wolting, C., Don-

aldson, I., Schandorff, S., Shewnarane, J., Vo, M., Taggart, J., Goudreault,

M., Muskat, B., Alfarano, C., Dewar, D., Lin, Z., Michalickova, K., Willems,

A. R., Sassi, H., Nielsen, P. A., Rasmussen, K. J., Andersen, J. R., Johansen,

L. E., Hansen, L. H., Jespersen, H., Podtelejnikov, A., Nielsen, E., Craw-

ford, J., Poulsen, V., Srensen, B. D., Matthiesen, J., Hendrickson, R. C.,

Gleeson, F., Pawson, T., Moran, M. F., Durocher, D., Mann, M., Hogue,

C. W., Figeys, D., and Tyers, M. (2002). Systematic identification of pro-

tein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature,

415:180–183.

Hope, I. A. and Struhl, K. (1985). GCN4 protein, synthesized in vitro,

binds HIS3 regulatory sequences: implications for general control of amino

acid biosynthetic genes in yeast. Cell, 43:177–188.

Hughes, J. D., Estep, P. W., Tavazoie, S., and Church, G. M. (2000). Com-

putational identification of cis-regulatory elements associated with groups

of functionally related genes in Saccharomyces cerevisiae. J Mol Biol,

296:1205–1214.

Hutcheson, D. A. and Vetter, M. L. (2001). The bHLH factors Xath5 and

XNeuroD can upregulate the expression of XBrn3d, a POU-homeodomain

transcription factor. Dev Biol, 232:327–338.

IHGSC, Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C.,

Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R.,

Gage, D., Harris, K., Heaford, A., Howland, J., Kann, L., Lehoczky, J.,

LeVine, R., McEwan, P., McKernan, K., Meldrim, J., Mesirov, J. P., Mi-

randa, C., Morris, W., Naylor, J., Raymond, C., Rosetti, M., Santos, R.,

Sheridan, A., Sougnez, C., Stange-Thomann, N., Stojanovic, N., Subra-

manian, A., Wyman, D., Rogers, J., Sulston, J., Ainscough, R., Beck, S.,

BIBLIOGRAPHY 125

Bentley, D., Burton, J., Clee, C., Carter, N., Coulson, A., Deadman, R.,

Deloukas, P., Dunham, A., Dunham, I., Durbin, R., French, L., Grafham,

D., Gregory, S., Hubbard, T., Humphray, S., Hunt, A., Jones, M., Lloyd,

C., McMurray, A., Matthews, L., Mercer, S., Milne, S., Mullikin, J. C.,

Mungall, A., Plumb, R., Ross, M., Shownkeen, R., Sims, S., Waterston,

R. H., Wilson, R. K., Hillier, L. W., McPherson, J. D., Marra, M. A.,

Mardis, E. R., Fulton, L. A., Chinwalla, A. T., Pepin, K. H., Gish, W. R.,

Chissoe, S. L., Wendl, M. C., Delehaunty, K. D., Miner, T. L., Delehaunty,

A., Kramer, J. B., Cook, L. L., Fulton, R. S., Johnson, D. L., Minx, P. J.,

Clifton, S. W., Hawkins, T., Branscomb, E., Predki, P., Richardson, P.,

Wenning, S., Slezak, T., Doggett, N., Cheng, J. F., Olsen, A., Lucas, S.,

Elkin, C., Uberbacher, E., Frazier, M., Gibbs, R. A., Muzny, D. M., Scherer,

S. E., Bouck, J. B., Sodergren, E. J., Worley, K. C., Rives, C. M., Gorrell,

J. H., Metzker, M. L., Naylor, S. L., Kucherlapati, R. S., Nelson, D. L., We-

instock, G. M., Sakaki, Y., Fujiyama, A., Hattori, M., Yada, T., Toyoda, A.,

Itoh, T., Kawagoe, C., Watanabe, H., Totoki, Y., Taylor, T., Weissenbach,

J., Heilig, R., Saurin, W., Artiguenave, F., Brottier, P., Bruls, T., Pelletier,

E., Robert, C., Wincker, P., Smith, D. R., Doucette-Stamm, L., Ruben-

field, M., Weinstock, K., Lee, H. M., Dubois, J., Rosenthal, A., Platzer,

M., Nyakatura, G., Taudien, S., Rump, A., Yang, H., Yu, J., Wang, J.,

Huang, G., Gu, J., Hood, L., Rowen, L., Madan, A., Qin, S., Davis, R. W.,

Federspiel, N. A., Abola, A. P., Proctor, M. J., Myers, R. M., Schmutz, J.,

Dickson, M., Grimwood, J., Cox, D. R., Olson, M. V., Kaul, R., Raymond,

C., Shimizu, N., Kawasaki, K., Minoshima, S., Evans, G. A., Athanasiou,

M., Schultz, R., Roe, B. A., Chen, F., Pan, H., Ramser, J., Lehrach, H.,

Reinhardt, R., McCombie, W. R., de la Bastide, M., Dedhia, N., Blcker, H.,

Hornischer, K., Nordsiek, G., Agarwala, R., Aravind, L., Bailey, J. A., Bate-

man, A., Batzoglou, S., Birney, E., Bork, P., Brown, D. G., Burge, C. B.,

Cerutti, L., Chen, H. C., Church, D., Clamp, M., Copley, R. R., Doerks, T.,

Eddy, S. R., Eichler, E. E., Furey, T. S., Galagan, J., Gilbert, J. G., Har-

mon, C., Hayashizaki, Y., Haussler, D., Hermjakob, H., Hokamp, K., Jang,

W., Johnson, L. S., Jones, T. A., Kasif, S., Kaspryzk, A., Kennedy, S., Kent,

W. J., Kitts, P., Koonin, E. V., Korf, I., Kulp, D., Lancet, D., Lowe, T. M.,

McLysaght, A., Mikkelsen, T., Moran, J. V., Mulder, N., Pollara, V. J.,

Ponting, C. P., Schuler, G., Schultz, J., Slater, G., Smit, A. F., Stupka, E.,

Szustakowski, J., Thierry-Mieg, D., Thierry-Mieg, J., Wagner, L., Wallis,

J., Wheeler, R., Williams, A., Wolf, Y. I., Wolfe, K. H., Yang, S. P., Yeh,

BIBLIOGRAPHY 126

R. F., Collins, F., Guyer, M. S., Peterson, J., Felsenfeld, A., Wetterstrand,

K. A., Patrinos, A., Morgan, M. J., Szustakowki, J., de Jong, P., Catanese,

J. J., Osoegawa, K., Shizuya, H., Choi, S., and Chen, Y. J. (2001). Initial

sequencing and analysis of the human genome. Nature, 409:860–921.

Iyer, V. and Struhl, K. (1995). Poly(dA:dT), a ubiquitous promoter element

that stimulates transcription via its intrinsic DNA structure. EMBO J,

14:2570–2579.

Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore, T., Lee, J. C.,

Trent, J. M., Staudt, L. M., Hudson, J., Boguski, M. S., Lashkari, D.,

Shalon, D., Botstein, D., and Brown, P. O. (1999). The transcriptional

program in the response of human fibroblasts to serum. Science, 283:83–87.

Jackson, S. P., MacDonald, J. J., Lees-Miller, S., and Tjian, R. (1990). GC

box binding induces phosphorylation of Sp1 by a DNA-dependent protein

kinase. Cell, 63:155–165.

Jacob, F., Perrin, D., Sanchez, C., and Monod, J. (1960). [Operon: a group

of genes with the expression coordinated by an operator]. C R Hebd Seances

Acad Sci, 250:1727–1729.

Jacob, W. F., Silverman, T. A., Cohen, R. B., and Safer, B. (1989). Iden-

tification and characterization of a novel transcription factor participating

in the expression of eukaryotic initiation factor 2 alpha. J Biol Chem,

264:20372–20384.

Jareborg, N., Birney, E., and Durbin, R. (1999). Comparative analysis of

noncoding regions of 77 orthologous mouse and human gene pairs. Genome

Res, 9:815–824.

Kanehisa, M. (1997). A database for post-genome analysis. Trends Genet,

13:375–376.

Kellis, M., Patterson, N., Endrizzi, M., Birren, B., and Lander, E. S. (2003).

Sequencing and comparison of yeast species to identify genes and regulatory

elements. Nature, 423:241–254.

King, M. C. and Wilson, A. C. (1975). Evolution at two levels in humans

and chimpanzees. Science, 188:107–116.

BIBLIOGRAPHY 127

Koch, K. A. and Thiele, D. J. (1999). Functional analysis of a homopoly-

meric (dA-dT) element that provides nucleosomal access to yeast and mam-

malian transcription factors. J Biol Chem, 274:23752–23760.

Koo, H. S., Wu, H. M., and Crothers, D. M. (2000). DNA bending at

adenine . Nature, 320:501–506.

Krawczak, M., Chuzhanova, N. A., and Cooper, D. N. (1999). Evolution

of the proximal promoter region of the mammalian growth hormone gene.

Gene, 237:143–151.

Lawrence, J. G. and Roth, J. R. (1996). Selfish operons: horizontal transfer

may drive the evolution of gene clusters. Genetics, 143:1843–1860.

Leblanc, B. and Moss, T. (2001). DNase I footprinting. Methods Mol Biol,

148:31–38.

Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber,

G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I.,

Zeitlinger, J., Jennings, E. G., Murray, H. L., Gordon, D. B., Ren, B.,

Wyrick, J. J., Tagne, J. B., Volkert, T. L., Fraenkel, E., Gifford, D. K., and

Young, R. A. (2002). Transcriptional regulatory networks in Saccharomyces

cerevisiae. Science, 298:799–804.

Lelivelt, M. J. and Culbertson, M. R. (1999). Yeast Upf proteins required

for RNA surveillance affect global expression of the yeast transcriptome.

Mol Cell Biol, 19:6710–6719.

Levy, S. and Hannenhalli, S. (2002). Identification of transcription factor

binding sites in the human genome sequence. Mamm Genome, 13:510–514.

Levy, S., Hannenhalli, S., and Workman, C. (2001). Enrichment of regu-

latory signals in conserved non-coding genomic sequence. Bioinformatics,

17:871–877.

Li, Z., Calcar, S. V., Qu, C., Cavenee, W. K., Zhang, M. Q., and Ren,

B. (2003). A global transcriptional regulatory role for c-Myc in Burkitt’s

lymphoma cells. Proc Natl Acad Sci U S A, 100:8164–8169.

Loots, G. G., Locksley, R. M., Blankespoor, C. M., Wang, Z. E., Miller,

W., Rubin, E. M., and Frazer, K. A. (2000). Identification of a coordinate

BIBLIOGRAPHY 128

regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons.

Science, 288:136–140.

Lowndes, N. F., Johnson, A. L., and Johnston, L. H. (1991). Coordination of

expression of DNA synthesis genes in budding yeast by a cell-cycle regulated

trans factor. Nature, 350:247–250.

Ludwig, M. Z., Bergman, C., Patel, N. H., and Kreitman, M. (2000). Ev-

idence for stabilizing selection in a eukaryotic enhancer element. Nature,

403:564–567.

Majewski, J. and Ott, J. (2002). Distribution and characterization of regu-

latory elements in the human genome. Genome Res, 12:1827–1836.

Manke, T., Bringas, R., and Vingron, M. (2003). Correlating protein-DNA

and protein-protein interaction networks. J Mol Biol, 333:75–85.

Mannhaupt, G., Schnall, R., Karpov, V., Vetter, I., and Feldmann, H.

(1999). Rpn4p acts as a transcription factor by binding to PACE, a nonamer

box found upstream of 26S proteasomal and other genes in yeast. FEBS

Lett, 450:27–34.

Mantovani, R. (1998). A survey of 178 NF-Y binding CCAAT boxes. Nu-

cleic Acids Res, 26:1135–1143.

Maquat, L. E. and Carmichael, G. G. (2001). Quality control of mRNA

function. Cell, 104:173–176.

Matter-Sadzinski, L., Matter, J. M., Ong, M. T., Hernandez, J., and Bal-

livet, M. (2001). Specification of neurotransmitter receptor identity in devel-

oping retina: the chick ATH5 promoter integrates the positive and negative

effects of several bHLH proteins. Development, 128:217–231.

Moll, T., Dirick, L., Auer, H., Bonkovsky, J., and Nasmyth, K. (1992).

SWI6 is a regulatory subunit of two different cell cycle START-dependent

transcription factors in Saccharomyces cerevisiae. J Cell Sci Suppl, 16:87–

96.

Morgenstern, B., Frech, K., Dress, A., and Werner, T. (1998). DIALIGN:

finding local similarities by multiple sequence alignment. Bioinformatics,

14:290–294.

BIBLIOGRAPHY 129

Moss, J., Tsuchiya, M., Tsai, S. C., Adamik, R., Bobak, D. A., Price, S. R.,

Nightingale, M. S., and Vaughan, M. (1990). Structural and functional char-

acterization of ADP-ribosylation factors, 20 kDa guanine nucleotide-binding

proteins that activate cholera toxin. Adv Second Messenger Phosphoprotein

Res, 24:83–88.

Munro, H. N., Aziz, N., Leibold, E. A., Murray, M., Rogers, J., Vass,

J. K., and White, K. (1988). The ferritin genes: structure, expression, and

regulation. Ann N Y Acad Sci, 526:113–123.

Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable

to the search for similarities in the amino acid sequence of two proteins. J

Mol Biol, 48:443–453.

Niehrs, C. and Pollet, N. (1999). Synexpression groups in eukaryotes. Na-

ture, 402:483–487.

Nielsen, S. J., Praestegaard, M., Jorgensen, H. F., and Clark, B. F. (1998).

Different Sp1 family members differentially affect transcription from the

human elongation factor 1 A-1 gene promoter. Biochem J, pages 511–517.

Parker, R. and Song, H. (2004). The enzymes and control of eukaryotic

mRNA turnover. Nat Struct Mol Biol, 11:121–127.

Pearson, W. R. (1991). Searching protein sequence libraries: comparison

of the sensitivity and selectivity of the Smith-Waterman and FASTA algo-

rithms. Genomics, 11:635–650.

Plump, A. S., Erskine, L., Sabatier, C., Brose, K., Epstein, C. J., Goodman,

C. S., Mason, C. A., and Tessier-Lavigne, M. (2002). Slit1 and Slit2 coop-

erate to prevent premature midline crossing of retinal axons in the mouse

visual system. Neuron, 33:219–232.

Rigaut, G., Shevchenko, A., Rutz, B., Wilm, M., Mann, M., and Sraphin,

B. (1999). A generic protein purification method for protein complex char-

acterization and proteome exploration. Nat Biotechnol, 17:1030–1032.

Rigoutsos, I. and Floratos, A. (1998). Combinatorial pattern discovery in

biological sequences: The TEIRESIAS algorithm. Bioinformatics, 14:55–67.

BIBLIOGRAPHY 130

Robin, S., Daudin, J. J., Richard, H., Sagot, M. F., and Schbath, S. (2002).

Occurrence probability of structured motifs in random sequences. J Comput

Biol, 9:761–773.

Rockman, M. V., Wray, G. A., and Wray, G. A. (2002). Abundant raw

material for cis-regulatory evolution in humans. Mol Biol Evol, 19:1991–

2004.

Rogozin, I. B., Kochetov, A. V., Kondrashov, F. A., Koonin, E. V., and

Milanesi, L. (2001). Presence of ATG triplets in 5’ untranslated regions

of eukaryotic cDNAs correlates with a ’weak’ context of the start codon.

Bioinformatics, 17:890–900.

Roth, F. P., Hughes, J. D., Estep, P. W., and Church, G. M. (1998). Finding

DNA regulatory motifs within unaligned noncoding sequences clustered by

whole-genome mRNA quantitation. Nat Biotechnol, 16:939–945.

Ryan, K. M. and Birnie, G. D. (1997). Analysis of E-box DNA binding

during myeloid differentiation reveals complexes that contain Mad but not

Max. Biochem J, pages 79–85.

Saito, R. and Tomita, M. (1999). On negative selection against ATG

triplets near start codons in eukaryotic and prokaryotic genomes. J Mol

Evol, 48:213–217.

Salgado, H., Moreno-Hagelsieb, G., Smith, T. F., and Collado-Vides, J.

(2000). Operons in Escherichia coli: genomic analyses and predictions.

Proc Natl Acad Sci U S A, 97:6652–6657.

Schell, T., Kocher, T., Wilm, M., Seraphin, B., Kulozik, A. E., and Hentze,

M. W. (2003). Complexes between the nonsense-mediated mRNA de-

cay pathway factor human upf1 (up-frameshift protein 1) and essential

nonsense-mediated mRNA decay factors in HeLa cells. Biochem J, 373:775–

783.

Scherf, M., Klingenhoff, A., and Werner, T. (2000). Highly specific localiza-

tion of promoter regions in large genomic sequences by PromoterInspector:

a novel context analysis approach. J Mol Biol, 297:599–606.

BIBLIOGRAPHY 131

Schmid, C. D., Praz, V., Delorenzi, M., Prier, R., and Bucher, P. (2004).

The Eukaryotic Promoter Database EPD: the impact of in silico primer

extension. Nucleic Acids Res, pages D82–D85.

Schneider, M. L., Turner, D. L., and Vetter, M. L. (2001). Notch signaling

can inhibit Xath5 function in the neural plate and developing retina. Mol

Cell Neurosci, 18:458–472.

Shapiro, S. and Wilk, M. (1965). 591-611. Biometrika, 52:–.

Smith, T. F. and Waterman, M. S. (1981). Identification of common molec-

ular subsequences. J Mol Biol, 147:195–197.

Sogawa, K., Imataka, H., Yamasaki, Y., Kusume, H., Abe, H., and Fujii-

Kuriyama, Y. (1993). cDNA cloning and transcriptional properties of a

novel GC box-binding protein, BTEB2. Nucleic Acids Res, 21:1527–1532.

Struhl, K. (1995). Yeast transcriptional regulatory mechanisms. Annu Rev

Genet, 29:651–674.

Sved, J. and Bird, A. (1990). The expected equilibrium of the CpG dinu-

cleotide in vertebrate genomes under a mutation model. Proc Natl Acad Sci

U S A, 87:4692–4696.

Tagle, D. A., Koop, B. F., Goodman, M., Slightom, J. L., Hess, D. L.,

and Jones, R. T. (1988). Embryonic epsilon and gamma globin genes of a

prosimian primate (Galago crassicaudatus). J Mol Biol, 203:439–455.

Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). CLUSTAL

W: improving the sensitivity of progressive multiple sequence alignment

through sequence weighting, position-specific gap penalties and weight ma-

trix choice. Nucleic Acids Res, 22:4673–4680.

Turner, B. M. (2000). Histone acetylation and an epigenetic code. Bioessays,

22:836–845.

Verma, R., Patapoutian, A., Gordon, C. B., and Campbell, J. L. (1991).

Identification and purification of a factor that binds to the Mlu I cell cycle

box of yeast DNA replication genes. Proc Natl Acad Sci U S A, 88:7155–

7159.

BIBLIOGRAPHY 132

Vetter, M. L. and Brown, N. L. (2001). The role of basic helix-loop-helix

genes in vertebrate retinogenesis. Semin Cell Dev Biol, 12:491–498.

Walhout, A. J., Reboul, J., Shtanko, O., Bertin, N., Vaglio, P., Ge, H., Lee,

H., Doucette-Stamm, L., Gunsalus, K. C., Schetter, A. J., Morton, D. G.,

Kemphues, K. J., Reinke, V., Kim, S. K., Piano, F., and Vidal, M. (2002).

Integrating interactome, phenome, and transcriptome mapping data for the

C. Curr Biol, 12:1952–1958.

Walter, J. and Biggin, M. D. (1996). DNA binding specificity of two home-

odomain proteins in vitro and in Drosophila embryos. Proc Natl Acad Sci

U S A, 93:2680–2685.

Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J. F.,

Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., An-

tonarakis, S. E., Attwood, J., Baertsch, R., Bailey, J., Barlow, K., Beck,

S., Berry, E., Birren, B., Bloom, T., Bork, P., Botcherby, M., Bray, N.,

Brent, M. R., Brown, D. G., Brown, S. D., Bult, C., Burton, J., Butler,

J., Campbell, R. D., Carninci, P., Cawley, S., Chiaromonte, F., Chinwalla,

A. T., Church, D. M., Clamp, M., Clee, C., Collins, F. S., Cook, L. L.,

Copley, R. R., Coulson, A., Couronne, O., Cuff, J., Curwen, V., Cutts, T.,

Daly, M., David, R., Davies, J., Delehaunty, K. D., Deri, J., Dermitzakis,

E. T., Dewey, C., Dickens, N. J., Diekhans, M., Dodge, S., Dubchak, I.,

Dunn, D. M., Eddy, S. R., Elnitski, L., Emes, R. D., Eswara, P., Eyras,

E., Felsenfeld, A., Fewell, G. A., Flicek, P., Foley, K., Frankel, W. N., Ful-

ton, L. A., Fulton, R. S., Furey, T. S., Gage, D., Gibbs, R. A., Glusman,

G., Gnerre, S., Goldman, N., Goodstadt, L., Grafham, D., Graves, T. A.,

Green, E. D., Gregory, S., Guig, R., Guyer, M., Hardison, R. C., Haussler,

D., Hayashizaki, Y., Hillier, L. W., Hinrichs, A., Hlavina, W., Holzer, T.,

Hsu, F., Hua, A., Hubbard, T., Hunt, A., Jackson, I., Jaffe, D. B., John-

son, L. S., Jones, M., Jones, T. A., Joy, A., Kamal, M., Karlsson, E. K.,

Karolchik, D., Kasprzyk, A., Kawai, J., Keibler, E., Kells, C., Kent, W. J.,

Kirby, A., Kolbe, D. L., Korf, I., Kucherlapati, R. S., Kulbokas, E. J., Kulp,

D., Landers, T., Leger, J. P., Leonard, S., Letunic, I., Levine, R., Li, J., Li,

M., Lloyd, C., Lucas, S., Ma, B., Maglott, D. R., Mardis, E. R., Matthews,

L., Mauceli, E., Mayer, J. H., McCarthy, M., McCombie, W. R., McLaren,

S., McLay, K., McPherson, J. D., Meldrim, J., Meredith, B., Mesirov, J. P.,

Miller, W., Miner, T. L., Mongin, E., Montgomery, K. T., Morgan, M.,

BIBLIOGRAPHY 133

Mott, R., Mullikin, J. C., Muzny, D. M., Nash, W. E., Nelson, J. O., Nhan,

M. N., Nicol, R., Ning, Z., Nusbaum, C., O’Connor, M. J., Okazaki, Y.,

Oliver, K., Overton-Larty, E., Pachter, L., Parra, G., Pepin, K. H., Peter-

son, J., Pevzner, P., Plumb, R., Pohl, C. S., Poliakov, A., Ponce, T. C.,

Ponting, C. P., Potter, S., Quail, M., Reymond, A., Roe, B. A., Roskin,

K. M., Rubin, E. M., Rust, A. G., Santos, R., Sapojnikov, V., Schultz, B.,

Schultz, J., Schwartz, M. S., Schwartz, S., Scott, C., Seaman, S., Searle, S.,

Sharpe, T., Sheridan, A., Shownkeen, R., Sims, S., Singer, J. B., Slater, G.,

Smit, A., Smith, D. R., Spencer, B., Stabenau, A., Stange-Thomann, N.,

Sugnet, C., Suyama, M., Tesler, G., Thompson, J., Torrents, D., Trevaskis,

E., Tromp, J., Ucla, C., Ureta-Vidal, A., Vinson, J. P., Niederhausern, A.

C. V., Wade, C. M., Wall, M., Weber, R. J., Weiss, R. B., Wendl, M. C.,

West, A. P., Wetterstrand, K., Wheeler, R., Whelan, S., Wierzbowski, J.,

Willey, D., Williams, S., Wilson, R. K., Winter, E., Worley, K. C., Wyman,

D., Yang, S., Yang, S. P., Zdobnov, E. M., Zody, M. C., and Lander, E. S.

(2002). Initial sequencing and comparative analysis of the mouse genome.

Nature, 420:520–562.

Watson, J. D. and Crick, F. H. (1953). Molecular structure of nucleic acids;

a structure for deoxyribose nucleic acid. Nature, 171:737–738.

Webb, C. T., Shabalina, S. A., Ogurtsov, A. Y., and Kondrashov, A. S.

(2002). Analysis of similarity within 142 pairs of orthologous intergenic

regions of Caenorhabditis elegans and Caenorhabditis briggsae. Nucleic

Acids Res, 30:1233–1239.

Weinmann, A. S. and Farnham, P. J. (2002). Identification of unknown

target genes of human transcription factors using chromatin immunopre-

cipitation. Methods, 26:37–47.

Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Mein-

hardt, T., Prss, M., Reuter, I., and Schacherer, F. (2000). TRANSFAC:

an integrated system for gene expression regulation. Nucleic Acids Res,

28:316–319.

Wray, G. A., Hahn, M. W., Abouheif, E., Balhoff, J. P., Pizer, M., Rock-

man, M. V., Romano, L. A., and Wray, G. A. (2003). The evolution of

transcriptional regulation in eukaryotes. Mol Biol Evol, 20:1377–1419.

BIBLIOGRAPHY 134

Zervos, A. S., Gyuris, J., and Brent, R. (1993). Mxi1, a protein that specif-

ically interacts with Max to bind Myc-Max recognition sites. Cell, 72:223–

232.

Zhu, J., Liu, J. S., and Lawrence, C. E. (1998). Bayesian adaptive sequence

alignment algorithms. Bioinformatics, 14:25–39.

Documents

Computational inverstigations into cis-regulation in Eukaryotes · 2013. 3. 1. · Abel Ureta-Vidal, Manu Mongin, Martin Hammond and Arek Kasprzyk. I would also like to thanks Ewan