Genomics Irena Artamonova Second European School of Bioinformatics Nijmegen, January 22, 2005

Genomics

Irena Artamonova

Second European School of Bioinformatics

Nijmegen, January 22, 2005

Complete genomes

2

149

4

18

30

55

84

8

19

422

1

107

4321

15

0

10

20

30

40

50

60

70

80

90

1995 1996 1997 1998 1999 2000 2001 2002

Brief calculation

Approximately 233 complete genomes with about 3000 genes in each on average.

Almost all genes are new and unstudied

In a lab: investigation of function of one gene requires one postdoc-year at least.

Hurrah!: we have work for all molecular biologists for thousands of years right now!

We have a new “complete genome”. What can we do with it now (in silico)?

(outline of the lecture)

• Gene recognition

• Prediction of regulation of gene expression

• Functional annotation of proteins

• Metabolic reconstruction

• Study of genome evolution

Main differences:

Prokaryotes and Eukaryotes

Gene recognition I. Prokaryotes

• Projection of known genes

• Genome comparisons

• Finding long ORFs

• Using DNA statistics

• Identification of gene starts

Size of a prokaryotic genome:

Pathogenesis bacteria - from < 1 Mb and 600 genes

Free living bacteria – up to 6-9 Mb, 9000 genes

E.g., Escherichia coli: 4.6 Mb - 4400 генов

Mapping “known” genes

BLASTx: //www.ncbi.nlm.nih.gov/BLAST/

A lot of information when a close genome is well-studied. But it happens rarely.

Problems: choice of thresholds, fine mapping of start positions in other cases. No perfect solutions.

Using long ORFs

–What minimal length is functional?

–Which Met is the start?

ORFs in a fragment of the K. pneumoniae genome

Frequencies of codons differ from frequencies of non-coding triplets:

• frequencies of amino acids (and their) codons;

• frequencies of dipeptides;

• frequencies of synonymous codons (genome-specific, correlate with tRNA concentration).

Use of DNA statistics in gene recognition

Coding potential

A function measuring whether the genomic fragment is coding or non-coding based on its DNA statistics.

We can calculate coding potential for ORFs or for sliding window

“Sliding window” technique:•Scan the DNA sequence with sliding window of fixed size•Calculate coding potential for each window position and plot it above the sequence (horizontal axis)• Choosing of a window size so as to minimize random noise

Selection of window size for sliding window

E. coli: 96nt window

48nt window

Exact mapping of gene start positions

• Prokaryotes: starting methionine is preceded by a ribosome-binding site (so-called Shine-Dalgarno box, any part of GGAGGA)

• Extension of the nucleotide alignment with orthologous region from a related genome: mutation patterns in the coding region differ from the those in the intergenic region

rbsD in enterobacteria

Sty AGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGCSen AGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGCStm GGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGCEco AGGATTAAACTGTGGGTCAGCGAAACGTTTCGCTGATGGAGAA-AAAAATGAAAAAAGGCYpe TTTTCTAAACTCCTTGTTAGCGAAACGTTTCGCTCTTGGAGTA-GATCATGAAAAAAGGT ** *** **************** ***** * * ***** ***** Sty ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTGSen ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTGStm ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTGEco ACCGTTCTTAATTCTGATATTTCATCGGTGATCTCCCGTCTGGGACATACCGATACGCTGYpe GTATTACTGAACGCTGATATTTCCGCGGTTATCTCCCGTCTGGGCCATACCGATCAGATT * ** ** **** ** ** **** ** *********** ***** *** *

Pattern of nucleotide changes in protein-coding regions

Sty TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATTStm TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATTSen TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATTEco TTGCCCG--TGCCAGACGGCAGATTATCTCCCTGACCTGGTGGTTGCCCAGGAGGAGGGCCGGAAATAGGTTGTATCATTKpn ----CGG--TGGCGCAGTGCCTGATGGG-CCTCGCCCTGGAGGACGGTCTGGCAT---ATCAGCAAGGGGGTGCGTCATGYpe TTGTTAGAACAGGGGAAAACGGTAAACAGTGTGGCATTAGATGTCGGTTATAGCT-----CCGCCTCTGCTTTTATCGCC * * * * * * * * * * *

Sty AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCTTGGCGGCCAGATTATGGC--ACACTTGTCCGGStm AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCCTGGCGGCCAGATTATGGC--ACACTTGTCCGGSen AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCCTGGCGGCCAGATTATGGC--ACACTTGTCCGGEco ACGTATCCTTATAC----------CTGAAATCTTCGCAAG--TATGCCTGGCCGCGAGATTATGGC--ACACTTGTCCGGKpn ATTCATCCTTTCGATATCGCGGTGCTGGAACCAGGTGATGAGTATGCCTGGCGGCCAGATTATGGC--ACACTTCCCCAGYpe ATGTTTCAGCAAATAT--------CGGGTACCA-CGCCTGAGCGTTTCCGGCGGGGCAATAGTGGCTTATACTAAGCCCC * ** * * * * *** * ** **** * *** **

Sty TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGCStm TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGCSen TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGCEco TTAACTCTCGT--CTCATACAG------GTAACACAAAC--GTGAAAATCCTTGTTGATGAAAATATGCCTTATGCCCGCKpn TTAACTCTCGTT-CTCAGACAG------GTACTGAACT---GTGAAAATCCTCGTTGATGAAAATATGCCCTATGCCCGTYpe CTGTTTTTCATCTGTATGGCAGTTCGCTGTCGGAGAGTAAAGTGAAAATTCTGGTTGATGAAAATATGCCGTACGCTGAG * * ** * * *** ** * ******** ** ***************** ** ** 123123123123123123123123123123123123123

pdxB in enterobacteria

OperonsMajority of genes in prokaryotes are transcribed in operons. Some examples of operons in eukaryotes: C.elegans

Ideas for de novo prediction of operon structure are trivial:• Small distance between adjacent genes• Co-orientation (lie on the same strand)• More reliability when these features are conserved in different speciesAdditional arguments:• Similar functional annotations of adjacent genes• Observed co-expression• Known average operon length

Training for a completely new genome

For all already discussed methods we need some initial knowledge about genes in the genome (DNA statistics, minimal ORFs length etc.) – from known genes or their very close orthologs

When we have no information at all, we use an iterative process with initial parameters from very long ORFs (and/or distant orthologs with reconstructed structure) as genes, and regions with no ORFs as intergenic regions

Gene recognition II. Eukaryotes

Specifics:• Exon-intron structure• 9-10 coding exons per gene on average (human),

~5 exons (insects)• Average length of internal exons is 120-130

nucleotides• Very long introns (>10Kb) are frequent, may be as

long as > 1 Mb• There are no Shine-Dalgarno sequences (the Kozak

rule can be used instead, but it is much weaker)

=> ORFs and “sliding window” techniques are inapplicable!

The gene of rat chemotripsin

Inapplicability of “sliding window” technique for eukaryotic genomes

Nothing (intergenic region)

Search for “known” genesBlastX is reliable only for large exons (short

introns are treated as long deletions)

What can we use instead? Splicing signals!

“Spliced alignment” is an alignment of DNA fragment with a sequence coding for a homologous protein. Unlike standard alignments, it is allowed to contain non-penalized long “deletions” flanked with splicing signals (that is, introns). BLAT, ProFrame, TWINSCAN

Spliced alignments of genomic sequences

VISTA (www-gsd.lbl.gov/vista/): human-dog-mouse

HMM (Hidden Markov Model)

Definition: An HMM is a 5-tuple (Q, V, p, A, E), where: Q is a finite set of states, |Q|=N

V is a finite set of observation symbols per state, |V|=M

p is the initial state probabilities.

A is the state transition probabilities, denoted by ast for each s, t ∈ Q.

For each s, t ∈ Q the transition probability is: ast ≡ P(xi = t|xi-1 = s)

E is a probability emission matrix, esk ≡ P (vk at time t | qt = s)Property: Emissions and transition are dependent on the current state only and not on the past.

Output: Only emitted symbols are observable by the system but not the underlying random walk between states -> “hidden”

HMM-based Gene Finding

• GENSCAN (Burge 1997)

• FGENESH (Solovyev 1997)

• HMMgene (Krogh 1997)

• GENIE (Kulp 1996)

• GENMARK (Borodovsky & McIninch 1993)

• VEIL (Henderson, Salzberg, & Fasman 1997)

GenScan Overview• Developed by Chris Burge (Burge 1997), in the

research group of Samuel Karlin, Dept of Mathematics, Stanford Univ.

• Characteristics:– Designed to predict complete gene structures

• Introns and exons, Promoter sites, Polyadenylation signals

– Incorporates:• Descriptions of transcriptional, translational and splicing signal• Length distributions (Explicit State Duration HMMs)• Compositional features of exons, introns, intergenic, C+G regions

– Larger predictive scope • Deal with partial and complete genes• Multiple genes separated by intergenic DNA in a sequence• Consistent sets of genes on either/both DNA strands

• Based on a general probabilistic model of genomic sequences composition and gene structure

GenScan Architecture

• It is based on Generalized HMM (GHMM)

• Model both strands at once– Other models: Predict on one

strand first, then on the other strand

– Avoids prediction of overlapping genes on the two strands (rare)

• Each state may output a string of symbols (according to some probability distribution).

• Explicit intron/exon length modeling

• Special sensors for Cap-site and TATA-box

• Advanced splice site sensors

RegulationLess than 5% of the sequence of human genome

are protein-coding sequences. What is the role of the remaining DNA?

It has been suggested, that a much larger part of human genome codes the regulatory machinery

Processes whose regulation we try to predict:• Transcription (DNA RNA)• Splicing (pre-mRNA mRNA)• Translation (mRNA protein)

Two types of analysis of regulation

Prediction of regulatory signal

Finding new sites

Identification of the signal

Signal is an ideal “site” or a set of ALL observed

sites

Site is a representative of the signal in the genome

Deriving of the signal ab initio I. Ubiquitous (necessary) signals

• Examples: promoters of transcription, ribosome-binding signal, acceptor and donor splicing sites, stop-codon, signal of polyadenilation

• We know many examples and some biological characteristics (and landmarks)

• Often short (4-6 nucleotides)

Re-alignment approaches

• Initial alignment by a biological landmark– start of transcription for promoters– start codon for ribosome binding sites– exon-intron boundary for splicing sites

• Fix the width of the sliding window and the expected signal size

• Derive the signal (the most frequent word) within a sliding window

• Repeat for other parameters, select the best set • Re-align anchoring on the signal• Identify the signal positions (with non-uniform

nucleotide frequencies)

Gene starts of Bacillus subtilis

dnaN ACATTATCCGTTAGGAGGATAAAAATG

gyrA GTGATACTTCAGGGAGGTTTTTTAATG

serS TCAATAAAAAAAGGAGTGTTTCGCATG

bofA CAAGCGAAGGAGATGAGAAGATTCATG

csfB GCTAACTGTACGGAGGTGGAGAAGATG

xpaC ATAGACACAGGAGTCGATTATCTCATG

metS ACATTCTGATTAGGAGGTTTCAAGATG

gcaD AAAAGGGATATTGGAGGCCAATAAATG

spoVC TATGTGACTAAGGGAGGATTCGCCATG

ftsH GCTTACTGTGGGAGGAGGTAAGGAATG

pabB AAAGAAAATAGAGGAATGATACAAATG

rplJ CAAGAATCTACAGGAGGTGTAACCATG

tufA AAAGCTCTTAAGGAGGATTTTAGAATG

rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG

rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG

rplM AGATCATTTAGGAGGGGAAATTCAATG

















cons. aaagtatataagggagggttaataATG

num. 001000000000110110000000111

760666658967228106888659666

















cons. tacataaaggaggtttaaaaat

num. 0000000111111000000001

5755779156663678679890

Positional information content before and after re-alignment

Deriving of the signal II. Transcription regulation

• Transcription factors binding sites

• Usually longer (10-20 nts or more)

• Relatively small sample: only several sites in a genome at all, very few examples are known

• Often have some symmetry

• Conserved among species

• Experimental studies are not sufficient: they define only the regulatory region

Why TFBS are palindromes? Examples

ProkaryotesEukaryotes

Use of symmetry

• DNA-binding factors and their signals

Co-operative homogeneous

Palindromes

Repeats

Co-operative non-homogeneous

Cassetes

Others

RNA signals: special conservative secondary structure

Regulation of transcriptionin eukaryotes

Signal, consensus

codB CCCACGAAAACGATTGCTTTTT

purE GCCACGCAACCGTTTTCCTTGC

pyrD GTTCGGAAAACGTTTGCGTTTT

purT CACACGCAAACGTTTTCGTTTA

cvpA CCTACGCAAACGTTTTCTTTTT

purC GATACGCAAACGTGTGCGTCTG

purM GTCTCGCAAACGTTTGCTTTCC

purH GTTGCGCAAACGTTTTCGTTAC

purL TCTACGCAAACGGTTTCGTCGG

consensus ACGCAAACGTTTTCGT

Pattern

codB CCCACGAAAACGATTGCTTTTT

purE GCCACGCAACCGTTTTCCTTGC

pyrD GTTCGGAAAACGTTTGCGTTTT

purT CACACGCAAACGTTTTCGTTTA

cvpA CCTACGCAAACGTTTTCTTTTT

purC GATACGCAAACGTGTGCGTCTG

purM GTCTCGCAAACGTTTGCTTTCC

purH GTTGCGCAAACGTTTTCGTTAC

purL TCTACGCAAACGGTTTCGTCGG

consensus ACGCAAACGTTTTCGT

pattern aCGmAAACGtTTkCkT

Frequency matrix

j a C G m A A A C G t T T k C k T

A 6 0 0 2 9 9 8 0 0 1 0 0 0 0 0 0

C 1 8 0 7 0 0 1 9 0 0 0 0 0 9 1 0

G 1 1 9 0 0 0 0 0 9 1 1 0 5 0 5 0

T 1 0 0 0 0 0 0 0 0 7 8 9 4 0 3 9

W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)

I = j b f(b,j)[log f(b,j) / p(b)] Information content

Positional weight matrix (PWM)

j a C G m A A A C G t T T k C k T

A 6 0 0 2 9 9 8 0 0 1 0 0 0 0 0 0

C 1 8 0 7 0 0 1 9 0 0 0 0 0 9 1 0

G 1 1 9 0 0 0 0 0 9 1 1 0 5 0 5 0

T 1 0 0 0 0 0 0 0 0 7 8 9 4 0 3 9

A 1.1 –1.0 –0.7 0.5 2.2 2.2 1.9 –0.7 –0.7 –0.1 –1.0 –0.7 –1.1 –0.7 –1.4 –0.7

C –0.4 1.9 –0.7 1.6 –0.7 –0.7 0.1 2.2 –0.7 –1.2 –1.0 –0.7 –1.1 2.2 –0.3 –0.7

G –0.4 0.1 2.2 –1.1 –0.7 –0.7 –1.0 –0.7 2.2 –0.1 –0.1 –0.7 1.2 –0.7 1.0 –0.7

T –0.4 –1.0 –0.7 –1.1 –0.7 –0.7 –1.0 –0.7 –0.7 1.5 1.9 2.2 1.0 –0.7 0.6 2.2

Sequence logo

Greedy algorithms (MEME)

Find a signal among all k-words (assuming that we know the length signal).

For all k-words it’s too time-consuming (k~16). So initially we consider only k-words that were present in the fragments.

For each k-word construct a matrix of “sites”: alignment of best “copies” of the k-word from every sequence fragment.

Select the best k-word. What is the measure for comparison of matrices? Information content!

Greedy algorithms. Cont’d

• Select the k-word with maximal information content

Problem. We considered only k-words from our sequences => may select not the signal (the consensus word), but only its best representative in our sample

Solution. For each k-word from the sample construct PWM and reconstruct the frequency matrix based on it. Repeat until stabilization of the matrix. Use the consensus of this matrix.

Limitation of greedy algorithms

• Started from k-words in our sequences and increase the information content at each step => find a local (not global) maximum of the functional.

• We need an alternative algorithm that will not be “greedy”!

Gibbs sampler

Let’s A be a signal (set of sites), and I(A) be its information content.

At each step a new site is selected in one sequence with probability

P ~ exp [(I(Anew)]

For each candidate site the total time of occupation is computed.

(Note that the signal changes all the time)

Recognition of signals I. Ubiquitous signals

• Consensus• Pattern (consensus with degenerate positions)• Positional weight matrix (PWM, or profile)

Weight of the site:

• Logical rules• Neural networks

W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)

Neural networks: architecture

• 4k input neurons (sensors), each responsible for observing a particular nucleotide at particular position

OR 2k neurons (one discriminates between purines and pyrimidines, the other, between A/T and G/C)

• One or more layers of hidden neurons

• One output neuron

• Each neuron is connected to all neurons of the next layer

• Each connection is ascribed a numerical weight

A neuron

• Sums the inputs at incoming connections

• Compares the total with the threshold (or transforms it according to a fixed function)

• If the threshold is passed, excites the outcoming connections (resp. sends the modified value)

Neural networks: architecture. II

Training of the neural network

• Sites and non-sites from the training sample are presented one by one.

• The output neuron produces the prediction.

• The connection weights increase if the prediction is correct and decrease if it’s incorrect.

Networks differ by architecture, particulars of the signal processing, the training schedule

• Neutral networks don’t work: need training, too few examples

• PWM – ok, but too many false positive predictions => we need rules to select the true sites among predicted.

• Many genomes are available => comparative approach:– Consistency filtering– Phylogenetic footprinting– Phylogenetic shadowing

Recognition of signals II. Regulation of transcription

Definition of orthologs

Duplication

Speciation

• Orthologous genes: – the result of speciation– the “same” role in the cell

• Paralogous genes : – the result of duplication– keep common biochemical

function

Example: gluconate and

idonate kinasesGenome 1 Genome 2

A1 B1 A2 B2

Consistency filtering

Basic assumption. Regulons (sets of co-regulated genes) are conserved =>

• True sites occur upstream of orthologous genes• False sites are scattered at random

We need to check that transcription factors are true orthologs by themselves (BBH, COGs are not sufficient; conservation of the DNA-binding domain, conservation of the core pathway), have exactly the same specificity (similar binding sites) and then compare genes (and whole operons) after the predicted sites

The basic procedure

Genome 2Genome 2Genome 1Genome 1

Set of known sitesSet of known sites ProfileProfile

Genome NGenome N

Accounting for the operon structure

«Old» genome «New» genome

A

A

BC

BC

D

XD

EF

E

F

X

X

X

X

Tryptophan operons

Closely related genomes: Phylogenetic footprinting

Regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions.

Low conservation

yjcD

ST AAA-GCATAAAAAGCGGCAAAGTTCAGTTGAAAAAGCGTTGATGATCGCTGGATAATCGTTTGCTTTTTTTTG---CCACEC AAA-GAGAAAAAAGCAGCAAACTTCGGTTGAAAAAGCCGCTATGATCGCCGGATAATCGTTTGCTTTTTTTA----CCACYP AAATGTATTAAATGTCGCATTCGGGTGTTGATTAGTCACCACTGATGGCTAGATAATCGTTTGCCTTAAATGACATCTGC *** * *** * *** ***** * * **** ** ************* ** * * *

ST CC--------GTTTTGT--------ATACGTG----GAGCTAAACGTTTGCTTTTTTGCGGCGCCCCG-G-TTGTCGTAAEC CC--------GTTTTGT--------ATGCGCG----GAGCTAAACGTTTGCTTTTTTGCGACGCAGCA-AATTGTCGCAAYP CCTAAACTTCGATTTTTTTTCAGTCATGCGTTCTCCCAGCTAATCGTTTGCTATTTTTCCCCGCTCTATGAGTCAGGGAG ** * *** * ** ** ****** ******** **** * *** * * *

ST ATGTAGC----------ACAAGGA-GATAACGTTGCGCTGTTAGTGGATTACCTCCCACGTATACCGACGAATAATAAATEC ACCTGGA----------GCAGGAA-GATAACGTTTCGCTGGCAGGGGATTGTCCGCCACGCATCTTGACGAAAATTAAACYP AGTTAGTGAGTTCATCGACAGGAACGGAAACGATTACGTAGAGAAGGGCGCTTGGCTTGGCATGCTATTTTAAAATGA-C * * * ** * * * **** * * ** * * ** * * * *

ST TCTCAGGGGATGTTTTCT-ATGTCT------ACGCCTTCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTTEC TCTCAGGGGATGTTTTCTTATGTCT------ACGCCATCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTTYP ACACAGGGGACATCACC--ATGTCTAGCAGCAACCCTCAAGCACAGCCAAAGGGCACGCTTGATGCATTCTTTAAGCTTA * ******* * * ****** * ** *** * * ** * ** ** ** * ***** **

High conservation

purL

ST AGCGGCATTTTGCGTAACAATGCGCCAGTTGGCAACTT-ATT-CGCAACGATAGCCGCACC--GTATGACAAGAAAAAGCEC AGCGGCATTTTGCGTAAACCTGCGCCAGATGGCAACTT-ATT-ACAGCCATTGGCGGCACG--CGTTGCTAATTCACGATYP AGTGGCATTTTGCGCAACAAAACGCCAGTGTGCAACTTTATTGCGAGCTATTTGCTGAGTCTGCGTTACACACACATAGC ** *********** ** ****** ******* *** * ** * * * *

ST GG-TGATT---------TTATTTCT-------ACGCAAACGGTTTCGTCGGCGCGTCAGATTCTTTATAATGACGGCCGTEC GG-TGATT---------TTATTTCC-------ACGCAAACGGTTTCGTCAGCGCATCAGATTCTTTATAATGACGCCCGTYP GGCTGTTTCTGACTGAATTATTAATAATAGATACGCAAACGGTTTCGTCGGCGGCTCAGATTCACTATAATGGCGCGCGT ** ** ** ***** ***************** *** ******** ******* ** ***

ST TTCCCCCC-------------------TTGCGCACACCAAA--------------GCTTAGAAGACGAGAGA--CTTA--EC TTCCCCCCC------------------TTGGGTACACCGAAA-------------GCTTAGAAGACGAGAGA--CTTA--YP TTTGCCCTGTTGTTGCGCCAATGAATGTTGCGCCCAATGAAGTGCTGTTCCAGCCGCTTCGAAGACGAGAGAAACTTAGA ** *** *** * ** ** **** ************ ****

ST TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCTGCATTCCGTATCAATAAACTGCTGGCGCGCTTTCAGGCTGCCAACEC TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCGGCATTCCGAATCAACAAACTGCTGGCACGTTTTCAGGCTGCCAGGYP TTATGGAAATACTGCGTGGTTCACCCGCTTTGTCGGCTTTTCGTATCACCAAACTGTTGTCCCGTTGCCAGGATGCTCAC * ******** *********** ** ** **** ** ** ** **** ****** ** * ** * **** ***

Another variation. Phylogenetic shadowing

Idea. Instead of distant orthologs use very close orthologs, but from multiple (very close) species. True sites would look like islands of strongly conserved columns on multiple alignment.

Need to sequence orthologous upstream regions from a series of close genomes (e.g., from many different primates) and analyze their multiple alignment

RNA regulation. RiboswitchesmRNA has two alternative conformations

of its leader region: one of them blocks the expression.

Two main cases (prokaryotes): a terminator interrupts transcription or a special structure blocks the ribosome-binding site.

Eukaryotes: block of a splicing site

Riboswitches are RNA signals stabilized by a small molecule

Capitals: invariant (absolutely conserved) positions.

Lower case letters: strongly conserved positions.

Obligatory base pairs are set in bold.

Degenerate positions: R = A or G; Y = C or U; K = G or U; B= not A; V = not U. N: any nucleotide. X: any nucleotide

Example of the secondary structure of riboswitch

Importance of prediction of RNA regulation as bioinformatics problem

• Phenomenon was discovered by means of bioinformatics

• RNA signal is strongly conserved (on the sequence level, not only as the secondary structure) => well-predictable (no “false positive” predictions)

• A portion of the regulation of this type is valuable (~ 5% of all genes for some species)

Assignment of function based on homology

We want to characterize a new gene. What is the function of the product?

The first step: BlastP.The best case: we obtain a hit with known

functionHave we got a functional information on our

gene? Similarity ≠ homology: e-val is a measure of

statistical significance (non-randomness) of similarity.

Definition of orthologs

Duplication

Speciation

• Orthologous genes: – the result of speciation– the “same” role in the cell

• Paralogous genes : – the result of duplication– keep common biochemical

function

Example: gluconate and

idonate kinasesGenome 1 Genome 2

A1 B1 A2 B2

Orthologs or paralogs?

The best proof is a phylogenetic tree, but it’s too time-expensive.

We use BBH - Bidirectional Best Hit.

COGs – Clusters of orthologous genes (//www.ncbi.nlm.nih.gov/COGs/new) (prokaryotes) or KOGs (eukaryotes)

Search for orthologs (fast and dirty)

Genome 1 Genome 2

symmetrical best hit

A

B

B"

A'

B'

Assignment of a new gene to specific functional system. I

• Positional clustering

Operon: co-transcription of several genes (usually for prokaryotes, rarely for eukaryotes - Caenorhabditis elegans). Genes are transcribed together and so, exactly under the same conditions => they are dependent functionally

Assignment of a new gene to specific functional system. II

• Genes are not in the same operon, but in the same locus: horizontal transfer

• Divergon: a regulatory signal influents the direct and the complementary chains (usually with opposite effects)

regulatory site(s)

gene (operon) on (+) strandgene (operon) on (-) strand

Measure of positional closeness

Let’s use a measure of positional neighborhood: a ration of divergent genomes in which our genes are closely located

Servers that predict functional dependence: ERGO (//www.cordis.lu/ergo/ ),

STRING (//string.embl.de/, may be described at the proteomics day): implementation and visualization of ALL the techniques related to this area

Eukaryotic case: domain shuffling Compression of biochemical functions into single molecules

Prokaryotes: all enzymatic activities carried out by separate proteins

Fungi: FAS1 gene encodes activities 3 and 4FAS2 gene encodes activities 1,2 and 5-7

Animals: All activities encoded by fatty-acid synthase

Genomic structure of fatty-acid synthase from rat

Protein domains

InterPro:

www.ebi.ac.uk/interpro/

Pfam:

http://www.sanger.ac.uk/Software/Pfam/

http://www.ebi.ac.uk/interpro



Co-regulation

Genes that are distant in the genome, but are regulated similarly.

Very similar to the case of operons

But it’s hard to work with computationally. A lot of manual analysis is necessary.

Co-expression

• If the expression of two genes changes consistently in response to changing conditions or in time => they are functionally related

Microarray data analysis: a special area of bioinformatics (Transcriptomics session)

Protein-protein interactions

• Evidence of physical interaction is a direct proof of the functionality in one cellular system (together)

Will be discussed in detail at the Proteomics session

Phylogenetic profiling

Usually functional system is present or absent in a genome as a whole (or it’s true for a separate subsystem) =>

If we have many distant complete genomes, we can compare patterns of occurrence (phylogenetic profile) for individual genes.

This is rather weak evidence, but useful in combination with other techniques.

The converse situation also is interesting: genes with complementary phylogenetic profiles may have identical function (non-orthologous displacement: paralogs, specificity changes or really different structure).

Combining of methods

Each individual type of evidence is rather weak => we need to combine methods in every case.

BlastP => general biochemical functionPositional clustering and/or domain shuffling

and/or phylogenetic profiling => assignment to functional system

Metabolic reconstruction => gaps in this systemTry place the product of our gene to each gap =>

(if we are lucky) exact biochemical function and exact position in the metabolic pathway

Archaeal shikimate-kinaseChorismate biosynthesis pathway (E. coli)

Pectin utilization

E. chrysanthemi

… and transport of oligogalacturonates

E. chrysanthemi

Y. pestis

K. pneumoniae

YpaA: riboflavine transport• 5 predicted TM segments => potential

transporter • Regulatory RFN-element => co-

regulation with genes from riboflavine metabolism => transport of metabolism or one of it’s predecessor

• S. pyogenes, E. faecalis, Listeria: have ypaA, no genes of riboflavin biosynthesis => transport of riboflavin

So, prediction: YpaA is a riboflavin transporter (Gelfand et al., 1999)

Verification:• YpaA imports riboflavin (genetic

analysis, Kreneva et al., 2000)• YpaA is regulated with riboflavin

(microarray expression analysis, Lee et al., 2001; direct verification, Winkler et al., 2002).

Genome evolution. Repeats• More than 45% of human genome is repetitive

DNA• A.Smith: ”The best algorithm of gene prediction

is to mask the repeats, and the rest will be genes!”

• Genome-specific classes of repeats are unique markers of genome post-speciation evolution (did humans appear due to special repeats?!)

• Too many repeats=> this task is computational• Influence on gene recognition, similarity search

and other genomic analyses. Mask repeats before!

RepeatMasker

www.repeatmasker.org/

Duplications in genomes. Example of a locus with internal duplications

MAGEA9a LW-1a FAM11a LW-1b MAGEA9b … 2 Mb … MAGEA4

GABRE MAGEA5 MAGEA10 GABRA3 GABRQ MAGEA6 TRAG3a MAGEA2a

MAGEA12 CSAGE MAGEA2b TRAG3b MAGEA3

repeat I repeat I

repeat II

repeat II

… 6 genes … MAGEA1

…

…

MAGE8

MAGE-A locus, X human chromosome

Duplications

• The main problem of duplications: assembly of newly sequenced genomes

• No universal solution: every group uses its own algorithm and software

Human genome: the number of duplications changes from one release to another. Two initial versions (Int. consortium, Celera) were significantly different at the point of duplications

• Human chromosomes cut into > 100 pieces and reassembled become a reasonable facsimile of the mouse chromosome

Synteny groups

Rearrangements as a unit of genome evolution

rearrangement

Rearrangements of alfafa and garden pea Transforming alfaalfa

into pea

Whole genome duplication in yeast

Kellis M, Birren BW, Lander ES. (2004) Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 428:617-24

Thank you!

The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health,"contract number LHSG-CT-2003-503265.

Documents

Genomics Irena Artamonova Second European School of Bioinformatics Nijmegen, January 22, 2005