View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Complete genomes
2
149
4
18
30
55
84
8
19
422
1
107
4321
15
0
10
20
30
40
50
60
70
80
90
1995 1996 1997 1998 1999 2000 2001 2002
Brief calculation
Approximately 233 complete genomes with about 3000 genes in each on average.
Almost all genes are new and unstudied
In a lab: investigation of function of one gene requires one postdoc-year at least.
Hurrah!: we have work for all molecular biologists for thousands of years right now!
We have a new “complete genome”. What can we do with it now (in silico)?
(outline of the lecture)
• Gene recognition
• Prediction of regulation of gene expression
• Functional annotation of proteins
• Metabolic reconstruction
• Study of genome evolution
Main differences:
Prokaryotes and Eukaryotes
Gene recognition I. Prokaryotes
• Projection of known genes
• Genome comparisons
• Finding long ORFs
• Using DNA statistics
• Identification of gene starts
Size of a prokaryotic genome:
Pathogenesis bacteria - from < 1 Mb and 600 genes
Free living bacteria – up to 6-9 Mb, 9000 genes
E.g., Escherichia coli: 4.6 Mb - 4400 генов
Mapping “known” genes
BLASTx: //www.ncbi.nlm.nih.gov/BLAST/
A lot of information when a close genome is well-studied. But it happens rarely.
Problems: choice of thresholds, fine mapping of start positions in other cases. No perfect solutions.
Using long ORFs
–What minimal length is functional?
–Which Met is the start?
ORFs in a fragment of the K. pneumoniae genome
Frequencies of codons differ from frequencies of non-coding triplets:
• frequencies of amino acids (and their) codons;
• frequencies of dipeptides;
• frequencies of synonymous codons (genome-specific, correlate with tRNA concentration).
Use of DNA statistics in gene recognition
Coding potential
A function measuring whether the genomic fragment is coding or non-coding based on its DNA statistics.
We can calculate coding potential for ORFs or for sliding window
“Sliding window” technique:•Scan the DNA sequence with sliding window of fixed size•Calculate coding potential for each window position and plot it above the sequence (horizontal axis)• Choosing of a window size so as to minimize random noise
Exact mapping of gene start positions
• Prokaryotes: starting methionine is preceded by a ribosome-binding site (so-called Shine-Dalgarno box, any part of GGAGGA)
• Extension of the nucleotide alignment with orthologous region from a related genome: mutation patterns in the coding region differ from the those in the intergenic region
rbsD in enterobacteria
Sty AGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGCSen AGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGCStm GGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGCEco AGGATTAAACTGTGGGTCAGCGAAACGTTTCGCTGATGGAGAA-AAAAATGAAAAAAGGCYpe TTTTCTAAACTCCTTGTTAGCGAAACGTTTCGCTCTTGGAGTA-GATCATGAAAAAAGGT ** *** **************** ***** * * ***** ***** Sty ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTGSen ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTGStm ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTGEco ACCGTTCTTAATTCTGATATTTCATCGGTGATCTCCCGTCTGGGACATACCGATACGCTGYpe GTATTACTGAACGCTGATATTTCCGCGGTTATCTCCCGTCTGGGCCATACCGATCAGATT * ** ** **** ** ** **** ** *********** ***** *** *
Pattern of nucleotide changes in protein-coding regions
Sty TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATTStm TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATTSen TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATTEco TTGCCCG--TGCCAGACGGCAGATTATCTCCCTGACCTGGTGGTTGCCCAGGAGGAGGGCCGGAAATAGGTTGTATCATTKpn ----CGG--TGGCGCAGTGCCTGATGGG-CCTCGCCCTGGAGGACGGTCTGGCAT---ATCAGCAAGGGGGTGCGTCATGYpe TTGTTAGAACAGGGGAAAACGGTAAACAGTGTGGCATTAGATGTCGGTTATAGCT-----CCGCCTCTGCTTTTATCGCC * * * * * * * * * * *
Sty AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCTTGGCGGCCAGATTATGGC--ACACTTGTCCGGStm AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCCTGGCGGCCAGATTATGGC--ACACTTGTCCGGSen AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCCTGGCGGCCAGATTATGGC--ACACTTGTCCGGEco ACGTATCCTTATAC----------CTGAAATCTTCGCAAG--TATGCCTGGCCGCGAGATTATGGC--ACACTTGTCCGGKpn ATTCATCCTTTCGATATCGCGGTGCTGGAACCAGGTGATGAGTATGCCTGGCGGCCAGATTATGGC--ACACTTCCCCAGYpe ATGTTTCAGCAAATAT--------CGGGTACCA-CGCCTGAGCGTTTCCGGCGGGGCAATAGTGGCTTATACTAAGCCCC * ** * * * * *** * ** **** * *** **
Sty TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGCStm TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGCSen TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGCEco TTAACTCTCGT--CTCATACAG------GTAACACAAAC--GTGAAAATCCTTGTTGATGAAAATATGCCTTATGCCCGCKpn TTAACTCTCGTT-CTCAGACAG------GTACTGAACT---GTGAAAATCCTCGTTGATGAAAATATGCCCTATGCCCGTYpe CTGTTTTTCATCTGTATGGCAGTTCGCTGTCGGAGAGTAAAGTGAAAATTCTGGTTGATGAAAATATGCCGTACGCTGAG * * ** * * *** ** * ******** ** ***************** ** ** 123123123123123123123123123123123123123
pdxB in enterobacteria
OperonsMajority of genes in prokaryotes are transcribed in operons. Some examples of operons in eukaryotes: C.elegans
Ideas for de novo prediction of operon structure are trivial:• Small distance between adjacent genes• Co-orientation (lie on the same strand)• More reliability when these features are conserved in different speciesAdditional arguments:• Similar functional annotations of adjacent genes• Observed co-expression• Known average operon length
Training for a completely new genome
For all already discussed methods we need some initial knowledge about genes in the genome (DNA statistics, minimal ORFs length etc.) – from known genes or their very close orthologs
When we have no information at all, we use an iterative process with initial parameters from very long ORFs (and/or distant orthologs with reconstructed structure) as genes, and regions with no ORFs as intergenic regions
Gene recognition II. Eukaryotes
Specifics:• Exon-intron structure• 9-10 coding exons per gene on average (human),
~5 exons (insects)• Average length of internal exons is 120-130
nucleotides• Very long introns (>10Kb) are frequent, may be as
long as > 1 Mb• There are no Shine-Dalgarno sequences (the Kozak
rule can be used instead, but it is much weaker)
=> ORFs and “sliding window” techniques are inapplicable!
The gene of rat chemotripsin
Inapplicability of “sliding window” technique for eukaryotic genomes
Nothing (intergenic region)
Search for “known” genesBlastX is reliable only for large exons (short
introns are treated as long deletions)
What can we use instead? Splicing signals!
“Spliced alignment” is an alignment of DNA fragment with a sequence coding for a homologous protein. Unlike standard alignments, it is allowed to contain non-penalized long “deletions” flanked with splicing signals (that is, introns). BLAT, ProFrame, TWINSCAN
HMM (Hidden Markov Model)
Definition: An HMM is a 5-tuple (Q, V, p, A, E), where: Q is a finite set of states, |Q|=N
V is a finite set of observation symbols per state, |V|=M
p is the initial state probabilities.
A is the state transition probabilities, denoted by ast for each s, t ∈ Q.
For each s, t ∈ Q the transition probability is: ast ≡ P(xi = t|xi-1 = s)
E is a probability emission matrix, esk ≡ P (vk at time t | qt = s)Property: Emissions and transition are dependent on the current state only and not on the past.
Output: Only emitted symbols are observable by the system but not the underlying random walk between states -> “hidden”
HMM-based Gene Finding
• GENSCAN (Burge 1997)
• FGENESH (Solovyev 1997)
• HMMgene (Krogh 1997)
• GENIE (Kulp 1996)
• GENMARK (Borodovsky & McIninch 1993)
• VEIL (Henderson, Salzberg, & Fasman 1997)
GenScan Overview• Developed by Chris Burge (Burge 1997), in the
research group of Samuel Karlin, Dept of Mathematics, Stanford Univ.
• Characteristics:– Designed to predict complete gene structures
• Introns and exons, Promoter sites, Polyadenylation signals
– Incorporates:• Descriptions of transcriptional, translational and splicing signal• Length distributions (Explicit State Duration HMMs)• Compositional features of exons, introns, intergenic, C+G regions
– Larger predictive scope • Deal with partial and complete genes• Multiple genes separated by intergenic DNA in a sequence• Consistent sets of genes on either/both DNA strands
• Based on a general probabilistic model of genomic sequences composition and gene structure
GenScan Architecture
• It is based on Generalized HMM (GHMM)
• Model both strands at once– Other models: Predict on one
strand first, then on the other strand
– Avoids prediction of overlapping genes on the two strands (rare)
• Each state may output a string of symbols (according to some probability distribution).
• Explicit intron/exon length modeling
• Special sensors for Cap-site and TATA-box
• Advanced splice site sensors
RegulationLess than 5% of the sequence of human genome
are protein-coding sequences. What is the role of the remaining DNA?
It has been suggested, that a much larger part of human genome codes the regulatory machinery
Processes whose regulation we try to predict:• Transcription (DNA RNA)• Splicing (pre-mRNA mRNA)• Translation (mRNA protein)
Two types of analysis of regulation
Prediction of regulatory signal
Finding new sites
Identification of the signal
Signal is an ideal “site” or a set of ALL observed
sites
Site is a representative of the signal in the genome
Deriving of the signal ab initio I. Ubiquitous (necessary) signals
• Examples: promoters of transcription, ribosome-binding signal, acceptor and donor splicing sites, stop-codon, signal of polyadenilation
• We know many examples and some biological characteristics (and landmarks)
• Often short (4-6 nucleotides)
Re-alignment approaches
• Initial alignment by a biological landmark– start of transcription for promoters– start codon for ribosome binding sites– exon-intron boundary for splicing sites
• Fix the width of the sliding window and the expected signal size
• Derive the signal (the most frequent word) within a sliding window
• Repeat for other parameters, select the best set • Re-align anchoring on the signal• Identify the signal positions (with non-uniform
nucleotide frequencies)
Gene starts of Bacillus subtilis
dnaN ACATTATCCGTTAGGAGGATAAAAATG
gyrA GTGATACTTCAGGGAGGTTTTTTAATG
serS TCAATAAAAAAAGGAGTGTTTCGCATG
bofA CAAGCGAAGGAGATGAGAAGATTCATG
csfB GCTAACTGTACGGAGGTGGAGAAGATG
xpaC ATAGACACAGGAGTCGATTATCTCATG
metS ACATTCTGATTAGGAGGTTTCAAGATG
gcaD AAAAGGGATATTGGAGGCCAATAAATG
spoVC TATGTGACTAAGGGAGGATTCGCCATG
ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
pabB AAAGAAAATAGAGGAATGATACAAATG
rplJ CAAGAATCTACAGGAGGTGTAACCATG
tufA AAAGCTCTTAAGGAGGATTTTAGAATG
rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
rplM AGATCATTTAGGAGGGGAAATTCAATG
dnaN ACATTATCCGTTAGGAGGATAAAAATG
gyrA GTGATACTTCAGGGAGGTTTTTTAATG
serS TCAATAAAAAAAGGAGTGTTTCGCATG
bofA CAAGCGAAGGAGATGAGAAGATTCATG
csfB GCTAACTGTACGGAGGTGGAGAAGATG
xpaC ATAGACACAGGAGTCGATTATCTCATG
metS ACATTCTGATTAGGAGGTTTCAAGATG
gcaD AAAAGGGATATTGGAGGCCAATAAATG
spoVC TATGTGACTAAGGGAGGATTCGCCATG
ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
pabB AAAGAAAATAGAGGAATGATACAAATG
rplJ CAAGAATCTACAGGAGGTGTAACCATG
tufA AAAGCTCTTAAGGAGGATTTTAGAATG
rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
rplM AGATCATTTAGGAGGGGAAATTCAATG
cons. aaagtatataagggagggttaataATG
num. 001000000000110110000000111
760666658967228106888659666
dnaN ACATTATCCGTTAGGAGGATAAAAATG
gyrA GTGATACTTCAGGGAGGTTTTTTAATG
serS TCAATAAAAAAAGGAGTGTTTCGCATG
bofA CAAGCGAAGGAGATGAGAAGATTCATG
csfB GCTAACTGTACGGAGGTGGAGAAGATG
xpaC ATAGACACAGGAGTCGATTATCTCATG
metS ACATTCTGATTAGGAGGTTTCAAGATG
gcaD AAAAGGGATATTGGAGGCCAATAAATG
spoVC TATGTGACTAAGGGAGGATTCGCCATG
ftsH GCTTACTGTGGGAGGAGGTAAGGAATG
pabB AAAGAAAATAGAGGAATGATACAAATG
rplJ CAAGAATCTACAGGAGGTGTAACCATG
tufA AAAGCTCTTAAGGAGGATTTTAGAATG
rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG
rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG
rplM AGATCATTTAGGAGGGGAAATTCAATG
cons. tacataaaggaggtttaaaaat
num. 0000000111111000000001
5755779156663678679890
Deriving of the signal II. Transcription regulation
• Transcription factors binding sites
• Usually longer (10-20 nts or more)
• Relatively small sample: only several sites in a genome at all, very few examples are known
• Often have some symmetry
• Conserved among species
• Experimental studies are not sufficient: they define only the regulatory region
Use of symmetry
• DNA-binding factors and their signals
Co-operative homogeneous
Palindromes
Repeats
Co-operative non-homogeneous
Cassetes
Others
RNA signals: special conservative secondary structure
Signal, consensus
codB CCCACGAAAACGATTGCTTTTT
purE GCCACGCAACCGTTTTCCTTGC
pyrD GTTCGGAAAACGTTTGCGTTTT
purT CACACGCAAACGTTTTCGTTTA
cvpA CCTACGCAAACGTTTTCTTTTT
purC GATACGCAAACGTGTGCGTCTG
purM GTCTCGCAAACGTTTGCTTTCC
purH GTTGCGCAAACGTTTTCGTTAC
purL TCTACGCAAACGGTTTCGTCGG
consensus ACGCAAACGTTTTCGT
Pattern
codB CCCACGAAAACGATTGCTTTTT
purE GCCACGCAACCGTTTTCCTTGC
pyrD GTTCGGAAAACGTTTGCGTTTT
purT CACACGCAAACGTTTTCGTTTA
cvpA CCTACGCAAACGTTTTCTTTTT
purC GATACGCAAACGTGTGCGTCTG
purM GTCTCGCAAACGTTTGCTTTCC
purH GTTGCGCAAACGTTTTCGTTAC
purL TCTACGCAAACGGTTTCGTCGG
consensus ACGCAAACGTTTTCGT
pattern aCGmAAACGtTTkCkT
Frequency matrix
j a C G m A A A C G t T T k C k T
A 6 0 0 2 9 9 8 0 0 1 0 0 0 0 0 0
C 1 8 0 7 0 0 1 9 0 0 0 0 0 9 1 0
G 1 1 9 0 0 0 0 0 9 1 1 0 5 0 5 0
T 1 0 0 0 0 0 0 0 0 7 8 9 4 0 3 9
W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)
I = j b f(b,j)[log f(b,j) / p(b)] Information content
Positional weight matrix (PWM)
j a C G m A A A C G t T T k C k T
A 6 0 0 2 9 9 8 0 0 1 0 0 0 0 0 0
C 1 8 0 7 0 0 1 9 0 0 0 0 0 9 1 0
G 1 1 9 0 0 0 0 0 9 1 1 0 5 0 5 0
T 1 0 0 0 0 0 0 0 0 7 8 9 4 0 3 9
A 1.1 –1.0 –0.7 0.5 2.2 2.2 1.9 –0.7 –0.7 –0.1 –1.0 –0.7 –1.1 –0.7 –1.4 –0.7
C –0.4 1.9 –0.7 1.6 –0.7 –0.7 0.1 2.2 –0.7 –1.2 –1.0 –0.7 –1.1 2.2 –0.3 –0.7
G –0.4 0.1 2.2 –1.1 –0.7 –0.7 –1.0 –0.7 2.2 –0.1 –0.1 –0.7 1.2 –0.7 1.0 –0.7
T –0.4 –1.0 –0.7 –1.1 –0.7 –0.7 –1.0 –0.7 –0.7 1.5 1.9 2.2 1.0 –0.7 0.6 2.2
Greedy algorithms (MEME)
Find a signal among all k-words (assuming that we know the length signal).
For all k-words it’s too time-consuming (k~16). So initially we consider only k-words that were present in the fragments.
For each k-word construct a matrix of “sites”: alignment of best “copies” of the k-word from every sequence fragment.
Select the best k-word. What is the measure for comparison of matrices? Information content!
Greedy algorithms. Cont’d
• Select the k-word with maximal information content
Problem. We considered only k-words from our sequences => may select not the signal (the consensus word), but only its best representative in our sample
Solution. For each k-word from the sample construct PWM and reconstruct the frequency matrix based on it. Repeat until stabilization of the matrix. Use the consensus of this matrix.
Limitation of greedy algorithms
• Started from k-words in our sequences and increase the information content at each step => find a local (not global) maximum of the functional.
• We need an alternative algorithm that will not be “greedy”!
Gibbs sampler
Let’s A be a signal (set of sites), and I(A) be its information content.
At each step a new site is selected in one sequence with probability
P ~ exp [(I(Anew)]
For each candidate site the total time of occupation is computed.
(Note that the signal changes all the time)
Recognition of signals I. Ubiquitous signals
• Consensus• Pattern (consensus with degenerate positions)• Positional weight matrix (PWM, or profile)
Weight of the site:
• Logical rules• Neural networks
W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)
Neural networks: architecture
• 4k input neurons (sensors), each responsible for observing a particular nucleotide at particular position
OR 2k neurons (one discriminates between purines and pyrimidines, the other, between A/T and G/C)
• One or more layers of hidden neurons
• One output neuron
• Each neuron is connected to all neurons of the next layer
• Each connection is ascribed a numerical weight
A neuron
• Sums the inputs at incoming connections
• Compares the total with the threshold (or transforms it according to a fixed function)
• If the threshold is passed, excites the outcoming connections (resp. sends the modified value)
Neural networks: architecture. II
Training of the neural network
• Sites and non-sites from the training sample are presented one by one.
• The output neuron produces the prediction.
• The connection weights increase if the prediction is correct and decrease if it’s incorrect.
Networks differ by architecture, particulars of the signal processing, the training schedule
• Neutral networks don’t work: need training, too few examples
• PWM – ok, but too many false positive predictions => we need rules to select the true sites among predicted.
• Many genomes are available => comparative approach:– Consistency filtering– Phylogenetic footprinting– Phylogenetic shadowing
Recognition of signals II. Regulation of transcription
Definition of orthologs
Duplication
Speciation
• Orthologous genes: – the result of speciation– the “same” role in the cell
• Paralogous genes : – the result of duplication– keep common biochemical
function
Example: gluconate and
idonate kinasesGenome 1 Genome 2
A1 B1 A2 B2
Consistency filtering
Basic assumption. Regulons (sets of co-regulated genes) are conserved =>
• True sites occur upstream of orthologous genes• False sites are scattered at random
We need to check that transcription factors are true orthologs by themselves (BBH, COGs are not sufficient; conservation of the DNA-binding domain, conservation of the core pathway), have exactly the same specificity (similar binding sites) and then compare genes (and whole operons) after the predicted sites
The basic procedure
Genome 2Genome 2Genome 1Genome 1
Set of known sitesSet of known sites ProfileProfile
Genome NGenome N
Closely related genomes: Phylogenetic footprinting
Regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions.
Low conservation
yjcD
ST AAA-GCATAAAAAGCGGCAAAGTTCAGTTGAAAAAGCGTTGATGATCGCTGGATAATCGTTTGCTTTTTTTTG---CCACEC AAA-GAGAAAAAAGCAGCAAACTTCGGTTGAAAAAGCCGCTATGATCGCCGGATAATCGTTTGCTTTTTTTA----CCACYP AAATGTATTAAATGTCGCATTCGGGTGTTGATTAGTCACCACTGATGGCTAGATAATCGTTTGCCTTAAATGACATCTGC *** * *** * *** ***** * * **** ** ************* ** * * *
ST CC--------GTTTTGT--------ATACGTG----GAGCTAAACGTTTGCTTTTTTGCGGCGCCCCG-G-TTGTCGTAAEC CC--------GTTTTGT--------ATGCGCG----GAGCTAAACGTTTGCTTTTTTGCGACGCAGCA-AATTGTCGCAAYP CCTAAACTTCGATTTTTTTTCAGTCATGCGTTCTCCCAGCTAATCGTTTGCTATTTTTCCCCGCTCTATGAGTCAGGGAG ** * *** * ** ** ****** ******** **** * *** * * *
ST ATGTAGC----------ACAAGGA-GATAACGTTGCGCTGTTAGTGGATTACCTCCCACGTATACCGACGAATAATAAATEC ACCTGGA----------GCAGGAA-GATAACGTTTCGCTGGCAGGGGATTGTCCGCCACGCATCTTGACGAAAATTAAACYP AGTTAGTGAGTTCATCGACAGGAACGGAAACGATTACGTAGAGAAGGGCGCTTGGCTTGGCATGCTATTTTAAAATGA-C * * * ** * * * **** * * ** * * ** * * * *
ST TCTCAGGGGATGTTTTCT-ATGTCT------ACGCCTTCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTTEC TCTCAGGGGATGTTTTCTTATGTCT------ACGCCATCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTTYP ACACAGGGGACATCACC--ATGTCTAGCAGCAACCCTCAAGCACAGCCAAAGGGCACGCTTGATGCATTCTTTAAGCTTA * ******* * * ****** * ** *** * * ** * ** ** ** * ***** **
High conservation
purL
ST AGCGGCATTTTGCGTAACAATGCGCCAGTTGGCAACTT-ATT-CGCAACGATAGCCGCACC--GTATGACAAGAAAAAGCEC AGCGGCATTTTGCGTAAACCTGCGCCAGATGGCAACTT-ATT-ACAGCCATTGGCGGCACG--CGTTGCTAATTCACGATYP AGTGGCATTTTGCGCAACAAAACGCCAGTGTGCAACTTTATTGCGAGCTATTTGCTGAGTCTGCGTTACACACACATAGC ** *********** ** ****** ******* *** * ** * * * *
ST GG-TGATT---------TTATTTCT-------ACGCAAACGGTTTCGTCGGCGCGTCAGATTCTTTATAATGACGGCCGTEC GG-TGATT---------TTATTTCC-------ACGCAAACGGTTTCGTCAGCGCATCAGATTCTTTATAATGACGCCCGTYP GGCTGTTTCTGACTGAATTATTAATAATAGATACGCAAACGGTTTCGTCGGCGGCTCAGATTCACTATAATGGCGCGCGT ** ** ** ***** ***************** *** ******** ******* ** ***
ST TTCCCCCC-------------------TTGCGCACACCAAA--------------GCTTAGAAGACGAGAGA--CTTA--EC TTCCCCCCC------------------TTGGGTACACCGAAA-------------GCTTAGAAGACGAGAGA--CTTA--YP TTTGCCCTGTTGTTGCGCCAATGAATGTTGCGCCCAATGAAGTGCTGTTCCAGCCGCTTCGAAGACGAGAGAAACTTAGA ** *** *** * ** ** **** ************ ****
ST TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCTGCATTCCGTATCAATAAACTGCTGGCGCGCTTTCAGGCTGCCAACEC TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCGGCATTCCGAATCAACAAACTGCTGGCACGTTTTCAGGCTGCCAGGYP TTATGGAAATACTGCGTGGTTCACCCGCTTTGTCGGCTTTTCGTATCACCAAACTGTTGTCCCGTTGCCAGGATGCTCAC * ******** *********** ** ** **** ** ** ** **** ****** ** * ** * **** ***
Another variation. Phylogenetic shadowing
Idea. Instead of distant orthologs use very close orthologs, but from multiple (very close) species. True sites would look like islands of strongly conserved columns on multiple alignment.
Need to sequence orthologous upstream regions from a series of close genomes (e.g., from many different primates) and analyze their multiple alignment
RNA regulation. RiboswitchesmRNA has two alternative conformations
of its leader region: one of them blocks the expression.
Two main cases (prokaryotes): a terminator interrupts transcription or a special structure blocks the ribosome-binding site.
Eukaryotes: block of a splicing site
Riboswitches are RNA signals stabilized by a small molecule
Capitals: invariant (absolutely conserved) positions.
Lower case letters: strongly conserved positions.
Obligatory base pairs are set in bold.
Degenerate positions: R = A or G; Y = C or U; K = G or U; B= not A; V = not U. N: any nucleotide. X: any nucleotide
Example of the secondary structure of riboswitch
Importance of prediction of RNA regulation as bioinformatics problem
• Phenomenon was discovered by means of bioinformatics
• RNA signal is strongly conserved (on the sequence level, not only as the secondary structure) => well-predictable (no “false positive” predictions)
• A portion of the regulation of this type is valuable (~ 5% of all genes for some species)
Assignment of function based on homology
We want to characterize a new gene. What is the function of the product?
The first step: BlastP.The best case: we obtain a hit with known
functionHave we got a functional information on our
gene? Similarity ≠ homology: e-val is a measure of
statistical significance (non-randomness) of similarity.
Definition of orthologs
Duplication
Speciation
• Orthologous genes: – the result of speciation– the “same” role in the cell
• Paralogous genes : – the result of duplication– keep common biochemical
function
Example: gluconate and
idonate kinasesGenome 1 Genome 2
A1 B1 A2 B2
Orthologs or paralogs?
The best proof is a phylogenetic tree, but it’s too time-expensive.
We use BBH - Bidirectional Best Hit.
COGs – Clusters of orthologous genes (//www.ncbi.nlm.nih.gov/COGs/new) (prokaryotes) or KOGs (eukaryotes)
Assignment of a new gene to specific functional system. I
• Positional clustering
Operon: co-transcription of several genes (usually for prokaryotes, rarely for eukaryotes - Caenorhabditis elegans). Genes are transcribed together and so, exactly under the same conditions => they are dependent functionally
Assignment of a new gene to specific functional system. II
• Genes are not in the same operon, but in the same locus: horizontal transfer
• Divergon: a regulatory signal influents the direct and the complementary chains (usually with opposite effects)
regulatory site(s)
gene (operon) on (+) strandgene (operon) on (-) strand
Measure of positional closeness
Let’s use a measure of positional neighborhood: a ration of divergent genomes in which our genes are closely located
Servers that predict functional dependence: ERGO (//www.cordis.lu/ergo/ ),
STRING (//string.embl.de/, may be described at the proteomics day): implementation and visualization of ALL the techniques related to this area
Eukaryotic case: domain shuffling Compression of biochemical functions into single molecules
Prokaryotes: all enzymatic activities carried out by separate proteins
Fungi: FAS1 gene encodes activities 3 and 4FAS2 gene encodes activities 1,2 and 5-7
Animals: All activities encoded by fatty-acid synthase
Protein domains
InterPro:
www.ebi.ac.uk/interpro/
Pfam:
http://www.sanger.ac.uk/Software/Pfam/
Co-regulation
Genes that are distant in the genome, but are regulated similarly.
Very similar to the case of operons
But it’s hard to work with computationally. A lot of manual analysis is necessary.
Co-expression
• If the expression of two genes changes consistently in response to changing conditions or in time => they are functionally related
Microarray data analysis: a special area of bioinformatics (Transcriptomics session)
Protein-protein interactions
• Evidence of physical interaction is a direct proof of the functionality in one cellular system (together)
Will be discussed in detail at the Proteomics session
Phylogenetic profiling
Usually functional system is present or absent in a genome as a whole (or it’s true for a separate subsystem) =>
If we have many distant complete genomes, we can compare patterns of occurrence (phylogenetic profile) for individual genes.
This is rather weak evidence, but useful in combination with other techniques.
The converse situation also is interesting: genes with complementary phylogenetic profiles may have identical function (non-orthologous displacement: paralogs, specificity changes or really different structure).
Combining of methods
Each individual type of evidence is rather weak => we need to combine methods in every case.
BlastP => general biochemical functionPositional clustering and/or domain shuffling
and/or phylogenetic profiling => assignment to functional system
Metabolic reconstruction => gaps in this systemTry place the product of our gene to each gap =>
(if we are lucky) exact biochemical function and exact position in the metabolic pathway
YpaA: riboflavine transport• 5 predicted TM segments => potential
transporter • Regulatory RFN-element => co-
regulation with genes from riboflavine metabolism => transport of metabolism or one of it’s predecessor
• S. pyogenes, E. faecalis, Listeria: have ypaA, no genes of riboflavin biosynthesis => transport of riboflavin
So, prediction: YpaA is a riboflavin transporter (Gelfand et al., 1999)
Verification:• YpaA imports riboflavin (genetic
analysis, Kreneva et al., 2000)• YpaA is regulated with riboflavin
(microarray expression analysis, Lee et al., 2001; direct verification, Winkler et al., 2002).
Genome evolution. Repeats• More than 45% of human genome is repetitive
DNA• A.Smith: ”The best algorithm of gene prediction
is to mask the repeats, and the rest will be genes!”
• Genome-specific classes of repeats are unique markers of genome post-speciation evolution (did humans appear due to special repeats?!)
• Too many repeats=> this task is computational• Influence on gene recognition, similarity search
and other genomic analyses. Mask repeats before!
Duplications in genomes. Example of a locus with internal duplications
MAGEA9a LW-1a FAM11a LW-1b MAGEA9b … 2 Mb … MAGEA4
GABRE MAGEA5 MAGEA10 GABRA3 GABRQ MAGEA6 TRAG3a MAGEA2a
MAGEA12 CSAGE MAGEA2b TRAG3b MAGEA3
repeat I repeat I
repeat II
repeat II
… 6 genes … MAGEA1
…
…
MAGE8
MAGE-A locus, X human chromosome
Duplications
• The main problem of duplications: assembly of newly sequenced genomes
• No universal solution: every group uses its own algorithm and software
Human genome: the number of duplications changes from one release to another. Two initial versions (Int. consortium, Celera) were significantly different at the point of duplications
• Human chromosomes cut into > 100 pieces and reassembled become a reasonable facsimile of the mouse chromosome
Synteny groups
Rearrangements as a unit of genome evolution
rearrangement
Rearrangements of alfafa and garden pea Transforming alfaalfa
into pea
Whole genome duplication in yeast
Kellis M, Birren BW, Lander ES. (2004) Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 428:617-24