View
11
Download
0
Category
Preview:
Citation preview
Carsten FriisCenter for Biological Sequence AnalysisTechnical University of Denmark
Genome Annotation and Gene-finding
27621Prokaryotic Gene Discovery,
Metagenomics and Pangenomics
Genome Sequencing
Annotation is the process of assigningbiological meaning to segments of
genomic DNA...
Outline
Some ‘trivial’ questions− Why gene prediction?
− The problem of faster genomic sequencing
− What is a Gene?
The anatomy of a gene
Manual gene finding by you! (exercise)
Gene finder methods and performance− NetGene2
− EasyGene
Outline
Some ‘trivial’ questions
− Why gene prediction?
− The problem of faster genomic sequencing
− What is a Gene?
The anatomy of a gene
Manual gene finding by you! (exercise)
Gene finder methods and performance
− NetGene2
− EasyGene
Why Look for Genes?
Genes is where the action is:
− Explain Basic Biological Functions
Protein kinases, Cyclins, etc.
− Explain Medical Conditions
Symptoms linked to certain genes
− Be Used for Treatment of Disease
− Contain commercial value
As enzymes (Lipases, Amylases, ’washing detergent’)
As drug targets (Ion channels, Receptors)
As therapeutic factors
Nobel Prizes & Genes
The history of genes and related analysis has introduced us several Nobel Prize winners, − Richard J. Roberts and Phillip A. Sharp for their discoveries
of split genes; − Barbara McClintock for her discovery of mobile genetic
elements; − J. Michael Bishop and Harold E. Varmus for their discovery of
the cellular origin of retroviral oncogenes; − Francis Crick & James Watson for the DNA double helix
structure.− ….
Now... sequencing your entire genome in two months
It took longer than 10 years and $4 bn to sequence the three billion letters of the human genome, which was a composite made from dozens of different individuals.
454 Life Sciences makes an innovative DNA sequencing machine, which proved capable of decoding Dr. Watson’s genome in 2 months at a cost of less than $1 million. A copy of his genome, recorded on a pair of DVDs, was presented to Dr. Watson in a ceremony in Houston (2007 May 31).
More than 500 organisms sequenced to date
We have the Genome Sequence......now what?
Are there still novel genes to be discovered?– Yes!
What is the challenge?– We don’t know how many
genes there are!– We don’t know where they
are!– We don’t know what they do!
Carsten FriisCenter for Biological Sequence AnalysisTechnical University of Denmark
The cure lies in high-quality automatedgene finders...
What is a gene?
“Most problems have either many answers or no answer. Only a few problems have a single answer.”– Edmund C. Berkeley Helen Pearson; Nature 441, 398-401, May 2006
What is a gene?
Genes are regions of DNA sequence which hold information required by the cell to generateproteins
Proteins are folded chains of amino acids whoseshape and electro-chemical characteristicsdetermine their function in the cell
Gene definition
A number of genes with distinct structures were discovered a) RNA genes which encode RNAs rather than proteins;
b) Pseudogenes which were considered as nonfunctional replicates of genes;
c) Nested genes located inside introns of other genes;
d) Overlapped genes, where parts of two genes are overlapped; and
e) Assembled genes, where several sections can reassemble into other genes.
Identification of putative non-coding RNA genes inthe Burkholderia cenocepacia J2315 genome
Tom Coenye, Pavel Drevinek, Eshwar Mahenthiralingam, Shiraz Ali Shah, Ryan T. Gill, Peter Vandamme and David W. Ussery
ABSTRACT Non-coding RNA (ncRNA) genes are not involved in the production of mRNA and proteins, but produce transcripts that function directly as structural or regulatory RNAs. In the present study, we evaluated the presence of ncRNA genes in the genome of Burkholderia cenocepacia J2315. We used an approach in which we combined a comparative genomics (alignment-based) approach and the use of secondary structure information for the identification of putative ncRNAs genes. 213 putative ncRNA genes were identified in the B. cenocepacia J2315 genome and we could confirm upregulatedexpression of four of these by microarray analysis. Most of the ncRNA gene transcripts have a marked secondary structure that may allow interaction with other molecules. Several B. cenocepacia J2315 ncRNAs seem related to previously characterised ncRNAsinvolved in regulation of various cellular processes, while the function of many others remains unknown. The presence of a large number of ncRNA genes in this organism may help to explain its complexity, phenotypic variability and ability to survive in a remarkably wide range of environments.
Finding ncRNAs
Gene definition
The origins of “Gene"
It was coined by the Danish geneticist Wilhelm Johannsenin 1909 as a calculating unit. At that time it was only an abstract concept.
In the early 1920s, H. J. Muller predicted that genes carry genetic information and can replicate themselves as real material entity (Muller, 1922;Muller, 1947).
Gene definition
Loose definition of a gene:
“A locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions”
Structure, function and regulation of genes are all extremely complicated, more so than we suspected, and always beyond our imagination.
The Intron
Manual gene finding
Can U spot Spot?
Manual gene finding
DNA SequenceAAGAGGTAATTAAAGCTAAATGAAGTTGTAAGAGTGGCCCTATCGCATAGGACTAGTGTCCCTATAAGAACACGAAGAAATCACCTTAGAAAGGCTGAGAAAGGGCTGCAGGGCAGTGGGAGTGCAGACTGAAAGATGCAGACCACTGGGCTTCTACTTCTGTTTCCATTTCTGATCCGGCCTGCATCTGCCTCCTTCCTGAACAGGCCAGAGAATTCATCTAAATAGCCTAAGCAGGCTGGGTGCTGTGGCTCACCTGTAATCCCAACACTTGGGAGGCCGAGGTGGGCAGATCACCTGAGGTCAGGAGTTCAAGGCTAGCCTAGCCAACATGACAAAACCCCATCTCTACTAAAAAAATACAAAAATTAGCCAGGCATAGTGGCGCCTATAGTTCCAGCTACTTGGGGGCTGAGGTAGGAAGATCGCTAGAGCCTGGGAGGTTAAGGCTGCGGTGAGCTGTGATTGTGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTGCCTCAAAAATAAATAAATAAATAAATAAATAAAAATAAGAGTGCTTGGCAGCTTGATCAAGCTATGCCAGGAACCCATCTCTCAAGCAGCAGCTCTTCTCCTGTGCCATTGTCAGCTTTGTCCTGTCTGAGTCCATGGGACTCTTCTGTTTGATGGTGGTCTTCCTCATCCTCTTCATCATGTGAAGCTCCATGGAGATCACCTACCCATACCTGCTTCTGTGACCTCATGCCATTCCTGGTGTTGGAATGTGCCAAGGTTTGCCATTAAACACACATTTCTCATTTCATAATTTCATATATATTATATATATGTGTGTGTGTGTGTGTTTATATATGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTATATATATATATATATATATATATATATATATATATATATAAAATATATAGGAAGAGGCACCAGAGAGCTCTCTGCATAGTCACAGAGGAAAGGTCATGTGAGGACAGCCAGAAGGCAGATGTCACAAGCCTCACCAGCAACCTACCATACCCTGCTTGTACCTCCATCCTGGAAGTCCAGCTTCTAAAATTAGAAGAAAATAGTCGGGTGTGGTGGCTCGCACCTATAATCCCAGCACTTTGGGAGGCTGATGTGGGAGGATCATTTGAGGTCAAGAGTTTGAAACCAGCCTAGGCAACATAGGGAGACCCTGTCTTTAAAAAAAATTTTTTTTTGTTTTAATTAGCTGGGTGTGATGGTGCACACCTGAGTCCTAGCTACTTGGGAGGCTGAGGTAGGAGGATCCCCTGAGCCCAGGGAAGTGGAGGCTGCAGTGAGCCATGATCACACCACTGCAATACAGCCTGGGTGACAGAGCAAGACCTTATCTCAAAATAAACAAACAAACAAAAAAGATGACAAAATAAATGTCTGTCGTTTAAGTCACCCATTCTGTGATATCTTGTTACGGCAGCCTGAACTGACCAATACACTTCCTCACCCAGTTTAAATTCCATGCTCAATCATAATCAGCCATTGCAATTACCCTCAACTGTATTATCAACCCTCAATTTGTATTAGTTGCTTGGCAAAACCCAAACCCTTGTGAAATCCAGTTCTTCTATATCTACATCGATGCTGCCGAATATGGCTGAAGAAAAGCAACTGTGTTGACTGGACTGCTTTAAATTCATGACCACTTACCTCAAGTGGGCACTTAACTTCCTGGCAATTATTCTACATTTTTCTAGTCCATTAACTCTCCTCCTCTCTGAGTTAATTATTTCACAGCTTTTCCTCCCTCTTTATACATGTTCCATCCTAACTCTCTGCTGATGACCTTGTTTCTTATTTCACTAATGGAGGCCACCAGGAGAGAACTCCCACAGCCATCAAATTCACCAAGCCAACAGCATCCTTACACAAATCCTCTGCCTTCTCTCTGGGCTGGCTGTGCCCTCTCTTTGCTCCTGCAATTTCCCTAACTCTCCTATACTGTTGTTATTCACTCTCCAGTGGATAATCACCATCAGGATGCAAAGATGCTGTACTAGCTTCTGAACTCTCCAAAAACCCAGGAAACAAAAAGGCAAAGGCTAAGCTTTTTCTTATTCCCCCTTATATACATATATATATATAGTAGGCACTCAATAAACATTCACTGAATGAATGAACAGTAATGCTCACTTGCCCATAAATACAAGTACCTCATCTTTTACCACAAAGGGTATTTGTAAATATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAAACTCCTGAGCTCAAGCAATCCTCCCACCTTGGCTTCCCAAAGTGCTGGGATTATAGGCGTGAGCAACTGTACCTGGCAAAAACTTTTTAAGAGCTTCGCTTCCAGATTTAGGTTGTTTCTACCCAGATTATGGCTTGGTAATTCTTTTTTTTTTTTTCTAATTTTTATTTTTTTTCTAGGGACAGGGTCTCACTATGTTGCCCAGGATGGTCTTGAACTCCTGGGCTCAAGCATTCTGCCTGCCTTGGCCTCCTAAAGTGCTGAGATTACAGGCATGAGCCACCGTGCCTGCCTTCATGTATGTTTTTAGAACACAGAGAAAATGTGTTCTAAATGTGCTCATTGCTCAGCAATGAGCAAAGGCTTATGCAGTCACCACCAATCAAAAACTTTTTTTTTTTTTTTTGAGACAAGATCTTGCTCTGTTGCCCAGGCTGGAGTGCAGTGGCAGGATCATAGCAAGCTGCAGTCTTGACCTCATAGGCCTAAATCATCCTCCCACCTCAGCCTCACAAGTAGCTAAGACCACAGGTACAAGCCACCGTATCTAGCTAACTTTCAAAATTTTTTGAATTTTTAAATTTAAAAATTTTGAGGCCAGGCTGGCCTCAGATTAGGCAACTTTAACCTTCAACAGTGATCATAACCCTTAGTTTTCAGATCCGATTAAGGGAAATGTGTAATGTCTTACTGACACACTAATCCCATCACTGCTCACACCACCCACAATTAGCTGAG
Outline
Some ‘trivial’ questions− Why gene prediction?
− The problem of faster genomic sequencing
− What is a Gene?
The anatomy of a gene
Manual gene finding by you! (exercise)
Gene finder methods and performance− NetGene2
− EasyGene
Start codon: ATGStop codons: TAA, TAG, TGA
>example (950 bp)
1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC
Manual gene findingFind, mark and countall ATGs
Find, mark and countall ATGs
How many ATGs do youexpect?
How many ATGs do youexpect?
Start codon: ATG
p(ATG)=p(A) x p(T) x p(G) ~ ¼ x ¼ x ¼ = 1/64 (in 950 bp = 14.8 ATG expected)
Manual gene finding
p(ATG)=p(A) x p(T) x p(G) ~ ¼ x ¼ x ¼ = 1/64 (in 950 bp = 14.8 ATG expected; observed = 16)
>example (950 bp)
1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC
Manual gene finding
p(ATG)=p(A) x p(T) x p(G) ~ ¼ x ¼ x ¼ = 1/64 (in 950 bp = 14.8 ATG expected; observed = 16 17)
>example (950 bp)
1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC
Manual gene finding
Start codon: ATGStop codons: TAA, TAG, TGA
>example (950 bp)
1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC
Mark codons untilfirst in-frameStop codon
Mark codons untilfirst in-frameStop codon
Manual gene finding
Start codon: ATGStop codons: TAA, TAG, TGA
>example (950 bp)
1 CTCCCTTAGA AGACTCCAGC AAGTTATTTG AAGAGGTCTT TGGAGACATG51 GTGAGTTCTC TTTCCTTCCC AGAAGGTAAG TCTCACTGTA AGGTCTTTAT101 GTCTTGTGTG TCCCCCAGCA GCCTTGTCAT CTCCGGCTGC CCTAGACCTG151 CATAAGGACA GATTGAGTGT GCTGGGATAG ACTTTTGTTG ACAAAGGGGC201 TGCTCTGCCC TTCTAAGAGG TTGAGTCTCA TCATAAGGCC TTTTGCAGCT251 TGCATGTGTA GTGCCAGGAA AGAGTAGTCA TCCCCCAAAA CCAGACAGGA301 ACTGACGAGA TGCAATCACT GTGTGGACTT TTTACCAGCT AGCTAGGGCA351 CTACCATGAG CCACTGTCTA GCAGGGAGGC TTTGGGGATG GTGTGCCCCG401 AATATCTCTC AGGGTAAGAG TTTACAGTAA GCAGCAAGCA GAGGGGTGTG451 GGTGAGTGTG CAAGTATCTA ATTGGCTAGT TTTTGTGGCC TGTAACATAT501 TGGTGGGTGT TGGGAGTCAT AAGCTAAATG TTTGCTTTCC TCTGCATTGG551 TGGTCATTAG GGAGGGGGCA GATTATGAAC CTAGGTTGCA GATCTGTTGG601 AGTAATAACA AGACACTGGT CTTGTTGGGG GTATAACCTA GAGACTCGAT651 TTATGTTCAT GTTTGGTTTG GGATGGGTTT TATGTGAGTG TTTTCTTTTT701 TGGGGAGGGG GTCGGTTAAC TTGGAAAGTA ATGCTAGGTA CTGTCCTGTT751 CATTTCCCTG AGGTGAAAGT TAGGTCAGGT TTTCTAGAAT GGAGTCTGAA801 GGTAAAACAT TTGGCCACTG GCATGCCCTA AAGTCTTTTT GTGTTCTTGT851 CCCCTAGCAG ATCCAGCCCT ATCATCTCCT GGTGCCCAAC AGCTGCATCA901 GGATGAAGCT CAGGTAGTGG TGGAGCTAAC TGCCAATGAC AAGCCCAGTC
Manual gene findingORF of 105 bps =>
A ‘protein’ of 35 aaORF of 105 bps =>
A ‘protein’ of 35 aa
Take home messages 1/2
We have a life book, but difficult to read
Amount of raw sequence is astronomical and growing
rRNA, tRNA genes, etc. are genes too
Many distinct gene structures, and far from every open reading frame is a gene
Outline
Some ‘trivial’ questions− Why gene prediction?
− The problem of faster genomic sequencing
− What is a Gene?
The anatomy of a gene
Manual gene finding by you! (exercise)
Gene finder methods and performance− NetGene2
− EasyGene
Gene Prediction
Prediction relies on integration of several gene features
Each gene feature carries a low signal− E.g. ATG, Donor/acceptor splice sites− Combinatorial explosion− Some are mutually exclusive (e.g. reading frame)
Gene Prediction
Codon frequency/bias– Organism dependent
– Hexamer statistics
Transcriptional– Promoters/enhancers
Exon/introns– Length distributions
– ORFs
Splicing– Donor/acceptor sites
– Branchpoints
Translational– Start codon (ATG)
context
Gene finders of the past...
GeneMark (Borodovsky & McIninch 1993)
Ecoparse (Krogh et al 1994)
GeneMark.hmm (Lukashin & Borodovsky 1998)
Glimmer (Salzberg et al 1998, Delcher et al 1999)
Orpheus (Frishman et al 1998)
Frame-by-frame (Shmatkov et al 1999)
GeneMark.hmm/S (Besemer et al 2001)
Since then...
GENEMARK.2Ecgene
AUGUSTUS.7EXONHUNTER.3DOGFISH-CE.4
GenscanGENEZILLA.2
AcemblyTWINSCAN-MARS.4
FGENESH++.1SAGA.4Geneid
SGPACEVIEW.3AUGUSTUS.2
SPIDA.7AUGUSTUS.4
N-SCAN.4N-SCAN.5Twinscan
AUGUSTUS.1AUGUSTUS.3EXOGEAN.3
PAIRAGON+N-SCAN.1PAIRAGON+N-SCAN.3
JIGSAW.1ENSEMBL.3
DOGFISH-CE.7
Gene Finders are often organism specific
Gene Prediction
Ab initio Gene Finders
”Integrated” methods− Predict genes in context (Hidden Markov Model based)
”Grammar” of genesCertain elements in specific order are required
− HMMgene www.cbs.dtu.dk/services/HMMgene/− GenScan http://genes.mit.edu/GENSCAN.html
”Isolated” methods− Predict individual features (Neural Network based)
E.g. splice sites, coding regions− NetGene2 www.cbs.dtu.dk/services/NetGene2/− GRAIL http://compbio.ornl.gov/Grail-1.3/
Artificial Neural Network
Pyr
Pyr
1
2 1
1
T/F
Pyr|
Pyr|
Pyr|
Pyr
+1
+1+1
+1
+1+1
–2
Hidden Markov Model
Gene Prediction
”Isolated” methods (e.g.NN):
HAPPYEUGENEAWASGUYFINDER
”Integrated” methods (e.g.HMM):
EUGENEFINDERWASAHAPPYGUY
EasyGene –Bacterial Gene Finder
Courtesy of T.S. Larsen & A. Krogh 2004
ORF distributions
Performance landscapeE. coli
Performance landscapeshort ORFs – E. coli
Annotation remains a problem...
Courtesy of M. Skovgaard et al 2001
Annotation remains a problem...
Easygene anno 2009
Take home messages 2/2
Genes may be predicted by computer programs
Most gene prediction programs only predict protein-coding genes
’Unusual’ genes are difficult to predict:− Alternative/Multiple start codons− Non-native genes− Lowly expressed− Introns, Alternatively Spliced
HMM-based gene prediction programs are suitable for “Gene Grammar”
Prediction methods are not perfect!
No single method is always best
Take home message...
UseUse gene finders gene finders withwith cautioncaution!!
...and Coffee Break!
Recommended