Gene Prediction

Gene Prediction

Increase in the recent 12 months

Motivation• The genome sequencing projects are progressing very fast: human, mouse, rat, rise, anopheles…

• The identification of genes is the next important step in the analysis of genomes.

• How can we identify genes in the sequence?

http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?chr=rat.inf

Gene Identification• Identify genes coding for known proteins.

• only few proteins are known

• Identify genes based on homology with other genomes.

• Identifying genes based on gene characteristics.


Gene Characteristics• Differences between prokaryotic and eukaryotic gene characteristics:

• Prokaryotes - the genome is more compact. several genes may reside on the same mRNA in different reading frames.• Eukaryotes - a gene may contain introns. The human genome: average gene ~ 27,800b.exon ~ 100b. intron 100-30,000 b.• Promoter regions are different – in prokaryotes the signals are more conserved.

• Differences between different types of genes in the same genome. Each type has its own characteristics.

• Differences between prokaryotic and eukaryotic gene characteristics:


Prediction Approaches

• The problem of gene prediction is very much open even in well studied genomes

• The number of genes in yeast keeps changing. • The identification of promoter regions in E. Coli is considered a great challenge of bioinformatics.

• Next we consider prediction of the following:• Protein coding genes (ORFs).• Functional RNA coding genes.


ORF Finding

Open Reading Frames – sequences that code for proteins.

• How can ORFs be detected?

All reading frames are checked.Search for initiation and termination codons within a sequence.

Are these codons totally conserved?


http://www.ncbi.nlm.nih.gov/gorf/gorf.html

The genetic code is not totally conserved

Prediction of Protein-Coding Genes

• Three types of post-transcriptional events make prediction difficult:

• genetic code; alternative splicing; RNA editing.

• DNA is not a random choice of possible codons for each amino acid. It is an ordered list of codons that reflects evolutionary origin and constraints related to gene expression.

• Each species has its own coding preferences – codon usage.


• T he genet ic code - Each amino acid is coded by 3 nucleot ides, named codon.• Code redundancy - M ost amino acids are coded by several codons.

- 64 t r iplet s code f or 20 amino acids & 3 st op codons.

T he Genet ic Code

General Codon Preferences• Codon usage is different in genes coding for highly /weakly expressed genes.

• in E. Coli genes were divided into 3 groups based on their codon usage– - regular genes (70%) - highly expressed genes (15%) - horizontally transferred genes (15%)

• There is strong preferences in ORFs for specific codon pairs and for specific codons near terminators.

• The base in the third position in each codon tends to repeat itself in the same ORF.


Signal Based Identification• Prokaryotes - signals such as the RBS – Ribosome Binding Site (Shine-Dalgarno) – are conserved.Located ~ -15 upstream AUG. (in B. Subtilis RBS is AGGAGG)

• Eukaryotes• Transcription signalsTATA (~-30 TSS), cap signal, poly-adenylation site. Any signal may be missing.• Translation signalsKozak signal (immediately upstream ATG), termination codon.• Splicing signals - the Spliceosome recognizes: donor and acceptor sites - introns usually start with GT and end with AG.branch point – inside the intron.


Prediction Reliability Tests

Where no experimental verification is available, reliability of prediction can be measured by:

1. Third base repeat in an ORF - does not require any prior knowledge

2. Codon usage - requires prior knowledge per species.

3. Predicted-protein sequence comparison - if homologs are found prediction is more reliable.Homologs can be searched in protein databases, EST databases, cDNA databases, etc.The quality of the results depends on the quality of the database (EST – error prone).

Computational Approaches to Prediction• Gene prediction is carried out by various computational methods including decision trees, neural nets, Markov models and Hidden Markov models (HMM).

• A model is studied based on known genes, and then applied to genomic sequences.

• Each genome defines its own model.

Markov Models – Probabilistic ApproachMarkov model - can be modeled by states and the probability of transition from one state to the next.Markov chain – progresses in steps; each step corresponds to a move between states.The probability of being at state X in step i depends only on the state we reached at step i-1.

It has been found that ORFs have a reading-frame specific hexamer (6mer) composition. => the probability of the 6th base can be computed using the previous 5.=> The probability that a sequence is an ORF in a specific reading frame can be computed.

Finding the genes in genomic DNA.Chris Burge and Samuel KarlinCurr Opin Struct Biol. 1998 Jun;8(3):346-54. Review.

Markov Models – Figure Legend

Circle represent DNA bases or states. Numbers indicate codon positions. Arrows indicate dependency.

a) Three periodic 5th order Markov models. The next base is generated conditionally on the previous 5 bases and on the codon position.

b) Homogenous 5th order Markov Model.c) Hidden Markov model. Upper circles represent hidden

states, corresponding to whether the position is coding or non coding; upper arrows indicate that the states are generated according to a first order Markov Model. Lower circuits correspond to DNA bases. lower arrows indicate that each base is generated conditionally on the identity of the hidden state.

d) As c) with variable lengths of the hidden states.


Prediction of Complete Gene Structures in Human Genomic DNA, Chris Burge and Samuel Karlin

J Mol Biol. 1997 Apr 25;268(1):78-94.

Gene Prediction Tools• Glimmer at TIGR (The Institute of Genetics

Research). • GeneMark at Georgia Tech. • Grail at Oak Ridge National Laboratory• Genefinder at Baylor College of Medicine• Genscan at MIT

Prediction tools are compared using two criteria:• Sensitivity - % true predicted genes out of the

true genes in the genome.• Specificity - % true predicted genes out of the

total number of predicted genes.Both need to be high (correlation tests ~ 0.7-

0.9).


The General Scheme

1. Obtain new genomic DNA sequence.

2. A) Translate in all 6 reading frames and compare to protein databases.

b) Perform database similarity search of expressed sequence tags (EST) database of same organism, or cDNA sequences if available.

3. Use gene prediction program to locate genes.

4. Analyze regulatory sequences in the gene (signals).

Can help characterize putative genes.

Functional RNA Genes

• RNA genes are transcribed but are not translated – no codon preference exists.How can rRNA, tRNA and small RNA genes be predicted?

• Promoter regions can be characterized, but remain a big challenge.

• RNA secondary structure is important.Can be predicted using RNA structure prediction tools (MFOLD tool).

Characteristics of Characteristics of E. coliE. coli promoterspromoters

TTGACA

-35 hexamerspacer

TATAAT

-10 hexamer

Transcription start site

interval

Actual promoters exhibit large sequence variation.Actual promoters exhibit large sequence variation.Upon predicting promoters – known ones are missing, Upon predicting promoters – known ones are missing, many false ones emerge. many false ones emerge.

15 to 19 bases 5 to 9 bases

UUUU5’ 3’

Loop of 3-8 bases

At least 4 U residues

Stem loop structure free energy below -7 kcal/mole

Characteristics Characteristics ofof rhorho-independent -independent terminatorsterminators

Stem of 5-10 base pairs at least 60% GC

Identifying Small RNAs

• The role of small RNAs (sRNA) is a hot topic in current biology.sRNA genes fill many regulatory functions, e.g., regulating translation of mRNA (antisense).They are hard to find experimentally.

• A group of researchers from the Hebrew University and from Sweden combined bioinformatic predictions with experimental verification.Argaman et. Al – Current Biology 2001.

Identifying Small RNAs

• Based on 10 known sRNA in E. Coli they predicted 24 sRNAs, of which 14 were experimentally verified.

• 3 successive studies identified ~ 20 more sRNA genes in E. Coli.

Locate “empty” regions in the E. coli genome

ORF A

ORF B

ORF C

“Empty” regions

Search for promoter DNA sequences recognized by 70 of RNA polymerase

Identify rho-independent terminators

Extract sequences in which the distance between the promoter and the terminator is 50 to 400 bases.

-35 -10

Promoter +1 Terminator

50-400 bases

TTTT

Check sequences for conservation in other bacteria

-35 -10

Promoter +1

Predictive Predictive schemescheme

Argaman et. Al – Current Biology 2001.

Documents

Gene Prediction