Bioinformatics Splicing and gene prediction in eukaryotes Critical splice signals Coding statistics: DNA differences between exons and introns Discriminant

Bioinformatics

• Splicing and gene prediction in eukaryotes

• Critical splice signals

• Coding statistics: DNA differences between

exons and introns

• Discriminant function and combined approach

Lecture 12

• Any type of gene prediction and particularly ab initio is tremendously complicated in eukaryotes by the splicing phenomenon.

• The task is difficult, to predict positions of exon-intron boundaries for those eukaryotic genes, which have multiple introns, and to predict absence of introns for intronless genes.

• Eukaryotic genomes differ significantly in a number of ways, which requires species specific prediction programs.

• The major differences include: a) variation in GC-content (e.g. mammalian genomes have large variation in GC-content, referred as isochors), b) variation in codon usage frequencies.

• All these factors, if not taken into consideration, diminish quality of prediction.

Splicing and gene prediction in eukaryotes

AT/GC ratios in coding regions in some eukaryotes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

A.thaliana C.elegans D.melanogaster H.sapiens

AT%

CG%

The number of correct and incorrect (number in parentheses) of whole gene model predictions shared among the 3 programs from a test set of 1783 genes

GlimmerM(GA)

GenMark.hmm(GM) Genscan+(GS)

Incorrect gene refers to cases in which all coding exons in the gene are in perfect agreement among the gene finders but not with the true gene

mRNA splicing

Critical splice signals

EXON 1 INTRON EXON 2

G U A/G A G U U U A/G A U/C U/C A G

(100%) ( 62 –68 %) (100%)

A G

Donor site

5’ splice junction

Acceptor site

3’ splice junction

Branch site

G/A

Frequencies of nucleotides at the ends of exons

C. elegans C. elegans

D. melanogaster D. melanogaster

H. sapiens H. sapiens

The first 10 nucleotides of exons, 5’ end The last 10 nucleotides of exons, 3’ end

• At least 3 critical signals/motifs (donor, acceptor and branch sites) should be recognised in order to predict position of an intron and both splice junctions.

• Significant sequence variation in these sites between species and different genes negatively affects quality of predictions.

• The best average of error (false-positive + false-negative) rate for either donor or acceptor site prediction is about 5%. This may be acceptable if the search is restricted by a short region. However search of a large region leads to unacceptable rate of the false-positive because for every true site there are hundreds of pseudo-sites.

• For example, if a large region has 40 true sites and 4000 pseudo-sites, one true site would be missed (2.5% false-negatives) and 100 pseudo-sites would be predicted as true sites (2.5% false-positives)!

Recognition of variable splice sites and gene prediction

• Since adjacent donor site and acceptor site are not independent, this correlation can be explored for further eliminating false-positives.

• For short introns, occurring mostly in lower eukaryotes, an intron is recognized by the interaction of splicing factors binding across the intron-ends (hence 5’ss – 3’ss correlation).

• In vertebrates, exons are much shorter, recognition of exons by the interaction of splicing factors binding across the exon-ends (hence 3’ss – 5’ss correlation) is the key.

• Therefore mammalian functional splice sites can only be effectively identified simultaneously through exon recognition.

• Also there are several additional signals/motifs essential for the correct splicing, which are responsible for recognition of certain proteins involved in splicing. Identification of such sites and their use in prediction programs should increase quality of eukaryotic gene predictions.

Recognition of variable splice sites and gene prediction

• Except splicing signals and ORF there are several additional characteristics, which may help to discriminate between exons and introns including

• These features include DNA periodicity in exons, codon preferences, hexamer usage, codon prototype, compositional bias between codon positions

Coding statistics: DNA differences between exons and introns

Frequency of nucleotide A in phase 0 H. sapiens exons aligned at the 5' end

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Position

Freq

uenc

yDNA periodicity in exons

DNA periodicity in exons, 3

Curve of best-fit in H. sapiens phase 0 exons - dinucleotide 'AG'

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

Nucleotide position

Fre

qu

ency

Periodic structure in DNA sequences.

The absolute frequency of the A A pair with ( 0 to 5) nucleotides between the two A's in the 200 first base pairs of the sequences in the set of 1761 human exons and 1753 human introns. A clear period-3 pattern appears in coding regions, which is absent in non-coding regions. A similar periodic pattern appears in coding regions for the other fifteen possible pairs of nucleotides.

• A coding statistic was introduced to measure uneven usage of synonymous codons solely.

• Indeed, from a codon usage table, we can compute the relative probability of each synonymous codon to code for a given amino acid.

• For instance, GAG and GAA the two codons coding for Glutamic Acid are used in coding regions with probabilities 0.03882 and 0.02751, which results in a relative probability of 0.59 and 0.41, respectively.

Codon Preference

• Bias in the distribution of oligonucleotides longer than codons can also be used to discriminate between coding and non-coding regions. Bias in the usage of hexamers may be the most discriminant one (probably because of dependence between adjacent amino acids in

the proteins). Bias in hexamer usage can be computed exactly as bias in codon usage as the background information for codon frequencies is known and frequencies of each of 642 = 4096

hexamers can be found.

• There are several ways to construct frame specific hexamer score, both log-odd LE(w,i) = log [fE(w,i)/fI(w)] and preference score PE(w,i) = fE(w,i) / [fE(w,i) + fI(w)], where fE(w,i) is frequency of hexamer w in frame i, calculated from known exon training data and fI(w) is the frequency of w from known introns.

Hexamer usage correlation

Codon position 1

A C G T

A .36 .27 .35 .18

C .21 .23 .24 .27

G .19 .14 .23 .23

T .24 .35 .19 .31

Codon position 2

A C G T

A .16 .19 .15 .07

C .28 .44 .41 .33

G .40 .12 .27 .45

T .16 .25 .17 .16

Codon position 3

A C G T

A .22 .33 .24 .13

C .21 .29 .27 .21

G .44 .15 .37 .53

T .13 .22 .12 .13

Probabilities of the four nucleotides at the different codon positions conditioned to the nucleotide in the preceding codon position. Estimated from a set of human exon and intron sequences.

• A measure can be introduced which show how similar to the prototypical distribution (see the table) is the observed distribution of base frequencies at the three codon positions in a sequence (exon or intron).

• Dependencies between nucleotide positions in coding regions can be explicitly described by means of Markov Models.

• Average Mutual Information can measure the probability in the sequence of the pair of nucleotides i and j and at a distance of k nucleotides.

Codon Prototype, Markov model measure and Average Mutual Information

Nucleotide Codon position

1 2 3

A 0.27 0.31 0.18

C 0.24 0.24 0.31

G 0.32 0.20 0.29

T 0.17 0.26 0.22

Exon sequence Intron sequence

Coding frameNon-coding

framesFrame 1 Frame 2 Frame 3

Codon Usage 24.06 -16.13 -3.16 -14.36 -23.74 -19.67

Hexamer Usage 27.62 -11.64 -6.51 -20.90 -27.56 -22.07

39.98 -14.58 -8.46 -26.73 -27.81 -25.87

Codon Preference 15.97 -1.32 7.24 -7.96 -12.70 -14.93

Amino Acid Usage 8.17 -14.87 -10.17 -6.15 -10.69 -4.57

Codon Prototype 9.87 -11.23 -10.30 -11.45 -17.44 -14.49

Markov Model order 1 29.92 -2.69 -3.31 -35.44 -42.40 -41.73

order 2 34.73 -18.26 -7.77 -29.61 -41.76 -40.05

order 5 72.69 -21.38 13.56 -37.63 -30.99 -36.40

Position Asymmetry 0.0957 0.0211

Periodic Asymmetry Index 1.159 1.009

Average Mutual Information

0.00681 0.000344

Fourier Spectrum 2.278 0.892

Values of different coding statistics in the 223 bp long 2nd coding exon of the human -globin gene, and in a 223 bp long seq. from the middle of the 2nd intron of the same gene

• A number of different pattern features of sequences are used to discriminate coding (ex) and non coding seq. A linear and quadratic analysis are shown with the later being more efficient. EPS is the 6-mer exon preference score and 3’SS (3’splicing site) is an example

Pattern discriminant analysis

EPS

• The next generation of computational method able to construct gene models is currently developed, which takes as input (combines) a genomic sequence and the locations of gene predictions from ab initio gene finders, protein sequence alignments, expressed sequence tag (EST) and cDNA alignments, splice site predictions, and other evidence

• An example of such program is COMBINER, which uses rigorous statistical assessments, evaluate candidate gene models and estimate probabilities using so-called decision trees.

COMBINERcomputational gene prediction using multiple sources of evidence

Documents

Bioinformatics Splicing and gene prediction in eukaryotes Critical splice signals Coding statistics: DNA differences between exons and introns Discriminant