27
Translation initiation start prediction in human cDNAs with high accuracy A. G. Hatzigeorgiou Paper Presentation Introduction to Bioinformatics Anaxagoras Fotopoulos | Marina Adamou - Tzani 21/01/201

TIS prediction in human cDNAs with high accuracy

Embed Size (px)

DESCRIPTION

Correct identification of the Translation Initiation Start (TIS) in cDNA is an important issue for genome annotation. The aim of this work is to improve upon current methods and provide a performance guaranteed prediction.

Citation preview

Page 1: TIS prediction in human cDNAs with high accuracy

Translation initiation start prediction in human cDNAs

with high accuracy

A. G. Hatzigeorgiou

Paper PresentationIntroduction to BioinformaticsAnaxagoras Fotopoulos | Marina Adamou - Tzani

21/01/2014

Page 2: TIS prediction in human cDNAs with high accuracy

2

• Primary objective of the present research is contribution to the definition of the coding part of a gene.

• The search is performed in cDNA sequences.

• Coding regions are surrounded by UnTraslated Regions (UTRs).

• The interest is focused in finding the Translation Initiation Start (TIS) which defines the start of the coding region.

Introduction

complementary DNA (cDNA) is DNA synthesized from a messenger RNA (mRNA) in a reaction catalyzed by the enzymes reverse transcriptase and DNA polymerase.

cDNA

Page 3: TIS prediction in human cDNAs with high accuracy

3

Generalized Second Order Profiles. • Implementation of the Ribosome Scanning Model

(Kozak, 1996)

Previous Research

Positional Conditional Probability matrix. Salzberg, 1997

Agarwal and Bafna, 1998a

The ribosome first attaches to a specific region in the 5’ end of the mRNA and then scans the sequence for the first ATG

• No significant deferences were observed between the above methods and a weight matrix

• The above methods are studied in common due to the high rate of false positives.

Page 4: TIS prediction in human cDNAs with high accuracy

4

Six characteristics are applied for the analysis of the region around TIS including weight matrix and hexanucleotide difference.

Use of Support Vector Machines (SVMs) for TIS prediction

Previous Research

Usage of ANNs for the recognition of local context and statistical properties around the TIS. Large region of analysis 100 bases before and 100 after the start codon

Pedersen and Nielsen, 1997

Salamov et. al., 1998

Zien et. al., 2000

All of the above methods give up to 85% correct predictions.

Page 5: TIS prediction in human cDNAs with high accuracy

5

Coding/Non

Coding Potential

Coding

Conserved Motif

Consensus

Methods – Suggested Model

NN

ScoreMultiplication

Swissprot

Training Gene Pool

Test Gene PoolTraining Set + Evaluation Set

Test Set

Parameter estimation

TIS Prediction

Training Gene Pool

Test Gene PoolTraining Set + Evaluation Set

Test Set

Parameter estimation

TIS Prediction

NN

475 cDNAs(Verified + Checked)

Page 6: TIS prediction in human cDNAs with high accuracy

6

Consensus Neural Network

12-nucleotides long window

325 positive+

325 negativeexamples

Binirization of the input

Selection of the appropriate

feed-forward NN

Feed forward with short cut connections & two hidden units trained with cascade correlation algorithm

Cascade Correlation Algorithm

Page 7: TIS prediction in human cDNAs with high accuracy

7

Coding Neural Network

12-nucleotides long window

54 nucleotides length window

Use Smith – Waterman algorithm for the elimination

of homologies between training

and test data

282 genes with less than 70% homology

were used for training

700 positive +

700 negative Sequence regions

extracted for training

250 positive+

250 negativeSequence regions

extracted for testing

Apply codon usage static(Count for every window

all non-overlapping codons)

The sequence window is

rescaled to 64 units

Every unit gives the normalized

frequency of the codon in the

window

Resilient back- propagation algorithm is applied to a

feed-forward NN.

Page 8: TIS prediction in human cDNAs with high accuracy

8

Integrated method Analysis of full length mRNA

sequences

1st stage

• Calculation of coding score for every nucleotide of the mRNA sequence

2nd stage

• Calculation of coding evidence of the coding region included in the longest ORF of the sequence

3rd stage

• For every in-frame ATG a consensus score is calculated

4th stage

• For the same in-frame ATG, a coding difference score is calculated

The final score is obtained by combining the output of the consensus ANN and the

coding difference

Page 9: TIS prediction in human cDNAs with high accuracy

9

Integrated method Analysis of full length mRNA

sequences

The use of the Las Vegas algorithm gives a confident decision. The incorporation of this algorithm leads to a highly accurate recognition of the TIS in human

cDNAs for 60% of the cases!

Las Vegas algorithm provides a correct prediction in some cases and has a “no answer” option in the

remaining cases. That is, it always produces the correct result or it informs about the failure.

• This method provides only one prediction for every ORF• According to the results of the test group:

• 94% of the TIS were correctly predicted• 6% of the predictions were false positive

Las Vegas

Page 10: TIS prediction in human cDNAs with high accuracy

10

Results – Score Combination 1/3

Cod line: Score of coding ANNLocal line: Score and position of consensus ANN for all ATGs in coding frame

Nucleotide 255 : cod 0.98 – local 0.2

A score combination of coding ANN and consensus ANN gives low final score.

Page 11: TIS prediction in human cDNAs with high accuracy

11

Results – Score Combination 2/3

Cod line: Score of coding ANNLocal line: Score and position of consensus ANN for all ATGs in coding frame

Nucleotide 270: cod 0.44 – local 0.4

A score combination of coding ANN and consensus ANN gives low final score.

Page 12: TIS prediction in human cDNAs with high accuracy

12

Results – Score Combination 3/3

Cod line: Score of coding ANNLocal line: Score and position of consensus ANN for all ATGs in coding frame

Nucleotide 148: cod 0.95 – local 0.8

A score combination of coding ANN and consensus ANN gives high final score.

Correct TIS

Page 13: TIS prediction in human cDNAs with high accuracy

13

Results – Methods Comparison

Correct TIS positions

Page 14: TIS prediction in human cDNAs with high accuracy

14

Results – Methods Comparison Prediction for the 3 TIS positions

with the highest scores

Page 15: TIS prediction in human cDNAs with high accuracy

15

Results – Methods Comparison Consensus motif scores

(only for DIANA-TIS)

Page 16: TIS prediction in human cDNAs with high accuracy

16

Results – Methods Comparison

Final scores

Page 17: TIS prediction in human cDNAs with high accuracy

17

Results – Methods Comparison

Correct predictions

Page 18: TIS prediction in human cDNAs with high accuracy

18

Results – Methods Comparison

High prediction

score difference

Found TIS but other higher score exists

TIS correct position: 471

Did not find TIS

Prediction Analysis

Page 19: TIS prediction in human cDNAs with high accuracy

19

Results – Methods Comparison

Correct TIS positions

Performance of the three programs for TIS prediction along the mRNA with signal peptide sequences

Page 20: TIS prediction in human cDNAs with high accuracy

20

Results – Methods Comparison

Length of signal peptide

Page 21: TIS prediction in human cDNAs with high accuracy

21

Results – Methods Comparison

Prediction for the 2 TIS positions with the highest scores

Page 22: TIS prediction in human cDNAs with high accuracy

22

Results – Methods Comparison

Consensus motif scoresonly for DIANA-TIS)

Page 23: TIS prediction in human cDNAs with high accuracy

23

Results – Methods Comparison

Final scores

Page 24: TIS prediction in human cDNAs with high accuracy

24

Results – Methods Comparison

Prediction example #1:DIANA-TIS is able to distinguish between TIS and other ATGs better than other ANN based programs like NetStart:

2 suitable ATGs are 12 nucleotides away

Coding/non-coding information is similar

Consensus motif is completely different

Page 25: TIS prediction in human cDNAs with high accuracy

25

Results – Methods Comparison

Consensus motif is completely different

Combined score is much lower

In some signal peptides sequences the coding potential score is relatively low, and can thus affect the combined score.

Prediction example #2:A favorable prediction does not work for all examples:

Page 26: TIS prediction in human cDNAs with high accuracy

26

Results – Methods Comparison

TIS prediction program

TIS prediction

rate

DIANA-TIS (2001) 94%

Agarwal & Bafna (1998) 85%

ATGPred (Salamov et al, 1998) 79%

NetStart (Pedersen & Nielsen, 1997) 78%

These methods allow more than one prediction per gene

Notice The results come from different datasets and thus these numbers should not be directly compared.

Page 27: TIS prediction in human cDNAs with high accuracy

27

Thank you!

National & KapodistrianUniversity of AthensDepartment of Informatics

Technological Education Institute of AthensDepartment of Biomedical Engineering

Biomedical ResearchFoundation Academy of Athens

Demokritos National Center for Scientific Research

Introduction to BioinformaticsInformation Technologies in Medicine and Biology