Master of Science Thesis in Bioinformatics429096/FULLTEXT01.pdfnerade startkodoner eller nukleotider. Oftast beror de positivt falska resultaten på felaktigt positionerade startkodoner

The Department of Physics, Chemistry and Biology

Master of Science Thesis in Bioinformatics

Unsupervised hidden Markov model for automatic

analysis of expressed sequence tags

Andrei T. Alexsson

LiTH-IFM-A-EX--11/2553--SE

The Department of Physics, Chemistry and BiologyLinköpings universitet

SE-581 83 Linköping, Sweden

Master of Science Thesis in BioinformaticsLiTH-IFM-A-EX--11/2553--SE

Unsupervised hidden Markov model for automaticanalysis of expressed sequence tags

Andrei T. Alexsson

Supervisor: Lars Arvestad, KTH Royal Institute of Technology

Examiner: Bengt Persson, Linköping University

Linköping, 17 June, 2011

Avdelning, Institution

Division, Department

The Department of Physics, Chemistry and BiologyThe Department of Physics, Chemistry and BiologyLinköpings universitetSE-581 83 Linköping, Sweden

Datum

Date

2011-06-17

Språk

Language

� Svenska/Swedish

� Engelska/English

�

⊠

Rapporttyp

Report category

� Licentiatavhandling

� Examensarbete

� C-uppsats

� D-uppsats

� Övrig rapport

�

⊠

URL för elektronisk version

http://www.ifm.liu.se/

ISBN

—

ISRN

LiTH-IFM-A-EX--11/2553--SE

Serietitel och serienummer

Title of series, numberingISSN

—

Titel

TitleOövervakad dold Markov modell för automatisk analys av EST-sekvenser

Unsupervised hidden Markov model for automatic analysis of expressed sequencetags

Författare

AuthorAndrei T. Alexsson

Sammanfattning

Abstract

This thesis provides an in-depth analyze of expressed sequence tags (EST) thatrepresent pieces of eukaryotic mRNA by using unsupervised hidden Markov model(HMM). ESTs are short nucleotide sequences that are used primarily for rapididentification of new genes with potential coding regions (CDS). ESTs are madeby sequencing on double-stranded cDNA and the synthesized ESTs are stored indigital form, usually in FASTA format. Since sequencing is often randomized andthat parts of mRNA contain non-coding regions, some ESTs will not representCDS. It is desired to remove these unwanted ESTs if the purpose is to identifygenes associated with CDS. Application of stochastic HMM allow identificationof region contents in a EST. Softwares like ESTScan use HMM in which a train-ing of the HMM is done by supervised learning with annotated data. However,because there are not always annotated data at hand this thesis focus on the abil-ity to train an HMM with unsupervised learning on data containing ESTs, bothwith and without CDS. But the data used for training is not annotated, i.e. theregions that an EST consists of are unknown. In this thesis a new HMM is intro-duced where the parameters of the HMM are in focus so that they are reasonablyconsistent with biologically important regions of an mRNA such as the Kozaksequence, poly(A)-signals and poly(A)-tails to guide the training and decodingcorrectly with ESTs to proper states in the HMM. Transition probabilities in theHMM has been adapted so that it represents the mean length and distribution ofthe different regions in mRNA. Testing of the HMM’s specificity and sensitivityhave been performed via BLAST by blasting each EST and compare the BLASTresults with the HMM prediction results. A regression analysis test shows that thelength of ESTs used when training the HMM is significantly important, the longerthe better. The final results shows that it is possible to train an HMM with unsu-pervised machine learning but to be comparable to supervised machine learning asESTScan, further expansion of the HMM is necessary such as frame-shift correc-tion of ESTs by improving the HMM’s ability to choose correctly positioned startcodons or nucleotides. Usually the false positive results are because of incorrectlypositioned start codons leading to too short CDS lengths. Since no frame-shiftcorrection is implemented, short predicted CDS lengths are not acceptable and ishence not counted as coding regions during prediction. However, when there is alack of supervised models then unsupervised HMM is a potential replacement withstable performance and able to be adapted for any eukaryotic organism.

Nyckelord

Keywords Machine learning, Markov Model, Hidden Markov Model, Expressed sequence tag,EST, Baum-Welch, 1-best, Unsupervised, Supervised, GHMM

Abstract

This thesis provides an in-depth analyze of expressed sequence tags (EST) thatrepresent pieces of eukaryotic mRNA by using unsupervised hidden Markov model(HMM). ESTs are short nucleotide sequences that are used primarily for rapididentification of new genes with potential coding regions (CDS). ESTs are madeby sequencing on double-stranded cDNA and the synthesized ESTs are stored indigital form, usually in FASTA format. Since sequencing is often randomized andthat parts of mRNA contain non-coding regions, some ESTs will not representCDS. It is desired to remove these unwanted ESTs if the purpose is to identifygenes associated with CDS. Application of stochastic HMM allow identificationof region contents in a EST. Softwares like ESTScan use HMM in which a train-ing of the HMM is done by supervised learning with annotated data. However,because there are not always annotated data at hand this thesis focus on the abil-ity to train an HMM with unsupervised learning on data containing ESTs, bothwith and without CDS. But the data used for training is not annotated, i.e. theregions that an EST consists of are unknown. In this thesis a new HMM is intro-duced where the parameters of the HMM are in focus so that they are reasonablyconsistent with biologically important regions of an mRNA such as the Kozaksequence, poly(A)-signals and poly(A)-tails to guide the training and decodingcorrectly with ESTs to proper states in the HMM. Transition probabilities in theHMM has been adapted so that it represents the mean length and distribution ofthe different regions in mRNA. Testing of the HMM’s specificity and sensitivityhave been performed via BLAST by blasting each EST and compare the BLASTresults with the HMM prediction results. A regression analysis test shows that thelength of ESTs used when training the HMM is significantly important, the longerthe better. The final results shows that it is possible to train an HMM with unsu-pervised machine learning but to be comparable to supervised machine learning asESTScan, further expansion of the HMM is necessary such as frame-shift correc-tion of ESTs by improving the HMM’s ability to choose correctly positioned startcodons or nucleotides. Usually the false positive results are because of incorrectlypositioned start codons leading to too short CDS lengths. Since no frame-shiftcorrection is implemented, short predicted CDS lengths are not acceptable and ishence not counted as coding regions during prediction. However, when there isa lack of supervised models then unsupervised HMM is a potential replacementwith stable performance and able to be adapted for any eukaryotic organism.

v

vi

Sammanfattning

I detta examensarbete har man analyserat EST-sekvenser som representerar bitarav ett eukaryotiskt mRNA genom att använda oövervakad dold Markov modell(HMM). EST-sekvenser är korta nukleotidsekvenser som används först och främstför snabb identifiering av nya gener med potentiellt kodande regioner (CDS). Se-kvenseringen görs på hela dubbelsträngade cDNA som utgångsmaterial där desyntetiserade EST-sekvenserna lagras i digitalt form, oftast i FASTA format. Ef-tersom sekvenseringen oftast sker randomiserat och att delar av mRNA innehållericke kodande regioner, blir konsekvensen att vissa EST-sekvenser ej representerarde regioner som är kodande d.v.s. CDS-regioner. Dessa oönskade EST-sekvenservill man sålla bort om syftet är att kartlägga gener med tillhörande CDS. Tillämp-ning av stokastisk HMM möjliggör identifikation av just dessa EST-sekvenser.Program som ESTScan utnyttjar HMM där träningen av HMM:et sker genomövervakad maskininlärning med annoterad data. Dock finns inte alltid annoteradedata till hands och i det här exjobbet har man istället provat att träna ett HMMmed oövervakad maskinlärning med data innehållande EST-sekvenser, EST bå-de med och utan CDS. Men eftersom datat inte är annoterad så har man helleringen kunskap om EST-sekvensernas positioner i ett mRNA, d.v.s. man vet intevilka regioner som en EST-sekvens består av. Man har därför introducerat en nyHMM och lagt ned fokus på själva HMM’s parametrar så att de någorlunda över-enstämmer biologiskt med viktiga regioner av ett mRNA såsom Kozaksekvenser,poly(A)-signaler samt poly(A)-kedjor för att på så sätt underlätta träningen ochavkodningen med EST-sekvenserna över korrekta tillstånd i HMM:et. Övergångs-sannolikheterna i HMM:et har anpassats så att det representerar medelvärdet avlängden och längdfördelningen för de olika regionerna i mRNA:t. Testningen avHMM’s specificitet och sensitivitet har utförts via BLAST genom att blasta var-je EST-sekvens och på så sätt jämföra BLAST-resultaten med den oövervakadeHMM’s predikteringsresultat. En regressionsanalys visar att längderna på EST-sekvenser är signifikant viktiga under träningen av HMM:et, ju längre desto bättre.Resultaten visar att det är möjligt att träna ett HMM med oövervakad maskinin-lärning men det behövs ytterligare utökning av HMM:et såsom felhantering avEST-sekvenser, exempelvis förbättra HMM’s kapacitet att välja korrekt positio-nerade startkodoner eller nukleotider. Oftast beror de positivt falska resultaten påfelaktigt positionerade startkodoner som leder till alldeles för korta CDS längder.Eftersom felhantering inte är implementerad är korta predikterade CDS-längderinte acceptabla och räknas ej som kodande regioner under en prediktering. Men närdet är brist på förberäknade modeller är en oövervakad HMM en potentiell ersät-tare med stabil prestanda och är enkel att anpassas för alla typer av eukaryotiskaorganismer.

Acknowledgments

I would like to thank Dr. Lars Arvestad for providing me the opportunity to dothis thesis and for his support, and thanks to Prof. Bengt Persson for being myexaminer. Thanks to my family for their endless private support.

vii

Contents

1 Introduction 31.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Thesis objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Principles of EST 72.1 Eukaryotic mRNA structure . . . . . . . . . . . . . . . . . . . . . . 72.2 Nucleotide distributions in mRNA . . . . . . . . . . . . . . . . . . 8

2.2.1 Start and stop codon . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Poly(A)-signal and poly(A)-tail . . . . . . . . . . . . . . . . 82.2.3 Kozak sequence . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.4 GC content . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 EST production - basic overview . . . . . . . . . . . . . . . . . . . 92.3.1 cDNA synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 cDNA library . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.3 EST synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Principles of Hidden Markov Model 133.1 Discrete HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 HMM order and periodicity . . . . . . . . . . . . . . . . . . . . . . 143.3 HMM training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.1 Forward algorithm . . . . . . . . . . . . . . . . . . . . . . . 163.3.2 Backward algorithm . . . . . . . . . . . . . . . . . . . . . . 183.3.3 Baum-Welch algorithm . . . . . . . . . . . . . . . . . . . . 18

3.4 HMM decoding - 1-best algorithm . . . . . . . . . . . . . . . . . . 21

4 Experimental design 234.1 HMM of mRNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 5’UTR HMM . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1.2 Start codon HMM . . . . . . . . . . . . . . . . . . . . . . . 254.1.3 CDS HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1.4 Stop codon HMM . . . . . . . . . . . . . . . . . . . . . . . 294.1.5 3’UTR HMM . . . . . . . . . . . . . . . . . . . . . . . . . . 29

ix

x Contents

4.1.6 The complete HMM . . . . . . . . . . . . . . . . . . . . . . 304.1.7 Implement HMM in XML . . . . . . . . . . . . . . . . . . . 31

4.2 Choosing appropriate EST-data . . . . . . . . . . . . . . . . . . . . 334.3 Training and decoding . . . . . . . . . . . . . . . . . . . . . . . . . 344.4 The designed softwares . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Methods of working and tests . . . . . . . . . . . . . . . . . . . . . 37

5 Result and discussion 455.1 Designed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Effects of parameter changes . . . . . . . . . . . . . . . . . . . . . 46

5.2.1 Emission order . . . . . . . . . . . . . . . . . . . . . . . . . 465.2.2 Null model . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2.3 HMM expanding . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3 Comparing the lengths and 1-best probabilities for true and falsematches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.4 Multivariate analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 525.5 Final results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.5.1 Lack of supervised models . . . . . . . . . . . . . . . . . . . 56

6 Conclusion and future works 61

Bibliography 65

A Details about each cDNA library 67

B Model differences 69

List of Figures

2.1 Eukaryotic mRNA structure. . . . . . . . . . . . . . . . . . . . . . 72.2 Average GC-content of different organisms for 5’UTR, 3’UTR and

third position in codons. Hum (Human). OM (Other mammal).Rod (Rodent). OV (Other vertebrate). Inv (Invertebrate). Pl(Plant). Fun (Fungi). Figure from [14]. . . . . . . . . . . . . . . . 10

2.3 Basic schematic of cDNA synthesis. . . . . . . . . . . . . . . . . . . 112.4 EST sequencing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Hidden Markov model topology. . . . . . . . . . . . . . . . . . . . . 143.2 3-periodic Hidden Markov model topology. . . . . . . . . . . . . . . 153.3 Forward algorithm directions and paths when obtaining probabil-

ities. Red circle is the current state, green the past. Blue circlesdefine start and end state. Black circle is not yet reached. . . . . . 17

3.4 Forward algorithm directions and paths when obtaining the fullprobability that sequence β1β2 is generated from the model. . . . . 17

3.5 Optional caption for list of figures . . . . . . . . . . . . . . . . . . 193.6 Illustration on how to achieve Aξ1ξ1

from a given sequence β1β2β3,i.e. Aξ1ξ1

= (P1 + P2 + P3)/P (β1β2β3). . . . . . . . . . . . . . . . 22

4.1 HMM of 5’UTR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 x-axis = length, y-axis = P (i). . . . . . . . . . . . . . . . . . . . . 244.3 HMM of start codon ATG. . . . . . . . . . . . . . . . . . . . . . . 264.4 HMM of CDS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5 Negative binomial distribution. y-axis = P (c), x-axis = c. . . . . . 274.6 Stop codon HMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.7 3’UTR HMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.8 x-axis = length, y-axis = P (i). . . . . . . . . . . . . . . . . . . . . 314.9 The complete HMM with all submodels and the two extra states

N1 and N2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.10 5’read EST represent sense strand and 3’read EST represent anti-

sense strand. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.11 BLAST web interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . 424.12 CDS information and localization for a match. . . . . . . . . . . . 434.13 FASTA format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1 Differences between HMM and ESTScan according to specificityand sensitivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Green mark: start codon. Yellow mark: CDS. Red mark: stopcodon. No mark: UTR. . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3 Negative binomial distributions with different number of states for5’UTR. Blue line: 2 states, p = 0.995. Red line: 3 states, p = 0.99.Green line: 4 states, p = 0.985. Pink line: 5 states, p = 0.98 . . . . 50

5.4 Histogram of 1-best log-probability for hits. . . . . . . . . . . . . . 525.5 Histogram of length for hits. . . . . . . . . . . . . . . . . . . . . . . 535.6 Histogram of CDS length for hits. . . . . . . . . . . . . . . . . . . . 54

2 Contents

5.7 Multivariate analysis results. Mean length variable significant at5% level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.8 Mean length effect on sensitivity and specificity. . . . . . . . . . . . 555.9 Residuals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.10 Graph showing the difference in parameters between two models

with increasing number of generated sequences. . . . . . . . . . . . 595.11 Mean values for specificity and sensitivity for the unsupervised

HMM and ESTScan. . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

List of Tables

4.1 Emission probabilities for each state in 5’UTR HMM. . . . . . . . 254.2 Emission probabilities for each state in start codon HMM. . . . . . 264.3 Table of how the emission probabilities are achieved for a two order

emission state. xy is any combination from the emission set. . . . . 284.4 Order emission in each state. . . . . . . . . . . . . . . . . . . . . . 284.5 Emission probabilities for each state in CDS HMM. . . . . . . . . . 394.6 Emission probabilities for each state in stop codon HMM. State

21-23 represent TAA, 24-26 represent TAG and 27-29 represent TGA. 404.7 Emission probabilities for each state in 3’UTR. . . . . . . . . . . . 404.8 cDNA librarys from UniGene. . . . . . . . . . . . . . . . . . . . . . 414.9 cDNA librarys from UniGene. . . . . . . . . . . . . . . . . . . . . . 414.10 Variables used in regression analysis. . . . . . . . . . . . . . . . . . 44

5.1 Results from BLAST. *: first five libraries ignored. . . . . . . . . . 455.2 Comparing mean values for respective number of sequences. . . . . 465.3 Results given with different orders. . . . . . . . . . . . . . . . . . . 485.4 Results given with a null model with equalized probabilities with

no fixed emissions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.5 Results when adding states to each region of HMM. . . . . . . . . 515.6 Results via BLAST. *: first five libraries ignored. . . . . . . . . . . 565.7 Results via BLAST. . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Chapter 1

Introduction

1.1 Background

In order to understand how different organisms function from a biological perspec-tive it requires an insight into their genetically processes. The possibility to identifygenes with different high-throughput (microarrays, mass spectrometry) methodsenables the characterization of gene functions and how they are expressed in var-ious processes [3]. High-throughput technologies have been rapidly evolved sincethe human genome project, and the focus today is mainly on using low cost high-throughputs [2]. It is rather expensive and complex to sequence whole genomes andbecause of that other techniques have been considered, such as high-throughputEST. EST or expressed sequence tag are short biological sequences derived frompartial or full-length cDNA. cDNA are mainly synthesized from mRNA and be-cause mRNA is built up by exons only, cDNA is therefore important for genediscovery. By using EST which is a part of a cDNA, makes it possible to rapidlyexplore transcriptomes and genes in a more practical way rather than using wholefull-length cDNA [3].

Consequently, high-throughput have generated a huge amount of EST data whichcontinuously increases. During the master’s thesis project more than 67 millionsEST in public entries at dbEST were available from different organisms. Theamount of public entries contributes to implement powerful data mining utilitiesand machine learning methods that can quickly process and assemble the vol-ume of EST that continuously expanding. Data mining tools that can annotateunknown ESTs which involves identification of protein-coding regions are highlyappreciated. Such tools have already been designed like ESTScan, which use asupervised HMM to decode ESTs and finds their region contents [4]. A supervisedHMM, is a trained HMM with annotated data where one have knowledge aboutthe sequences used in the training data, for example the regions that can be foundin each such sequence. However, in this thesis, the focus is on using unsupervisedHMM with data that is not annotated. The data contain EST sequences derivedfrom different mRNAs, and by training an HMM with the algorithm Baum-Welch

3

4 Introduction

with those ESTs, a performance test will be analyzed on how well the trainedHMM can distinct between ESTs with and without coding regions by predictionsusing 1-best decoding algorithm.

1.2 Thesis objectives

The thesis objective is to design an HMM in XML and a software in Pythonthat can train HMM unsupervised from unannotated data, i.e. data containingESTs derived from different mRNAs without any knowledge about the EST se-quence contents. The HMM shall after training be able to predict ESTs containingprotein-coding regions from the unannotated data.

Once having developed a proper HMM and software, evaluation will be performedby evaluating the specificity and sensitivity of the model by comparing the model’sprediction with the predictions from BLAST and ESTScan.

1.3 Methods

Studying literature for better understanding of the various regions of an EST-sequence. Particularly, examination of nucleotide distributions at important areasof EST was performed to find part in regions with common nucleotide distri-butions, for example the common start codon ATG at the beginning of CDS.Literature studies on Markov models and HMM and the various algorithms usedin parameter estimation and decoding.

The development of the software were made in Python. HMM design was es-tablished in XML. GHMM-library from Algorithmics group at the Max PlanckInstitute for Molecular Genetics which is a C-library and comes with Python wrap-pers, was used to incorporate in the Python software various kind of algorithmssuch as 1-best and Baum-Welch and for its support to HMM design.

Evaluation of the predicted sequences by the software was compared with BLASTresults which is used as reference. Comparison with ESTScan is also performed.

1.4 Delimitations

• The following two algorithms will be used which is implemented in GHMMlibrary:

– Baum-Welch

– 1-best

• No modifications will be made on the Baum-Welch and 1-best algorithms.

• HMM topology will be static, i.e. the topology is not learned during training.

1.5 Abbreviations 5

• The model will not provide support for frame-shift error in EST sequences.

• The HMM shall consider the following features:

– 5’UTR sites

– Start codon

– CDS region, start codon and stop codon not included here.

– Stop codon

– 3’UTR sites

– Poly(A)-signal.

– Poly(A)-tail.

– HMM applied only on eukaryotic organisms.

• cDNA libraries was downloaded from UniGene database and used as EST-data.

• The developed software features and requirements:

– Linux is required as operating system.

– EST-data need to be in FASTA-format.

– HMM must be in XML-format and fully supported by GHMM-libraryfunctions.

– Shall train HMM unsupervised.

– After training, a prediction can occur with any test data (FASTA for-mat).

– The predicted EST-sequences containing CDS can be saved in a newfile.

– Able to sample sequences.

– Estimate parameter differences between two models.

1.5 Abbreviations

3’UTRThree prime untranslated region.

5’UTRFive prime untranslated region.

AAdenin.

BLASTBasic local alignment search tool.

6 Introduction

CCytosine.

cDNAComplementary DNA.

CDSCoding sequence.

db cDNAdouble-stranded complementary deoxyribonucleic acid.

dbESTExpressed Sequence Tags database.

DNADeoxyribonucleic acid.

ESTExpressed sequence tag.

GGuanine.

GHMM-libraryThe General Hidden Markov Model library.

HMMHidden Markov model.

poly(A)Polyadenosine.

RNARibonucleic acid.

ss cDNAsingle-stranded complementary deoxyribonucleic acid.

TThymine.

UUracil.

XMLExtensible markup language.

Chapter 2

Principles of EST

2.1 Eukaryotic mRNA structure

mRNA is the biological template used for protein synthesis. Therefore mRNA is auseful resource for exploration of new genes with potential coding regions. WhenmRNA is mature it consists only by exons with fundamental regions that havebiological functions. The structure of eukaryotic mRNA is basically composed byuntranslated regions from both ends, i.e. 5’UTR and 3’UTR. In between is the ac-tually region that translates to a protein, i.e. CDS region (see figure 2.1). Withinthose regions there are elements or typical common nucleotide distributions as forexample the start codon AUG that can be found in the beginning of a CDS regioncontributing to translation start of a protein.

CDS5'UTR 3'UTR

poly(A)poly(A) signalKozak

Start Stop

Figure 2.1. Eukaryotic mRNA structure.

The average length of CDS is typically 1500 nucleotides or 500 codons. 5’UTRaverage length range between 100 and 200 nucleotides and is usually shorter than3’UTR average length which range between 200 and 800 nucleotides (800 in hu-mans) [12]. 5’UTR is involved in regulation of the mRNA translation initiation[13].

From the other end, 3’end, it basically consist of two important elements, apoly(A)-tail and a poly(A)-signal. A poly(A)-signal on the pre-mRNA (primarytranscription of DNA consisting of exons and introns) interact with cleaving fac-tors together with the synthesis of poly(A)-tail forming the mature mRNA 3’end

7

8 Principles of EST

with repeated adenosines [6]. One of the poly(A)-tail functions is to stabilize themRNA especially against degradation and for subcellular localization. Indeed,3’UTR regulate mainly the stability and transportation of the mRNA [12].

The description about mRNA elements in each region so far is focused on generallyconserved elements that occurs in almost all eukaryotic organisms and thereforerelevant for this thesis. Other elements exist but varies too much in nucleotidedistribution and location to be considered here.

2.2 Nucleotide distributions in mRNA

Each region in mRNA must have some partly overall nucleotide patterns to interactwith different molecules as for example ribosome units to initial the translation.Nevertheless, to effectively use unsupervised machine learning methods to identifythe different mRNA regions a knowledge of the commonly base compositions ineach region is necessary. The nucleotide distributions described below can almostalways be found in any eukaryotic organisms and hence important later on forunsupervised HMM by initiating the emission parameters with the correspondingbase compositions (see chapter 4).

2.2.1 Start and stop codon

Each organism requires a start and a stop codon to initiate respective terminatethe translation. Start codon is the first codon in CDS and stop codon the lastcodon.

The start codon in almost all eukaryotic mRNAs is composed by AUG. However,the RNA nucleotide U is never written in EST data, instead an mRNA sequence iswritten as a regular DNA sequence, so U will be typed as T. Thus the first codonin CDS is determined as ATG [7].

Stop codons are composed in three different ways. These are determined as UAA->TAA, UAG->TAG and UGA->TGA. Only one of these is recognized as stopcodon in the last codon of the CDS [?].

2.2.2 Poly(A)-signal and poly(A)-tail

The nucleotide distribution in poly(A)-signal is one of the most preserved in eu-karyotic organisms. The base sequence is often AAUAAA->AATAAA.

Poly(A)-tail is a distribution of only adenosines. The number of adenosine residuescan be several hundreds, around 250 nucleotides [18].

The region between polyadenylation signal and poly(A) tail contain variation ofnucleotides, mostly adenosines, and range between 10 to 30 nucleotides [5].

2.3 EST production - basic overview 9

2.2.3 Kozak sequence

It has been showed that a sequence flanking the start codon is mainly conservedin eukaryotic mRNAs, i.e. a Kozak sequence. Kozaq sequence strongly enhancethe initiation translation by signaling to the ribosome units that the flanked AUGis a true start codon. How the signaling works is unknown but it is believed thatKozak sequence slows the scanning of the ribosome unit contributing to easierrecognition of a start codon [11].

The Kozak sequence is mainly GCCRCCAUGG where R is a purine A or G,mostly A [9][10].

2.2.4 GC content

GC contents exhibit significant correlations within regions of mRNA. 5’UTR areparticularly high in GC-content, the average GC content for human are more than60%. Although high in 5’UTR, 3’UTR obtains AT-richer distributions, i.e. de-pleted GC contents as so in less than 45% in human (see figure 2.2). Moreover,the third position in codons are as well high in GC-content often associated withthe GC content in 5’UTR, if lower GC content in 5’UTR then lower GC contentat the third position in codons.

Invertebrate, plant and fungi show a frequently low GC content in both UTRbut still high in CDS, all over 50% [14]. Therefore, it is importance to includeemission parameters in the model that represents the GC statistics leading to acompromise of nucleotide G or C against A or T in the affected regions of HMM.

2.3 EST production - basic overview

EST are short biological sequences derived from partial or full-length cDNA. cDNAare synthesized from mRNA and because mRNA is composed by exons only, cDNAis therefore important for gene discovery. By using EST which is a part of a cDNA,contributes to more rapidly exploration of transcriptomes and genes in a morepractical way rather than using whole full-length cDNA.[3].

Here follows a very basic description of cDNA synthesis procedure to EST pro-duction. For more details see [16] and [3].

2.3.1 cDNA synthesis

Synthesis of regular cDNA occurs in several steps. First step involves the synthesisof ss cDNA by using mRNA as template and the enzyme reverse transcriptase. Re-verse transcriptase is a kind of RNA-dependent DNA polymerase acting on RNAstrand and transcribes RNA to DNA. Reverse transcriptase can be found in retro-virus and hence retrovirus are used as source when isolating the reverse transcrip-tase. Transcription start directionally from 3’end usually containing polyadenyla-

10 Principles of EST

Figure 2.2. Average GC-content of different organisms for 5’UTR, 3’UTR and thirdposition in codons. Hum (Human). OM (Other mammal). Rod (Rodent). OV (Othervertebrate). Inv (Invertebrate). Pl (Plant). Fun (Fungi). Figure from [14].

tion sites (if eukaryotic) of the mRNA template and progresses against 5’end.

Once the antisense ss cDNA is synthesized, it needs to be converted to db cDNAbecause ss cDNA cannot be cloned directly and db cDNA is better stabilizedagainst degradation [3][19]. Nevertheless ss cDNA is removed from the mRNAtemplate and act itself as template for another DNA-polymerase producing adouble-stranded cDNA (see figure 2.3). The cDNA can now be taken to the nextstep in cloning procedures and for EST sequencing [16].

2.3.2 cDNA library

Considering that cDNA comprises only by exons cDNA library are an importanttool used for identification of genetic information. A set of mRNAs from a specifictype (organism, tissue) are collected for preparing a set of cDNA. Afterwards,these are incorporated in a vector molecule within a host such as a bacteria cell.The host cell replicates itself generating larger scale of cDNAs producing a cDNAlibrary[17].

2.3.3 EST synthesis

EST are sequenced from cDNA. Sequencing of a cDNA can occur randomly fromany position and from both directions, however, a full-length sequencing is not ob-tained. Both strands in cDNA can be used as probes producing manifolds of 5’read

2.3 EST production - basic overview 11

mRNA template

3'5'

reverse transcriptase

+

primer

5'

5'3'

3'

mRNA template

singe-stranded cDNA

5'

5'3'

3'

single-stranded cDNA

mRNA template

mRNA template remove

5'3'

single-stranded cDNA

DNA polymerase I

5'

5'3'

3'double-stranded cDNA

Figure 2.3. Basic schematic of cDNA synthesis.

EST sequences and 3’read EST sequences (see figure 2.4). 5’read EST sequencerepresent the sense strand or the mRNA strand and 3’read sequence represent theantisense strand [3].

EST-data from databases such as UniGene contain collections of multiple ESTfrom different mRNAs. Each sequence is written 5’ to 3’ which needs to be con-sidered when having an EST-data with both 5’read EST and 3’read EST. AnHMM representing mRNA strand in 5’ to 3’ direction, 3’read EST requires to beinverted, i.e. 3’read EST must be converted to 5’read EST before applying thedata to the HMM.

12 Principles of EST

CDS5'UTR 3'UTR

EST sequencing

cDNA

Figure 2.4. EST sequencing.

Chapter 3

Principles of Hidden MarkovModel

A description of the mathematical background of HMM is given here. Most of theformula described is taken from [15] and [1].

3.1 Discrete HMM

When searching for biological patterns in the form of nucleotide sequences of or-ganisms the patterns will almost never be exact. Also from the same family oforganisms there are variations in the nucleotide sequences of example genes encod-ing the same protein. It is necessary to consider these fluctuations when trying toalign sequences with the same provenance or be able to lookup what region in themRNA a sequence belongs. A way to do this is to use stochastic models. Stochasticmodels take into account the random variations in a sequence for a common regionor element, for example the poly(A)-signal in 3’UTR part of the mRNA. There-fore, models with stochastic capacity is very important for prediction of sequences.

HMM is a kind of stochastic model. An illustration of HMM topology can beseen in figure 3.1. It is composed of hidden states that are not observable. Thesehidden states can be interpreted as the regions in the mRNA. A state may un-dergo transition to another state where each state is able to emit one or moreobservations called emissions (as in nucleotides). For example when a stop codonis reached, the next nucleotide is on the 3’UTR region, hence a state transitionfrom CDS to 3’UTR has occurred. Furthermore, each transition and emissionoccurs within a probability distribution. When working with discrete emissions,as in nucleotides, the HMM is said to be discrete.

Let assume we have S distinct states denoted ξ1, ξ2, ..., ξS , where each state emitsone of O discrete emissions denoted β1, β2, ..., βO. A state transition ξk to ξl occurwith a state transition probability aξkξl

defined as a conditional probability where

13

14 Principles of Hidden Markov Model

State 1 State 2 State 3Transition Transition

Observation Observation Observation

Figure 3.1. Hidden Markov model topology.

the current state at time t (in nucleotide sequence time can be interpreted as thenucleotide position) is denoted xt:

aξkξl= P [xt+1 = ξl|xt = ξk], 1 ≤ k, l ≤ S

1 ≤ t(3.1)

When in a current state ξk at t, the probability eξk(βi) that emission βi emits is

defined as:

eξk(βi) = P [βi at t|xt = ξk], 1 ≤ k ≤ S

1 ≤ i ≤ O

1 ≤ t

(3.2)

An HMM needs to have initial probabilities before proceeding with training orsampling of emissions, i.e. initial state distribution π, emission distribution B andtransition distribution A [15].

3.2 HMM order and periodicity

An important property of HMM is the dependences of the previous states whencalculating the outcome of next state, i.e. HMM’s order. Equation 3.1 is a typical1-order HMM, the next state is dependent only by the current state. A 2-orderHMM is dependent of the previous two states, 3-order HMM of the previous threestates and so on. There is also states with higher order emission, a state thatcontain information about previous emissions. It follows the same principle ashigher order state HMM, 2-order emission states is dependent of the previous twoemissions, 3-order emission of the three previous emissions and so on. Normally, astate is of 0-order emission meaning that the state has no memory of the previousemissions. But in this thesis the focus will be in using increased emission orderfor states modeling the coding region of mRNA (more of this in chapter 4). Theoutcome probability of next emission eξk

(βi) at t + 1 where a state with higherorder emissions depends on last n emissions , i.e. n emission order state is defined

3.3 HMM training 15

as:

eξk(βi|β1...βn) = P [βi at t + 1|xt+1 = ξk, xt−1 = β1..., xt−n = βn], 1 ≤ k ≤ S

1 ≤ i ≤ O

1 ≤ t

(3.3)

If a state ξk can only be reached after d number of steps where d is an integer andd > 1, then the state ξk is said to be periodic with period d. Figure 3.2 shows atypical 3-periodic hidden Markov model. Same state can be reached after threesteps[15].

State 1 State 2 State 3Transition Transition

Observation Observation Observation

Transition

Figure 3.2. 3-periodic Hidden Markov model topology.

3.3 HMM training

An HMM’s ability to recognize biological sequences derived from any organism,is dependent on the parameter used in HMM. The parameters must be estimatedso that they can represent sequences one is interested in finding. An HMM canbe trained with specific sequences, say a specific mRNA from human brain. Bytraining an HMM with that mRNA , the parameters of HMM will be estimatedin such a way that it will represent the nucleotide distributions for that specificmRNA (if the HMM is builded that way). When having an unknown sequence,one can apply this on the trained HMM and the HMM will predict whether or notthe unknown sequence represent the mRNA from human brain. The sequencesthat are used for training the HMM are called a training set. New unknown se-quences that are used for testing the HMM’s ability to predict new sequences arecalled for a test set. There are different ways to train an HMM with sequences,but most of the time an HMM is trained automatically with a training algorithm.The parameters (transition probabilities, emission probabilities, state initial prob-abilities) of HMM must be initiated with values before applying training. Then


during the training, the parameters are estimated until some kind of thresholdis reached, often the difference in model parameters between an iteration i andi + 1. If the difference is sufficiently small, the iteration stops. In this thesis thefocus is on using a training algorithm called Baum-Welch. Baum-Welch is a kindof iterative training algorithm where emission sequences are used as training data(as EST sequences). Baum-Welch algorithm is complex and contain itself two kindof fundamental algorithms, namely forward and backward algorithm [1]. Beforeproceeding further with the explanation of how Baum-Welch works we need tounderstand the forward and backward algorithm together with some concepts andmathematical backgrounds which are described below.

3.3.1 Forward algorithm

The full forward algorithm calculates the probability that a sequence X is gener-ated from a model λ over all possible paths in S, i.e. several paths kan emit thesame sequence X . The model λ consist of parameters A, B and π. Let assumesequence X = x1x2...xL is of length L and contains any of the emissions in Owhere an emission in O can occur more than once in X . Each position in X withemission xi emitted from a specific state ξl of S is then calculated recursively untilreaching length L or any desired length of X . The probability that the partialsequence x1...xi in X end in state ξl with emission xi is acquired recursively withequation:

fξl(i) = eξl

(xi)S

∑

k=1

fξk(i − 1)aξkξl

, 1 ≤ l ≤ S (3.4)

Indeed, the recurse equation 3.4 is used if interested in the probability ending ina specific state with a specific emission. The recursion always start from positionone in X and continues forward to a specified length, hence the name forwardalgorithm. The procedure is initialized with a state called begin state, i.e. whenk = 0. The begin state does not emit any emissions (end state ξE as well).

The recursive steps for full algorithm calculates all possible paths emitting wholesequence X and sum them to obtain probability that X is generated by model λ:

Initialization:(i=0)

fξ0(0) = 1

fξk(0) = 0 for k > 0

Recursion start:(i=1...L)

fξl(i) = eξl

(xi)S

∑

k=1

fξk(i − 1)aξkξl

3.3 HMM training 17

Termination:

P (X) =S

∑

k=1

fξk(L)aξkξE

ξE = End state

Figure 3.3 illustrates the paths and directions needed to obtain the probabil-

Figure 3.3. Forward algorithm directions and paths when obtaining probabilities. Redcircle is the current state, green the past. Blue circles define start and end state. Blackcircle is not yet reached.

ity that emission β2 emits from state ξ1. The emission sequence is β1β2 andthere are two hidden states ξ1 and ξ2. As can be seen, the sum of the paths(fξ1

(1)aξ1ξ1+ fξ2

(1)aξ2ξ1) multiplied with the emission probability eξ1

(β2) givesthe referred probability (red circle).

Figure 3.4. Forward algorithm directions and paths when obtaining the full probabilitythat sequence β1β2 is generated from the model.


When it is desired to have the probability that sequence β1β2 belongs to thatmodel, all state are reached and paths are summed (figure 3.4), i.e. P (β1β2) =fξ1

(2)aξ1ξE+ fξ2

(2)aξ2ξE[1][15].

3.3.2 Backward algorithm

While forward algorithm start the recursion from the beginning of a sequence,backward algorithm does the opposite, i.e. recursion start from the end of a se-quence. However, here we are interested in the probability that the next statesemit the remaining sequence xi+1...xL when in current state ξk with emission xi,i.e. we wants to find bk(i). Total probability P (X) is the same as the one fromforward algorithm. The recursive equation is the following:

Initialization (i = L):bξk

(L) = aξkξEndfor all k

Recursion (i = L-1,...,1):

bξk(i) =

S∑

l=1

aξkξleξl

(xi+1)bξl(i + 1)

Termination

P (X) =S

∑

l=1

aξ0ξleξl

(x1)bξl(1)

Figure 3.5(a) shows the backward recursion to obtain the probability that the nextstates emit emission β2 and also the full probability P (β1β2) in figure 3.5(b) [1].

3.3.3 Baum-Welch algorithm

The Baum-Welch algorithm estimates the parameters of emission and transitionwith the help of a training set containing a number of emission sequences. Transi-tion parameter aξkξl

is estimated by first sum the probabilities that state transitionξk to ξl occurs in training data from each position i in sequence X for each se-quence X

′

in the training data. Then divide by the sum of the probabilities thatξk transits to each possible ξl

′ in S at each position i in sequence X for eachsequence X

′

in the training data, i.e. formally as:

aξkξl=

Aξkξl

S∑

l′ =1

Aξkξl′

(3.5)

To obtain Aξkξlwe need to combine forward with backward algorithm and then

calculating the posterior probabilities. First we are interested in the probabilitythat an observed sequence X is generated from a model where xi, xi+1 is emittedfrom state ξk and ξl respectively by:

P (X, xi = ξk, xi+1 = ξl) = fξk(i)aξkξl

eξl(xi+1)bξl

(i + 1) (3.6)

3.3 HMM training 19

(a) Red circle correspond to the probability b1(1).

(b) Full probability P (β1β2).

Figure 3.5. Backward algorithm directions.

The probability that state transition aξkξl occur from position i in sequence X isthen obtained from the posterior probability where P (X) can be obtained throughforward or backward algorithm:

P (xi = ξk, xi+1 = ξl|X) =fξk

(i)aξkξleξl

(xi+1)bξl(i + 1)

P (X)(3.7)

Figure 3.6 illustrates the principles of obtaining Aξkξl. Here, I have assumed only

one sequence β1β2β3 in the training set, and to obtain the probability that thissequence is generated by the model from state ξ1 to ξ1 representing sequence β1β2

is easily done by summing all the related paths multiplied by emission probabilityof β2 at ξ1.

Furthermore, to find the directly probability that β1β2 is emitted from state ξ1

and ξ1 respectively, i.e. the probability of aξ1ξ1from position one in the given

observed sequence β1β2β3, we divide P (β1β2β3, β1 = ξ1, β2 = ξ1) with P (β1β2β3),i.e. the posterior probability (see figure 3.6(a)).


We repeat those steps and yet again calculating the posterior probability thatthe same state transition ξ1 to ξ1 occur but instead from position two (β2) (seefigure 3.6(b) and 3.6(c)). The steps are repeated until reaching the end of theemission sequence. Each posterior probability obtained is then summed giving thesearched Aξ1ξ1

= (P1 + P2 + P3)/P (β1β2β3). Of course, if we had more than onesequence, we would as well sum the probabilities over all sequences, i.e. to obtainAξkξl

we sum over all positions and over all sequences in the training set. Aξkξl

can therefore be defined as:

Aξkξl=

∑

j

1

P (Xj)

∑

i

f jξk

(i)aξkξleξl

(xji+1)bj

ξl(i + 1), (3.8)

where j is the sequence and i the position in j.

Emission parameter eξk(x) is estimated in a similar way as in 3.5. The proba-

bility that emission x emits from state ξk in training data divided by the sum ofprobabilities of each possible emission βi

′ that takes place in state ξk:

eξk(x) =

Eξk(x)

O∑

i′ =1

Eξk(βi

′ )

(3.9)

Eξk(βi) is obtained analogous to 3.8. The definition is:

Eξk(x) =

∑

j

1

P (Xj)

∑

i|βj

i=x

f jξk

(i)bjξk

(i), (3.10)

where we are only interested in position i that emits x as can be seen in the innersum.

We are now ready to summarize the Baum-Welch algorithm by following itera-tion steps:

1. Initialization:

Initialize transition parameters and emission parameters with values, i.e. theimplementation of matrices A, B and π.

2. New parameter estimation:

Estimate all aξkξland eξk

(βi) through 3.5 and 3.9.

3. Log-likelihood probability of the new model:

Calculates the probability that training data belongs to the new model λwith new parameters on A, B and π. Logarithmic probabilities are used toavoid numerical problems. Log-likelihood probability is achieved with:

∑

j

logP (Xj|λ), (3.11)

where j is the sequence in training data.

3.4 HMM decoding - 1-best algorithm 21

4. Converge criterion:

A convergence criterion is needed to end the iteration. A termination willoccur when the difference in the log-likelihood of the model is sufficientlysmall between two iterations.

Conclusively, the convergence of Baum-Welch is strongly dependent on initial pa-rameters, i.e. Baum-Welch reaches different local maxima. Therefore it is ofinterest to keep in mind when using unsupervised HMM that the initial parame-ters should represent a reasonably biological significance. Furthermore, to avoidunderflows the forward and backward variables will be scaled keeping the valuesin a reasonable interval[1].

3.4 HMM decoding - 1-best algorithm

When having data containing EST, we are interested in what regions of the mRNAeach EST sequence represent. The hidden states in HMM represent the regions inmRNA and therefore cannot be directly observed, i.e. we cannot tell the regioncontents of an EST sequence by viewing it directly. We must instead decode eachEST on the HMM and calculate the most likely regions that generates the specificEST sequence. For this purpose we can once again use an iterative algorithm,namely 1-best algorithm. 1-best algorithm calculates all hidden state sequencesthat may give the observed EST sequence, and picks up the one that gives thehighest probability. It calculates probabilities that an emission is emitted by ahidden state.

1-best algorithm is described as followed:

1. Start recursion at i = 0 and continue to i = L:

γl(labelj) = (∑

k

aξkξlγk(labeli))el(xi+1)

γl(labelj) is the probability that next emission xi+1 is emitted from thehidden state l with label j. γk(labeli) is the highest probability picked forstate k of current emission i. The label represent the region of mRNA inthis case. γl(labelj) is calculated for each state in HMM and with all labels.Unfortunately, the states in the unsupervised HMM have only one label each,so there will be only one estimated γl(labelj) for each state l.

2. Summing all probabilities

Sum the probabilities of each finally estimated γl(labelj) over the states andsave the one with the highest value.

EST can be several hundreds of nucleotides, therefore using the original 1-bestalgorithm can produce numerical underflows. To avoid this the algorithm is loga-ritmized [8].


(a)

(b)

(c)

Figure 3.6. Illustration on how to achieve Aξ1ξ1from a given sequence β1β2β3, i.e.

Aξ1ξ1= (P1 + P2 + P3)/P (β1β2β3).

Chapter 4

Experimental design

4.1 HMM of mRNA

A description in detail of the design of HMM representing mRNA will be mentionedhere. Because mRNA contains several different regions the HMM is constructedby several sub-models representing each region of the mRNA (see figure 2.1).Explanation of the various transitions between states and all parameters distinctvalues implemented will of course be mentioned. A simple HMM will be initiatedwith as few states as possible, later on a further expansion of the HMM will beperformed.

4.1.1 5’UTR HMM

Figure 4.1 illustrates the proposed eight-state HMM for the entire 5’UTR-section.State one and two models 5’UTR over a length distribution. State three to eightrepresents the Kozak sequence before the start codon (see chapter two).

The first two states have a transition to themselves so that any length can bemodeled. Note that two states are chosen after each other to avoid the geomet-rical distribution. A geometric distribution is dependent on the length, i.e. thelonger the length, the less probability. Thus, if having only one state and aretrying to model a long 5’UTR sequence the risk is that the transition occurs toosoon from that state to the Kozaq sequence states and subsequently to the CDSregion. The consequence is that some part of 5’UTR sequence is modeled as aCDS region instead of 5’UTR-region. Instead, using two states will induce thenegative binomial distribution which capture length distributions better.

The transition probabilities in state one and two are tied (equally probabilitiesin both states) and will be chosen to fit an approximately bell-shaped negative bi-nomial distribution with a maximum top representing the mean length of 5’UTR.The sum of transition probabilities in each state will always be one, as for theemission probabilities. The probability density for a geometrical distribution

23

24 Experimental design

when having only one state is P (i) = ai−1kk (1 − akk) where i is the length of a

sequence. However, using two or more states gives the probability density func-tion P (i) =

(

i−1n−1

)

ai−nkk (1 − akk)n where n is the number of states and akk is same

for all states.

After testing (by plotting different values of akk in MATLAB) a11 = a22 = 0.995has been shown to produce a reasonably binomial distribution considering that5’UTR average length is set to 200 nucleotides (see figure 4.2(a)). However, using

1 2 3 4 5 6 7 8

Kozak sequence

0.995 0.995

0.005 0.005 1.0 1.0 1.0 1.0 1.0

Figure 4.1. HMM of 5’UTR.

0 100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

−3

(a) Negative binomial distribution.

0 100 200 300 400 500 600 700 800 900 10000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

−3

(b) Geometrical distribution.

Figure 4.2. x-axis = length, y-axis = P (i).

only one state with same transition probabilities gives the geometrical distribu-tion as can be seen in figure 4.2(b), where the probability then shrink exponentialwith length. If reducing the value of a11 and a22 it present a more geometricaldistribution. Increasing the values contributes to capture longer lengths.

I have chosen to have emission order 0 because 5’UTR vary greatly in the nu-cleotide positions and rarely follows a specific pattern, i.e. a nucleotides positionis not dependent on the backlying nucleotides. Kozak sequence is also of emissionorder 0 where emission probabilities are set to be high for those specific nucleotides

4.1 HMM of mRNA 25

that occur in the Kozak sequence. By using emission order 0 in the 5’UTR HMMunnecessary parameters can be avoided.

The initial transition probabilities for all states can be seen in figure 4.1.

The emission parameters of the first two states reflect the GC statistics that wasmentioned in chapter two. This means that the emission probability of nucleotideG or C will be slightly higher than A and T. States three to eight reflects of coursethe Kozak sequence. The overall initial emission probabilities are showed in ta-ble 4.1. A matrix Bξk

will hold the initial probabilities for each state which isimplemented in the XML file. The emission probabilities is randomly chosen but

Table 4.1. Emission probabilities for each state in 5’UTR HMM.

A G C T NState 1 0.2 0.2995 0.2995 0.2 0.001State 2 0.2 0.2995 0.2995 0.2 0.001State 3 0.0003 0.999 0.0003 0.0003 0.0001State 4 0.0003 0.0003 0.999 0.0003 0.0001State 5 0.0003 0.0003 0.999 0.0003 0.0001State 6 0.8 0.1 0.045 0.045 0.001State 7 0.0003 0.0003 0.999 0.0003 0.0001State 8 0.0003 0.0003 0.999 0.0003 0.0001

follows the GC content and Kozak sequence statistics.

Emission N stands for "any nucleotide" or "unknown nucleotide" and is one ofthe many denotation in FASTA format for nucleotide representation. Some po-sitions in a sequence is represented for various reasons with an N, for examplebecause of sequencing errors. Additional ambiguity symbols are "R", "Y", "C","M", "S", "W", "B", "D", "H", "V", "X". When these are found in a sequence, theywill be converted to an "N" simply interpreted as an unknown nucleotide in thatspecific position. If extending the emission parameters to represent all symbolsthe HMM becomes too complex contributing to heavy calculations when trying toestimate the parameters through Baum-Welch. The probability for N is set to below in all states because of rare occurrences on the same nucleotide positions.

4.1.2 Start codon HMM

Start codon model will be designed as a three-state HMM. The model shall rep-resent the nucleotide sequence ATG. Emission parameters of this model are fixed,i.e. the parameters will not be modified during training with Baum-Welch, be-cause I simply assume that ATG always start the CDS. State 9 has fixed highemission probability of A, state 10 on T and state 11 on G. Because the emission


91 1

10 11

Figure 4.3. HMM of start codon ATG.

probabilities are fixed the start codon HMM do not need higher emission order,thus order equals zero. It is possible to see in table 4.2 that the probabilities of

Table 4.2. Emission probabilities for each state in start codon HMM.

A G C T NState 9 0.999 0.0003 0.0003 0.0003 0.0001State 10 0.0003 0.0003 0.0003 0.999 0.0001State 11 0.0003 0.999 0.0003 0.0003 0.0001

remaining emissions are not exactly zero in order to avoid problems with overfit-ting. For example, if a decoding propose that a nucleotide T is emitted from state9 the probability that the whole sequence belongs to the model would be zero.Decoding with 1-best algorithm would give a zero probability because of productmultiplications with zero in the recursion equation (see chapter 3). Hence, to avoidoverfitting problems I add extra initial values to the emission probabilities.

4.1.3 CDS HMM

The yellow states in figure 4.4 hold high emission probabilities of nucleotides G andC relative A and T, because of the GC statistics that was mentioned in chapter2. The blue states hold equal distributed emission probabilities of the nucleotidesA,C,G and T. State 12 represents the nucleotide that immediately follow the startcodon and is the last nucleotide of the Kozak sequence, namely G, and which Ghas therefore a high emission probability relative the other emissions.

States 15-17 and 18-20 represent the codons. A codon is made up of three nu-cleotides and encodes a specific amino acid during translation. Thus, states 15-17reflects the position of one to three in a codon, the same applies for states 18-20.It is possible to see that the two codon models are periodic with period threethrough the transition from state 17 and 20. State 17 has a transition to state15, i.e. the transition back to the codon model, or a transition to the next codonmodel. Similarly, there is a transition from state 20 back to state 18 or a transitionto a stop codon (next section). This three-periodic system captures therefore thecodon structure in a coding region in a proper way.

Three emission parameters in state 17 and 20 are chosen to avoid stop codons

4.1 HMM of mRNA 27

in the CDS HMM by changing the values of P (G|T A), P (T |AA) and P (A|T G) tobe exactly zero.

On the same principle as in the 5’UTR HMM, I used two codon models to avoid thegeometrical distribution. Average length of CDS is approximately 1500 nucleotidesor 500 codons. A length distribution that covers averagely 500 codons must there-fore be implemented. By using the formula P (c) =

(

c−1m−1

)

ac−mkl (1 − akl)m, where c

is the number of codons, m is the number of codon models, I have through testingfound that a17,15 = a20,18 = 0.9981 gives a fairly negative binomial distributionwhere 500 codons gives highest probability top (see figure 4.5). The states in

13 15 16 1712 18 19 201 1 1 1 1 1 1 0.00190.0019

0.9981 0.9981

Figure 4.4. HMM of CDS.

0 200 400 600 800 1000 1200 1400 1600 1800 20000

1

2

3

4

5

6

7

8x 10

−4

Figure 4.5. Negative binomial distribution. y-axis = P (c), x-axis = c.

the codon models are of higher order emission because the purpose is to capturethe CDS structure, emission order is therefore set to 2. A second order emissionmeans that each state is dependent of a history of the previous two emissions.The emission set {A, G, C, T, N}, may occur in 52 = 25 different ways over theprevious two emissions. This means that each emission of the states 15 to 20 canbe estimated in 25 different ways, which gives a total of 25 ∗ 5 = 125 different


emission parameters (see table 4.3). For example, the probability of P (A|GG),the probability that A occurs given sequence GG, another probability P (A|GC),contributing to 25 different emission probabilities for emission A. This also appliesto the remaining emissions.

Table 4.3. Table of how the emission probabilities are achieved for a two order emissionstate. xy is any combination from the emission set.

A G C T NP (A|xy1) P (G|xy1) P (C|xy1) P (T |xy1) P (N |xy1)P (A|xy2) P (G|xy2) P (C|xy2) P (T |xy2) P (N |xy2)

... ... ... ... ...P (A|xy25) P (G|xy25) P (C|xy25) P (T |xy25) P (N |xy25)

Total emission parameters: 125

However, state 12-14 is of zero order emissions. State 12 does not need to bedependent of the previous emissions because there is already historical knowledgeabout the previous emissions correspond to the start codon. State 13 and 14 isalso of zero order emission because of simplicity and for reducing the number ofparameters, but also to have at least one place in the CDS model that can be astart point directly through decoding or training to reduce frame-shift errors. It isnot possible to directly start in state 15 to 20 without emission history of at leasttwo nucleotides. The order of a state is seen in table 4.4 and the total emissionprobabilities for each state is written in table 4.5.

The transition probabilities are shown in figure 4.4.

Table 4.4. Order emission in each state.

State Order12 013 014 015 216 217 218 219 220 2

4.1 HMM of mRNA 29

4.1.4 Stop codon HMM

The emission probabilities are fixed in the same way as in the start codon simplybecause I assume that these nucleotide sequences are always present as stop codons.The three sub-models in figure 4.6 represent the three stop codons TAA, TAG,TGA. The emission in the states are of zero order. The emission probabilities foreach state is defined in table 4.6 below.

21 221 1

24 251 1

27 281 1

CDS 3'UTR

Figure 4.6. Stop codon HMM.

4.1.5 3’UTR HMM

The first three states 30-32 in figure 4.7, models the 3’UTR part right after thestop codon. The purpose of the selected topology is to illustrate that the 3’UTRis longer in average length than the 5’UTR, I might as well have used only twostates, but to avoid confusion between the lengths of the 5’UTR and 3’UTR thegiven topology has been chosen. By testing the transition probabilities in MAT-LAB, the value a30,30 = a31,31 = a32,32 = 0.9964 gives a binomial distributionwith 550 nucleotides as mean length (see figure 4.8(a)). 3’UTR show tendenciesof AT-rich elements, therefore high emission probabilities of A and T are initialized.

States 33-38 covers the polyadenylation signal AATAAA and hence contributesto high emission probabilities for nucleotides A and T in their respectively states.However, the emissions is not fixed simply because the nucleotides may vary.

There is a space between polyadenylation signal and poly(A) tail covering 10 to30 nucleotides. This space is modeled by state 39. A short length is better mod-eled by a geometrical distribution than a negative binomial distribution. I let thetransition probability (a39,39) be equalized with 0.5 allowing the Baum-Welch tohave more control when estimating the value through training. The geometricaldistribution can be seen in figure 4.8(b). As in state 30-32, the emission probabil-ities in state 39 represent AT rich elements.

States 40-41 models poly(A)-tail consisting of repeated adenosine nucleotides thatextends around 250 adenosines and so it is important once again to have a neg-ative binomial distribution with transition probability a40,40 = a41,41 = 0.99595


(see figure 4.8(c)). The emissions is not fixed because of variations, some T mayappear instead of A for example.

State 42 contain fixed emissions with A as highest emission probability because ofthe desires to end with an Adenosine. A transition of 1.0 is only there to model thelast adenosines of an emission sequence. As in 5’UTR, all states are of zero-orderemissions. Table 4.7 shows all emission probabilities.

30 31 32 33 34 35 36 37

Polyadenylation signal

0.9964 0.9964

0.0036 0.0036 0.0036 1.0 1.0 1.0 1.0

0.9964

1.038

39

0.5

0.541

0.00405

0.99595 0.99595

0.00405

Poly(A) tail

1.0

1.0

Figure 4.7. 3’UTR HMM.

4.1.6 The complete HMM

The complete HMM is built up of all the submodels for each region in the mRNAdescribed above. Figure 4.9 shows two silent states, a begin state and an end state.A transition can occur to all states in the submodels from begin state simply be-cause that an EST sequence is randomly sequenced, thus EST-sequences can startanywhere on the HMM. These transitions from begin state is defined as the initialstate probabilities, and are equally distributed over the 42 states that can be usedas start point in the HMM, i.e. πξk

= 1/42 for all states except state 42 and N2state. There is no need to start in state 42, because it is very unlikely to havea very short EST sequence with only adenosines. Similarly, the end state illus-trates that EST sequences can end anywhere in the HMM. However, a transitionto an end state does not exist because end state is only abstract as the begin state.

Starting in a state with higher order emission is not allowed because the statehas emission probabilities that requires history of previous emissions, GHMM li-brary does not support the ability to start in such a state without emission history.Therefore the HMM needs two additional states before the CDS (the states thatare of higher order emissions), which are named N1 and N2. These states have

4.1 HMM of mRNA 31

0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1x 10

−3

(a) Negative binomial distribution over states30,31 and 32.

0 5 10 15 20 25 300

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

(b) Geometrical distribution in state 39.

0 100 200 300 400 500 600 700 800 900 10000

0.5

1

1.5x 10

−3

(c) Negative binomial distribution over states 40and 41.

Figure 4.8. x-axis = length, y-axis = P (i).

fixed uniformly emission probabilities. It is possible to start in N1 but not in N2,since the affected CDS states (15-20) require two nucleotides in history.

4.1.7 Implement HMM in XML

The designed HMM is implemented in XML, which layout is defined by GHMMlibrary. The XML file will be loaded in the software using GHMM library andsaved as a datatype interpreted as an HMM. Once the datatype is created fromthe XML-file it is possible to apply the Baum-Welch and 1-best algorithm andthus it is possible to estimate new parameters directly in the datatype or decodingthe datatype.

The code below shows a typical HMM implemented in XML. The XML codedefines a two-state HMM where state 2 is of emission order 1. In XML the HMM


Begin

5'UTR

HMM

Start codon

HMM

CDS

HMM

Stop codon

HMM

3'UTR

HMM

End state

Figure 4.9. The complete HMM with all submodels and the two extra states N1 andN2.

is initialized by defining the type of the HMM to be used (line 1). In this case,the HMM is initialized by using a discretized higher order HMM. "Labeled" meansthat the emissions are named and not numbered. Line 2-8 defines the number ofemissions and their respective denominations and line 9-15 the number of regions.Line 16 and 20 define the two states. States are identified with a number thatalways starts with zero and in chronological order. The states must be initiatedwith initial state probabilities and it is because of this a begin state is unneces-sary. Line 17 and 21 initialize the emissions of each state which specifies emissionprobabilities, and if they are to be fixed or not, and what order it should be inthe state. Line 21 specifies that state 2 is of emission order 1 which provides 25emission parameters. The last lines specifies the transitions between the statesand transition probabilities.

XML code:

1 <HMM type="higher-order labeled discrete">

2 <alphabet id="0">

3 <symbol code="0">A</symbol>

4 <symbol code="1">C</symbol>

5 <symbol code="2">G</symbol>

6 <symbol code="3">T</symbol>

7 <symbol code="4">N</symbol>

8 </alphabet>

9 <classAlphabet>

10 <symbol code="0">5U</symbol>

4.2 Choosing appropriate EST-data 33

11 <symbol code="1">Start</symbol>

12 <symbol code="2">CDS</symbol>

13 <symbol code="3">Stop</symbol>

14 <symbol code="4">3U</symbol>

15 </classAlphabet>

16 <state id="0" initial="0.03333333" desc="State 1">

17 <discrete id="0" fixed = "1" >0.1998, 0.1998, 0.1998, 0.1998, 0.01</discrete>

18 <class>0</class>

19 </state>

20 <state id="1" initial="0.03333333" desc="State 2">

21 <discrete id="0" order="1">0.1998, 0.1998, 0.1998, 0.1998, 0.01,

22 0.1998, 0.1998, 0.1998, 0.1998, 0.01,

23 0.1998, 0.1998, 0.1998, 0.1998, 0.01,

24 0.1998, 0.1998, 0.1998, 0.1998, 0.01,

25 0.1998, 0.1998, 0.1998, 0.1998, 0.01</discrete>

26 <class>0</class>

27 </state>

28 <transition source="0" target="0">

29 <probability>0.5</probability>

30 </transition>



33 </transition>



36 </transition>

37 </HMM>

4.2 Choosing appropriate EST-data

EST data was downloaded from the UniGene database. The data consists of ESTsequences from a cDNA library. cDNA library is derived from specific tissues andusually consist of different mRNA sequences. The downloaded EST sequences willthus represent parts of different mRNA sequences, so the HMM should be able tomodel different mRNA sequences at once.

I have chosen 20 different EST-data from various organs belonging to Homo sapi-ens, the HMM is therefore adapted to human mRNA because of the chosen tran-sition probabilities that represent the average length for the specific regions of thehuman mRNA. These 20 are categorized into different numbers of EST sequences,i.e. 5 with approximately 500 sequences, 5 with approximately 2000 sequences, 5with approximately 5000 sequences, and 5 with approximately 10 000 sequenceseach. With the selected data, a measure of how well an unsupervised model isrelative supervised model. ESTScan will hence use a supervised HMM for Homosapiens. More details about the EST data can be seen in table 4.8 and appendix A.

Another task will be to compare the results from the unsupervised HMM withESTScan when there is no supervised model for some organism. Seven distinct or-


ganisms (see table 4.9) are chosen that lack supervised models. However, ESTScanwill still use a supervised HMM for Homo sapiens because the initial unsupervisedHMM is more or less adapted for Homo sapiens but also that ESTScan need amodel to be executed, even if the model does not represent a desired organism. Theseven distinct organisms are randomly chosen, but with more than 6000 sequenceseach just to make sure that there are sufficient data to train the unsupervisedHMM. The point of this task is to show the unsupervised model’s ability to beadapted for a set of data and how well it performs relative a supervised model foranother organism or when there is lack of supervised models.

4.3 Training and decoding

The data from table 4.8 is used both as training data and test data. First we trainthe HMM with EST data and then using the trained HMM to predict contentregions of each EST sequence from the same EST data, thus EST data validatesHMM’s ability to predict correctly. Training of the HMM is unsupervised, i.e.one does not know what part of the HMM the EST sequences belongs to andthus Baum-Welch algorithm estimate all parameters of the model at each itera-tion for all EST sequences. Since we use unsupervised learning it is important tohave emission probabilities that represent important elements in an mRNA whichhopefully direct EST sequences in the right part of the HMM during training.

Since the unsupervised HMM models the mRNA strand from 5’ to 3’ direction,it is important to know that 3’read EST in the data represents the complemen-tary strand of mRNA. 3’read must therefore be converted to represent the mRNAstrand. The anti-sense strand in figure 4.10 is written 5’ CGCATCGC 3’ in the

5' G C G A T G C G 3' sense strand (mRNA)

| | | | | | | |

3' C G C T A C G C 5' anti-sense strand

Figure 4.10. 5’read EST represent sense strand and 3’read EST represent anti-sensestrand.

EST-data. Therefore it is required to convert it to 5’ GCGATGCG 3’ instead tobe able to correctly train the sequence on the HMM representing sense strand.

When estimating the parameters of HMM with the Baum-Welch, the Baum-Welchalgorithm automatically stop the iterative training when the difference in log-likelihood probability for the HMM is smaller than 0.001 ∗ log(P ) between twoconsecutive iterations, where log(P ) is obtained through equation 3.11, which isdefined by the GHMM-library.

When training of the HMM is done, it is time for decoding by applying the 1-best algorithm on the trained HMM for each sequence in the data. We want toknow on which part of the mRNA an EST sequence is derived from. Suppose we

4.4 The designed softwares 35

have an EST sequence like 5’ ATGCGCCCCTGA 3’, the decoding of the sequencemay then look as follows: Start-Start-Start-CDS-CDS-CDS-CDS-CDS-CDS-Stop-Stop-Stop. The decoding shows that the EST sequence contains a CDS region.Similarly, we decode the remaining EST sequences and create a new file containingEST with predicted CDS through the designed software.

BLAST is a program which align sequence queries against a database (see fig-ure 4.11(a)). Investigation of each EST sequence in the training data is performedby blasting against a database consisting of mRNA sequences. Such a databaseis RefSeq RNA (Reference mRNA sequences), see figure 4.11(b). In addition,a further limitation of the search is accomplished by aligning to mRNA derivedfrom UniGene database, because the EST data used is from UniGene itself andthe identified mRNA’s present in the EST-data are through UniGene. This canbe done by adding a filter in "Entrez Query". To increase the accuracy of truehits the value of "Expect threshold" is diminished to 0.001 (see figure 4.11(c)) thatindicates how often you would expect randomized match entries. There is alsoa desires to reduce the number of max matches in a query range to one becauseEST sequences are generally short and partly and a distinct sequence region (e.g.,CDS) may be frequently obscured by numerous matches to another region of thesame sequence. Other settings in BLAST is in default mode. When the resultsare presented a reformation of the hits is done by changing options in "formattingoptions" to show CDS information about each query match (see figure 4.11(d)).Then it is possible to see what parts of the EST sequence that belongs to the CDSregion, see figure 4.12. Given that thousands of sequences against a database arealigned, it would take to much time to manually review each sequence so insteadall hits for all sequences are downloaded as a file. I have then created a softwarethat counts the number of EST sequences with the CDS in that file and the soft-ware is also able to filter them out to a new file (see next section).

All the sequences will be in FASTA format. It consists of a description partand the actual sequence. Description part is initialized with the operator ">".The sequence is written on a new line after the description, see figure 4.13.

4.4 The designed softwares

Two main softwares have been created, one that is used to obtain informationabout blasted sequences, and the main software that can learn and decode anHMM. The list below gives a detailed information about the softwares in whichall are written in Python.

HMM program

1. Load EST-dataLoad a file in FASTA format before training an HMM.


2. Train HMMTraining an HMM with loaded file.

3. Decode and create new fileCreate a new file containing EST with predicted CDS from the loadedfile.

4. Sample sequencesSample sequences from an HMM. The generated sequences are saved ina separate file.

5. Test model differencesTest model differences according to their parameters by summing allthe parameter differences.

BLAST software

The blast software can reveal the following from a file containing informationabout hit of alignments against a database. The file that was blasted is theone with EST sequences with predicted CDS from the HMM software. Thesoftware is hence used for evaluating the predicted EST sequences:

1. Total number of identified EST sequences

2. Total number of EST sequences with corresponding CDS

3. Total number of EST sequences with no CDS

4. Create new file with EST that contains CDS

5. Create new file containing 1-best probabilites for each true hit

6. Create new file containing 1-best probabilites for each false hit

7. Create new file containing lengths for each true hit

8. Create new file containing lengths for each false hit

9. Create new file containing CDS lengths for each true hit

10. Create new file containing CDS lengths for each false hit

11. Evaluate specificity for the HMM

12. Evaluate sensitivity for the HMM

4.5 Methods of working and tests 37

4.5 Methods of working and tests

The HMM will be tested in several ways, and the results is obtained in terms ofspecificity and sensitivity for the model.

Specificity indicates how well the model correctly classifies sequences without CDS,i.e. true predicted negative sequences. The specificity is obtained by using theequation below:

Specificity =true predicted negative sequences

total number of true negative sequences(4.1)

Sensitivity indicates the ability of the model to correctly classify sequences withCDS, i.e. true predicted positive sequences. Sensitivity is achieved through equa-tion below:

Sensitivity =true predicted positive sequences

total number of true positive sequences(4.2)

A high sensitivity does not necessarily provide a high specificity, it happens moreoften that high specificity contribute to lower sensitivity and vice versa. However,a good model should not act that way, it shall instead show no correlation betweenhigh values of sensitivity and specificity.

The following tests will be made on HMM (with Homo sapiens as EST data),where a consideration is accomplished of how each test affect the model’s speci-ficity and sensitivity:

1. Changing the emission order of the states in CDS.

2. Equalize all the parameters in the model, i.e. a null model.

3. Remove the implementation of forbidden stop codons in the CDS.

4. Increase the number of states at each region at the same time by one exceptfor CDS model which increases by three corresponding to additional onecodon. Each increasing of states must still follow the negative binomialdistributions as described above, thus changing the number of states a changeof the transition probabilities are necessary as well. If getting better results,continue to expand the model.

Once having found an HMM giving best sensitivity and specificity, a multivariateand regression analysis will be performed on the results to see correlations betweendistinct variables. The distinct variables are listen in table 4.10.

A test of the model’s training performance dependences of number of sequencesused in a training set. Model M0 is trained with a training data. That trainedHMM is called M1. M1 then generates 10 times each, 500, 2000, and 5000...20000 sequences. Those generated sequences are used as training data to train the


initial M0 again giving the models M500, M2000, and M5000...M20000. The esti-mated parameters of each such model trained with the generated sequences will becompared with M1 by summing the parameter differences. The smaller the sumthe more similar models. A good model should give similar parameters betweenthe models.

Finally, a test of the HMM’s ability to be adapted for other eukaryotic organ-isms will be performed.


Table 4.5. Emission probabilities for each state in CDS HMM.

A G C T NState 12 0.003 0.99 0.003 0.003 0.001State 13 0.2497 0.2497 0.2497 0.2497 0.001State 14 0.2 0.2995 0.2995 0.2 0.001State 15 0.24971 0.24971 0.24971 0.24971 0.0011

... ... ... ... ...0.249725 0.249725 0.249725 0.249725 0.00125

State 16 0.24971 0.24971 0.24971 0.24971 0.0011

... ... ... ... ...0.249725 0.249725 0.249725 0.249725 0.00125

State 17 0.21 0.29951 0.29951 0.21 0.0011

... ... ... ... ...0.215 0.299515 0.299515 0.215 0.00115

0.016 0.016 0.598401616 0.399600416 0.00199800216

0.217 0.299517 0.299517 0.217 0.00117

0.018 0.37437518 0.37437518 0.2518 0.0012518

0.219 0.299519 0.299519 0.219 0.00119

... ... ... ... ...0.225 0.299525 0.299525 0.225 0.00125

State 18 0.24971 0.24971 0.24971 0.24971 0.0011

... ... ... ... ...0.249725 0.249725 0.249725 0.249725 0.00125

State 19 0.24971 0.24971 0.24971 0.24971 0.0011

... ... ... ... ...0.249725 0.249725 0.249725 0.249725 0.00125

State 20 0.21 0.29951 0.29951 0.21 0.0011

... ... ... ... ...0.215 0.299515 0.299515 0.215 0.00115

0.016 0.016 0.598401616 0.399600416 0.00199800216

0.217 0.299517 0.299517 0.217 0.00117

0.018 0.37437518 0.37437518 0.2518 0.0012518

0.219 0.299519 0.299519 0.219 0.00119

... ... ... ... ...0.225 0.299525 0.299525 0.225 0.00125


Table 4.6. Emission probabilities for each state in stop codon HMM. State 21-23 rep-resent TAA, 24-26 represent TAG and 27-29 represent TGA.

A G C T NState 21 0.0003 0.0003 0.0003 0.999 0.0001State 22 0.999 0.0003 0.0003 0.0003 0.0001State 23 0.999 0.0003 0.0003 0.0003 0.0001State 24 0.0003 0.0003 0.0003 0.999 0.0001State 25 0.999 0.0003 0.0003 0.0003 0.0001State 26 0.0003 0.999 0.0003 0.0003 0.0001State 27 0.0003 0.0003 0.0003 0.999 0.0001State 28 0.0003 0.999 0.0003 0.0003 0.0001State 29 0.999 0.0003 0.0003 0.0003 0.0001

Table 4.7. Emission probabilities for each state in 3’UTR.

A G C T NState 30 0.2995 0.2 0.2 0.2995 0.001State 31 0.2995 0.2 0.2 0.2995 0.001State 32 0.2995 0.2 0.2 0.2995 0.001State 33 0.999 0.0003 0.0003 0.0003 0.0001State 34 0.999 0.0003 0.0003 0.0003 0.0001State 35 0.0003 0.0003 0.0003 0.999 0.0001State 36 0.999 0.0003 0.0003 0.0003 0.0001State 37 0.999 0.0003 0.0003 0.0003 0.0001State 38 0.999 0.0003 0.0003 0.0003 0.0001State 39 0.2995 0.2 0.2 0.2995 0.001State 40 0.999 0.0003 0.0003 0.0003 0.0001State 41 0.999 0.0003 0.0003 0.0003 0.0001


Table 4.8. cDNA librarys from UniGene.

cDNA library Organ Total sequences Total different mRNAs

515 Cervix 502 1357364 Bladder tumor 515 1578287 Nervous tumor 503 1866137 Amnion normal 507 21618390 Cord blood 506 279

12639 Mandible 2018 5106815 Pheochromocytoma 2460 94818407 Blood 2074 100410310 Stomach 2020 10428613 Blood 2114 1591

16395 Kidney 5178 16918975 Blood 5182 17016992 Teratocarcinoma 5186 1628742 Cerebellum 5235 2646

16443 Uterus 5146 2309

13039 Embryonic stem cells 10055 11325383 Bladder 10515 245618302 Adrenal gland 10385 29799724 Leukocyte 10545 42359725 Medulla 10788 4946

Table 4.9. cDNA librarys from UniGene.

cDNA library Organism

14436 Acyrthosiphon pisum23834 Caenorhabditis elegans23325 Drosophila simulans16895 Equus caballus23730 Cryptococcus neoformans21849 Ixodes scapularis14248 Silurana tropicalis


(a) One can choose to upload a file or type in sequencesmanually.

Match against UniGene

Match against mRNA

sequences

(b) Database environment.

(c) Parameters for hit accuracy.

(d) Retrieve CDS information.

Figure 4.11. BLAST web interfaces.


Figure 4.12. CDS information and localization for a match.

Figure 4.13. FASTA format.


Table 4.10. Variables used in regression analysis.

VariablesNumber of mRNA

Identified EST through BLASTNumber of EST with CDS

Number of EST without CDSNumber of 5’Read ESTNumber of 3’Read EST

Number of EST with unknown nucleotidesMean length

Number of not identified EST

Chapter 5

Result and discussion

5.1 Designed model

Table 5.1 and figure 5.1 show the results of BLAST for the designed model andfor ESTScan. A good model is supposed to have both high specificity and highsensitivity. The unsupervised HMM has a fairly high sensitivity but very poorspecificity. This means that the model has problem at distinguishing sequencescontaining CDS and sequences without CDS unlike ESTScan which shows moresmooth results for both sensitivity and specificity.

Table 5.1. Results from BLAST. *: first five libraries ignored.

cDNA library Specificity Sensitivity ESTScan: specificity ESTScan: sensitivity515 28.75 65.38 95.24 66.227364 36.11 80.32 86.11 72.298287 90.63 18.63 78.13 85.956137 67.74 84.48 82.26 92.2418390 79.17 81.46 83.33 97.3512639 90.36 66.00 97.79 78.166815 61.47 100.00 94.30 88.1218407 59.09 99.73 71.82 97.3110310 47.55 90.09 84.31 95.228613 24.52 96.94 92.18 83.5116395 44.84 98.02 51.33 98.168975 24.36 89.56 95.25 84.166992 49.75 83.99 89.09 89.17742 46.34 79.56 89.17 87.13

16443 69.73 97.78 90.33 94.6113039 59.62 98.46 85.33 95.665383 65.86 95.10 89.56 87.8618302 58.84 96.06 79.57 95.159724 35.49 97.17 81.67 91.709725 34.97 97.73 77.51 91.95

Mean 53.76 85.82 84.71 88.60*Mean 51.52 92.41 84.61 90.52

45

46 Result and discussion

Table 5.2. Comparing mean values for respective number of sequences.

Mean sensitivity Mean specificityUnsupervised HMM

500 60.48 66.052000 56.60 90.565000 47.00 89.7810000 50.96 96.9

ESTScan

500 82.81 86.572000 88.46 92.245000 90.65 89.9510000 92.46 91.68

A study of the specificity and sensitivity variation with the number of sequencesused when training are summarized in table 5.2. For ESTScan there is no sig-nificant difference. The specificity, however, seems to increase with number ofsequences. Nevertheless, this is just a coincidence because the HMM in ESTScanare trained with independently annotated data sequences.

In unsupervised model, it is possible to see some differences. Data sets thatcontain approximately 500 sequences appear to differ from the rest. This maybe an indication that 500 sequences is insufficient to train the model. The modelneeds more than 2000 sequences for training to be adaptable enough for the dataapplied.

5.2 Effects of parameter changes

The parameters have been studied on how they affect the model by changing theparameter values. Only a subset of all cDNA libraries are choosen for testing,partly because it takes time to blast them all through Internet when acquiring theBLAST results. However, the subset should be representative for the rest of thelibraries, if different subsets give different results indicates that the model beingused is more or less randomized.

Libraries are strategically chosen in the sense that it contains many sequenceswithout CDS (see appendix A). The task is to check how well the model is todistinguish between sequences with and without CDS. A library is selected fromeach category (a category is the number of sequences used in a data).

5.2.1 Emission order

As expected, a 0-order emission HMM gives worst results as can be see in table5.3, both specificity and sensitivity are low. For library 515 and 12639 the model

5.2 Effects of parameter changes 47

(a)

515 7364 8287 6137 18390 12639 6815 18407 10310 8613 16395 8975 6992 742 16443 13039 5383 18302 9724 97250

10

20

30

40

50

60

70

80

90

100

cDNA library

Sen

sitiv

ity %

HMM

ESTScan

(b)

Figure 5.1. Differences between HMM and ESTScan according to specificity and sen-sitivity.

could not find any sequence with CDS, but there are 78 respective 403 sequencescontaining CDS (see appendix A). 0-order emissions has no "memory" of previousemissions, so training and decoding of the model is more arbitrary (however, themodel is still of 1-order state).

When the emission order increases the sensitivity increases as well. The speci-ficity is very low at order one, but increases slowly with higher order emission.One might ask why the sensitivity is higher than specificity when increasing theorder. One reason could be that higher order emissions provides more parame-ters in the CDS part of the HMM. This affects the decoding process in the sensethat there are more probability combinations to find at the CDS part than forthe remaining HMM. This may contribute to better 1-best decoding paths givinghighest probability. However, continue increasing the order needs also larger train-ing set and if not enough large it contributes to decrease the sensitivity slowly.Indeed, there is a compromise of these two factors when acquiring the sensitivityand specificity.

Another very important aspect is the use of stop codon restrictions in the CDSmodel. The specificity decreases dramatically when the order of two is used with-


Table 5.3. Results given with different orders.

cDNA library Order Specificity Sensitivity515 0 0 0

12639 0 0 0742 0 46.68 79.175383 0 75.4 68.21

Mean 30.52 36.85

515 1 1.25 93.5912639 1 12.02 97.77742 1 0.64 99.225383 1 14.55 98.9

Mean 7.12 97.37

515 2 28.75 65.3812639 2 90.36 66.00742 2 46.34 79.565383 2 65.86 95.10

Mean 57.83 76.51

515 2(no stop) 1.88 91.0312639 2(no stop) 64.81 95.29742 2(no stop) 0.76 99.185383 2(no stop) 24.78 90.30

Mean 23.06 93.95

515 3 33.13 57.6912639 3 86.30 65.76742 3 44.95 83.125383 3 73.62 81.00

Mean 59.5 71.9

out restrictions of any stop codons. A decoding with 1-best algorithm of a sequencefrom the 5383 Library (see figure 5.2(a)) shows that the three different stop codonsare decoded as CDS multiple times, which shows that the model has problems tocorrectly train and decode without proposed restrictions. One of the reason forthis behavior is the permission of a sequence to start and stop anywhere in theHMM, it is therefore of importance to forbid stop codons in the emission param-eters. Figure 5.2(b) illustrates a decoding when CDS model got restrictions, thefirst occurrence of a stop codon (the marked red TGA) is instead decoded as astop codon as it should be. However, the CDS in the sequence is not correctly pre-dicted because it is too short (according to BLAST, the first occurrence of ATG iscounted as a true start codon giving a longer different CDS region), neverthelessthe crucial point was to show the importance of restricted stop codons in CDSmodel.

A further increasing of the order does not give significantly better specificity orsensitivity. In fact, mean specificity increases negligible by 1.67% and mean sen-sitivity decreases by 4.61%. This is due to the increased emission parameters,more parameters need larger training set to recognize sequences with CDS (ob-serve that stop codons are restricted in the emission parameters for 3-order as

5.2 Effects of parameter changes 49

(a) No restriction of stop codons in the model.

(b) Restrictions of stop codons in the model.

Figure 5.2. Green mark: start codon. Yellow mark: CDS. Red mark: stop codon. Nomark: UTR.

well). Therefore it is satisfied to continue with 2-order emission.

5.2.2 Null model

Table 5.4 shows clearly how important it is to specify the probabilities, a nullmodel with unspecified or equalized probabilities gives disastrous results. It isimpossible for the model to know how it should be trained and decoded by notinitiating the transition probabilities and emission probabilities to achieve a biolog-ical representation as the training and decoding can occur anywhere on the model.

Just as in the last subchapter about emission order, the results shows high sensi-tivity because of the number of possible probability combinations in the CDS partof HMM.

5.2.3 HMM expanding

An expansion of the model was done by adding a state to each region, i.e. an ad-ditional state at 5’UTR, three new states at CDS, which represent an additionalcodon, a state at 3’UTR and a new state at poly(A)-tail. However, an additional


Table 5.4. Results given with a null model with equalized probabilities with no fixedemissions.

cDNA library Null model: specificity Null model: sensitivity515 0 100

12639 6.81 75.68742 0 99.965383 33.02 94.57

Mean 9.96 92.55

state at the intermediate region between the poly(A)-signal and poly(A)-tail is notneeded, because otherwise it contributes to have a binomial distribution and risk tomodel incorrectly long lengths at that position. Therefore only one state is needed.

The expanded model shall still hold the average length for each region, i.e.

Figure 5.3. Negative binomial distributions with different number of states for 5’UTR.Blue line: 2 states, p = 0.995. Red line: 3 states, p = 0.99. Green line: 4 states,p = 0.985. Pink line: 5 states, p = 0.98

the maximum point of the binomial distribution will still be given by the sameaverage length as before. Figure 5.3 illustrates how the binomial distribution ischanging through the extension of a state, in this case for 5’UTR-section. Theprobability increases with number of states for the average length as the distribu-tion constricts toward the average length.

Table 5.5, shows some small deterioration of both specificity and sensitivity com-pared with the original model. An expansion of the model in this way provideshence no improvement. There are other possibilities of expanding the model forexample by adding a state for a region at a time and not simultaneously. One can

5.3 Comparing the lengths and 1-best probabilities for true and falsematches 51

Table 5.5. Results when adding states to each region of HMM.

cDNA library Specificity Sensitivity515 25.00 62.82

12639 90.54 67.00742 48.82 81.875383 65.43 92.05

Mean 57.45 75.94

then continue to test new structure expansion, but this manual way of HMM ex-panding is too cumbersome and takes far too long between tests and the expandedHMM becomes more dependent of the training data used for training the HMM.Ideally one want to be able to find the optimal HMM structure automatically viasome clever training algorithm, but the GHMM library used did not have such analgorithm and therefore it is satisfied with the original HMM structure.

5.3 Comparing the lengths and 1-best probabili-ties for true and false matches

Here I try to identify differences between true hits and false hits by looking at theirrespective 1-best probability, length, and the predicted CDS length in form of his-tograms. cDNA libraries 6815 and 742 have been chosen as an example because oftheir specificity and sensitivity value results, 6815 with maximum sensitivity and742 with low values of both specificity and sensitivity (see figure 5.1), i.e. a librarywith good result and a library with bad result. The libraries with approximately500 sequences are ignored considering that their results of the trained HMM arenot trustworthy due to the low sequence content.

Figure 5.4 shows the histogram of 1-best probabilities for the libraries. Thereseems to be no difference between true hits and false hits, the histograms arealmost identical on each case. Furthermore, figure 5.5 illustrates no noticeabledifference in the lengths of any case as well. The average lengths are almost thesame for true and false hits for each cDNA library, which may explain why thehistograms for 1-best probabilities are not indistinguishable. 1-best probabilitiesare more or less depending on the length of a sequence, short sequences providehigh 1-best probabilities while longer sequences provide lower 1-best probabilities,specifically because of multiplication with products in the 1-best algorithm whencalculating the probability. Therefore, there is no correlation between true hitsand false hits according to 1-best probabilities.

However, it can be seen a apprehensible dissimilarity between the CDS lengthof true and false hits in figure 5.5. Predicted sequences with short CDS lengthare more unlikely to be considered as true hit than sequences with longer CDS.


The majority CDS length of the false matches are under 200, whose average is124 and 143 respectively (see Figure 5.5) for each tested library. A list of CDSaverage length of all libraries are shown in appendix A. This behavior can be seenfor all libraries, of the false hits, most CDS lengths are simply too short. Thus,there is an apparent correlation between false hits and CDS lengths. Therefore,during decoding a restriction to a minimum length of CDS is carried out in orderto accept a sequence as a match. The overall average CDS length of all librariesis 182 which will be used as the minimum CDS length for the next final testing.

−1300 −1200 −1100 −1000 −900 −800 −700 −600 −500 −400 −3000

5

10

15

20

25

30

(a) cDNA library 6815: true hits.

−1300 −1200 −1100 −1000 −900 −800 −700 −600 −500 −400 −3000

2

4

6

8

10

12

14

16

(b) cDNA library 6815: false hits.

−1200 −1000 −800 −600 −400 −200 0 2000

5

10

15

20

25

30

35

40

(c) cDNA library 742: true hits.

−1200 −1000 −800 −600 −400 −200 0 2000

5

10

15

20

25

(d) cDNA library 742: false hits.

Figure 5.4. Histogram of 1-best log-probability for hits.

5.4 Multivariate analysis

Figure 5.7 shows the final results of multivariate analysis. At the beginning, all thevariables from table 4.9 were chosen in the multivariate model. The variables wereremoved gradually that had a significance value bigger than 0.05, i.e. variablesare not significant at the 5% significance level if significance value is above 5%

5.4 Multivariate analysis 53

100 200 300 400 500 600 700 800 900 10000

5

10

15

20

25

30

(a) cDNA library 6815: true hits. Mean length:566.

100 200 300 400 500 600 700 800 900 1000 11000

2

4

6

8

10

12

14

(b) cDNA library 6815: false hits. Mean length:558.

−100 0 100 200 300 400 500 600 700 8000

5

10

15

20

25

(c) cDNA library 742: true hits. Mean length:314.

−100 0 100 200 300 400 500 600 700 8000

2

4

6

8

10

12

(d) cDNA library 742: false hits. Mean length:317.

Figure 5.5. Histogram of length for hits.

or 0.05. It turns out that the only significant variable is the mean length whichbest describes the model’s behavior of the chosen explanatory variables. Figure5.8 shows the effect of mean length on the sensitivity and specificity by lookingat the estimated B parameter. B parameter is positive for both sensitivity andspecificity, which means that the longer the average length of EST sequences inthe data, the better sensitivity and specificity. However, specificity increases verylittle with increasing average length and mean length is therefore not so signifi-cant for specificity, i.e. there is no strong correlation between mean length andspecificity. The residual plot can be seen in figure 5.9(a). The residual has noclear pattern and seems to be randomly distributed, suggesting that the regres-sion model with mean length as explanatory variable and sensitivity as responsevariable is acceptable. The residuals are also normally distributed, as can be seenin figure 5.9(b) and 5.9(c). The residual is suggested to be normally distributedbecause of the significance value that is greater than 0.05 in the Shapiro Wilk test.


−100 0 100 200 300 400 500 600 700 800 9000

5

10

15

20

25

30

35

40

(a) cDNA library 6815: true hits. Mean length:327.

−100 0 100 200 300 400 500 6000

5

10

15

20

25

(b) cDNA library 6815: false hits. Mean length:124.

−100 0 100 200 300 400 500 600 7000

5

10

15

20

25

30

35

40

45

50

(c) cDNA library 742: true hits. Mean length:219.

−100 0 100 200 300 400 500 6000

2

4

6

8

10

12

14

16

18

20

(d) cDNA library 742: false hits. Mean length:143.

Figure 5.6. Histogram of CDS length for hits.

A value below 0.05 suggest that residuals are not normally distributed. However,the so-called R-square adjusted value (after a simple regression analysis with onlythe sensitivity as response variable and mean length as explanatory variable) isonly 45.3%, which says that the mean length variable only describes 45.3% of thevariance for sensitivity. This is a fairly low value and indicates the need of otherexplanatory variables in the regression model that could explain the variance ina better way. The effect of mean length should therefore be taken with a grainof salt, it requires further analysis of the model to understand the variance ofspecificity and sensitivity. The specificity did not show any correlation with theexplanatory variables chosen in table 4.9. But a multiregression analysis showeda weak significant correlation with mean length and number of mRNA. The esti-mated B coefficients was positive for mean length but negative for mRNA meaningthat increasing number of mRNA decreases the specificity. Increasing the meanlength has the same effect as for sensitivity, it increase specificity as well (as the

5.5 Final results 55

Figure 5.7. Multivariate analysis results. Mean length variable significant at 5% level.

multivariate analysis told). The R-square adjusted value was only about 20%,which is considered far to low, 80% of the variance for specificity is still unknown.But even if number of mRNA may be significant, there is nothing there can bedone about that. It just tells, that to many mRNA sequences can be troublesomefor the model’s specificity. However by previous testing, I have found that thelength of CDS have great impact on specificity. I did not chose CDS length asexplanatory variable, because I wanted to focus on variables that are known beforetraining the model.

With the results given here, a length of at least 182 nucleotides will be usedwhen training the model, simply because it is wanted a sequence that could coverthe minimum length of CDS (which is 182).

Figure 5.8. Mean length effect on sensitivity and specificity.

5.5 Final results

After having introduced a limit of at least 182 nucleotides long sequences to trainwith, and that the predicted sequences must have a CDS length of at least 182nucleotides, a test of the model was once again executed with all the libraries. Theresults are collocated in table 5.6. There is a significant increase in specificity, buta decrease of sensitivity due to the fact that several sequences with short CDSlength than 182 nucleotides were ignored. If the first five libraries are ignored,because of their short average lengths compared to the rest, an overall better bal-anced model with little difference between the specificity and sensitivity can beseen. Specificity and sensitivity mean value are now fairly high of around 80%


Table 5.6. Results via BLAST. *: first five libraries ignored.

cDNA library Specificity Sensitivity515 80.63 20.517364 91.67 18.888287 98.44 13.076137 85.48 61.7918390 87.50 71.5212639 97.52 53.356815 92.49 81.5218407 80.91 95.1510310 88.73 72.548613 70.34 73.8516395 52.51 95.568975 88.02 53.256992 73.04 82.60742 83.60 53.49

16443 88.91 91.2313039 86.18 87.515383 91.81 61.6318302 80.30 90.529724 67.75 83.979725 72.38 86.2

Mean 82.91 67.407*Mean 80.96 77.49

each (see *Mean in table 5.6).

Library 18407 were chosen as data on the basis of its results with high sensitivityand specificity to train an HMM . After training, a different number of sequenceswas generated, where they once again were used to train a new HMM and thusexamine how the difference varies between the HMM parameters, the one trainedwith 18407 and the other HMM trained with the generated sequences. The averagelength of the generated sequences was equal to 18407, which is 564 nucleotides. Itturns out that the difference of the model’s parameters decreases with the numberof generated sequences as they should, i.e. the model is trained nonrandomized(see figure 5.10). If the model was trained randomized, there would be no decreas-ing difference of parameters with the number of generated sequences, instead itwould vary independently. With library 18407 as training data, the model needed6000 generated sequences before the variation and difference of the parametersbecame more steady. On this basis the conclusion is that the more parametersa model consist of, the more generated sequences is needed to get close to anyoriginal model. Details of the result can be seen in appendix B.

5.5.1 Lack of supervised models

Table 5.7 shows the results for the selected organisms that has no correspondingsupervised model. When ESTScan uses a supervised model for any other organismthan the one needed, ESTScan becomes more inaccurate with the predictions. Inthis case when using Homo sapiens as supervised model it appears that the speci-


Table 5.7. Results via BLAST.

Organism Specificity Sensitivity ESTScan: specificity ESTScan: sensitivityAcyrthosiphon pisum 87.72 79.45 75.44 91.88

Caenorbhaditis elegans 66.67 56.23 33.33 85.00Drosophila simulans 100 75.93 100 95.13

Equus caballus 88.56 82.28 83.15 94.44Cryptococcus neoformans 97.72 40.96 30.48 91.42

Ixodes scapularis 97.70 65.89 14.95 93.02Silurana tropicalis 80.36 90.90 64.08 92.37

ficity of ESTScan fails. It has high sensitivity but during the prediction it alsoincludes a lot of junks, i.e. ESTScan has problems to distinguish between ESTswith and without CDS. It sounds reasonable though that ESTScan get predictionproblems given that it use a supervised model that is not suitable for the datathat was tested.

However, the unsupervised HMM that can be adapted to the data through train-ing shows that the model’s performance is more or less unchanged (comparingwith the results from table 5.6) even though I used various eukaryotic organismsand that the initial parameters are more or less suited for Homo sapiens. Thereis high specificity and a fairly good sensitivity with averages 88.39 % and 70.23 %respectively (see figure 5.11). The specificity remains at 80 % area and sensitivityat 70 % area compared to when I applied the data containing Homo sapiens ESTs.ESTScan has however lost its specificity average of 84.61 % (table 5.1) to 57.35 %.This shows the potential of unsupervised HMM when there is lack of supervisedmodels, i.e. the actual purpose of the designed unsupervised HMM is to be usedwhen there are no available supervised models.


(a) Residual plot of sensitivity.

(b) Residual Shapiro-Wilk normality test.

(c) Residual normality plot.

Figure 5.9. Residuals.


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

10

20

30

40

50

60

70

Figure 5.10. Graph showing the difference in parameters between two models withincreasing number of generated sequences.

Sensitivity Specificity0

10

20

30

40

50

60

70

80

90

100

Per

cent

age

%

HMMESTScan

70.23

91.90 88.39

57.35

Figure 5.11. Mean values for specificity and sensitivity for the unsupervised HMM andESTScan.

Chapter 6

Conclusion and future works

The model’s last results with average 80.96% and 77.49% of specificity and sen-sitivity when using Homo sapiens as data shows that it may be possible to trainan HMM with unsupervised learning and is not far away from supervised per-formance. However the model is still unfinished because there are a few majorproblems left with the model that need to be considered and if possible should beresolved in order to increase the unsupervised performance. First and foremost,a frame-shift correction of EST sequences during training and decoding must beimplemented. A careful analysis of a predicted sequence and its decoding showsincorrectly positioned start codon which leads to different CDS than proposed byBLAST because of displacement. Such error decreases the model’s accuracy a lotif that was accounted. Unfortunately, I did not account this error for the cor-rectly predicted sequences with CDS simply because of no frame-shift correctionimplemented. However, if the error was taken into account a calculation of howmany of the correctly predicted sequences with CDS had correctly positioned CDSaccording to BLAST could be done. I did manually check this error and foundthat many sequences were perfectly predicted, but sometimes I also did find pre-dicted sequences with incorrectly positioned CDS as can be seen in figure 6.1(a).The model predicted a shorter CDS because of wrongly positioned start codon,instead it should be the first occurrence of ATG that should be the start codon asproposed by BLAST leading in this case to a longer CDS (see figure 6.1(b). Thisis a major problem and a solution needs to be found to direct the model to predictcorrectly shifted CDS, for example by allowing the HMM to read three frames.

Another problem is the ability to start in a state with higher order emission.These states require some form of emission history, it is not possible to actuallystart directly in such a state. To start in the CDS model with order two requiresthat the first two nucleotides in the sequence are modeled on the first two statesdescribed in figure 4.9. The parameters in these states are fixed, but these param-eters are included when the probability parameters for the model are estimatedduring Baum-Welch and for the 1-best algorithm. The question is how these pa-rameters influence for example the 1-best algorithm when it picks out the best

61

62 Conclusion and future works

(a) Predicted by model. CDS is displaced.

(b) Predicted by BLAST.

Figure 6.1.

decoding patch. The first two nucleotides in the sequence may be of importancefor choosing the correct CDS frame in the CDS model, but because those nu-cleotides must go through the first two states before CDS with order two it maygive incorrectly CDS path.

The HMM structure is currently static. Building an HMM topology is more of anart, and to find the optimal topology in a manual way is very difficult. It wouldbe interesting if the model has the ability to change their topology during a train-ing. I think this would lead to better prediction results and through the changeof the HMM topology may also lead to new transition probabilities, i.e. transitionprobabilities from one state to more variated states that would enhance the lengthdistributions to capture the modality of a data in a better way than the classicalnegative binomial distribution.

Aforementioned discussions are something that should be considered in futurework in this area. Unsupervised learning feels much more complicated than su-pervised learning because there is no annotated data, which means that moreattention must be focused on the model and the algorithms applied. The classicaltraining algorithm Baum-Welch is far too dependent on initial parameter values,while it does not guarantee to find the global optimum, instead it find itself into

63

local optima that not always describe the data in a proper manner. This canbe problematic for unsupervised learning, which gives totally different parameterestimates depending on the initial parameters. It would therefore be desirable touse if possibly another algorithm that is less dependent on the initial parametersand which may find local optima that are closer to the global optimum, or simplyan algorithm that guarantees global optimum. Therefore, many difficulties mustbe solved when using unsupervised learning before getting a reasonably acceptableunsupervised HMM relative a supervised HMM.

However, when ESTScan or other supervised softwares lack supervised modelsfor any specific organism, the unsupervised HMM is then a potential replacement.The unsupervised HMM is able to be adapted to other eukaryotic organisms andthe performance of the model is more or less unchanged compared to when Homosapiens were used as data. Therefore, the essential purpose of the designed soft-ware and the unsupervised HMM is hence to be used when there is a lack ofsupervised models and for adaption through training for the data applied.

Bibliography

[1] R. Durbin, S. Eddy, A Krogh, and G Mitchison. Biological sequence analysis:

Probabilistic models of proteins and nucleic acids.

[2] genomics.energy.gov. About the human genome project. http://www.

ornl.gov/sci/techresources/Human_Genome/project/about.shtml, Au-gust 2008.

[3] Shivashankar H.Nagaraj, Robin B.Gasser, and Shoba Ranganathan. A hitch-hiker’s guide to expressed sequence tag(EST) analysis. Oxford University

Press, 8(1):6–21, April 2006.

[4] Christian Iseli, C.Victor Jongeneel, and Philipp Bucher. ESTScan: a programfor detecting, evaluating, and reconstructing potential coding regions in ESTsequences. Proc Int Conf Intell Syst Mol Biol, 138(48), 1999.

[5] Zhao Jing. Polyadenylation signals. http://aghunt.wordpress.com/2008/

06/29/polyadenylation-signals/, 29 June 2008.

[6] Nick J.Proudfoot, Andre Furger, and Michael J.Dye. Integrating mRNA pro-cessing with transcription. Cell, 108:501–512, 22 February 2002.

[7] Dr. Sumaiya Khan. Start codon. http://www.buzzle.com/articles/

start-codon.html, 16 Mars 2010.

[8] A. Krogh. Two methods for improving performance of an HMM and theirapplication for gene finding. In Proceedings of ISMB, pages 179–186, 1997.

[9] Kozak Marilyn. An analysis of 5 prime-noncoding sequences from 699 ver-tebrate messenger RNAs. Nucleic acids research, 15(20):8125–8148, 6 July1987.

[10] Kozak Marilyn. Structural features in eukaryot mRNAs that modulate theinitiation of translation. The Journal biological of chemistry, 266(30):19867–19870, 25 October 1991.

[11] Kozak Marilyn. Initiation of translation in prokaryotes and eukaryotes. Gene,234(Issue 2):187–208, 3 February 1999.

65

66 Bibliography

[12] Flavio Mignone, Carmela Gissi, Sabino Liunit, and Graziano Pesole. Un-translated regions of mRNAs. Genome Biology, 3(3):reviews0004.1–0004.10,28 February 2002.

[13] Anna Niedzwiecka, Joseph Marcotrigiano, Janusz Stepinski, MarzenaJankowska-Anyszka, Aleksandra Wyslouch-Cieszynska, Michal Dadlez, Anne-Claude Gingras, Pawel Mak, Edward Darzynkiewicz, Nahum Sonenberg,Stephen K.Burley, and Ryszard Stolarski. Biophysical studies of eIF4E cap-binding protein: Recognition of mRNA 5 prime cap structure and syntheticfragments of eIF4G and 4E-BP1 protein. Journal of Molecular Biology, 319(Is-sue 3):615–635, 7 June 2002.

[14] Graziano Pesole, Sabino Liuni, Giorgio Grillo, and Cecilia Saccone. Struc-tural and compositional features of untranslated regions of eukaryotic mR-NAs. Gene, 205(1997):95–101, 16 June 1997.

[15] Lawrence R. Rabiner. A tutorial on Hidden Markov Models and selectedapplications in speech recognition. Proceedings of the IEEE, 77(2):257–286,February 1989.

[16] Dr R.Croy. Molecular genetics ii - genetic engineering course (supplemen-tary notes). http://dwb4.unl.edu/Chem/CHEM869N/CHEM869NLinks/www.

dur.ac.uk/~dbl0www/Staff/Croy/cDNAfigs.htm, April 1998.

[17] Yutaka Suzuki, Kiyomi Yoshitomo-Nakagawa, Kazuo Maruyama, AkiraSuyama, and Sumio Sugano. Construction and characterization of a fulllength-enriched and a 5’-end-enriched cDNA library. Gene, 200(16):149–156,28 November 1997.

[18] E. Wahle. Poly(A) tail length control is caused by termination of processivesynthesis. The journal of biological chemistry, 270(6):2800–2808, 10 February1995.

[19] Stefan Wilkening and Augustinus Bader. Quantitative real-time polymerasechain reaction: Methodical analysis and mathematical model. Journal of

Biomolecular Techniques, 15(ISSUE 2):107–111, June 2004.

Appendix A

Details about each cDNAlibrary

cDNA library Identified EST EST with CDS EST without CDS 5’Read EST 3’Read EST

515 238 78 160 502 0

7364 321 249 72 515 0

8287 370 306 64 503 0

6137 397 335 62 507 0

18390 477 453 24 497 9

12639 1534 403 1131 974 1044

6815 1796 1077 719 2460 0

18407 1967 1857 110 2045 29

10310 1919 1715 204 2020 0

8613 1783 849 934 7 2107

16395 5026 4348 678 5178 0

8975 4210 3200 1010 5182 0

6992 4935 3166 1769 2603 2583

742 4121 2548 1573 5194 41

16443 5046 4415 631 5146 0

13039 9503 9134 369 10055 0

5383 9192 7384 1808 10255 260

18302 9376 8584 792 10114 271

9724 9147 8034 1113 9639 906

9725 9329 8471 858 10362 426

cDNA library EST with ’N’ Mean length Not identified EST Mean length(true hits)

515 90 218 281 229

7364 58 301 194 337

8287 70 329 133 370

6137 10 364 110 381

18390 62 550 29 554

12639 70 467 484 503

6815 904 563 664 566

18407 222 564 107 565

10310 124 526 101 537

8613 191 472 331 502

16395 1109 579 152 579

8975 106 348 972 357

6992 5146 668 251 719

742 4650 314 1211 314

16443 1659 571 100 572

13039 3260 697 880 700

5383 1517 751 1536 744

18302 1734 552 1188 555

9724 1837 834 1398 838

9725 2338 815 1459 834

67

68 Details about each cDNA library

cDNA library Mean length(false hits) CDS mean length(true hits) CDS mean length(false hits)

515 225 155 135

7364 329 191 143

8287 243 298 170

6137 362 257 186

18390 557 424 260

12639 443 303 137

6815 558 327 124

18407 546 431 199

10310 513 384 142

8613 460 281 177

16395 580 470 388

8975 353 218 94

6992 579 525 201

742 317 219 143

16443 570 415 170

13039 749 429 225

5383 769 259 150

18302 538 415 197

9724 795 436 209

9725 801 462 190

Appendix B

Model differences

Model It. 1 It. 2 It. 3 It. 4 It. 5 It. 6 It. 7 It. 8 It. 9 It. 10

M500 55.57 71.2 63.37 60 60.6 69.57 74.64 65.42 65.5 62.15

M2000 30.35 48.07 58.27 31.3 67.21 61.03 34.75 74.68 43.28 26.12

M5000 25.17 23.12 19.31 16.34 22.31 19.06 23.92 67.64 26.87 31.02

M6000 17.05 30.06 18.87 23.18 20.58 19.48 20.08 17.7 17.76 25.8

M7000 26.41 20.54 17.91 32.39 16.34 18.25 21.38 25.58 20.59 16.39

M8000 19.15 14.4 17.84 18.99 25.64 23.23 21.36 25.57 20.66 20.71

M9000 21.78 19.91 12.57 15.57 15.93 17.31 27.08 16.27 20.23 38.45

M10000 27.17 15.69 22.34 23.48 12.38 18.46 12.58 12.99 17.25 17.83

M11000 16.43 14.94 17.71 24.77 13.61 18.04 14.43 14.82 18.75 18.11

M12000 17.53 17.14 13.94 13.33 15.38 23.39 12.4 15.18 15.04 11.24

M13000 16.34 12.65 13.82 30.55 14.89 18.02 15.19 16.88 21.58 15.19

M14000 16.56 17.1 18.41 12.63 16.69 11.64 14.3 14.59 12.21 12.27

M15000 12.4 17.05 16.66 14.8 16.01 16.73 17.52 15.78 11.89 16.35

M16000 15.99 20.24 12.43 15.54 17.17 12.62 13.7 16.05 12.54 21.22

M17000 14.05 17.27 16.35 16.15 14.36 17.01 18.22 17 17.62 17.82

M18000 17.43 16.67 18.2 11.31 14.04 17.78 21.12 14.95 20.17 13.07

M19000 11.34 13.63 13.62 15.88 11.08 16.83 17.57 13.77 14.98 13.27

M20000 8.16 13.13 12.97 11.33 11.94 14.22 16.09 14.8 13.87 14.52

69

Documents

Master of Science Thesis in Bioinformatics429096/FULLTEXT01.pdfnerade startkodoner eller nukleotider. Oftast beror de positivt falska resultaten på felaktigt positionerade startkodoner