18
BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666 Pogány, Hungary

BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Embed Size (px)

Citation preview

Page 1: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

BioInformatics ConsultationPresentation 5

Gábor Pauler, Ph.D.

Tax.reg.no: 63673852-3-22Bank account: 50400113-11065546

Location: 1st Széchenyi str., 7666 Pogány, HungaryTel: +36-309-015-488

E-mail: [email protected]

Page 2: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Content of the PresentationGene Search in Eucariots

Hidden Markov Models (HMM):Basic definitionSTEP1: Weight matrixSTEP2: Probability scoresSTEP3: Gene syntaxSTEP4: Assembling Bayesian NetworkSTEP5: Markov Chain ModelsSTEP6: Maximum Likelyhood

Dynamic ProgrammingGenetic Algorithm

HMM SoftwareGenMark

Codon Usage Statistics: Basic terms

Codon Usage optimizationCodon Usage TablesCodon Usage in ORFsFrameshift detectionAlternative startcodon detectionProblems in Codon Usage statistics

Codon Usage software:Codon Usage databases: KazusaCodon Usage software: EBI, GCUA

References

Page 3: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Gene Search in Eucariots: Hidden Markov Models:Weight matrix,Probability score- This is more complex task than simple ORF analysis in Procariots and in cDNS as here

exons and introns should be separated to determinate coding parts of analyzed sequence: Hidden Markov Model, HMM (Rejtett Markov-modell): It can predict location of genes and

their introns/exons in the 6 possible reading frames of an analyzed sequence, using statistical/optimization methods to look for probabilistic signals (Valószínűsíthető jelzések):

- Donor(Donor)/ Receiver(Fogadó) splice sites, - Start and stop codons, - Transcription termination sequences, - Polyadenilation sites, - Ribosome binding sites, - Transcription factor binding sites - Elements of promoter, TATA-box.

STEP1:Set up database of Signal Sensor (Jelzés érzékelő) Weight Matrices (Súlymátrixok):- Statistical summary about the probability of occourence of A,T,C,G bases in a given

position of a signal (eg. intron start). It can be represented on a diagramm where size of the nucleotids’ symbol are proportional with their probability:

- Weight matrices of signals are stored for each species and continuosly refined with Learning algorithms (Tanuló algoritmus):

- Predistion of weight matrices can be set up from similar known sequences of similar species. Therefore providing species origin or related species to sequence analyzed greatly helps gene prediction

- Moreover, from longer (10Mbase) unknown sequence, it is possible to learn weight matrices standalone, providing sizeable sample for statistical analysis

STEP2:Determine Probability Scores (Valószínűségi score-ok): Based on Bayesian Conditional Probability(Bayes-i feltételes valószínűség) theorem, probability of fitting sequence analyzed to a signal starting in a hypothetical position can be determined:

- Probability of joint occournce (A∩B) of two independent events (A, B) equals the multiplication of their individual probabilities: P(A∩B) = P(A)×P(B)

- We consider nucleotide positions of unknown sequence independent from each other, in that parts where a signal is not identified yet

- Therefore probability of signal fit is multiplication of position fit probabilities: Eg. Probability of sequence part GTAAGT to be intron open is = 100%G × 100%T × 50%A × 60%A × 70%G × 40%T (Probabilities are from intron start weight matrix above)

×× ×× ×× ×× ××

Page 4: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Gene Search in Eucariots: Hidden Markov Models:Assembling Bayesian networks STEP3:Set up database of Gene Syntax (Gén szintaktikai) probabilities:

- Probabilities are assigned to specific follow-up of signals (eg. after intron open, it is much more likely that intron close will follow, instead of another intron start)

STEP4:Assembling Bayesian Network (Bayes-féle hálózat): Data is summarized from weight matrices, probability of fit scores, follow-up probabilities on a Directed Graph (Irányított gráf) where:

- Nodes(Csomópontok) denote States(Állapotok). States can be:- Observable Output(Megfigyelhető): We can observe nucleotides in positions of

analyzed sequence (see state y1 on graph = there is T on 41th position)- Hidden (Rejtett) states: They cannot be observed, we just assume that a given type

of signal starts from a given hypotetical position in the sequence (see state x1 on graph = Last position of intron start, most likely T). Hence model is called „Hidden”

- Edges(Élek) denote Transition Probabilities (Tranzíciós valószínűségek) between 2 given states, what can be computed from:

- Matching probability of a given signal in a given position with analyzed sequence (see b11on graph:T on 41th position(y1) can qu-ite probably()resulted from last position of intron start(x1)

- Syntactic rules between signals (see a12 on graph: last character of intron start (x1) is less likely () followed by first character of intron end (x2), as this would result in 0 lenght intron part)Sum of probabilities of edges departing from one node is 1: (intron end can be many places (y1,y2,y3,y4) but it is sure there is somewhere, so b11 + b12 + b13 + b14 = 1)

- The graph can have any type of network structureTCCTTTAAATCCCTTACATGATCTGAGTTCAGACCGGCGTGAGCCAGGTCGGTTTCT

Page 5: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Gene Search in Eucariots: Hidden Markov Models: Markov Chain Models STEP5:Markov-Chain (Markov-lánc) models:

- They assume that from the graph network of probabilistic transitions, only a tree-like subset can be valid in one moment, called Decision Tree(Döntési fa):

- In the tree, probability of all states are influenced max. 1 Ascendent(Előzmény/Felmenő/Gyökér) state, so only 1 edge can arrive into 1 node (eg. there cannot be both T and G in the same position same time in the analyzed sequence),

- But, for 1 state, there can be more Descendant (Következmény/Leszármazott) states with its given probability (eg. a given T can be part of both intron start and TATA-box also). And this is repeated from Root(Gyökér) state of tree to Leaves(Levelek)

- Multiplicating probability of nodes leading from root to a given leaf, we can compute Aggregate Probability (Aggregált valószínűség) that the analyzed sequence describes given type of signals in given position

- Only 1 leaf element (and 1 possible chain(Lánc) route from edges) can be valid in 1 time

- EXAMPLE5-1:To understand working of the decision tree and probability aggregation, we present you a sample application computing the probability your partner will shop up the whole plaza if you leave alone him/her with your credit card:

- A general partner is the root event - He/she cannot have multiple Sex, Hair color and IQ in

the same time, so descendant events Brown, Blonde, Stupid, Clever form a decision tree

- At a given leaf-partner, only one chain of edges leading from root will be valid, and its probability can be computed multiplicating probability of edges:0.4(Female)×0.5(Blonde)×0.4(Stupid) = 0.08

IF SEX=FEMALE

THENX=1

IF SEX=MALE

THENX=2

0.4 0.6

IFX=1

AND HAIR=BROWN

THENY=1

IF X=1

AND HAIR=BLONDE

THENY=2

IF X=2

AND HAIR=BROWN

THENY=3

IF X=2

AND HAIR=BLONDE

THENY=4

0.5 0.5 0.3 0.7

IFX=2

AND IQ=STUPIDTHEN

SHOPPING=800$

IFX=2

AND IQ=CLEVER

THENSHOPPING=

400$

0.4 0.6

0.40 0.60

0.4×0.5=

0.2

0.4×0.5=

0.2

0.6×0.3=

0.18

0.6×0.7=

0.42

0.4×0.5×

0.4=0.08

0.4×0.5×

0.6=0.12

Page 6: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Content of the PresentationGene Search in Eucariots

Hidden Markov Models (HMM):Basic definitionSTEP1: Weight matrixSTEP2: Probability scoresSTEP3: Gene syntaxSTEP4: Assembling Bayesian NetworkSTEP5: Markov Chain ModelsSTEP6: Maximum Likelyhood

Dynamic ProgrammingGenetic Algorithm

HMM SoftwareGenMark

Codon Usage Statistics: Basic terms

Codon Usage optimizationCodon Usage TablesCodon Usage in ORFsFrameshift detectionAlternative startcodon detectionProblems in Codon Usage statistics

Codon Usage software:Codon Usage databases: KazusaCodon Usage software: EBI, GCUA

References

Page 7: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

1 2 3 4 5 6 7 8 9 101 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 101 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 101 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 101 2 3 4 5 6 7 8 9 10

Gene Search in Eucariots: Hidden Markov Models: Maximum Likelyhood (ML)STEP6: Maximum Likelyhood (ML) (Maximális valószínűség):

We use an Optimization Algorithm (Optimalizációs algoritmus) on analyzed DNA sequence to:

Change starting points of possible signals as discrete-valued decision variables (if a signal is considered not in the sequence, starting point will be pushed to its end) In all 6 possible reading frames of the analyzed sequence To get maximal aggregated probability summing up:

Signal nucleotid weights (TACG),Signal matching probabilities ({-big, {-small),Signal syntactic probabilities (-big, -small)

This goal function is a nasty stepped, multimodal, nonlinear monster with tons of local sub-optima

Using signal starting positions of Optimal Solution ( ) (Optimális megoldás), analyzed sequence can be translated into gene structure and exons/introns, and coding parts can be translated into proteinesProtein products can be further analyzed with other tools, not just get the location of the gene, but its function also

TCCTTTAAATCCCTTACATGATCTGAGTTCAGACCGGCGTGAGCCAGGTCGGTTTCT

Promoter1 2 3 4 5 6 7 8 9 101112131415

Page 8: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Gene Search in Eucariots: Hidden Markov Models: Optimization methods 1- What kind of optimization methods we can use from the collection we learn earlier?

- You can forget Analytic Optimization(Analitikus optimalizáció) as it cannot handle discrete variables and stepped functions

- Also you can forget Gradient Descendent (Hegymászó) and Simulated Annealing (Szimulált hűtés) algorithms, as probability function is so multimodal that they will stuck in first sub-optimum

- So, older software use Linear Programming (Lineáris programozás) with Branching& Bounding (Korlátozás-szétválasztás) which can handle discrete variables, and nonlinear goal function with linearization, but it explodes the model size, and has so huge computational requirement, that the examined species will be quite extint when it will find optimal solution finish gene search

- But it has a special variant called Dynamic Programming (Dinamikus programozás):It uses the principle of series of „bottlenecks”:

- IF a system has consecutive series of states (eg. time periods)- AND it can be transitioned only into the next neighbored state (cannot jump more

states in one step, or go back, see stages 3, 2, 1, 0 on the figure) (eg. time goes forward continously, except for the Pyjama-clothed guys in Star Trek)

- THEN optimizing neighboured state transitions individually in series of small, easy to compute models will optimize the whole system (during all time)

Page 9: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Gene Search in Eucariots: Hidden Markov Models: Optimization methods 2- EXAMPLE5-2:To make this math blah-blah more understandable, this principle can be

true not only in time but space: eg. if you have to cross 3 rivers consecutively, and there are only 1 bridges at each river, it is enough to optimize your route BETWEEN NEIGHBOURED BRIDGES, and your whole route will be optimal.

- And exactly this is what was totally fucked up at Operation Market Garden in 1944…

- EXAMPLE5-3:Also we can utilize that on the chromosome, genes are coded in one consecutive direction (5’-3’ on upper strand or 3’-5’ on reverse strand), not forth and back. So, if we already find a promoter with quite high mathing probability on the starting part of the sequence, it is not worth to search for upstream elements (eg. Expression factors) usually reside BEFORE the promoter. Instead of that we will search only intron starts/ends

- This way „frog-leaping” forward only with a window of sequence on each 3 reading frames of 2 DNA strands, and use output of last model run (recognized start position of signals) as input in next model, will tremendously reduce computational requirement!

- In newer software Genetic Algorithm (Genetikus algoritmus) is used to solve the discrete, multimodal, nonlinear optimization problem directly, without need of dynamic programming, because of their relatively much lower computational requirement than LP-B&B

TCCTTTAAATCCCTTACATGATCTGAGTTCAGACCGGCGTGAGCCAGGTCGGTTTCT

Page 10: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Gene Search in Eucariots: Hidden Markov Models: Software Hidden Markov Model-based gene search

software: GenMark (http://opal.biology.gatech.edu/GeneMark/)

- At Start Screen:- Give Title- Give Sequence in FASTA format- Select Species for signal weight

matrices- Press Start GeneMark.hmm button

- It gives estimated position of genes and their introns/exons:

- In Startframe/endframe column it gives the reading frame of probable exon starts/ends.

- If starting and ending reading frame do not match, there is Frameshift mutation (Leolvasási keret mutácó) in exon: 1 or 2 extra bases are inserted/deleted in its sequence

- Also, it splices exons (unfortunately, only in one version, corresponding their original sequence), and translates them into protein sequence in FASTA format

- This can be analyzed later with Protein Structure software

Gene Exon StrandExon Type

From To LengthStart

FrameEnd

Frame1 4 - Internal 36 849 814 3 31 3 - Internal 940 1119 180 2 31 2 - Internal 1286 1400 115 2 21 1 - Initial 1712 1769 58 1 12 1 + Initial 2796 2898 103 1 12 2 + Internal 2980 3095 116 2 32 3 + Terminal 3205 4260 56 1 33 8 - Terminal 4546 4690 145 3 33 7 - Internal 4813 4889 77 2 13 6 - Internal 4985 5038 54 3 13 5 - Internal 5107 5193 87 3 13 4 - Internal 5272 5370 99 3 13 3 - Internal 5457 5528 72 3 13 2 - Internal 5688 5777 90 3 13 1 - Initial 5882 5980 99 3 14 1 + Initial 6130 6144 15 1 34 2 + Internal 6212 6265 54 1 34 3 + Terminal 6365 6469 105 1 3

>gene_1|GeneMark.hmm|389_aaMSAPAKRSSTDTQDKDLMLAADKDMEKDTWNFKSMTDDDPMDFGFGSPAKNKKNAFKLDMGFDLDGDFGSSFKMDMPDFDFSSPAKKTTKTKETSDDKPSGNSKQKKNPFAFSYDFDALDDFDLGSSPPKKGSKTTTKSMDCEEICASSKSDKSDDLDFGLDLPITRQVPSKANTDVQAKASAEKESQNYKTTDTLVVNKSKNSNQAALESMGDFEAVESPQGSRKKASQTHTMCVQPQSVDTSPLKTSCSKVEEKNEPCPSNETIAPSPLHASEIAHIAVNRETSPDIHELCRSGTKEDCPIDPENANKKMITTMESSYEKIEQTSPSISSHLCSDKIEHQQEEMGTDTQAEIQDNTKGALYNSDAGHSLTTLSGKISPGTRTSQTAK

ClickClick

ClickClick

ClickClick

Page 11: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Content of the PresentationGene Search in Eucariots

Hidden Markov Models (HMM):Basic definitionSTEP1: Weight matrixSTEP2: Probability scoresSTEP3: Gene syntaxSTEP4: Assembling Bayesian NetworkSTEP5: Markov Chain ModelsSTEP6: Maximum Likelyhood

Dynamic ProgrammingGenetic Algorithm

HMM SoftwareGenMark

Codon Usage Statistics: Basic terms

Codon Usage optimizationCodon Usage TablesCodon Usage in ORFsFrameshift detectionAlternative startcodon detectionProblems in Codon Usage statistics

Codon Usage software:Codon Usage databases: KazusaCodon Usage software: EBI, GCUA

References

Page 12: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Gene Search in Eucariots: Codon Usage Statistics: Basic terms 1- There is a considerable evolution pressure, that

codons in coding parts should be much more organized than in non-coding parts, as bad-coded proteins (eg. infested with frameshift mutations) will not work and reduce survival

- Redundant coding (Redundáns kódolás): 1 amino acid is usually coded by more codons, because of two reasons:

- There are more tRNA with different anticodons still transfer the same amino acid

- In some tRNA space sturucture of anticodons allows tRNA to „wobble” binding on mRNA at protein synthesis allowing to company the same amino acid for more codons – but the point is that probability of alternate „wobbled” matches varies considerably across different species!

Codon Usage Optimization (Kodonhasználat-optimalizáció):

- In rapidly reproducing organisms (eg. yeast), where fast transcription is base of survival, a preference is formed during evolution for some codon alternatives

- Optimized codon usage probably had synergic effect on tRNA evolution also, but it is little known about this issue yet

- In slowly reproducing organisms (eg. man) codon usage optimization has reduced importance

Codon Usage Table (Kodonhasználati tábla):- A database broken up by species/sub-species:

- What is the percentage share of amino acids in coding parts

- What is the percentage share of alternative codons coding the same amino acid

AmAcid Codon Number /1000 Fraction AmAcid Codon Number /1000 FractionAla GCG 0 0.00 0.00 Pro CCG 0 0.00 0.00Ala GCA 3 20.13 0.33 Pro CCA 5 33.56 0.71Ala GCT 5 33.56 0.56 Pro CCT 2 13.42 0.29Ala GCC 1 6.71 0.11 Pro CCC 0 0.00 0.00Cys TGT 6 40.27 0.86 Gln CAG 1 6.71 0.20Cys TGC 1 6.71 0.14 Gln CAA 4 26.85 0.80Asp GAT 6 40.27 1.00 Arg AGG 0 0.00 0.00Asp GAC 0 0.00 0.00 Arg AGA 5 33.56 0.36Glu GAG 7 46.98 0.64 Arg CGG 0 0.00 0.00Glu GAA 4 26.85 0.36 Arg CGA 3 20.13 0.21Phe TTT 2 13.42 0.67 Arg CGT 6 40.27 0.43Phe TTC 1 6.71 0.33 Arg CGC 0 0.00 0.00Gly GGG 0 0.00 0.00 Ser AGT 2 13.42 0.13Gly GGA 3 20.13 0.43 Ser AGC 3 20.13 0.19Gly GGT 4 26.85 0.57 Ser TCG 0 0.00 0.00Gly GGC 0 0.00 0.00 Ser TCA 5 33.56 0.31His CAT 2 13.42 1.00 Ser TCT 6 40.27 0.38His CAC 0 0.00 0.00 Ser TCC 0 0.00 0.00Ile ATA 1 6.71 0.11 Thr ACG 1 6.71 0.10Ile ATT 7 46.98 0.78 Thr ACA 5 33.56 0.50Ile ATC 1 6.71 0.11 Thr ACT 4 26.85 0.40Lys AAG 1 6.71 0.14 Thr ACC 0 0.00 0.00Lys AAA 6 40.27 0.86 Val GTG 0 0.00 0.00Leu TTG 2 13.42 0.18 Val GTA 2 13.42 0.20Leu TTA 3 20.13 0.27 Val GTT 6 40.27 0.60Leu CTG 0 0.00 0.00 Val GTC 2 13.42 0.20Leu CTA 3 20.13 0.27 Trp TGG 4 26.85 1.00Leu CTT 2 13.42 0.18 Tyr TAT 1 6.71 0.50Leu CTC 1 6.71 0.09 Tyr TAC 1 6.71 0.50Met ATG 5 33.56 1.00 End TGA 0 0.00 0.00Asn AAT 2 13.42 0.67 End TAG 0 0.00 0.00Asn AAC 1 6.71 0.33 End TAA 1 6.71 1.00

Page 13: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Gene Search in Eucariots: Codon Usage Statistics: Basic terms 2 Codon Usage analysis in ORFs (Nyitott leolvasási keretek kodonhasználata):

- It examines all ORFs from all laternative ATG startcodon in all 6 possible reading frames- It computes which are preferated and Scarce codons (p<10%) in a given ORF- It computes frequency AT/GC pairs in the middle (non-wobbling) codon positions in a

given ORF- It compares data with frequencies stored in data base (possibly at matching sub-

species) and estimates which of the ORFs can be coding ORF Frameshift mutation detection (Leolvasási keret hiba detekció):

- If 2 coding ORFs positions are almost consecutive, just they are in different reading frames, then it is very likely that there was an insert/delete frameshift mutation at their border

- This can more exactly identfy frameshift mutation location than Hidden Markov Model, where we get an indirect warning that consecutive intron start and –end are not in the same reading frame

Alternative start codon detection (Alternatív startkodon detekció):- ATG is most frequent start codon but not the only possible- If the actual ORF before the first ATG seems to be coding one, it is likely that the start

codons are the more scarce GTG, GTT- Additional signals help to decide which

can be the real start codon:- Start codon is usually preceeded

with 10 base pairs by the ca. 3-5 base pair wide Shine-Dalgrano Ri-bosome Binding Site, RBS (Ribo-szóma bekötőhely), what is comple-menter of a ribosomal RNA

- Or, even before this we can capture TATA box of promoter

Problems in codon preference analysis:- Lack of species-related scarce codons- At Eucariotes, intron content can blur statistics, so they should be removed by HMM- mRNA Editing (mRNS editálás): At eucariotes, alternative splicing can put unexpected

stop codon on mRNA, but this has no any trace in DNA

Page 14: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Content of the PresentationGene Search in Eucariots

Hidden Markov Models (HMM):Basic definitionSTEP1: Weight matrixSTEP2: Probability scoresSTEP3: Gene syntaxSTEP4: Assembling Bayesian NetworkSTEP5: Markov Chain ModelsSTEP6: Maximum Likelyhood

Dynamic ProgrammingGenetic Algorithm

HMM SoftwareGenMark

Codon Usage Statistics: Basic terms

Codon Usage optimizationCodon Usage TablesCodon Usage in ORFsFrameshift detectionAlternative startcodon detectionProblems in Codon Usage statistics

Codon Usage software:Codon Usage databases: KazusaCodon Usage software: EBI, GCUA

References

Page 15: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Gene Search in Eucariots: Codon Usage Statistics: Software 1 Database of codon usage tables by

species:(http://www.kazusa.or.jp/codon/):

Simple codon usage statistics software: EBI(http://www.ebioinfogen.com/biotools/codon-usage.htm):

- At Start Screen:- Upload analyzed DNA sequence

in FASTA format, - Select code table: Standard for

vertebrates, genomial- Press Submit button

- In output table, amino acids and their codon usage frequencies can be seen

AmAcid Codon Number /1000 Fraction AmAcid Codon Number /1000 FractionAla GCG 0 0.00 0.00 Pro CCG 0 0.00 0.00Ala GCA 3 20.13 0.33 Pro CCA 5 33.56 0.71Ala GCT 5 33.56 0.56 Pro CCT 2 13.42 0.29Ala GCC 1 6.71 0.11 Pro CCC 0 0.00 0.00Cys TGT 6 40.27 0.86 Gln CAG 1 6.71 0.20Cys TGC 1 6.71 0.14 Gln CAA 4 26.85 0.80Asp GAT 6 40.27 1.00 Arg AGG 0 0.00 0.00Asp GAC 0 0.00 0.00 Arg AGA 5 33.56 0.36Glu GAG 7 46.98 0.64 Arg CGG 0 0.00 0.00Glu GAA 4 26.85 0.36 Arg CGA 3 20.13 0.21Phe TTT 2 13.42 0.67 Arg CGT 6 40.27 0.43Phe TTC 1 6.71 0.33 Arg CGC 0 0.00 0.00Gly GGG 0 0.00 0.00 Ser AGT 2 13.42 0.13Gly GGA 3 20.13 0.43 Ser AGC 3 20.13 0.19Gly GGT 4 26.85 0.57 Ser TCG 0 0.00 0.00Gly GGC 0 0.00 0.00 Ser TCA 5 33.56 0.31His CAT 2 13.42 1.00 Ser TCT 6 40.27 0.38His CAC 0 0.00 0.00 Ser TCC 0 0.00 0.00Ile ATA 1 6.71 0.11 Thr ACG 1 6.71 0.10Ile ATT 7 46.98 0.78 Thr ACA 5 33.56 0.50Ile ATC 1 6.71 0.11 Thr ACT 4 26.85 0.40Lys AAG 1 6.71 0.14 Thr ACC 0 0.00 0.00Lys AAA 6 40.27 0.86 Val GTG 0 0.00 0.00Leu TTG 2 13.42 0.18 Val GTA 2 13.42 0.20Leu TTA 3 20.13 0.27 Val GTT 6 40.27 0.60Leu CTG 0 0.00 0.00 Val GTC 2 13.42 0.20Leu CTA 3 20.13 0.27 Trp TGG 4 26.85 1.00Leu CTT 2 13.42 0.18 Tyr TAT 1 6.71 0.50Leu CTC 1 6.71 0.09 Tyr TAC 1 6.71 0.50Met ATG 5 33.56 1.00 End TGA 0 0.00 0.00Asn AAT 2 13.42 0.67 End TAG 0 0.00 0.00Asn AAC 1 6.71 0.33 End TAA 1 6.71 1.00

ClickClick

ClickClick ClickClick

Page 16: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Gene Search in Eucariots: Codon Usage Statistics: Software 2 Complex codon usage analysis software:

GCUA: (http://gcua.schoedl.de/ ):- At StartScreen: we can select 2 modes:

- Analyze triplets in all ORFs of the analyzed sequence based on codon usage of a selected species:

- Give Sequence name- Give organism of the analyzed

sequence- Give Analyzed sequence in

FASTA format- Select the Codon usage table

by species- Press Submit button

- It gives Relative adaptiveness score (Relatív adaptivitási score) of triplet codons in all ORFs at bar carts:

= freq. of alternative codons

freq. of the most prefferred codon coding the given amino acid

- Scarce codons on charts marked with grey, red signalling non-coding regions

ClickClick

ClickClick

ClickClick

Clic

kC

lick

ClickClick

Page 17: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

Gene Search in Eucariots: Codon Usage Statistics: Software 3- The second mode is Compare

codon usage table of analyzed sequence with a known species:

- Black bars are relative adaptiveness of codons in analyzed sequence

- Red bars are relative adaptiveness of codons in selected species

- It also provides the Mean percentage difference bet-ween them

Page 18: BioInformatics Consultation Presentation 5 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str.,

ReferencesGene search in Eucariots

Hidden Markov ModelsBayesian probability: http://en.wikipedia.org/wiki/Bayesian_probability Bayesian network: http://www.cs.columbia.edu/~sal/notes/AISP05/m14-bayesian.ppt http://www.niedermayer.ca/papers/bayesian/ Decision trees: http://en.wikipedia.org/wiki/Decision_tree Markov-chains: http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter11.pdf Hidden Markov Models (HMM): http://jedlik.phy.bme.hu/~gerjanos/HMM/node2.html Optimization algorithm:

Dynamic Programming: http://en.wikipedia.org/wiki/Dynamic_programming HMM Software:

GeneMark: http://opal.biology.gatech.edu/GeneMark/Codon Usage/Preference:

Codon Usage databases: Kazusa: http://www.kazusa.or.jp/codon/

Codon Usage software: EBI: http://www.ebioinfogen.com/biotools/codon-usage.htmGCUA: http://gcua.schoedl.de/