120

2015 bioinformatics go_hmm_wim_vancriekinge

Embed Size (px)

Citation preview

Page 1: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 2: 2015 bioinformatics go_hmm_wim_vancriekinge

FBW1-12-2015

Wim Van Criekinge

Page 3: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 4: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 5: 2015 bioinformatics go_hmm_wim_vancriekinge

BPC 2015

*** ERGRO ***

Page 6: 2015 bioinformatics go_hmm_wim_vancriekinge

BPC 2015

*** ERGRO *** 1. Longest English word where first three letters are identical to the last three2. English word where longest stretch of letters are identical at beginning and at the end3. In Dutch ?4. Any other language5. Biological relevance ?

Send before 1st of december to [email protected] one wins, if same size first to submit

Page 7: 2015 bioinformatics go_hmm_wim_vancriekinge

Dries Godderis

1. Langste engels woord waar 3 eerste letters = 3 laatste letters: antipredeterminant (18)2. Langste engels woord met langste gelijke stretch = benzeneazobenzene (17)3. In nederlands langste woord met eerste 3 letters = 3 laatste letters: tentoonstellingsprojecten (25)  In nederlands langste woord met langste gelijke stretch = dierentuindieren (16)4. In portugees langste woord met eerste 3 letters = 3 laatste letters: desconstitucionalizardes (24)  In portugees langste woord met langste gelijke stretch = reassenhoreasse (15) (=vervoegd werkwoord van reassenhorear)5. Biologische relevantie: bij een hairpin loop (stem-loop) model wordt de stabiliteit en vorming van deze structuur bepaalt door de stabiliteit van de helix en de gevormde loopregio's. Een gelijke stretch aan begin en eind van de sequentie zullen een belangrijke rol spelen in een goede basepaarvorming

Page 8: 2015 bioinformatics go_hmm_wim_vancriekinge

Exams

• Dates ?

• 1st question

Page 9: 2015 bioinformatics go_hmm_wim_vancriekinge

Gene Prediction, HMM & ncRNA

What to do with an unknown sequence ?

Gene OntologiesGene PredictionComposite Gene PredictionNon-coding RNAHMM

Page 10: 2015 bioinformatics go_hmm_wim_vancriekinge

UNKNOWN PROTEIN SEQUENCE

LOOK FOR:• Similar sequences in databases ((PSI)

BLAST)• Distinctive patterns/domains associated

with function• Functionally important residues• Secondary and tertiary structure• Physical properties (hydrophobicity, IEP

etc)

Page 11: 2015 bioinformatics go_hmm_wim_vancriekinge

BASIC INFORMATION COMES FROM SEQUENCE

• One sequence- can get some information eg amino acid properties

• More than one sequence- get more info on conserved residues, fold and function

• Multiple alignments of related sequences- can build up consensus sequences of known families, domains, motifs or sites.

• Sequence alignments can give information on loops, families and function from conserved regions

Page 12: 2015 bioinformatics go_hmm_wim_vancriekinge

Additional analysis of protein sequences

• transmembrane regions

• signal sequences• localisation

signals• targeting

sequences• GPI anchors• glycosylation sites

• hydrophobicity• amino acid

composition• molecular weight• solvent accessibility• antigenicity

Page 13: 2015 bioinformatics go_hmm_wim_vancriekinge

FINDING CONSERVED PATTERNS IN PROTEIN SEQUENCES

• Pattern - short, simplest, but limited• Motif - conserved element of a sequence

alignment, usually predictive of structural or functional region

To get more information across whole alignment:

• Profile• HMM

Page 14: 2015 bioinformatics go_hmm_wim_vancriekinge

PATTERNS

• Small, highly conserved regions• Shown as regular expressions

Example:[AG]-x-V-x(2)-x-{YW}– [] shows either amino acid– X is any amino acid– X(2) any amino acid in the next 2 positions– {} shows any amino acid except these BUT- limited to near exact match in small

region

Page 15: 2015 bioinformatics go_hmm_wim_vancriekinge

PROFILES

• Table or matrix containing comparison information for aligned sequences

• Used to find sequences similar to alignment rather than one sequence

• Contains same number of rows as positions in sequences

• Row contains score for alignment of position with each residue

Page 16: 2015 bioinformatics go_hmm_wim_vancriekinge

HIDDEN MARKOV MODELS (HMM)

• An HMM is a large-scale profile with gaps, insertions and deletions allowed in the alignments, and built around probabilities

• Package used HMMER (http://hmmer.wusd.edu/)• Start with one sequence or alignment -HMMbuild,

then calibrate with HMMcalibrate, search database with HMM

• E-value- number of false matches expected with a certain score

• Assume extreme value distribution for noise, calibrate by searching random seq with HMM build up curve of noise (EVD)

HMM

Page 18: 2015 bioinformatics go_hmm_wim_vancriekinge

Gene Prediction, HMM & ncRNA

What to do with an unknown sequence ?

Gene OntologiesGene PredictionHMMComposite Gene PredictionNon-coding RNA

Page 19: 2015 bioinformatics go_hmm_wim_vancriekinge

What is an ontology?

• An ontology is an explicit specification of a conceptualization.

• A conceptualization is an abstract, simplified view of the world that we want to represent.

• If the specification medium is a formal representation, the ontology defines the vocabulary.

Page 20: 2015 bioinformatics go_hmm_wim_vancriekinge

Why Create Ontologies?

• to enable data exchange among programs

• to simplify unification (or translation) of disparate representations

• to employ knowledge-based services • to embody the representation of a

theory • to facilitate communication among

people

Page 21: 2015 bioinformatics go_hmm_wim_vancriekinge

Summary

• Ontologies are what they do: artifacts to help people and their programs communicate, coordinate, collaborate.

• Ontologies are essential elements in the technological infrastructure of the Knowledge Age

• http://www.geneontology.org/

Page 22: 2015 bioinformatics go_hmm_wim_vancriekinge

• Molecular Function — elemental activity or task

nuclease, DNA binding, transcription factor• Biological Process — broad objective or goal

mitosis, signal transduction, metabolism • Cellular Component — location or complex

nucleus, ribosome, origin recognition complex

The Three Ontologies

Page 23: 2015 bioinformatics go_hmm_wim_vancriekinge

DAG Structure

Directed acyclic graph: each child may have one or more

parents

Page 24: 2015 bioinformatics go_hmm_wim_vancriekinge

Example - Molecular Function

Page 25: 2015 bioinformatics go_hmm_wim_vancriekinge

Example - Biological Process

Page 26: 2015 bioinformatics go_hmm_wim_vancriekinge

Example - Cellular Location

Page 27: 2015 bioinformatics go_hmm_wim_vancriekinge

AmiGO browser

Page 28: 2015 bioinformatics go_hmm_wim_vancriekinge

GO: Applications

• Eg. chip-data analysis: Overrepresented item can provide functional clues

• Overrepresentation check: contingency table– Chi-square test (or Fisher is frequency < 5)

Page 29: 2015 bioinformatics go_hmm_wim_vancriekinge

Gene Prediction, HMM & ncRNA

What to do with an unknown sequence ?

Web applicationsGene OntologiesGene PredictionHMMComposite Gene PredictionNon-coding RNA

Page 30: 2015 bioinformatics go_hmm_wim_vancriekinge

Problem:

Given a very long DNA sequence, identify coding regions (including intron splice sites) and their predicted protein sequences

Computational Gene Finding

Page 31: 2015 bioinformatics go_hmm_wim_vancriekinge

Eukaryotic gene structure

Computational Gene Finding

Page 32: 2015 bioinformatics go_hmm_wim_vancriekinge

• There is no (yet known) perfect method for finding genes. All approaches rely on combining various “weak signals” together

• Find elements of a gene– coding sequences (exons)– promoters and start signals– poly-A tails and downstream signals

• Assemble into a consistent gene model

Computational Gene Finding

Page 33: 2015 bioinformatics go_hmm_wim_vancriekinge

Gen

efin

der

Page 34: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 35: 2015 bioinformatics go_hmm_wim_vancriekinge

GENE STRUCTURE INFORMATION - POSITION ON PHYSICAL MAP This gene structure corresponds to the position on the physical map

Page 36: 2015 bioinformatics go_hmm_wim_vancriekinge

GENE STRUCTURE INFORMATION - ACTIVE ZONE This gene structure shows the Active Zone

The Active Zone limits the extent of analysis, genefinder & fasta dumps

A blue line within the yellow box indicates regions outside of the active zone

The active zone is set by entering coordinates in the active zone (yellow box)

Page 37: 2015 bioinformatics go_hmm_wim_vancriekinge

GENE STRUCTURE INFORMATION - POSITION This gene structure relates to the Position:

Change origin of this scale by entering a number in the green 'origin' box

Page 38: 2015 bioinformatics go_hmm_wim_vancriekinge

GENE STRUCTURE INFORMATION - PREDICTED GENE STRUCTURE This gene structure relates to the predicted gene structures

Boxes are Exons, thin lines (or springs) are Introns

Page 39: 2015 bioinformatics go_hmm_wim_vancriekinge

Find the open reading frames

GAAAAAGCTCCTGCCCAATCTGAAATGGTTAGCCTATCTTTCCACCGT

Any sequence has 3 potential reading frames (+1, +2, +3)

Its complement also has three potential reading frames (-1, -2, -3)

6 possible reading frames

The triplet, non-punctuated nature of the genetic code helps us out

64 potential codons61 true codons3 stop codons (TGA, TAA, TAG)

Random distribution app. 1/21 codons will be a stop

E K A P A Q S E M V S L S F H R

K K L L P N L K W L A Y L S T

K S S C P I * N G * P I F P P

Page 40: 2015 bioinformatics go_hmm_wim_vancriekinge

GENE STRUCTURE INFORMATION - OPEN READING FRAMES This gene structure relates to Open reading Frames

There is one column for each frame

Small horizontal lines represent stop codons

Page 41: 2015 bioinformatics go_hmm_wim_vancriekinge

They have one column for each frame The size indicates relative score for the particular start site

GENE STRUCTURE INFORMATION - START CODONS This gene structure represents Start Codons

Page 42: 2015 bioinformatics go_hmm_wim_vancriekinge

• Amino acid distributions are biased e.g. p(A) > p(C)

• Pairwise distributions also biasede.g. p(AT)/[p(A)*p(T)] > p(AC)/[p(A)*p(C)]

• Nucleotides that code for preferred amino acids (and AA pairs) occur more frequently in coding regions than in non-coding regions.

• Codon biases (per amino acid)• Hexanucleotide distributions that reflect those

biases indicate coding regions.

Computational Gene Finding: Hexanucleotide frequencies

Page 43: 2015 bioinformatics go_hmm_wim_vancriekinge

Gene predictionGeneration of datasets (Ensmart@Ensembl):

Dataset 1 (http://biobix.ugent.be/txt/coding.txt) consists of >900 coding regions (DNA):

Dataset 2 (http://biobix.ugent.be/txt/noncoding.txt) consists of >900 non-coding regions

Distance Array: Calculate for every base all the distances (in bp) to the same nucleotide (focus on the first 1000 bp of the coding region and limit the distance array to a window of 1000 bp)

Do you see a difference in this “distance array” between coding and noncoding sequence ?

Could it be used to predict genes ?Write a program to predict genes in the following genomic

sequence (http://biobix.ugent.be/txt/genomic.txt)What else could help in finding genes in raw genomic

sequences ?

Page 44: 2015 bioinformatics go_hmm_wim_vancriekinge

GENE STRUCTURE INFORMATION - CODING POTENTIAL This gene structure corresponds to the Coding Potential

The grey boxes indicate regions where the codon frequencies match those of known C. elegans genes.

the larger the grey box the more this region resembles a C. elegans coding element

Page 45: 2015 bioinformatics go_hmm_wim_vancriekinge

blastn (EST)

For raw DNA sequence analysis blastx is extremely useful

Will probe your DNA sequence against the protein database

A match (homolog) gives you some ideas regarding function

One problem are all of the genome sequences

Will get matches to genome databases that are strictly identified by sequence homology – often you need some experimental evidence

Page 46: 2015 bioinformatics go_hmm_wim_vancriekinge

GENE STRUCTURE INFORMATION - SEQUENCE SIMILARITY This feature shows protein sequence similarity

The blue boxes indicate regions of sequence which when translated have similarity to previously characterised proteins.

To view the alignment, select the right mouse button whilst over the blue box.

Page 47: 2015 bioinformatics go_hmm_wim_vancriekinge

GENE STRUCTURE INFORMATION - EST MATCHES This gene structure relates to Est Matches

The yellow boxes represent DNA matches (Blast) to C. elegans Expressed Sequence Tags (ESTS) To view the alignment use the right mouse button whilst over the yellow box to invoke Blixem

Page 48: 2015 bioinformatics go_hmm_wim_vancriekinge

Borodovsky et al., 1999, Organization of the Prokaryotic Genome (Charlebois, ed) pp. 11-34

New generation of programs to predict gene coding sequences based on a non-random repeat pattern(eg. Glimmer, GeneMark) – actually pretty good

Page 49: 2015 bioinformatics go_hmm_wim_vancriekinge

• CpG islands are regions of sequence that have a high proportion of CG dinucleotide pairs (p is a phoshodiester bond linking them)– CpG islands are present in the promoter and

exonic regions of approximately 40% of mammalian genes

– Other regions of the mammalian genome contain few CpG dinucleotides and these are largely methylated

• Definition: sequences of >500 bp with – G+C > 55% – Observed(CpG)/Expected(CpG) > 0.65

Computational Gene Finding

Page 50: 2015 bioinformatics go_hmm_wim_vancriekinge

GENE STRUCTURE INFORMATION - REPEAT FAMILIES This gene structure corresponds to Repeat Families

This column shows matches to members of a number of repeat families Currently a hidden markov model is used to detect these

Page 51: 2015 bioinformatics go_hmm_wim_vancriekinge

GENE STRUCTURE INFORMATION - REPEATS This gene structure relates to Repeats

This column shows regions of localised repeats both tandem and inverted Clicking on the boxes will show the complete repeat information in the blue line at the top end of the screen

Page 52: 2015 bioinformatics go_hmm_wim_vancriekinge

Exon/intron boundaries

Page 53: 2015 bioinformatics go_hmm_wim_vancriekinge

• Most Eukaryotic introns have a consensus splice signal: GU at the beginning (“donor”), AG at the end (“acceptor”).

• Variation does occur in the splice sites• Many AGs and GTs are not splice sites.• Database of experimentally validated

human splice sites: http://www.ebi.ac.uk/~thanaraj/splice.html

Computational Gene Finding: Splice junctions

Page 54: 2015 bioinformatics go_hmm_wim_vancriekinge

GENE STRUCTURE INFORMATION - PUTATIVE SPLICE SITES This gene structure shows putative splice sites

The Splice Sites are shown 'Hooked' The Hook points in the direction of splicing, therefore 3' splice sites point up and 5' Splice sites point down The colour of the Splice Site indicates the position at which it interrupts the Codon The height of the Splices is proportional to the Genefinder score of the Splice Site

Page 55: 2015 bioinformatics go_hmm_wim_vancriekinge

Gene Prediction, HMM & ncRNA

What to do with an unknown sequence ?

Web applicationsGene OntologiesGene PredictionHMMComposite Gene PredictionNon-coding RNA

Page 56: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 57: 2015 bioinformatics go_hmm_wim_vancriekinge

• Recall that profiles are matrices that identify the probability of seeing an amino acid at a particular location in a motif.

• What about motifs that allow insertions or deletions (together, called indels)?

• Patterns and regular expressions can handle these easily, but profiles are more flexible.

• Can indels be integrated into profiles?

Towards profiles (PSSM) with indels – insertions and/or deletions

Page 58: 2015 bioinformatics go_hmm_wim_vancriekinge

• Need a representation that allows specification of the probability of introducing (and/or extending) a gap in the profile.

A .1C .05D .2E .08F .01

Gap A .04C .1D .01E .2F .02

Gap A .2C .01D .05E .1F .06

in s e r t

delete

continue

Hidden Markov Models: Graphical models of sequences

Page 59: 2015 bioinformatics go_hmm_wim_vancriekinge

• A sequence is said to be Markovian if the probability of the occurrence of an element in a particular position depends only on the previous elements in the sequence.

• Order of a Markov chain depends on how many previous elements influence probability– 0th order: uniform probability at every position– 1st order: probability depends only on immediately

previous position.• 1st order Markov chains are good for proteins.

Hidden Markov Chain

Page 60: 2015 bioinformatics go_hmm_wim_vancriekinge

Marchov Chain for DNA

Page 61: 2015 bioinformatics go_hmm_wim_vancriekinge

Markov chain with begin and end

Page 62: 2015 bioinformatics go_hmm_wim_vancriekinge

• Consists of states (boxes) and transitions (arcs) labeled with probabilities

• States have probability(s) of “emitting” an element of a sequence (or nothing).

• Arcs have probability of moving from one state to another. – Sum of probabilities of all out arcs must be 1– Self-loops (e.g. gap extend) are OK.

Markov Models: Graphical models of sequences

Page 63: 2015 bioinformatics go_hmm_wim_vancriekinge

• Simplest example: Each state emits (or, equivalently, recognizes) a particular element with probability 1, and each transition is equally likely.

Example sequences: 1234 234 14 121214 2123334

Begin

Emit 1

Emit 2

Emit 4

Emit 3End

Markov Models

Page 64: 2015 bioinformatics go_hmm_wim_vancriekinge

• Now, add probabilities to each transition (let emission remain a single element)

• We can calculate the probability of any sequence given this model by multiplying

0.5

0.50.25 0.75

0.9

0.10.2

0.81.0Begi

nEmit 1

Emit 2

Emit 4

Emit 3End

p(1234) = 0.5 * 0.1 * 0.75 * 0.8 = 0.03p(14) = 0.5 * 0.9 = 0.45p(2334)= 0.5 * 0.75 * 0.2 * 0.8 = 0.06

Hidden Markov Models: Probabilistic Markov Models

Page 65: 2015 bioinformatics go_hmm_wim_vancriekinge

• If we let the states define a set of emission probabilities for elements, we can no longer be sure which state we are in given a particular element of a sequence

BCCD or BCCD ?

0.5

0.50.25 0.75

0.9

0.10.2

0.81.0Begi

nA (0.8) B(0.2)

B (0.7) C(0.3)

C (0.1) D (0.9)

C (0.6) A(0.4)End

Hidden Markov Models: Probablistic Emmision

Page 66: 2015 bioinformatics go_hmm_wim_vancriekinge

• Emission uncertainty means the sequence doesn't identify a unique path. The states are “hidden”

• Probability of a sequence is sum of all paths that can produce it:

0.5

0.50.25 0.75

0.9

0.10.2

0.81.0Begi

nA (0.8) B(0.2)

B (0.7) C(0.3)

C (0.1) D (0.9)

C (0.6) A(0.4)End

p(bccd) = 0.5 * 0.2 * 0.1 * 0.3 * 0.75 * 0.6 * 0.8 * 0.9 + 0.5 * 0.7 * 0.75 * 0.6 * 0.2 * 0.6 * 0.8 * 0.9 = 0.000972 + 0.013608 = 0.01458

Hidden Markov Models

Page 67: 2015 bioinformatics go_hmm_wim_vancriekinge

Hidden Markov Models

Page 68: 2015 bioinformatics go_hmm_wim_vancriekinge

Hidden Markov Models: The occasionally dishonest casino

Page 69: 2015 bioinformatics go_hmm_wim_vancriekinge

Hidden Markov Models: The occasionally dishonest casino

Page 70: 2015 bioinformatics go_hmm_wim_vancriekinge

• The HMM must first be “trained” using a training set– Eg. database of known genes. – Consensus sequences for all signal sensors are needed.– Compositional rules (i.e., emission probabilities) and

length distributions are necessary for content sensors.• Transition probabilities between all connected

states must be estimated.

• Estimate the probability of sequence s, given model m, P(s|m)– Multiply probabilities along most likely path

(or add logs – less numeric error)

Use of Hidden Markov Models

Page 71: 2015 bioinformatics go_hmm_wim_vancriekinge

• HMMs are effectively profiles with gaps, and have applications throughout Bioinformatics

• Protein sequence applications:– MSAs and identifying distant homologs

E.g. Pfam uses HMMs to define its MSAs– Domain definitions– Used for fold recognition in protein structure

prediction • Nucleotide sequence applications:

– Models of exons, genes, etc. for gene recognition.

Applications of Hidden Markov Models

Page 72: 2015 bioinformatics go_hmm_wim_vancriekinge

• UC Santa Cruz (David Haussler group)– SAM-02 server. Returns alignments, secondary

structure predictions, HMM parameters, etc. etc.– SAM HMM building program

(requires free academic license) • Washington U. St. Louis (Sean Eddy group)

– Pfam. Large database of precomputed HMM-based alignments of proteins

– HMMer, program for building HMMs• Gene finders and other HMMs (more later)

Hidden Markov Models Resources

Page 73: 2015 bioinformatics go_hmm_wim_vancriekinge

Example TMHMM

Beyond Kyte-Doolitlle …

Page 74: 2015 bioinformatics go_hmm_wim_vancriekinge

HMM in protein analysis

• http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Page 75: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 76: 2015 bioinformatics go_hmm_wim_vancriekinge

Hidden Markov model for gene structure

• A representation of the linguistic rules for what features might follow what other features when parsing a sequence consisting of a multiple exon gene.

• A candidate gene structure is created by tracing a path from B to F.• A hidden Markov model (or hidden semi-Markov model) is defined by

attaching stochastic models to each of the arcs and nodes.

Signals (blue nodes):• begin sequence (B)• start translation (S)• donor splice site (D)• acceptor splice site (A)• stop translation (T)• end sequence (F)

Contents (red arcs):• 5’ UTR (J5’)• initial exon (EI)• exon (E)• intron (I)• final exon (EF)• single exon (ES)• 3’ UTR (J3’)

Page 77: 2015 bioinformatics go_hmm_wim_vancriekinge

Classic Programs for gene finding

Some of the best programs are HMM based:• GenScan – http://genes.mit.edu/GENSCAN.html• GeneMark – http://opal.biology.gatech.edu/GeneMark/

Other programs• AAT, EcoParse, Fexeh, Fgeneh, Fgenes, Finex, GeneHacker, GeneID-3,

GeneParser 2, GeneScope, Genie, GenLang, Glimmer, GlimmerM, Grail II, HMMgene, Morgan, MZEF, Procrustes, SORFind, Veil, Xpound

Page 78: 2015 bioinformatics go_hmm_wim_vancriekinge

GENSCANnot to be confused with GeneScan, a commercial product

• A Semi-Markov Model– Explicit model of how long

to stay in a state (ratherthan just self-loops, whichmust be exponentiallydecaying)

• Tracks “phase” of exon orintron (0 coincides with codonboundary, or 1 or 2)

• Tracks strand (and direction)

Hidden Markov Models: Gene Finding Software

Page 79: 2015 bioinformatics go_hmm_wim_vancriekinge

Conservation of Gene Features

Conservation pattern across 3165 mappings of human RefSeq mRNAs to the genome. A program sampled 200 evenly spaced bases across 500 bases upstream of transcription, the 5’ UTR, the first coding exon, introns, middle coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There are peaks of conservation at the transition from one region to another.

50%55%

60%65%

70%75%

80%85%

90%95%

100%

aligning identity

Page 80: 2015 bioinformatics go_hmm_wim_vancriekinge

Composite Approaches

• Use EST info to constrain HMMs (Genie)• Use protein homology info on top of HMMs

(fgenesh++, GenomeScan)• Use cross species genomic alignments on top

of HMMs (twinscan, fgenesh2, SLAM, SGP)

Page 81: 2015 bioinformatics go_hmm_wim_vancriekinge

Gene Prediction: more complex …

1. Species specific 2. Splicing enhancers found in coding regions3. Trans-splicing 4. …

Page 82: 2015 bioinformatics go_hmm_wim_vancriekinge

Length preference

5’ ss intcomp branch 3’ ss

Page 83: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 84: 2015 bioinformatics go_hmm_wim_vancriekinge

Con

tent

s-S

ched

ule

RNA genes

Besides the 6000 protein coding-genes, there is:

140 ribosomal RNA genes275 transfer RNA gnes40 small nuclear RNA genes>100 small nucleolar genes

?

pRNA in 29 rotary packaging motor (Simpson et el. Nature 408:745-750,2000)Cartilage-hair hypoplasmia mapped to an RNA (Ridanpoa et al. Cell 104:195-203,2001)The human Prader-Willi ciritical region (Cavaille et al. PNAS 97:14035-7, 2000)

Page 85: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 86: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 87: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 88: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 89: 2015 bioinformatics go_hmm_wim_vancriekinge

RNA genes can be hard to detects

UGAGGUAGUAGGUUGUAUAGU

C.elegans let-27; 21 nt (Pasquinelli et al. Nature 408:86-89,2000)

Often smallSometimes multicopy and redundantOften not polyadenylated (not represented in ESTs)Immune to frameshift and nonsense mutationsNo open reading frame, no codon biasOften evolving rapidly in primary sequence

miRNA genes

Page 90: 2015 bioinformatics go_hmm_wim_vancriekinge

• Lin-4 identified in a screen for mutations that affect timing and sequence of postembryonic development in C.elegans. Mutants re-iterate L1 instead of later stages of development

• Gene positionally cloned by isolating a 693-bp DNA fragment that can rescue the phenotype of mutant animals

• No protein found but 61-nucleotide precursor RNA with stem-loop structure which is processed to 22-mer ncRNA

• Genetically lin-4 acts as negative regulator of lin-14 and lin-28• The 3’ UTR of the target genes have short stretches of

complementarity to lin-4• Deletion of these lin-4 target seq causes unregulated gof phenotype• Lin-4 RNA inhibits accumulation of LIN-14 and LIN-28 proteins

although the target mRNA

Lin-4

Page 91: 2015 bioinformatics go_hmm_wim_vancriekinge

Let-7 (lethal-7) was also mapped to a ncRNA gene with a 21-nucleotide productThe small let-7 RNA is also thought to be a post-transcriptional negative regulator for lin-41 and lin-42100% conserved in all bilaterally symmetrical animals (not jellyfish and sponges)Sometimes called stRNAs, small temporal RNAs

Let-7(Pasquinelli et al. Nature 408:86-89,2000)

Page 92: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 93: 2015 bioinformatics go_hmm_wim_vancriekinge

Two computational analysis problems

• Similarity search (eg BLAST), I give you a query, you find sequences in a database that look like the query (note: SW/Blat)– For RNA, you want to take the secondary structure of

the query into account• Genefinding. Based solely on a priori knowledge

of what a “gene” looks like, find genes in a genome sequence– For RNA, with no open reading frame and no codon

bias, what do you look for ?

Page 94: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 95: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 96: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 97: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 98: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 99: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 100: 2015 bioinformatics go_hmm_wim_vancriekinge

Basic CFG“production rules”

S -> aSS -> SaS -> aSuS -> SS

Context-free grammers

A CFG “derivation”

S -> aS

Page 101: 2015 bioinformatics go_hmm_wim_vancriekinge

Basic CFG“production rules”

S -> aSS -> SaS -> aSuS -> SS

Context-free grammers

A CFG “derivation”

S -> aSS -> aaS

Page 102: 2015 bioinformatics go_hmm_wim_vancriekinge

Basic CFG“production rules”

S -> aSS -> SaS -> aSuS -> SS

Context-free grammers

A CFG “derivation”

S -> aSS -> aaSS -> aaSS

Page 103: 2015 bioinformatics go_hmm_wim_vancriekinge

Basic CFG“production rules”

S -> aSS -> SaS -> aSuS -> SS

Context-free grammers

A CFG “derivation”

S -> aSS -> aaSS -> aaSSS -> aagScuS

Page 104: 2015 bioinformatics go_hmm_wim_vancriekinge

Basic CFG“production rules”

S -> aSS -> SaS -> aSuS -> SS

Context-free grammers

A CFG “derivation”

S -> aSS -> aaSS -> aaSSS -> aagScuS

Page 105: 2015 bioinformatics go_hmm_wim_vancriekinge

Basic CFG“production rules”

S -> aSS -> SaS -> aSuS -> SS

Context-free grammers

A CFG “derivation”

S -> aSS -> aaSS -> aaSSS -> aagScuSS -> aagaSucugSc

Page 106: 2015 bioinformatics go_hmm_wim_vancriekinge

Basic CFG“production rules”

S -> aSS -> SaS -> aSuS -> SS

Context-free grammers

A CFG “derivation”

S -> aSS -> aaSS -> aaSSS -> aagScuSS -> aagaSucugScS -> aagaSaucuggSccS -> aagacSgaucuggcgSccc

Page 107: 2015 bioinformatics go_hmm_wim_vancriekinge

Basic CFG“production rules”

S -> aSS -> SaS -> aSuS -> SS

Context-free grammers

A CFG “derivation”

S -> aSS -> aaSS -> aaSSS -> aagScuSS -> aagaSucugScS -> aagaSaucuggSccS -> aagacSgaucuggcgScccS -> aagacuSgaucuggcgScccS -> aagacuuSgaucuggcgaScccS -> aagacuucSgaucuggcgacScccS -> aagacuucgSgaucuggcgacaScccS -> aagacuucggaucuggcgacaccc

Page 108: 2015 bioinformatics go_hmm_wim_vancriekinge

Basic CFG“production rules”

S -> aSS -> SaS -> aSuS -> SS

Context-free grammers

A CFG “derivation”

S -> aSS -> aaSS -> aaSSS -> aagScuSS -> aagaSucugScS -> aagaSaucuggSccS -> aagacSgaucuggcgScccS -> aagacuSgaucuggcgScccS -> aagacuuSgaucuggcgaScccS -> aagacuucSgaucuggcgacScccS -> aagacuucgSgaucuggcgacaScccS -> aagacuucggaucuggcgacaccc

Page 109: 2015 bioinformatics go_hmm_wim_vancriekinge

Basic CFG“production rules”

S -> aSS -> SaS -> aSuS -> SS

Context-free grammers

A CFG “derivation”

S -> aSS -> aaSS -> aaSSS -> aagScuSS -> aagaSucugScS -> aagaSaucuggSccS -> aagacSgaucuggcgScccS -> aagacuSgaucuggcgScccS -> aagacuuSgaucuggcgaScccS -> aagacuucSgaucuggcgacScccS -> aagacuucgSgaucuggcgacaScccS -> aagacuucggaucuggcgacaccc

A

C

G

U

*

A

AA

A

A

GG

G G GC

C

CC

CCC

U

U

U

*

*

* * *

Page 110: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 111: 2015 bioinformatics go_hmm_wim_vancriekinge
Page 112: 2015 bioinformatics go_hmm_wim_vancriekinge

The power of comparative analysis• Comparative genome analysis is an indispensable means

of inferring whether a locus produces a ncRNA as opposed to encoding a protein.

• For a small gene to be called a protein-coding gene, one excellent line of evidence is that the ORF is significantly conserved in another related species.

• It is more difficult to positively corroborate a ncRNA by comparative analysis but, in at least some cases, a ncRNA might conserve an intramolecular secondary structure and comparative analysis can show compensatory base substitutions.

• With comparative genome sequence data now accumulating in the public domain for most if not all important genetic systems, comparative analysis can (and should) become routine.

Page 113: 2015 bioinformatics go_hmm_wim_vancriekinge

Compensatory substitutions that maintain the structure

U U

C G U A A UG CA UCGAC 3’

G C

5’

Page 114: 2015 bioinformatics go_hmm_wim_vancriekinge

Evolutionary conservation of RNA molecules can be revealed by identification of compensatory substitutions

Page 115: 2015 bioinformatics go_hmm_wim_vancriekinge

…………

Page 116: 2015 bioinformatics go_hmm_wim_vancriekinge

• Manual annotation of 60,770 full-length mouse complementary DNA sequences, clustered into 33,409 ‘transcriptional units’, contributing 90.1% of a newly established mouse transcriptome database.

• Of these transcriptional units, 4,258 are new protein-coding and 11,665 are new non-coding messages, indicating that non-coding RNA is a major component of the transcriptome.

Page 117: 2015 bioinformatics go_hmm_wim_vancriekinge

Function on ncRNAs

Page 118: 2015 bioinformatics go_hmm_wim_vancriekinge

ncRNAs & RNAi

Page 119: 2015 bioinformatics go_hmm_wim_vancriekinge

Therapeutic Applications

• Shooting millions of tiny RNA molecules into a mouse’s bloodstream can protect its liver from the ravages of hepatitis, a new study shows. In this case, they blunt the liver’s selfdestructive inflammatory response, which can be triggered by agents such as the hepatitis B or C viruses. (Harvard University immunologists Judy Lieberman and Premlata Shankar)

• In a series of experiments published online this week by Nature Medicine, Lieberman’s team gave mice injections of siRNAs designed to shut down a gene called Fas. When overactivated during an inflammatory response, it induces liver cells to self-destruct. The next day, the animals were given an antibody that sends Fas into hyperdrive. Control mice died of acute liver failure within a few days, but 82% of the siRNA-treated mice remained free of serious disease and survived. Between 80% and 90% of their liver cells had incorporated the siRNAs.

Page 120: 2015 bioinformatics go_hmm_wim_vancriekinge