54
CS5263 Bioinformatics RNA Secondary Structure Prediction

CS5263 Bioinformatics RNA Secondary Structure Prediction

Embed Size (px)

Citation preview

Page 1: CS5263 Bioinformatics RNA Secondary Structure Prediction

CS5263 Bioinformatics

RNA Secondary Structure Prediction

Page 2: CS5263 Bioinformatics RNA Secondary Structure Prediction

Outline

• Biological roles for RNA

• RNA secondary structure– What’s “secondary structure”?– How is it represented?– Why is it important?

• How to predict?

Page 3: CS5263 Bioinformatics RNA Secondary Structure Prediction

Central dogma

The flow of genetic information

DNA RNA Protein

transcription translation

Replication

Page 4: CS5263 Bioinformatics RNA Secondary Structure Prediction

Classical Roles for RNA

• mRNA

• tRNA

• rRNA

Ribosome

Page 5: CS5263 Bioinformatics RNA Secondary Structure Prediction

“Semi-classical” RNA

• snRNA - small nuclear RNA (60-300nt), involved in splicing (removing introns), etc.

• RNaseP - tRNA processing (~300 nt)• SRP RNA - signal recognition particle RNA:

membrane targeting (~100-300 nt)• tmRNA - resetting stalled ribosomes, destroy

aberrant mRNA• Telomerase - (200-400nt)• snoRNA - small nucleolar RNA (many varieties;

80-200nt)

Page 6: CS5263 Bioinformatics RNA Secondary Structure Prediction

Non-coding RNAs

Dramatic discoveries in last 10 years• 100s of new families• Many roles: regulation, transport, stability, catalysis, …• siRNA: Small interfering RNA (Nobel prize 2006) and miRNAs: both are ~21-23 nt

– Regulating gene expression– Evidence of disease-

association• 1% of DNA codes forprotein, but 30% of it is copied into RNA, i.e.ncRNA >> mRNA

Page 7: CS5263 Bioinformatics RNA Secondary Structure Prediction

Take-home message

• RNAs play many important roles in the cell beyond the classical roles– Many of which yet to be discovered

• RNA functions are determined by structures

Page 8: CS5263 Bioinformatics RNA Secondary Structure Prediction

Example: Riboswitch

• Riboswitch: an mRNA regulates its own activity

Page 9: CS5263 Bioinformatics RNA Secondary Structure Prediction

RNA structure

• Primary: sequence

• Secondary: base-pairing

• Tertiary: 3D shape

Page 10: CS5263 Bioinformatics RNA Secondary Structure Prediction

RNA base-pairing

• Watson-Crick Pairing– C-G ~3kcal/mole– A-U ~2kcal/mole

• “Wobble Pair” G – U ~1kcal/mole

• Non-canonical Pairs

Page 11: CS5263 Bioinformatics RNA Secondary Structure Prediction

tRNA structure

Page 12: CS5263 Bioinformatics RNA Secondary Structure Prediction

Secondary structure prediction

• Given: CAUUUGUGUACCU…. • Goal:

• How can we compute that?

Page 13: CS5263 Bioinformatics RNA Secondary Structure Prediction

Hairpin Loops

Stems

Bulge loop

Interior loops

Multi-branched loop

Terminology

Page 14: CS5263 Bioinformatics RNA Secondary Structure Prediction

Pseudoknot

• Makes structure prediction hard. Not considered in most algorithms.

5’5

10

15202530

35

40 45 3’

ucgacuguaaaaaagcgggcgacuuucagucgcucuuuuugucgcgcgc5’- -3’10 20 30 40

Page 15: CS5263 Bioinformatics RNA Secondary Structure Prediction

The Nussinov algorithm

• Goal: maximizing the number of base-pairs

• Idea: Dynamic programming– Loop matching– Nussinov, Pieczenik, Griggs, Kleitman ’78

• Too simple for accurate prediction, but stepping-stone for later algorithms

Page 16: CS5263 Bioinformatics RNA Secondary Structure Prediction

The Nussinov algorithm

Problem:

Find the RNA structure with the maximum (weighted) number of nested pairings

Nested: no pseudoknotAGACC

UCUGG

GCGGC

AGUC

UAU

GCG

AA

CGC

GUCA

UCAG

C UG

GA

AGAAG

GG A

GA

UC

U U C

ACCA

AU

ACU

G

AA

UU

GC

A

ACCACGCUUAAGACACCUAGCUUGUGUCCUGGAGGUCUAUAAGUCAGACCGCGAGAGGGAAGACUCGUAUAAGCG

Page 17: CS5263 Bioinformatics RNA Secondary Structure Prediction

The Nussinov algorithm

• Given sequence X = x1…xN,

• Define DP matrix: F(i, j) = maximum number of base-pairs if xi…xj folds optimally– Matrix is symmetric, so let i < j

Page 18: CS5263 Bioinformatics RNA Secondary Structure Prediction

The Nussinov algorithm

• Can be summarized into two cases:– (i, j) paired: optimal score is 1 + F(i+1, j-1)– (i, j) unpaired: optimal score is

maxk F(i, k) + F(k+1, j) k = i..j-1

Page 19: CS5263 Bioinformatics RNA Secondary Structure Prediction

The Nussinov algorithm

• F(i, i) = 0

F(i+1, j-1) + S(xi, xj)• F(i, j) = max

maxk=i..j-1 F(i, k) + F(k+1, j)• S(xi, xj) = 1 if xi, xj can form a base-pair,

and 0 otherwise– Generalize: S(A, U) = 2, S(C, G) = 3, S(G, U) = 1– Or other types of scores (later)

• F(1, N) gives the optimal score for the whole seq

Page 20: CS5263 Bioinformatics RNA Secondary Structure Prediction

How to fill in the DP matrix?

F(i+1, j-1) + S(xi, xj)

• F(i, j) = max

maxk=i..j-1 F(i, k) + F(k+1, j)0

0

0 (i, j)

0

0

0

0

0

0

0

i

i+1

j–1 j

Page 21: CS5263 Bioinformatics RNA Secondary Structure Prediction

How to fill in the DP matrix?

F(i+1, j-1) + S(xi, xj)

• F(i, j) = max

maxk=i..j-1 F(i, k) + F(k+1, j)0

0

0

0

0

0

0

0

0

0

j – i = 1

Page 22: CS5263 Bioinformatics RNA Secondary Structure Prediction

How to fill in the DP matrix?

F(i+1, j-1) + S(xi, xj)

• F(i, j) = max

maxk=i..j-1 F(i, k) + F(k+1, j)0

0

0

0

0

0

0

0

0

0

j – i = 2

Page 23: CS5263 Bioinformatics RNA Secondary Structure Prediction

How to fill in the DP matrix?

F(i+1, j-1) + S(xi, xj)

• F(i, j) = max

maxk=i..j-1 F(i, k) + F(k+1, j)0

0

0

0

0

0

0

0

0

0

j – i = 3

Page 24: CS5263 Bioinformatics RNA Secondary Structure Prediction

How to fill in the DP matrix?

F(i+1, j-1) + S(xi, xj)

• F(i, j) = max

maxk=i..j-1 F(i, k) + F(k+1, j)0

0

0

0

0

0

0

0

0

0

j – i = N - 1

Page 25: CS5263 Bioinformatics RNA Secondary Structure Prediction

Minimum Loop length

• Sharp turns unlikely• Let minimum length

of hairpin loop be 1

(3 in real preds)• F(i, i+1) = 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0

U AG CC GG

C

Page 26: CS5263 Bioinformatics RNA Secondary Structure Prediction

AlgorithmInitialization:

F(i, i) = 0; for i = 1 to NF(i, i+1) = 0; for i = 1 to N-1

Iteration:For L = 1 to N-1

For i = 1 to N – lj = min(i + L, N)

F(i+1, j -1) + s(xi, xj)F(i, j) = max

max{ i k < j } F(i, k) + F(k+1, j)

Termination: Best score is given by F(1, N)(For trace back, refer to the Durbin book)

Page 27: CS5263 Bioinformatics RNA Secondary Structure Prediction

Complexity

For L = 1 to N-1

For i = 1 to N – l

j = min(i + L, N)

F(i+1, j -1) + s(xi, xj)

F(i, j) = max

max{ i k < j } F(i, k) + F(k+1, j)

• Time complexity: O(N3)

• Memory: O(N2)

Page 28: CS5263 Bioinformatics RNA Secondary Structure Prediction

Example

• RNA sequence: GGGAAAUCC

• Only count # of base-pairs– A-U = 1– G-C = 1– G-U = 1

• Minimum hairpin loop length = 1

Page 29: CS5263 Bioinformatics RNA Secondary Structure Prediction

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0

G G G A A A U C C

G G G A A A U C C

Page 30: CS5263 Bioinformatics RNA Secondary Structure Prediction

0 0 0

0 0 0

0 0 0

0 0 0

0 0 1

0 0 0

0 0 0

0 0

0

G G G A A A U C C

G G G A A A U C C

Page 31: CS5263 Bioinformatics RNA Secondary Structure Prediction

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 1

0 0 1 1

0 0 0 0

0 0 0

0 0

0

G G G A A A U C C

G G G A A A U C C

Page 32: CS5263 Bioinformatics RNA Secondary Structure Prediction

0 0 0 0 0

0 0 0 0 0

0 0 0 0 1

0 0 0 1 1

0 0 1 1 1

0 0 0 0

0 0 0

0 0

0

G G G A A A U C C

G G G A A A U C C

Page 33: CS5263 Bioinformatics RNA Secondary Structure Prediction

0 0 0 0 0 0 1 2 3

0 0 0 0 0 1 2 3

0 0 0 0 1 2 2

0 0 0 1 1 1

0 0 1 1 1

0 0 0 0

0 0 0

0 0

0

G G G A A A U C C

G G G A A A U C C

A UG CG CG

AA

G UG CG C

AAA

A UGG CG C

AA

Page 34: CS5263 Bioinformatics RNA Secondary Structure Prediction

0 0 0 0 0 0 1 2 3

0 0 0 0 0 1 2 3

0 0 0 0 1 2 2

0 0 0 1 1 1

0 0 1 1 1

0 0 0 0

0 0 0

0 0

0

G G G A A A U C C

G G G A A A U C C

A UG CG CG

AA

G UG CG C

AAA

A UGG CG C

AA

Page 35: CS5263 Bioinformatics RNA Secondary Structure Prediction

0 0 0 0 0 0 1 2 3

0 0 0 0 0 1 2 3

0 0 0 0 1 2 2

0 0 0 1 1 1

0 0 1 1 1

0 0 0 0

0 0 0

0 0

0

G G G A A A U C C

G G G A A A U C C

A UG CG CG

AA

G UG CG C

AAA

A UGG CG C

AA

Page 36: CS5263 Bioinformatics RNA Secondary Structure Prediction

0 0 0 0 0 0 1 2 3

0 0 0 0 0 1 2 3

0 0 0 0 1 2 2

0 0 0 1 1 1

0 0 1 1 1

0 0 0 0

0 0 0

0 0

0

G G G A A A U C C

G G G A A A U C C

A UG CG CG

AA

G UG CG C

AAA

A UGG CG C

AA

Page 37: CS5263 Bioinformatics RNA Secondary Structure Prediction

Energy minimization

For L = 1 to N-1For i = 1 to N – l

j = min(i + L, N);

E(i+1, j -1) + e(xi, xj)E(i, j) = min

min{ i k < j } E(i, k) + E(k+1, j)

e(xi, xj) represents the energy for xi base pair with xj

• Energy are negative values. Therefore minimization rather than maximize.

• More complex energy rules: energy depends on neighboring bases

Page 38: CS5263 Bioinformatics RNA Secondary Structure Prediction

More realistic energy rules

U UA A

A

A

A

G C

G C

G C

U A

A U

C G

A U

4nt hairpin+5.9 -1.1, Terminal mismatch of hairpin

-2.9, stacking

-2.9, stacking (special for 1nt bulge)

-1.8, stack

-0.9, stack

-1.8, stack

-2.1, stack

5’

3’5’-dangle, -0.3

unstructured, 0Overall G = -4.6 kcal/mol

1nt bulge, +3.3

Complete energy rules at http://www.bioinfo.rpi.edu/zukerm/cgi-bin/efiles.cgi

Page 39: CS5263 Bioinformatics RNA Secondary Structure Prediction

The Zuker algorithm – main ideas

1. Instead of base pairs, pairs of base pairs (more accurate)

2. Separate score for bulges3. Separate score for different-size & composition of loops4. Separate score for interactions between stem &

beginning of loop5. Use additional matrix to remember current state. e.g, to

model stacking energy: • W(i, j): energy of the best structure on i, j• V(i, j): energy of the best structure on i, j given that i, j are paired• Similar to affine-gap alignment.

Page 40: CS5263 Bioinformatics RNA Secondary Structure Prediction

Two popular implementations

• mfold (Zuker)http://mfold.bioinfo.rpi.edu/

• RNAfold in the Vienna package (Hofacker)http://www.tbi.univie.ac.at/~ivo/RNA/

Page 41: CS5263 Bioinformatics RNA Secondary Structure Prediction

Accuracy

• 50-70% for sequences up to 300 nt• Not perfect, but useful• Possible reasons:

– Energy rule not perfect: 5-10% error– Many alternative structures within this error

range– Alternative structure do exist– Structure may change in presence of other

molecules

Page 42: CS5263 Bioinformatics RNA Secondary Structure Prediction

Comparative structure prediction

To maintain structure, two nucleotides that form a base-pair tend to mutate together

Given K homologous aligned RNA sequences:

Human aagacuucggaucuggcgacacccMouse uacacuucggaugacaccaaagugWorm aggucuucggcacgggcaccauucFly ccaacuucggauuuugcuaccauaOrc aagccuucggagcgggcguaacuc

If ith and jth positions are always base paired and covary, then they are likely to be paired

Page 43: CS5263 Bioinformatics RNA Secondary Structure Prediction

Mutual information

fab(i,j): Prob for a, b to be in positions i, j

fa (i): Prob for a to be in positions i

)|()(

)()(

),(log),(),( 2

),,,(,

jiHiH

jfif

jifjifjiM

ba

ab

TGCAbaab

aagacuucggaucuggcgacacccuacacuucggaugacaccaaagugaggucuucggcacgggcaccauucccaacuucggauuuugcuaccauaaagccuucggagcgggcguaacuc

fgc(3,13) = 3/5fcg(3,13) = 1/5fau(3,13) = 1/5

fg(3) = 3/5fc(3) = 1/5fa(3) = 1/5

fc(13) = 3/5fg(13) = 1/5fu(13) = 1/5

37.1

)2.02.0

2.0(log2.0)

2.02.0

2.0(log2.0)

6.06.0

6.0(log6.0)13,3( 222

M

Page 44: CS5263 Bioinformatics RNA Secondary Structure Prediction

Mutual information

• Also called covariance score• M is high if base a in position i always follow by base b in position j

– Does not require a to base-pair with b– Advantage: can detect non-canonical base-pairs

• However, M = 0 if no mutation at all, even if perfect base-pairs

)()(

),(log),(),( 2

),,,(, jfif

jifjifjiM

ba

ab

TGCAbaab

aagacuucggaucuggcgacacccuacacuucggaugacaccaaagugaggucuucggcacgggcaccauucccaacuucggauuuugcuaccauaaagccuucggagcgggcguaacuc

One way to get around is to combine covariance and energy scores

Page 45: CS5263 Bioinformatics RNA Secondary Structure Prediction

Comparative structure prediction

• Given a multiple alignment, can infer structure that maximizes the sum of mutual information, by DP

• However, alignment is hard, since structure often more important than sequence

Page 46: CS5263 Bioinformatics RNA Secondary Structure Prediction

Comparative structure prediction

In practice:1. Get multiple alignment2. Find covarying bases – deduce structure3. Improve multiple alignment (by hand)4. Go to 2

A manual EM process!!

Page 47: CS5263 Bioinformatics RNA Secondary Structure Prediction

Comparative structure prediction

• Align then fold

• Fold then align

• Align and fold

Page 48: CS5263 Bioinformatics RNA Secondary Structure Prediction

Context-free Grammar for RNA Secondary Structure

• S = SS | aSu | cSg | uSa | gSc | L

• L = aL | cL | gL | uL |

aaacgg ugcc

ag ucg

a c g g a g u g c c c g u

S

S

S

S

L

S

L

a L

S

La

Page 49: CS5263 Bioinformatics RNA Secondary Structure Prediction

Stochastic Context-free Grammar (SCFG)

• Probabilistic context-free grammar• Probabilities can be converted into weights• CFG vs SCFG is similar to RG vs HMM

• S = SS • S = aSu | uSa• S = cSg | gSc• S = uSg | gSu • S = L• L = aL | cL | gL | uL |

0

2

3

0

1

e(xi, xj) + S(i+1, j-1)

S(i, j) = max L(i, j)

maxk (S(i, k) + S(k+1, j))

L(i, j) = 0

0

Page 50: CS5263 Bioinformatics RNA Secondary Structure Prediction

SCFG Decoding

• Decoding: given a grammar (SCFG/HMM) and a sequence, find the best parse (highest probability or score)– Cocke-Younger-Kasami (CYK) algorithm

(analogous to Viterbi in HMM)– The Nussinov and Zuker algorithms are

essentially special cases of CYK– CYK and SCFG are also used in other

domains (NLP, Compiler, etc).

Page 51: CS5263 Bioinformatics RNA Secondary Structure Prediction

SCFG Evaluation

• Given a sequence and a SCFG model– Estimate P(seq is generated by model), summing

over all possible paths (analogous to forward-algorithm in HMM)

• Inside-outside algorithm– Analogous to forward-background– Inside: bottom-up parsing (P(xi..xj))– Outside: top-down parsing (P(x1..xi-1 xj+1..xN))

• Can calculate base-paring probability – Analogous to posterior decoding– Essentially the same idea implemented in the Vienna

RNAfold package

Page 52: CS5263 Bioinformatics RNA Secondary Structure Prediction

SCFG Learning

• Covariance model: similar to profile HMMs– Given a set of sequences with common structures,

simultaneously learn SCFG parameters and optimally parse sequences into states

– EM on SCFG – Inside-outside algorithm– Efficiency is a bottleneck

• Have been successfully applied to predict tRNA genes and structures– tRNAScan

Page 53: CS5263 Bioinformatics RNA Secondary Structure Prediction

Summary: SCFG and HMM algorithms

GOAL HMM algorithm SCFG algorithm

Optimal parse Viterbi CYK

Estimation Forward InsideBackward Outside

Learning EM: Fw/Bck EM: Ins/Outs

Memory Complexity O(N K) O(N2 K)Time Complexity O(N K2) O(N3 K3)

Where K: # of states in the HMM # of nonterminal symbols in the SCFG

Page 54: CS5263 Bioinformatics RNA Secondary Structure Prediction

Open research problems

• ncRNA gene prediction• ncRNA regulatory networks

• Structure prediction– Secondary, including pseudoknots– Tertiary

• Structural comparison tools– Structural alignment

• Structure search tools– “RNA-BLAST”

• Structural motif finding– “RNA-MEME”