RNA secondary structure prediction€¦ · Main approaches to RNA secondary structure prediction • Energy minimization –dynamic programming approach –does not require prior

RNA secondary structure prediction

RNA Basics

• RNA bases A,C,G,U

• Canonical Base Pairs

– A-U

– G-C

– G-U

“wobble” pairing

– Bases can only pair with one other base.

Image: http://www.bioalgorithms.info/

2 Hydrogen Bonds 3 Hydrogen Bonds – more stable

http://www.bioalgorithms.info/

wobble base pair

A wobble base pair is a pairing between two nucleotides in RNA molecules that does not follow Watson-Crick base pair rules. The four main wobble base pairs are • guanine-uracil (G-U), • hypoxanthine-uracil(I-U), • hypoxanthine-adenine (I-A), • hypoxanthine-cytosine (I-C).

https://en.wikipedia.org/wiki/Nucleotides

https://en.wikipedia.org/wiki/RNA

https://en.wikipedia.org/wiki/Base_pair



https://en.wikipedia.org/wiki/Guanine

https://en.wikipedia.org/wiki/Uracil

https://en.wikipedia.org/wiki/Hypoxanthine

https://en.wikipedia.org/wiki/Uracil


https://en.wikipedia.org/wiki/Adenine


https://en.wikipedia.org/wiki/Cytosine

RNA Basics

• transfer RNA (tRNA)

• messenger RNA (mRNA)

• ribosomal RNA (rRNA)

• small interfering RNA (siRNA)

• micro RNA (miRNA)

• small nucleolar RNA (snoRNA)

http://www.genetics.wustl.edu/eddy/tRNAscan-SE/




Non-coding RNAs • A non-coding RNA (ncRNA) is a functional RNA molecule that is not

translated into a protein; small RNA (sRNA) is often used for bacterial ncRNAs.

• tRNA (transfer RNA), rRNA (ribosomal RNA), snoRNA (small RNA molecules that guide chemical modifications of other RNAs)

• microRNAs (miRNA, μRNA, single-stranded RNA molecules of 21-23 nucleotides in length, regulate gene expression)

• siRNAs (short interfering RNA or silencing RNA, double-stranded, 20-25 nucleotides in length, involved in the RNA interference (RNAi) pathway, where it interferes with the expression of a specific gene. )

• piRNAs (expressed in animal cells, forms RNA-protein complexes through interactions with Piwi proteins, which have been linked to transcriptional gene silencing of retrotransposons and other genetic elements in germ line cells)

• long ncRNAs (non-protein coding transcripts longer than 200 nucleotides)

RNA Secondary Structure

Hairpin loop

Junction (Multiloop)

Bulge Loop

Single-Stranded

Interior Loop

Stem

Image– Wuchty

Pseudoknot

Sequence Alignment as a method to determine structure

• Bases pair in order to form backbones and determine the secondary structure

• Aligning bases based on their ability to pair with each other gives an algorithmic approach to determining the optimal structure

Complex folds

Pseudoknots

i

j

j’

i’ i j j’ i’

?

RNA secondary structure representation

• 2D

• Circle plot

• Dot plot

• Mountain

• Parentheses

• Tree model (((…)))..((

….))

Main approaches to RNA secondary structure prediction

• Energy minimization

– dynamic programming approach

– does not require prior sequence alignment

– require estimation of energy terms contributing to secondary structure

• Comparative sequence analysis

– using sequence alignment to find conserved residues and covariant base pairs.

– most trusted

• Simultaneous folding and alignment (structural alignment)

Assumptions in energy minimization approaches

• Most likely structure similar to energetically most stable structure

• Energy associated with any position is only influenced by local sequence and structure

• Neglect pseudoknots

Base-pair maximization • Find structure with the most base pairs

– Only consider A-U and G-C and do not distinguish them

• Nussinov algorithm (1970s)

– Too simple to be accurate, but stepping-stone for later algorithms

• Problem definition – Given sequence X=x1x2…xL,compute a structure that has

maximum (weighted) number of base pairings

• How can we solve this problem? – Remember: RNA folds back to itself!

– S(i,j) is the maximum score when xi..xj folds optimally

– S(1,L)?

– S(i,i)?

Nussinov algorithm

1 L i j

S(i,j)

“Grow” from substructures (1) (2) (4) (3)

1 L i j i+1 j-1 k

Dynamic programming

• Compute S(i,j) recursively (dynamic programming) – Compares a sequence against itself in a dynamic

programming matrix

• Three steps

Nussinov RNA Folding Algorithm

• Initialization: γ(i, i-1) = 0 for I = 2 to L;

γ(i, i) = 0 for I = 2 to L.

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G

2 G

3 G

4 A

5 A

6 A

7 U

8 C

9 C

i

j

Image Source: Durbin et al. (2002) “Biological Sequence Analysis”


1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G

2 G 0

3 G 0

4 A 0

5 A 0

6 A 0

7 U 0

8 C 0

9 C 0

j

i


γ(i, i) = 0 for I = 2 to L.



1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0

2 G 0 0

3 G 0 0

4 A 0 0

5 A 0 0

6 A 0 0

7 U 0 0

8 C 0 0

9 C 0 0

j

i


γ(i, i) = 0 for I = 2 to L.



• Recursive Relation:

• For all subsequences from length 2 to length L:

)],1(),([max

),()1,1(

)1,(

),1(

max),(

jkki

jiji

ji

ji

ji

jki

Case 1

Case 2

Case 3

Case 4


1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0

2 G 0 0 0

3 G 0 0 0

4 A 0 0 0

5 A 0 0 0

6 A 0 0 1

7 U 0 0 0

8 C 0 0 0

9 C 0 0

)],1(),([max

),()1,1(

)1,(

),1(

max),(

jkki

jiji

ji

ji

ji

jki

j

i



1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0

2 G 0 0 0 0

3 G 0 0 0 0

4 A 0 0 0 0

5 A 0 0 0 1

6 A 0 0 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

)],1(),([max

),()1,1(

)1,(

),1(

max),(

jkki

jiji

ji

ji

ji

jki

j

i



1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0

2 G 0 0 0 0 0

3 G 0 0 0 0 0

4 A 0 0 0 0 1

5 A 0 0 0 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

)],1(),([max

),()1,1(

)1,(

),1(

max),(

jkki

jiji

ji

ji

ji

jki

j

i


Example Computation

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0

2 G 0 0 0 0 0

3 G 0 0 0 0 0

4 A 0 0 0 0

5 A 0 0 0 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i

)]7,1(),4([max

)7,4()6,5(

)6,4(

)7,5(

max)7,4(

74 kkk


Example Computation

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0

2 G 0 0 0 0 0

3 G 0 0 0 0 0

4 A 0 0 0 0

5 A 0 0 0 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i

)]7,1(),4([max

)7,4()6,5(

)6,4(

)7,5(

max)7,4(

74 kkk

A U

A

A

i

i+1 j


Example Computation

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0

2 G 0 0 0 0 0

3 G 0 0 0 0 0

4 A 0 0 0 0

5 A 0 0 0 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i

)]7,1(),4([max

)7,4()6,5(

)6,4(

)7,5(

max)7,4(

74 kkk


Example Computation

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0

2 G 0 0 0 0 0

3 G 0 0 0 0 0

4 A 0 0 0 0

5 A 0 0 0 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i

)]7,1(),4([max

)7,4()6,5(

)6,4(

)7,5(

max)7,4(

74 kkk

i+1 j-1

i j A U

A A


Example Computation

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0

2 G 0 0 0 0 0

3 G 0 0 0 0 0

4 A 0 0 0 0

5 A 0 0 0 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i

)]7,1(),4([max

)7,4()6,5(

)6,4(

)7,5(

max)7,4(

74 kkk


Example Computation

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0

2 G 0 0 0 0 0

3 G 0 0 0 0 0

4 A 0 0 0 0 1

5 A 0 0 0 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i

)]7,1(),4([max

)7,4()6,5(

)6,4(

)7,5(

max)7,4(

74 kkk


Completed Matrix

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0 0 0 1 2 3

2 G 0 0 0 0 0 0 1 2 3

3 G 0 0 0 0 0 1 2 2

4 A 0 0 0 0 1 1 1

5 A 0 0 0 1 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

)],1(),([max

),()1,1(

)1,(

),1(

max),(

jkki

jiji

ji

ji

ji

jki

j

i


Traceback

• value at γ(1, L) is the total base pair count in the maximally base-paired structure

• as in other DP, traceback from γ(1, L) is necessary to recover the final secondary structure

• pushdown stack is used to deal with bifurcated structures

Retrieving the Structure

1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0 0 0 1 2 3

2 G 0 0 0 0 0 0 1 2 3

3 G 0 0 0 0 0 1 2 2

4 A 0 0 0 0 1 1 1

5 A 0 0 0 1 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i


STACK

(1,9)

CURRENT PAIRS


1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0 0 0 1 2 3

2 G 0 0 0 0 0 0 1 2 3

3 G 0 0 0 0 0 1 2 2

4 A 0 0 0 0 1 1 1

5 A 0 0 0 1 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i


STACK

(2,9)

CURRENT

(1,9)

PAIRS


1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0 0 0 1 2 3

2 G 0 0 0 0 0 0 1 2 3

3 G 0 0 0 0 0 1 2 2

4 A 0 0 0 0 1 1 1

5 A 0 0 0 1 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i


STACK

(3,8)

CURRENT

(2,9)

C

G

G

PAIRS

(2,9)


1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0 0 0 1 2 3

2 G 0 0 0 0 0 0 1 2 3

3 G 0 0 0 0 0 1 2 2

4 A 0 0 0 0 1 1 1

5 A 0 0 0 1 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i


STACK

(4,7)

CURRENT

(3,8)

C

G

G

C G

PAIRS

(2,9)

(3,8)


1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0 0 0 1 2 3

2 G 0 0 0 0 0 0 1 2 3

3 G 0 0 0 0 0 1 2 2

4 A 0 0 0 0 1 1 1

5 A 0 0 0 1 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i


STACK

(5,6)

CURRENT

(4,7) U

C

G

A

G

C G

PAIRS

(2,9)

(3,8)

(4,7)


1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0 0 0 1 2 3

2 G 0 0 0 0 0 0 1 2 3

3 G 0 0 0 0 0 1 2 2

4 A 0 0 0 0 1 1 1

5 A 0 0 0 1 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i


STACK

(6,6)

CURRENT

(5,6)

A

U

C

G

A

G

C G

PAIRS

(2,9)

(3,8)

(4,7)


1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0 0 0 1 2 3

2 G 0 0 0 0 0 0 1 2 3

3 G 0 0 0 0 0 1 2 2

4 A 0 0 0 0 1 1 1

5 A 0 0 0 1 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i


STACK

-

CURRENT

(6,6)

A

U

C

G

A

G

C G

A PAIRS

(2,9)

(3,8)

(4,7)


1 2 3 4 5 6 7 8 9

G G G A A A U C C

1 G 0 0 0 0 0 0 1 2 3

2 G 0 0 0 0 0 0 1 2 3

3 G 0 0 0 0 0 1 2 2

4 A 0 0 0 0 1 1 1

5 A 0 0 0 1 1 1

6 A 0 0 1 1 1

7 U 0 0 0 0

8 C 0 0 0

9 C 0 0

j

i

A

U

C

G

A

G

C G

A


Evaluation of Nussinov

• unfortunately, while this does maximize the base pairs, it does not create viable secondary structures

• in Zuker’s algorithm, the correct structure is assumed to have the lowest equilibrium free energy (ΔG) (Zuker and Stiegler, 1981; Zuker 1989a)

Limitation of Nussinov Algorithm

Pseudo code for Nussinov Algorithm

Documents

RNA secondary structure prediction€¦ · Main approaches to RNA secondary structure prediction • Energy minimization –dynamic programming approach –does not require prior