Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
RNA secondary structure prediction
RNA Basics
• RNA bases A,C,G,U
• Canonical Base Pairs
– A-U
– G-C
– G-U
“wobble” pairing
– Bases can only pair with one other base.
Image: http://www.bioalgorithms.info/
2 Hydrogen Bonds 3 Hydrogen Bonds – more stable
wobble base pair
A wobble base pair is a pairing between two nucleotides in RNA molecules that does not follow Watson-Crick base pair rules. The four main wobble base pairs are • guanine-uracil (G-U), • hypoxanthine-uracil(I-U), • hypoxanthine-adenine (I-A), • hypoxanthine-cytosine (I-C).
RNA Basics
• transfer RNA (tRNA)
• messenger RNA (mRNA)
• ribosomal RNA (rRNA)
• small interfering RNA (siRNA)
• micro RNA (miRNA)
• small nucleolar RNA (snoRNA)
http://www.genetics.wustl.edu/eddy/tRNAscan-SE/
Non-coding RNAs • A non-coding RNA (ncRNA) is a functional RNA molecule that is not
translated into a protein; small RNA (sRNA) is often used for bacterial ncRNAs.
• tRNA (transfer RNA), rRNA (ribosomal RNA), snoRNA (small RNA molecules that guide chemical modifications of other RNAs)
• microRNAs (miRNA, μRNA, single-stranded RNA molecules of 21-23 nucleotides in length, regulate gene expression)
• siRNAs (short interfering RNA or silencing RNA, double-stranded, 20-25 nucleotides in length, involved in the RNA interference (RNAi) pathway, where it interferes with the expression of a specific gene. )
• piRNAs (expressed in animal cells, forms RNA-protein complexes through interactions with Piwi proteins, which have been linked to transcriptional gene silencing of retrotransposons and other genetic elements in germ line cells)
• long ncRNAs (non-protein coding transcripts longer than 200 nucleotides)
RNA Secondary Structure
Hairpin loop
Junction (Multiloop)
Bulge Loop
Single-Stranded
Interior Loop
Stem
Image– Wuchty
Pseudoknot
Sequence Alignment as a method to determine structure
• Bases pair in order to form backbones and determine the secondary structure
• Aligning bases based on their ability to pair with each other gives an algorithmic approach to determining the optimal structure
Complex folds
Pseudoknots
i
j
j’
i’ i j j’ i’
?
RNA secondary structure representation
• 2D
• Circle plot
• Dot plot
• Mountain
• Parentheses
• Tree model (((…)))..((
….))
Main approaches to RNA secondary structure prediction
• Energy minimization
– dynamic programming approach
– does not require prior sequence alignment
– require estimation of energy terms contributing to secondary structure
• Comparative sequence analysis
– using sequence alignment to find conserved residues and covariant base pairs.
– most trusted
• Simultaneous folding and alignment (structural alignment)
Assumptions in energy minimization approaches
• Most likely structure similar to energetically most stable structure
• Energy associated with any position is only influenced by local sequence and structure
• Neglect pseudoknots
Base-pair maximization • Find structure with the most base pairs
– Only consider A-U and G-C and do not distinguish them
• Nussinov algorithm (1970s)
– Too simple to be accurate, but stepping-stone for later algorithms
• Problem definition – Given sequence X=x1x2…xL,compute a structure that has
maximum (weighted) number of base pairings
• How can we solve this problem? – Remember: RNA folds back to itself!
– S(i,j) is the maximum score when xi..xj folds optimally
– S(1,L)?
– S(i,i)?
Nussinov algorithm
1 L i j
S(i,j)
“Grow” from substructures (1) (2) (4) (3)
1 L i j i+1 j-1 k
Dynamic programming
• Compute S(i,j) recursively (dynamic programming) – Compares a sequence against itself in a dynamic
programming matrix
• Three steps
Nussinov RNA Folding Algorithm
• Initialization: γ(i, i-1) = 0 for I = 2 to L;
γ(i, i) = 0 for I = 2 to L.
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G
2 G
3 G
4 A
5 A
6 A
7 U
8 C
9 C
i
j
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Nussinov RNA Folding Algorithm
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G
2 G 0
3 G 0
4 A 0
5 A 0
6 A 0
7 U 0
8 C 0
9 C 0
j
i
• Initialization: γ(i, i-1) = 0 for I = 2 to L;
γ(i, i) = 0 for I = 2 to L.
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Nussinov RNA Folding Algorithm
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0
2 G 0 0
3 G 0 0
4 A 0 0
5 A 0 0
6 A 0 0
7 U 0 0
8 C 0 0
9 C 0 0
j
i
• Initialization: γ(i, i-1) = 0 for I = 2 to L;
γ(i, i) = 0 for I = 2 to L.
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Nussinov RNA Folding Algorithm
• Recursive Relation:
• For all subsequences from length 2 to length L:
)],1(),([max
),()1,1(
)1,(
),1(
max),(
jkki
jiji
ji
ji
ji
jki
Case 1
Case 2
Case 3
Case 4
Nussinov RNA Folding Algorithm
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0
2 G 0 0 0
3 G 0 0 0
4 A 0 0 0
5 A 0 0 0
6 A 0 0 1
7 U 0 0 0
8 C 0 0 0
9 C 0 0
)],1(),([max
),()1,1(
)1,(
),1(
max),(
jkki
jiji
ji
ji
ji
jki
j
i
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Nussinov RNA Folding Algorithm
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0
2 G 0 0 0 0
3 G 0 0 0 0
4 A 0 0 0 0
5 A 0 0 0 1
6 A 0 0 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
)],1(),([max
),()1,1(
)1,(
),1(
max),(
jkki
jiji
ji
ji
ji
jki
j
i
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Nussinov RNA Folding Algorithm
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0
2 G 0 0 0 0 0
3 G 0 0 0 0 0
4 A 0 0 0 0 1
5 A 0 0 0 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
)],1(),([max
),()1,1(
)1,(
),1(
max),(
jkki
jiji
ji
ji
ji
jki
j
i
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Example Computation
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0
2 G 0 0 0 0 0
3 G 0 0 0 0 0
4 A 0 0 0 0
5 A 0 0 0 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
)]7,1(),4([max
)7,4()6,5(
)6,4(
)7,5(
max)7,4(
74 kkk
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Example Computation
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0
2 G 0 0 0 0 0
3 G 0 0 0 0 0
4 A 0 0 0 0
5 A 0 0 0 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
)]7,1(),4([max
)7,4()6,5(
)6,4(
)7,5(
max)7,4(
74 kkk
A U
A
A
i
i+1 j
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Example Computation
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0
2 G 0 0 0 0 0
3 G 0 0 0 0 0
4 A 0 0 0 0
5 A 0 0 0 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
)]7,1(),4([max
)7,4()6,5(
)6,4(
)7,5(
max)7,4(
74 kkk
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Example Computation
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0
2 G 0 0 0 0 0
3 G 0 0 0 0 0
4 A 0 0 0 0
5 A 0 0 0 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
)]7,1(),4([max
)7,4()6,5(
)6,4(
)7,5(
max)7,4(
74 kkk
i+1 j-1
i j A U
A A
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Example Computation
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0
2 G 0 0 0 0 0
3 G 0 0 0 0 0
4 A 0 0 0 0
5 A 0 0 0 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
)]7,1(),4([max
)7,4()6,5(
)6,4(
)7,5(
max)7,4(
74 kkk
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Example Computation
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0
2 G 0 0 0 0 0
3 G 0 0 0 0 0
4 A 0 0 0 0 1
5 A 0 0 0 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
)]7,1(),4([max
)7,4()6,5(
)6,4(
)7,5(
max)7,4(
74 kkk
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Completed Matrix
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0 0 0 1 2 3
2 G 0 0 0 0 0 0 1 2 3
3 G 0 0 0 0 0 1 2 2
4 A 0 0 0 0 1 1 1
5 A 0 0 0 1 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
)],1(),([max
),()1,1(
)1,(
),1(
max),(
jkki
jiji
ji
ji
ji
jki
j
i
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Traceback
• value at γ(1, L) is the total base pair count in the maximally base-paired structure
• as in other DP, traceback from γ(1, L) is necessary to recover the final secondary structure
• pushdown stack is used to deal with bifurcated structures
Retrieving the Structure
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0 0 0 1 2 3
2 G 0 0 0 0 0 0 1 2 3
3 G 0 0 0 0 0 1 2 2
4 A 0 0 0 0 1 1 1
5 A 0 0 0 1 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
STACK
(1,9)
CURRENT PAIRS
Retrieving the Structure
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0 0 0 1 2 3
2 G 0 0 0 0 0 0 1 2 3
3 G 0 0 0 0 0 1 2 2
4 A 0 0 0 0 1 1 1
5 A 0 0 0 1 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
STACK
(2,9)
CURRENT
(1,9)
PAIRS
Retrieving the Structure
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0 0 0 1 2 3
2 G 0 0 0 0 0 0 1 2 3
3 G 0 0 0 0 0 1 2 2
4 A 0 0 0 0 1 1 1
5 A 0 0 0 1 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
STACK
(3,8)
CURRENT
(2,9)
C
G
G
PAIRS
(2,9)
Retrieving the Structure
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0 0 0 1 2 3
2 G 0 0 0 0 0 0 1 2 3
3 G 0 0 0 0 0 1 2 2
4 A 0 0 0 0 1 1 1
5 A 0 0 0 1 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
STACK
(4,7)
CURRENT
(3,8)
C
G
G
C G
PAIRS
(2,9)
(3,8)
Retrieving the Structure
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0 0 0 1 2 3
2 G 0 0 0 0 0 0 1 2 3
3 G 0 0 0 0 0 1 2 2
4 A 0 0 0 0 1 1 1
5 A 0 0 0 1 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
STACK
(5,6)
CURRENT
(4,7) U
C
G
A
G
C G
PAIRS
(2,9)
(3,8)
(4,7)
Retrieving the Structure
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0 0 0 1 2 3
2 G 0 0 0 0 0 0 1 2 3
3 G 0 0 0 0 0 1 2 2
4 A 0 0 0 0 1 1 1
5 A 0 0 0 1 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
STACK
(6,6)
CURRENT
(5,6)
A
U
C
G
A
G
C G
PAIRS
(2,9)
(3,8)
(4,7)
Retrieving the Structure
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0 0 0 1 2 3
2 G 0 0 0 0 0 0 1 2 3
3 G 0 0 0 0 0 1 2 2
4 A 0 0 0 0 1 1 1
5 A 0 0 0 1 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
STACK
-
CURRENT
(6,6)
A
U
C
G
A
G
C G
A PAIRS
(2,9)
(3,8)
(4,7)
Retrieving the Structure
1 2 3 4 5 6 7 8 9
G G G A A A U C C
1 G 0 0 0 0 0 0 1 2 3
2 G 0 0 0 0 0 0 1 2 3
3 G 0 0 0 0 0 1 2 2
4 A 0 0 0 0 1 1 1
5 A 0 0 0 1 1 1
6 A 0 0 1 1 1
7 U 0 0 0 0
8 C 0 0 0
9 C 0 0
j
i
A
U
C
G
A
G
C G
A
Image Source: Durbin et al. (2002) “Biological Sequence Analysis”
Evaluation of Nussinov
• unfortunately, while this does maximize the base pairs, it does not create viable secondary structures
• in Zuker’s algorithm, the correct structure is assumed to have the lowest equilibrium free energy (ΔG) (Zuker and Stiegler, 1981; Zuker 1989a)
Limitation of Nussinov Algorithm
Pseudo code for Nussinov Algorithm