69
Applied Bioinformatics Week 11

Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Embed Size (px)

Citation preview

Page 1: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Applied Bioinformatics

Week 11

Page 2: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Topics

• Protein Secondary Structure

• RNA Secondary Structure

Page 3: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Theory I

Page 4: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Recall Domains

• Functional region of a protein sequence

• Proteins may have several domains

• Generally identified by MSA

Page 5: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Domains

• Convey function

• Function derives from 3D structure

• How to determine 3D structure of proteins?

• First step secondary structure

Page 6: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Four levels of protein structure

Page 7: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Structure

Page 8: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Secondary Structure

• Local three dimensional structure

• Elements– Helix– Sheet– Coil

G = 3-turn helix (310 helix). Min length 3 residues.H = 4-turn helix (α helix). Min length 4 residues.I = 5-turn helix (π helix). Min length 5 residues.T = hydrogen bonded turn (3, 4 or 5 turn)E = extended strand in parallel and/or anti-parallel  β-sheet conformation. Min length 2 residues.B = residue in isolated β-bridge (single pair β-sheet hydrogen bond formation)S = bend (the only non-hydrogen-bond based assignment)

Page 9: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Secondary Structure 8 different categories

(DSSP):H: - helixG: 310 – helixI: - helix (extremely

rare) E: - strandB: - bridgeT: - turnS: bend L: the rest

Page 10: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Protein Secondary Structure [3]

Alpha Helix-

Structure repeats itself evry5.4 Angstroms along the helix axis

Every main chain CO and NH group is hydrogen bonded to a peptide bond 4 residues away

Beta Sheet – Two or more polypeptide chains run alongside each other and are linked by hydrogen bonds

Yuchun Tang, Preeti Singh, Yanqing Zhang, Chung-Dar Lu and Irene Weber, Georgia State University

Page 11: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Simplification

• 20 amino acids

• 5 - 11 groups of amino acids– Amino acids with similar chemical properties– Depends on the study

• 3 secondary structures

Page 12: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Secondary Structure Preditiction

• Sheet/ helix forming tendency of amino acids– Up to 60% accurate

• MSA -> neighborhood exploitation– Words of several aa are formed– Hydrophobicity is included– Up to 80% accurate

Page 13: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Propensities

Page 14: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Generation of Prediction Methods

• 1st generation : single residue statistics – Base on single amino acid propensity

• 2nd generation : segment statistics – Propensity for segments of 3-51 adjacent residues

• 3rd generation : evolution to better predictions – The use of evolutionary information (evolutionary

profile)

Page 15: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Assignment to Structure

• Sliding window of 7 amino acids– Why 7?

• Middle amino acid is assigned average propensity– Helix, Sheet

• Long stretches of similar assignments

About 2 turns (3.6 per turn)

Page 16: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Example: Window • Consider a secondary structure (x, e) and the window of

length 5 with the special position in the middle (bold letters)

• Fist position of the window is:

x = A R N S T V V S T A A . . .

e = ? ? H H C C C E E E . . . .

Window returns instance:

A R N S T H

Page 17: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Example: Window • Second position of the window is:

x = A R N S T V V S T A A . . .

e = ? ? H H C C C E E E . . . .

• Windows returns instance: R N S T V H

• Next instances are:N S T V V C

S T V V S C

T V V S T C

Page 18: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Practical Secondary Structure Prediction

• Can aid in MSA– If structures are not more similar than the

aligned sequences; there is a problem

• Step towards three dimensional structure

• Clue about architecture– 28 regular protein architectures

Page 19: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

PSIPRED Example

Page 20: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Secondary structure prediction methods

PSI-pred (PSI-BLAST profiles used for prediction; David Jones, Warwick)

JPRED Consensus prediction (includes many of the methods given below; Cuff & Barton, EBI)

DSC King & SternbergPREDATORFrischman & Argos (EMBL) PHD home page Rost & Sander, EMBL, Germany ZPRED server Zvelebil et al., Ludwig, U.K. nnPredict Cohen et al., UCSF, USA. BMERC PSA Server Boston University, USA SSP (Nearest-neighbor) Solovyev and Salamov, Baylor College, USA.

http://speedy.embl-heidelberg.de/gtsp/secstrucpred.html

Andrew CR Martin, UCL

Page 21: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Consensus prediction method

hydrophobichighly conservedb= buried, e = exposed

Andrew CR Martin, UCL

Page 22: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Consensus prediction method -JPRED

hydrophobichighly conservedb= buried, e = exposed

amphipathic

hydrophobic

Andrew CR Martin, UCL

Page 23: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Neural network prediction - PHD

Multiple alignment

of protein family

SS profile for window of adjacent residues

Andrew CR Martin, UCL

Page 24: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Hidden Markov Models-HMMSTR

amino acid

secondary structure element

structural context

Markov state

• Recurrent local features of protein sequences

• Accuracy of 74%

Bystroff et al., 2000Andrew CR Martin, UCL

Page 25: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Consensus/ Meta Prediction Method

• Uses more than one existing method

• Learns how to combine the results

• Produces a result which is on average better than the single methods

• E.g.: http://gor.bb.iastate.edu/cdm/

Page 26: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Prediction Accuracy Assessment

• Protein Structure Prediction Center – http://predictioncenter.org/

• CASP– Critical Assessment of protein Structure

Prediction

Page 27: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Hydrophobicity

Page 28: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Assignment to Structure

• Sliding window of 5-7 or 19-21 amino acids– Why?

• Otherwise same idea as for secondary structure forming propensities

Page 29: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure
Page 30: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

End Theory I

Mindmapping

10 min break

Page 31: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Practice I

Page 32: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Sec Struct Predictionhttp://bioinf.cs.ucl.ac.uk/psipred/psiform.htmlhttp://compbio.soe.ucsc.edu/HMM-apps/T02-query.html http://distill.ucd.ie/porter/ http://sable.cchmc.org/ http://www.compbio.dundee.ac.uk/www-jpred/advanced.html http://genamics.com/expression/strucpred.htm http://www.predictprotein.org/ http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_phd.html http://www.chemie.uni-erlangen.de/lanig/PMII/sek_str.html http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_sopma.html http://molbiol-tools.ca/Protein_secondary_structure.htm http://mobyle.pasteur.fr/cgi-bin/portal.py?form=predator http://www.aber.ac.uk/~phiwww/prof/ http://www.expasy.ch/tools/ http://gor.bb.iastate.edu/ http://www.predictprotein.org/

Page 33: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

In class assignment• Choose a protein sequence

– Not too short!• Perform secondary structure predictions with as

many tools as possible– Google at least one more than given in the slides

• Retrieve and rewrite the predictions such that they use the 3 letter code (H,C,S; Helix, Coil, Sheet)– Use search and replace functionality of your word

processor• Make an MSA with the predicted secondary

structures to compare the results– Are there gaps? – Are they within the transition from one secondary

structure to the next?

Page 34: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Try to predict TMDs

• Find a protein with TMDs

• Expasy will provide you with prediction methods– DAS - Prediction of transmembrane regions in prokaryotes using the Dense

Alignment Surface method (Stockholm University)– HMMTOP - Prediction of transmembrane helices and topology of proteins

(Hungarian Academy of Sciences)– PredictProtein - Prediction of transmembrane helix location and topology

(Columbia University)– SOSUI - Prediction of transmembrane regions (Nagoya University, Japan)– TMHMM - Prediction of transmembrane helices in proteins (CBS; Denmark)– TMpred   - Prediction of transmembrane regions and protein orientation (EMBnet-

CH)– TopPred - Topology prediction of membrane proteins (France)

Page 35: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

End Practice I

Page 36: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Theory II

Page 37: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

RNA

• Coding RNA– Results in protein

• Non Coding RNA– Structural– Regulational– Catalytic– …

Page 38: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

RNA Basicstransfer RNA (tRNA)

messenger RNA (mRNA)

ribosomal RNA (rRNA)

small interfering RNA (siRNA)

micro RNA (miRNA)

small nucleolar RNA (snoRNA)

http://www.genetics.wustl.edu/eddy/tRNAscan-SE/

Page 39: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

RNA Secondary Structure

• Just like amino acids interact to form a secondary structure, nucleotides do the same

• Here base pairing is the driving motor

• Generally the structure of RNA molecules is projected onto 2 dimensions

Page 40: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Chemical Structure of RNAFour base types.

Distinguishable ends.

Page 41: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Partial Tertiary Structure

One illustration

Page 42: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Yet Another Tertiary Structure

Found via google

Page 43: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Our Final Tertiary Picture

Very complex

Page 44: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

A Partial RNA Secondary Structure

Page 45: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Pure Secondary Structure

Page 46: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

RNA Folding

• Single stranded RNA– Unstable– Base pairs with complementary

sequences– Base pair stacking– Favorable loop sizes

• Highest Stability– Lowest energy model

• Folding process– Not known in detail– Extremely fast

Page 47: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

RNA Secondary Structure Prediction

Dynamic Programming Approaches

Sarah Aerni

http://www.tbi.univie.ac.at/

Page 48: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

OutlineRNA folding

Dynamic programming for RNA secondary structure prediction

Covariance model for RNA structure prediction

Page 49: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

RNA Secondary Structure

Hairpin loopJunction (Multiloop)

Bulge Loop

Single-Stranded

Interior Loop

Stem

Image– Wuchty

Pseudoknot

Page 50: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Sequence Alignment as a method to determine structure

Bases pair in order to form backbones and determine the secondary structure

Aligning bases based on their ability to pair with each other gives an algorithmic approach to determining the optimal structure

Page 51: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Base Pair Maximization – Dynamic Programming Algorithm

Simple Example:Maximizing Base Pairing

Base pair at i and jUnmatched at iUmatched at jBifurcation

Images – Sean Eddy

S(i,j) is the folding of the subsequence of the RNA strand from index i to index j which results in the highest number of base pairs

Page 52: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Base Pair Maximization – Dynamic Programming Algorithm

Alignment Method Align RNA strand to itself Score increases for feasible base

pairs

Each score independent of overall structure

Bifurcation adds extra dimension

Initialize first two diagonal arrays to 0

Fill in squares sweeping diagonally

Images – Sean Eddy

Bases cannot pair, similarto unmatched alignment

S(i, j – 1)

Bases can pair, similarto matched alignment

S(i + 1, j)

Dynamic Programming – possible paths S(i + 1, j – 1) +1

Page 53: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Base Pair Maximization – Dynamic Programming Algorithm

Alignment Method Align RNA strand to itself Score increases for feasible base

pairs

Each score independent of overall structure

Bifurcation adds extra dimension

Initialize first two diagonal arrays to 0

Fill in squares sweeping diagonally

Images – Sean Eddy

Reminder:For all k

S(i,k) + S(k + 1, j)

k = 0 : Bifurcation max in this case

S(i,k) + S(k + 1, j)

Reminder:For all k

S(i,k) + S(k + 1, j)

Bases cannot pair, similarBases can pair, similarto matched alignmentDynamic Programming –

possible pathsBifurcation – add values for

all k

Page 54: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Base Pair Maximization - Drawbacks

Base pair maximization will not necessarily lead to the most stable structureMay create structure with many interior loops or

hairpins which are energetically unfavorable

Comparable to aligning sequences with scattered matches – not biologically reasonable

Page 55: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Energy Minimization

Thermodynamic StabilityEstimated using experimental techniques

Theory : Most Stable is the Most likely

No Pseudknots due to algorithm limitations

Uses Dynamic Programming alignment technique

Attempts to maximize the score taking into account thermodynamics

MFOLD and ViennaRNA

Page 56: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Energy Minimization Results

Linear RNA strand folded back on itself to create secondary structure

Circularized representation uses this requirementArcs represent base pairing

Images – David Mount

All loops must have at least 3 bases in them Equivalent to having 3 base pairs between all arcs

Exception: Location where the beginning and end of RNA come together in circularized representation

Page 57: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Trouble with Pseudoknots

Pseudoknots cause a breakdown in the Dynamic Programming Algorithm.

In order to form a pseudoknot, checks must be made to ensure base is not already paired – this breaks down the recurrence relations

Images – David Mount

Page 58: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Energy Minimization Drawbacks

Compute only one optimal structure

Usual drawbacks of purely mathematical approachesSimilar difficulties in other algorithms

Protein structure

Exon finding

Page 59: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Alternative Algorithms - Covariaton

Incorporates Similarity-based methodEvolution maintains sequences that are importantChange in sequence coincides to maintain structure

through base pairs (Covariance)Cross-species structure conservation example – tRNA

Manual and automated approaches have been used to identify covarying base pairs

Models for structure based on resultsOrdered Tree ModelStochastic Context Free Grammar

Expect areas of basepairing in tRNA to be covarying betweenvarious species

Base pairing creates same stable tRNA structure in organisms

Mutation in one baseyields pairing impossible and breaksdown structure

Covariation ensuresability to base pair is maintained and RNAstructure is conserved

Page 60: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Binary Tree Representation of RNA Secondary Structure

Representation of RNA structure using Binary tree

Nodes represent

Base pair if two bases are shown

Loop if base and “gap” (dash) are shown

Pseudoknots still not represented

Tree does not permit varying sequences

Mismatches

Insertions & Deletions

Images – Eddy et al.

Page 61: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Covariance Model

HMM which permits flexible alignment to an RNA structure – emission and transition probabilities

Model trees based on finite number of states Match states – sequence conforms to the model:

MATP – State in which bases are paired in the model and sequence

MATL & MATR – State in which either right or left bulges in the sequence and the model

Deletion – State in which there is deletion in the sequence when compared to the model

Insertion – State in which there is an insertion relative to model

Transitions have probabilitiesVarying probability – Enter insertion, remain in current state, etc

Bifurcation – no probability, describes path

Page 62: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Covariance Model (CM) Training Algorithm

S(i,j) = Score at indices i and j in RNA when aligned to the Covariance Model

Independent frequency of seeing the symbols (A, C, G, T) in locations i or j depending on symbol.

Frequencies obtained by aligning model to “training data” – consists of sample sequences Reflect values which optimize alignment of sequences to model

Frequency of seeing the symbols (A, C, G, T) together in locations i and j depending on symbol.

Page 63: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Alignment to CM Algorithm

Calculate the probability score of aligning RNA to CM

Three dimensional matrix – O(n³)Align sequence to given subtrees in CM

For each subsequence calculate all possible states

Subtrees evolve from Bifurcations

For simplicity Left singlet is default

Images – Eddy et al.

Page 64: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

•For each calculation take intoaccount the

• Transition (T) to next state • Emission probability (P) in the

state as determined by training data

Bifurcation – does not have a probabilityassociated with the stateDeletion – does not have an emission probability (P) associated with it

Images – Eddy et al.

Alignment to CM Algorithm

Page 65: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Covariance Model Drawbacks

Needs to be well trained

Not suitable for searches of large RNAStructural complexity of large RNA cannot be

modeled

Runtime

Memory requirements

Page 66: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

End Theory II

Mindmapping

10 min break

Page 67: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

Practice II

Page 68: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure

RNA Secondary Structure

• Online• http://compbio.cs.sfu.ca/taverna/alterna/• http://www.bioinfo.rpi.edu/applications/mfold/

• Download• RNAShapes• RNAFold

• Get RNAs– http://www.ncrna.org/frnadb/search.html

Page 69: Applied Bioinformatics Week 11. Topics Protein Secondary Structure RNA Secondary Structure