41
Swiss Institute of Bioinformatics Protein Structure Bioinformatics Introduction Secondary Structure & Protein Disorder Prediction EMBnet course Lausanne, February 21, 2007 Lorenza Bordoli Overview Introduction Secondary Structure Prediction Solvent Accessibility Prediction Disorder Prediction

Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Swiss Institute of Bioinformatics

Protein Structure BioinformaticsIntroduction

Secondary Structure & Protein Disorder Prediction

EMBnet course Lausanne, February 21, 2007

Lorenza Bordoli

Overview

Introduction

Secondary Structure Prediction

Solvent Accessibility Prediction

Disorder Prediction

Page 2: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Principles of protein structure

Primary Structure

Secondary Structure

Tertiary Structure (Fold)

Quaternary Structure

Principles of protein structure

Protein structure include:

Core Region:

Secondary structure element packed in close proximity

in hydrophobic environment

Limited amino acid substitution

Outside the core:

loops and structural elements in contact with water,

membrane or other proteins

Amino acid substitution: not as restricted as above

Page 3: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Protein Structures:

Solvent Accessibility

• Buried

• Solvent exposed

Primary Structure

Secondary Structure

Tertiary Structure (Fold)

Quaternary Structure

Secondary Structures:

• α Helix

• β Sheet

Secondary structure assignment

DSSP

Dictionary of Secondary Structure of Proteins (Kabsch

& Sander, 1983)

Based on recognition of hydrogen-bonding patterns in

known structures

Automated assignment of secondary structure

Interprets backbone hydrogen bonds

Uses a Coulomb approximation for the hydrogen bond

energy (-0.5 kcal/mol cut-off)

Secondary structures are assigned to consecutive

segments of residues with hydrogen bonds

Page 4: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Secondary structure assignment

DSSP secondary structure elements8 secondary structure classes

– H (α-helix) → H

– G (310-helix) → H

– I (π-helix) → H

– E (extended strand) → E

– B (residue in isolated β-bridge) → E

– T (turn) → L

– S (bend) → L

– " " (blank = other) → L

How many structures do we know?

http://www.wwpdb.org/

Page 5: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

How many structures do we know?

PDB: http://www.pdb.org

X-Ray, NMR => atom coordinates of the proteins are

deposited in PDB: worldwide repository for the 3-D

biological macromolecular structure data.

EBI-MSD: http://www.ebi.ac.uk/msd/ (2003)

suite of web-based search and retrieval interfaces for

macromolecular structure research.

How many structures do we know?

Page 6: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

[ PDB: http://www.pdb.org ]

Growth of the Protein Data Bank PDB

TotalYearly

100

1,000

10,000

100,000

1,000,000

10,000,000

1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

TrEMBL

SwissProt

PDB

No experimentalstructure for mostprotein sequences

(Sources: PDB, EBI, SIB)

How many structures do we know?

Page 7: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

How many structures do we know?

3Dhomology

modeling

fold

recognitionsome m

odel

1D

... ?

(B.Rost, Columbia, NewYork)

Genome View:

1D-Structure prediction

Secondary Structure Prediction

As starting point for 3D modeling

Improve sequence alignments

Use in fold recognition

Definition of loops / core regions

Solvent Accessibility Prediction

Identify exposed residues, e.g. for mutation studies,

epitopes, etc.

Page 8: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Secondary Structure prediction

What is protein secondary structure

prediction?

Simplification of prediction problem

3D → 1D

Secondary Structure prediction

Reduction to secondary structure “3-state” model: β-Strand, α-Helix, Loop

Projection onto strings of structural assignments

(S) β-Strand (E) (H) α-Helix (L) Loop

SEQ MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGSELGKQAK SS SSSSSSLLLLLLHHHHHHHHHHHLLLSSSLHHHHHHHHHHHLLLLLLHHHSS SSSSSS HHHHHHHHHHH SSS HHHHHHHHHHH HHH

Page 9: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Secondary Structure prediction

Assumption:there should be a correlation between amino

acid sequence and secondary structure

Conformational Preferences

Biochimica et Biophysica Acta 916: 200-204 (1987).

α

β

RT

Page 10: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

1st Generation secondary structure prediction

1st Generation based on single amino acid propensities

Chou and Fasman, 1974Robson, 1976GOR-1: Garnier, Osguthorpe, and Robson, 1978

Preference of particular residues for certain secondary structure elements:

Single-residue statistics: analysis of the frequency of each 20 aa in α helices, β strands or coils

Structure databases were of very limited size

Name P(H) P(E) P(turn) f(i) f(i+1) f(i+2) f(i+3)Alanine 142 83 66 0.06 0.076 0.035 0.058Arginine 98 93 95 0.07 0.106 0.099 0.085Aspartic Acid 101 54 146 0.147 0.11 0.179 0.081Asparagine 67 89 156 0.161 0.083 0.191 0.091Cysteine 70 119 119 0.149 0.05 0.117 0.128Glutamic Acid 151 37 74 0.056 0.06 0.077 0.064Glutamine 111 110 98 0.074 0.098 0.037 0.098Glycine 57 75 156 0.102 0.085 0.19 0.152Histidine 100 87 95 0.14 0.047 0.093 0.054Isoleucine 108 160 47 0.043 0.034 0.013 0.056Leucine 121 130 59 0.061 0.025 0.036 0.07Lysine 114 74 101 0.055 0.115 0.072 0.095Methionine 145 105 60 0.068 0.082 0.014 0.055Phenylalanine 113 138 60 0.059 0.041 0.065 0.065Proline 57 55 152 0.102 0.301 0.034 0.068Serine 77 75 143 0.12 0.139 0.125 0.106Threonine 83 119 96 0.086 0.108 0.065 0.079Tryptophan 108 137 96 0.077 0.013 0.064 0.167Tyrosine 69 147 114 0.082 0.065 0.114 0.125Valine 106 170 50 0.062 0.048 0.028 0.053

Chou-Fasman tables

Page 11: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Chou-Fasman

How it works:

a. Assign all of the residues the appropriate set of parameters

b. Identify α-helix and β-sheet regions. Extend the regions in

both directions.

c. If structures overlap compare average values for P(H) and

P(E) and assign secondary structure based on best scores.

d. Turns are modeled as tetra-peptides using 2 different

probability values.

Assign Pij values

1. Assign all of the residues the appropriate set of parameters

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75

P(turn) 114 143 152 114 66 74 59 60 95 143 114 156

Page 12: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Scan peptide for α−helix regions

2. Identify regions where 4/6 have a P(H) >100 “alpha-helix nucleus”

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Extend α-helix nucleus

3. Extend helix in both directions until a set of four residues has an average P(H) <100.

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Repeat steps 1 – 3 for entire peptide

Page 13: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

4. Identify regions where 3/5 have a P(E) >100 “b-sheet nucleus”

Extend b-sheet until 4 continuous residues have an average P(E) < 100

If region average > 105 and the average P(E) > average P(H) then “b-sheet”

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57P(E) 147 75 55 147 83 37 130 105 93 75 147 75

Scan peptide for β-sheet regions

Chou-Fasman

1. Assign all of the residues in the peptide the appropriate set of parameters.

2. Scan through the peptide and identify regions where 4 out of 6 contiguous residues have P(a-helix) > 100. That region is declared an alpha-helix. Extend the helix in both directions until a set of four contiguous residues that have an average P(a-helix) < 100 is reached. That is declared the end of the helix. If the segment defined by this procedure is longer than 5 residues and the average P(a-helix) > P(b-sheet) for that segment, the segment can be assigned as a helix.

3. Repeat this procedure to locate all of the helical regions in the sequence.

4. Scan through the peptide and identify a region where 3 out of 5 of the residues have a value of P(b-sheet) > 100. That region is declared as a beta-sheet. Extend the sheet in both directions until a set of four contiguous residues that have an average P(b-sheet) < 100 is reached. That is declared the end of the beta-sheet. Any segment of the region located by this procedure is assigned as a beta-sheet if the average P(b-sheet) > 105 and the average P(b-sheet) > P(a-helix) for that region.

5. Any region containing overlapping alpha-helical and beta-sheet assignments are taken to be helical if the average P(a-helix) > P(b-sheet) for that region. It is a beta sheet if the average P(b-sheet) > P(a-helix) for that region.

6. To identify a bend at residue number j, calculate the following value:p(t) = f(j)f(j+1)f(j+2)f(j+3)where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used. If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetra-peptide; and (3) the averages for the tetra-peptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.

Page 14: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

CHOFAS predicts protein secondary structure version 2.0u61 September 1998 Please cite: Chou and Fasman (1974) Biochem., 13:222-245 Chou-Fasman plot of @, 12 aa; SEQ1 sequence.

TSPTAELMRSTG helix <> sheet EEEEEEE turns T

Residue totals: H: 2 E: 7 T: 1 percent: H: 16.7 E: 58.3 T: 8.3

Chou-Fasman Results

Performance EvaluationAssumption: there should be a correlation*between amino acid sequence and secondary structure

Systematic performance testing pre-requisite for reliability of method

Training Set Test Set

Dataset

PDB

PDB sub set:derive correlation*

PDB sub-set:=> Q3

Page 15: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Accuracy of prediction

3-state-per-residue accuracy:

Gives % of correctly predicted

residues in α, β or other state

Q3 = 100 • Σ ci/N

• N= total number of residues

• Ci = number of correctly predicted residue

in state I (H,E,L)

1st Generation secondary structure prediction

3-state per residue accuracy assessment

SEQ KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLDSS EEEE E E E EEEEEE EEEEEE EEEEEEHHHEEEE

TYP EHHHH EE EEEE EE HHHEE EEEHH

Typical 1st generation prediction result:

3 - state per residue accuracy:

Q3 = 100 • Σ ci / N

ci = number of correctly predicted residues in state i (H,E,L)N = number of all residues

50 – 55 % Q3 accuracy Performance is overestimated!Q3

Random = 35.2 %

Page 16: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

2nd Generation secondary structure prediction

Improvements

Larger database of protein structures

Segment-based statistics (11-21 residue window)

Basic idea:

"How likely is it that the central residue in a window adopts a

particular secondary structure state?"

Algorithm used:

Presumably all conceivable algorithms on this planet have

been applied to the Secondary Structure prediction problem.

E.g. statistical information, physicochemical properties,

sequence patterns, neural networks, graph theory, expert

rules

(H) α-Helix, local interactions

Neural Network

Artificial intelligence:Computer programs are trainedto be able to recognize amino acid patters that are located in known secondary structure and distinguish from other patterns not located in these structures

NN can detect interactions between amino acids in a sequence window.

Page 17: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Artificial Neural Networks (ANN)

Excursion:

Introduction to Artificial

Neural Networks

Thanks to C. Pellegrini & P. Palagi (SIB) for slides about ANNs.

Inspiration - The brain

• Capable of remembering, recognizing patterns and associating. Main characteristics:

• massively parallel• non-linear• huge number of slow units highly connected • self-organizing and self-adapting

• Some statistics about the brain: • 1011 neurons• 1015 connections

• and about neurons: • 1 neuron is connected with 103 to 105 other neurons• slow: 10-3 sec (silicon logic gates 10-9 sec)

C. Pellegrini (SIB)

Page 18: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

The nervous system

forward information feedback

The brain continually receives information, perceives it and makesappropriate decisions.

ReceptorsNeural

networks Effectors ResponsesStimulus

Brain

Human nervous system is a three-stage system:

C. Pellegrini (SIB)

An artificial neural network

• An artificial neural network (ANN) is a “machine”:

• assembly of artificial neurons• created to model the way the brain execute tasks by simulating mathematically the neurons and their connections

• Requirements to achieve a good performance:

• a huge number of neurons• massive interconnection among them

C. Pellegrini (SIB)

Page 19: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Artificial neuron model

• Introduced by McCulloch & Pitts (1943):

v w xi ii

= ∑ if v > θ then output = +1

else output = -1

ν

x1

x2 w2

w1

θ− 1

output

Quite simple: All signals can be 1 or -1. The neuron calculates a weighted sum of inputs and compares it to a threshold. If the sum is higher then the threshold, the output is set to 1, otherwise -1.

P.Palagi (SIB)

Artificial neuron model

• This simple neuron model consists of:

• A set of connections, called synapses, which make the link to other neurons to create a network. Each synapse has a synaptic weight which represents the strength of the connection.

• One unity which multiplies each incoming activity by the weight on the connection and adds together all these weighted inputs toget a total input.

• An activation function that transforms the total input into an outgoing activity (to constrain the input amplitude).

v w xi ii

= ∑

if v > θ then output = +1

else output = -1

νx1

x2 w2

w1

θ−1

output

P.Palagi (SIB)

Page 20: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Artificial neuron model

Modern McCulloch & Pitts neuron:

Σ

x1 wk 1

x3 wk 3

x2 wk 2

x p wkp

Summationunit

( )ϕ . Output yk

Activationfunction

Thresholdθk

Synapticweights

Input

signals

vk

P.Palagi (SIB)

Artificial neuron model

v w xk kj jj

p

==

∑1

and

The model can be mathematically described:

( )y vk k k= −ϕ θ

Where:

( )

x x x inputs

w w w synaptic weights k

v linear combiner

threshold

activation function

y output

p

kp

k

k

k

1 2, , , are the ,

are the of neuron ,

is the output,

is the ,

is the

is the signal of the neuron.

K

K1 2, , ,

. ,

θϕ

P.Palagi (SIB)

Page 21: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Types of activation functions

( )y vif v

if vk k

k

k

= =≥<

⎧⎨⎩

ϕ1 0

0 0

The activation function defines the output of a neuron in terms of the activity level at its inputs. There are 3 basic types of activation functions.

( )y v

v

v v

vk k= =

≥> >

⎨⎪

⎩⎪

ϕα

α ββ

1

0

• threshold function

• piecewise-linear function

( )ϕ ve av=

+ −

1

1( )ϕ v

v e

e

v

v=⎛⎝⎜

⎞⎠⎟ =

−+

+tanh2

1

1

• sigmoid function

or

Activation functions - interpretation

An activation function is a decision function:

• defines a threshold under which the activation value will not

fire any output,

• allows to select, linearly or not, among different activation

values,

• the highest output value comes from the highest activation

value, i.e. similarity between the input values and the synaptic

weights.

Page 22: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Network architectures

The power of neural networks comes from its collective

behavior in a network where all neurons are

interconnected. The network starts evolving: neurons

continuously evaluate their output by looking at their

inputs, calculating the weighted sum and comparing to a

threshold to decide if they should fire. This is highly

complex parallel process whose features cannot be

reduced to phenomena taking place with individual

neurons.

Network architectures

Neural networks are formed by an assembly of many artificial neurons. An artificial neural network may be seen as a massively paralleldistributed processor.

The basic work of a neural network is determined by learning. The memorized information is retained through the synaptic weights.

⇒ Knowledge is represented by the free parameters of the neural network, i.e. synaptic weights and thresholds.

Page 23: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Single-layer Feed-forward Network

x2

x3

xm

x1

Inputsignals

Neurons

Single Layer

Perceptron

Learning methods

An artificial neural network learning method is a procedure which adjusts the neural network free parameters i.e. synaptic weightsand thresholds.

Supervised: We feed the neural network with k input (entries) and

their corresponding desired output. The learning algorithm modifies

(little by little) the synaptic weights to adapt the obtained output

according to the desired output. Only the synaptic weights which

produce an error are modified.

Non-supervised: We feed the neural network with the input

(entries) only. The neural network will organise itself in order to

represent the input data.

Page 24: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Multilayer Feed-forward Network

x2

x3

xm

x1

Inputsignals

Hidden Neurons Output layer

Multi Layer Perceptron

Training a neural network

Supervised Learning

We feed the neural network with the input (entries) and the corresponding desired output.

The learning algorithm modifies (step by step) the synaptic weights to adapt the obtained output according to the desired output. Only the synaptic weights which produce an error are modified.

The error back-propagation algorithm consists of two phases:

the forward phase where the activations are propagated from the input to the output layer, and

the backward phase, where the error between the observed actual and the requested nominal value in the output layer is propagated backwards in order to modify the weights and bias values.

Page 25: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

(H) α-Helix, local interactions

Neural Networks for Secondary Structure Prediction

Artificial intelligence:Computer programs are trainedto be able to recognize amino acid patters that are located in known secondary structure and distinguish from other patterns not located in these structures

NN can detect interactions between amino acids in a sequence window.

ACDEFGHIKLMNPQRSTVWY.

H

E

L

D (L)

R (E)

Q (E)

G (E)

F (E)

V (E)

P (E)

A (H)

A (H)

Y (H)

V (E)

K (E)

K (E)

(B.Rost, Columbia, NewYork)

Input Layer

Hidden Layer

Output Layer

WeightsTraining NN

Neural Networks for Secondary Structure Prediction

Page 26: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

H

E

L

D (L)

R (E)

Q (E)

G (E)

F (E)

V (E)

P (E)

A (H)

A (H)

Y (H)

V (E)

K (E)

K (E)

(B.Rost, Columbia, NewYork)

= 0.19

= 0.61

= 0.17

The winner is:

E

prediction

Neural Networks for Secondary Structure Prediction

Neural Networks

BenefitsGeneral applicable

Can capture higher order correlations

Inputs other than sequence information

DrawbacksNeeds many data points (solved structures)

Risk of overtraining

Page 27: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

2nd Generation secondary structure prediction

Methods:

GORIII

COMBINE

Q3 accuracy < 70%

Problems with first and second generation methods

Q3 accuracy < 70%

β-stands predicted < 28 - 48 % (slightly better than random)

Predicted helices and strands are too short

Page 28: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

The Dinosaurs are still alive …

Bad example:

PeptideStructure makes predictions of the following features of an amino acid sequence:

- Secondary structure according to the Chou-Fasmanmethod

- Secondary structure according to the Garnier-Osguthorpe-Robson method

-…

From the GCG Manual © 1982-2002 Accelrys

3rd Generation secondary structure prediction

IKEEHVI IQAE

HEC

IKEEHVIIQAEFYLNPDQSGEF…..Window

Input Layer

Hidden Layer

Output Layer

Weights

Graphics: C. Lundegaard, CBS

Page 29: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

3rd Generation secondary structure prediction

Graphics: C. Lundegaard, CBS

Inp Neuron 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

AAcid

A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

R 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

N 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

D 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

C 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Q 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

E 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

Sparse encoding

3rd Generation secondary structure prediction

Graphics: C. Lundegaard, CBS

IKEEHVI IQAE

00000010000000000000

Page 30: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

3rd Generation secondary structure prediction

Graphics: C. Lundegaard, CBS

BLOSUM 62A R N D C Q E G H I L K M F P S T W Y V B Z X *

A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4

3rd Generation secondary structure prediction

Graphics: C. Lundegaard, CBS

IKEEHVI IQAE

-1002-4

25-2

0-3-3

1-2-3-1

0-1-3-2-2

Page 31: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

3rd Generation secondary structure prediction

Breakthrough: Using evolutionary information 1 50fyn_human VTLFVALYDY EARTEDDLSF HKGEKFQILN SSEGDWWEAR SLTTGETGYI yrk_chick VTLFIALYDY EARTEDDLSF QKGEKFHIIN NTEGDWWEAR SLSSGATGYI fgr_human VTLFIALYDY EARTEDDLTF TKGEKFHILN NTEGDWWEAR SLSSGKTGCI yes_chick VTVFVALYDY EARTTDDLSF KKGERFQIIN NTEGDWWEAR SIATGKTGYI src_avis2 VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_aviss VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_avisr VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI src_chick VTTFVALYDY ESRTETDLSF KKGERLQIVN NTEGDWWLAH SLTTGQTGYI stk_hydat VTIFVALYDY EARISEDLSF KKGERLQIIN TADGDWWYAR SLITNSEGYI src_rsvpa .......... ESRIETDLSF KKRERLQIVN NTEGTWWLAH SLTTGQTGYI hck_human ..IVVALYDY EAIHHEDLSF QKGDQMVVLE ES.GEWWKAR SLATRKEGYI blk_mouse ..FVVALFDY AAVNDRDLQV LKGEKLQVLR .STGDWWLAR SLVTGREGYV hck_mouse .TIVVALYDY EAIHREDLSF QKGDQMVVLE .EAGEWWKAR SLATKKEGYI lyn_human ..IVVALYPY DGIHPDDLSF KKGEKMKVLE .EHGEWWKAK SLLTKKEGFI lck_human ..LVIALHSY EPSHDGDLGF EKGEQLRILE QS.GEWWKAQ SLTTGQEGFI ss81_yeast.....ALYPY DADDDdeISF EQNEILQVSD .IEGRWWKAR R.ANGETGII abl_mouse ..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWV abl1_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YnnGEWCEAQ ..TKNGQGWV src1_drome..VVVSLYDY KSRDESDLSF MKGDRMEVID DTESDWWRVV NLTTRQEGLI mysd_dicdi.....ALYDF DAESSMELSF KEGDILTVLD QSSGDWWDAE L..KGRRGKV yfj4_yeast....VALYSF AGEESGDLPF RKGDVITILK ksQNDWWTGR V..NGREGIF abl2_human..LFVALYDF VASGDNTLSI TKGEKLRVLG YNQNGEWSEV RSKNG.QGWV tec_human .EIVVAMYDF QAAEGHDLRL ERGQEYLILE KNDVHWWRAR D.KYGNEGYI abl1_caeel..LFVALYDF HGVGEEQLSL RKGDQVRILG YNKNNEWCEA RlrLGEIGWV txk_human .....ALYDF LPREPCNLAL RRAEEYLILE KYNPHWWKAR D.RLGNEGLI yha2_yeastVRRVRALYDL TTNEPDELSF RKGDVITVLE QVYRDWWKGA L..RGNMGIF abp1_sacex.....AEYDY EAGEDNELTF AENDKIINIE FVDDDWWLGE LETTGQKGLF

3rd Generation secondary structure prediction

Η

Ε

L

>

>

>

pickmaximal

unit=>

currentprediction

J2

inputlayer

first orhidden layer

second oroutput layer

s0 s1 s2J1

:GYIY

DPAVGDPDNGVEP

GTEF:

:GYIY

DPEVGDPTQNIPP

GTKF:

:GYEY

DPAEGDPDNGVKP

GTSF:

:GYEY

DPAEGDPDNGVKP

GTAF:

Alignments

5 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 5 . .. . . . . . . 2 . . . . . 3 . . . . . .. . . . . . . . . . . . . . . . . 5 . .

. . . . 5 . . . . . . . . . . . . . . .

. . . 5 . . . . . . . . . . . . . . . .

. . 3 . . . . 2 . . . . . . . . . . . .

. . . . 1 . . 2 . . . 2 . . . . . . . .5 . . . . . . . . . . . . . . . . . . .. . . . 5 . . . . . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .. . . . 4 . 1 . . . . . . . . . . . . .. . . . 1 3 . . . 1 . . . . . . . . . .4 . . . . 1 . . . . . . . . . . . . . .. . . . . . . . . . . 4 . 1 . . . . . .. . . 1 . 1 . 1 2 . . . . . . . . . . .. . . 5 . . . . . . . . . . . . . . . .

5 . . . . . . . . . . . . . . . . . . .. . . . . . 5 . . . . . . . . . . . . .. 1 1 . 1 . . 1 1 . . . . . . . . . . .. . . . . . . . . . . . . . . . . . 5 .

GSAPD NTEKQ CVHIR LMYFW

profile table

:GYIY

DPEDGDPDDGVNP

GTDF:

Protein

corresponds to the the 21*3 bits coding for the profile of one residue

(B.Rost, Columbia, NewYork)

Page 32: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

3rd Generation secondary structure prediction

PHD method (Rost and Sander)

Combine neural networks with MAXHOM multiple

sequence profiles

6-8 Percentage points increase in prediction accuracy

over standard neural networks

Use second layer “Structure to structure”

network to filter predictions

Jury of predictors

3rd generation secondary structure prediction

PHD (Rost et. al.) Q3 72 - 76 %

[ B.Rost (2001) J.Struct.Biol. 134, 204 ]

59 %

65 %

72 %

Q3

Prediction reliability (0 = weak, 9 = strong)

Page 33: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

3rd generation secondary structure prediction

PSI-Pred (Jones, DT)

Use alignments from iterative sequence

searches (PSI-Blast) as input to a neural

network

Better predictions due to better sequence

profiles

Available as stand alone program and via the

web

How accurate are predictions today?

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Num

ber

of p

rote

in c

hain

s

Per-residue accuracy (Q3)

<Q3>=72.3% ; sigma=10.5%

1spf

1bct

1stu

3ifm

1psm

(B.Rost, Columbia, NewYork)

Page 34: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

How accurate are predictions today?

Q3 = 72-77% +- 11 % (on average)

• I.e. 30 % of predicted assignments are wrong

• I.e. for 2/3 of all proteins, between 60% - 80% of residues are predicted correctly

• I.e. for your protein, accuracy can be lower than 60% or higher than 80%

Secondary Structure Prediction

META-PredictProtein Server

• http://www.predictprotein.org

• Simultaneous submission tool to several other servers, e.g.JPRED, PHD, PROF, PSIprod, SAM-T99, APSSP2, Sspro

• Includes also motif searches, domain assignments, TM predictions, etc.

Page 35: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

1D-Structure prediction

Secondary Structure Prediction

Solvent Accessibility Prediction

Identify exposed residues, e.g. for

mutation studies, epitopes, etc.

1D-Structure prediction

Projection onto strings of structural assignments

E.g. “Solvent Accessibility” (buried or exposed?)

A B C D E F G…¦ ¦ ¦ ¦ ¦ ¦ ¦e e b b e e e…

Page 36: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Accuracy of two-state prediction: 75% ± 10 %

PHDacc: solvent accessibility prediction

[http://www.predictprotein.org]

1D-Structure Prediction

Introduction

Secondary Structure Prediction

Solvent Accessibility Prediction

Disorder Prediction

Page 37: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Native Disorder in Proteins

Structural biology tenet: “Function of a protein determined by its 3D-Structure“However: disordered proteins or regions of proteins no fixed secondary or tertiary structure under physiological conditions and/or in the absence of a binding partner/ligand:

Ensemble of structural states leading to dynamic flexibilityNon globular structures that are extended in the solvent

2hfv2hfq

Experimental Detection of Disordered regions

Protein region is defined as disordered if it is devoid of stable secondary structure and if it has a large number of conformations:

X-Ray crystallography: lack of electron density

NMR: dynamics of sizeable disordered regions

CD (Circular dichroism)

SAXS (Small-angle X-ray scattering)

Hydrodynamic measurements

Traditional biochemical studies: proteolytic susceptibility

DisProt: Database of protein disorder (www.disprot.org)

Page 38: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Role of Protein Disorder

Participate in many biological processes: Regulation of transcription and translation

Cellular signal transduction

Cell-cycle control

Regulation of the self-assembly of large multiproteincomplexes (e.g. bacterial flagellum and the ribosome)

Role?Form larger contact areas with other proteins

Flexibility allows to bind multiple ligands

Protein easily regulated by PTM modifications

Relative instability of the intrinsically disordered proteins involved in transcription and signaling provides a further levelof control trough proteolytic degradation: concentration easily regulated by protease digestion.

The continuum of protein structure

ACTR: interaction domain of activator (p160) for retinoid receptor

NCBD: nuclear-receptor co-activator domain of CBP

TFIIIA: 3 zinc fingers of transcription factor

elF4E: translation-initiation factor

Page 39: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Thermodynamic consequences of coupled folding and binding

There is an entropic cost associated with the disorder-to-order transition: binding of an intrinsically unstructured protein to its target.The key thermodynamic driving force for the binding reaction is a favorable enthalpiccontribution: enthalpy-entropy compensation.Coupled folding and binding gives rise to a complex with high specificity and relatively low affinity: appropriate for signal-transduction proteins.

Characteristics of Disorder regions

Clear patterns that characterize disordered regions:

Low sequence complexity (biases composition, overrepresentation of a few residues)Amino acid compositional bias

• Low content of bulky hydrophobic amino acid (Val,Leu,Ile,Met,Phe,Trp and Tyr)

• High proportion of polar and charged amino acids(Gln, Ser, Pro, Glu, Lys and sometimes Gly and Ala)

High-sequence variability (high flexibility)

Training of NN

Page 40: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

Role of Prediction of Disordered regions

The prediction of disordered regions would provide:

First step in the identification of functionally relevant disordered regions

• Design of laboratory experiments for the identification of binding sites within disordered regions. [1]

Identification of regions that hinder successful crystallization of the protein: bottleneck in structural proteomics (high-through-put structure determination pipeline) [2]

[1] Longi S. et al. (2003), J. Biol. Chem., 278, 18638[2] Linding R. et al. (2003), Structure., 11, 1453

Program & Servers

Obradovic & Dunker: PONDR, http://www.pondr.com/

Jones: Disopred2, http://bioinf.cs.ucl.ac.uk/disopred/

Page 41: Protein Structure Bioinformatics Introduction · 1D-Structure prediction Secondary Structure Prediction ¾As starting point for 3D modeling ¾Improve sequence alignments ¾Use in

References

P.E.Bourne, H. Weissig. Structural Bioinformatics, Wiley-Liss and

Sons.

Methods in Molecular Biology 143: Protein Structure Prediction,

Humana Press.

Protein Structure Prediction: A practical Approach, Oxford

University Press.

R. Silipo, Neural Networks, in: M. Berthold, D.J. Hand, Intelligent

Data Analysis, Springer Verlag.