Protein Secondary Structures Assignment and prediction Pernille Haste Andersen 17.05.2006

Protein Secondary Structures

Assignment and prediction

Pernille Haste Andersen

17.05.2006

Outline

• What is protein secondary structure

• How can it be used?

• Different prediction methods– Alignment to homologues– Propensity methods– Neural networks

• Evaluation of prediction methods

• Links to prediction servers

Secondary Structure Elements

ß-strand

Helix

TurnBend

Use of secondary structure

• Classification of protein structures

• Definition of loops (active sites)

• Use in fold recognition methods

• Improvements of alignments

• Definition of domain boundaries

Classification of secondary structure

• Defining features– Dihedral angles– Hydrogen bonds– Geometry

• Assigned manually by crystallographers or• Automatic

– DSSP (Kabsch & Sander,1983)– STRIDE (Frishman & Argos, 1995)– DSSPcont (Andersen et al., 2002)

Dihedral Angles

phi - dihedral angle of the N-Calpha bondpsi - dihedral angle of the Calpha-C bondomega - dihedral angle of the C-N (peptide) bond

From http://www.imb-jena.de

Helices phi(deg) psi(deg) H-bond pattern-----------------------------------------------------------alpha-helix -57.8 -47.0 i+4pi-helix -57.1 -69.7 i+5310 helix -74.0 -4.0 i+3

(omega = 180 deg )From http://www.imb-jena.de

Beta Strands phi(deg) psi(deg) omega (deg)------------------------------------------------------------------beta strand -120 120 180

From http://broccoli.mfn.ki.se/pps_course_96/

Antiparallel

Parallel

Secondary Structure Elements

ß-strand

Helix

TurnBend

Secondary Structure Type Descriptions

* H = alpha helix * G = 310 - helix * I = 5 helix (pi helix)* E = extended strand, participates in beta ladder* B = residue in isolated beta-bridge * T = hydrogen bonded turn * S = bend * C = coil

Automatic assignment programs• DSSP ( http://www.cmbi.kun.nl/gv/dssp/ )• STRIDE ( http://www.hgmp.mrc.ac.uk/Registered/Option/stride.html )• DSSPcont ( http://cubic.bioc.columbia.edu/services/DSSPcont/ )

• The protein data bank visualizes DSSP assignments on structures in the data base

# RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA

1 4 A E 0 0 205 0, 0.0 2,-0.3 0, 0.0 0, 0.0 0.000 360.0 360.0 360.0 113.5 5.7 42.2 25.1 2 5 A H - 0 0 127 2, 0.0 2,-0.4 21, 0.0 21, 0.0 -0.987 360.0-152.8-149.1 154.0 9.4 41.3 24.7 3 6 A V - 0 0 66 -2,-0.3 21,-2.6 2, 0.0 2,-0.5 -0.995 4.6-170.2-134.3 126.3 11.5 38.4 23.5 4 7 A I E -A 23 0A 106 -2,-0.4 2,-0.4 19,-0.2 19,-0.2 -0.976 13.9-170.8-114.8 126.6 15.0 37.6 24.5 5 8 A I E -A 22 0A 74 17,-2.8 17,-2.8 -2,-0.5 2,-0.9 -0.972 20.8-158.4-125.4 129.1 16.6 34.9 22.4 6 9 A Q E -A 21 0A 86 -2,-0.4 2,-0.4 15,-0.2 15,-0.2 -0.910 29.5-170.4 -98.9 106.4 19.9 33.0 23.0 7 10 A A E +A 20 0A 18 13,-2.5 13,-2.5 -2,-0.9 2,-0.3 -0.852 11.5 172.8-108.1 141.7 20.7 31.8 19.5 8 11 A E E +A 19 0A 63 -2,-0.4 2,-0.3 11,-0.2 11,-0.2 -0.933 4.4 175.4-139.1 156.9 23.4 29.4 18.4 9 12 A F E -A 18 0A 31 9,-1.5 9,-1.8 -2,-0.3 2,-0.4 -0.967 13.3-160.9-160.6 151.3 24.4 27.6 15.3 10 13 A Y E -A 17 0A 36 -2,-0.3 2,-0.4 7,-0.2 7,-0.2 -0.994 16.5-156.0-136.8 132.1 27.2 25.3 14.1 11 14 A L E >> -A 16 0A 24 5,-3.2 4,-1.7 -2,-0.4 5,-1.3 -0.929 11.7-122.6-120.0 133.5 28.0 24.8 10.4 12 15 A N T 45S+ 0 0 54 -2,-0.4 -2, 0.0 2,-0.2 0, 0.0 -0.884 84.3 9.0-113.8 150.9 29.7 22.0 8.6 13 16 A P T 45S+ 0 0 114 0, 0.0 -1,-0.2 0, 0.0 -2, 0.0 -0.963 125.4 60.5 -86.5 8.5 32.0 21.6 6.8 14 17 A D T 45S- 0 0 66 2,-0.1 -2,-0.2 1,-0.1 3,-0.1 0.752 89.3-146.2 -64.6 -23.0 33.0 25.2 7.6

Secondary Structure Prediction

• What to predict?– All 8 types or pool types into groups

H

E

C

DSSP

Q3

* H = alpha helix * G = 310 -helix * I = 5 helix (pi helix)

* E = extended strand* B = beta-bridge

* T = hydrogen bonded turn * S = bend * C = coil

Straight HEC


• What to predict?– All 8 types or pool types into groups

H

E

C

Q3

* H = alpha helix

* E = extended strand

* T = hydrogen bonded turn * S = bend * C = coil* G = 310-helix* I = 5 helix (pi helix)* B = beta-bridge


• Simple alignments• Align to a close homolog for which the structure has been

experimentally solved.

• Heuristic Methods (e.g., Chou-Fasman, 1974)• Apply scores for each amino acid an sum up over a

window.

• Neural Networks• Raw Sequence (late 80’s)• Blosum matrix (e.g., PhD, early 90’s)• Position specific alignment profiles (e.g., PsiPred, late 90’s)• Multiple networks balloting, probability conversion, output

expansion (Petersen et al., 2000).

Improvement of accuracy

1974 Chou & Fasman ~50-53%1978 Garnier 63%1987 Zvelebil 66%1988 Quian & Sejnowski 64.3%1993 Rost & Sander 70.8-72.0%1997 Frishman & Argos <75%1999 Cuff & Barton 72.9%1999 Jones 76.5%2000 Petersen et al. 77.9%

Simple Alignments

•Solved structure of a homolog to query is needed•Homologous proteins have ~88% identical (3 state) secondary structure • If no close homologue can be identified alignments will give almost random results

Propensities: Amino acid preferences in -Helix

Propensities: Amino acid preferences in -Strand

Propensities: Amino acid preferences in coil

Chou-Fasman propensities

Name P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)Ala 142 83 66 0.06 0.076 0.035 0.058Arg 98 93 95 0.070 0.106 0.099 0.085Asp 101 54 146 0.147 0.110 0.179 0.081Asn 67 89 156 0.161 0.083 0.191 0.091Cys 70 119 119 0.149 0.050 0.117 0.128Glu 151 37 74 0.056 0.060 0.077 0.064Gln 111 110 98 0.074 0.098 0.037 0.098Gly 57 75 156 0.102 0.085 0.190 0.152His 100 87 95 0.140 0.047 0.093 0.054Ile 108 160 47 0.043 0.034 0.013 0.056Leu 121 130 59 0.061 0.025 0.036 0.070Lys 114 74 101 0.055 0.115 0.072 0.095Met 145 105 60 0.068 0.082 0.014 0.055Phe 113 138 60 0.059 0.041 0.065 0.065Pro 57 55 152 0.102 0.301 0.034 0.068Ser 77 75 143 0.120 0.139 0.125 0.106Thr 83 119 96 0.086 0.108 0.065 0.079Trp 108 137 96 0.077 0.013 0.064 0.167Tyr 69 147 114 0.082 0.065 0.114 0.125Val 106 170 50 0.062 0.048 0.028 0.053

Chou-Fasman

• Generally applicable

• Works for sequences with no solved homologs

• But the accuracy is low!

• The problem is that the method does not use enough information about the structural context of a residue

Neural Networks

• Benefits– Generally applicable– Can capture higher order correlations– Inputs other than sequence information

• Drawbacks– Needs a high amount of data (different solved

structures). However, today nearly 2500 structures with low sequence identity/high resolution are solved

– Complex method with several pitfalls

Architecture

IKEEHVI IQAE

HEC

IKEEHVIIQAEFYLNPDQSGEF…..Window

Input Layer

Hidden Layer

Output Layer

Weights

Sparse encoding

Inp Neuron 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

AAcid

A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

R 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

N 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

D 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

C 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Q 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

E 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

Input Layer

IKEEHVI IQAE

0000

001

000

000

000

000

0

BLOSUM 62

A R N D C Q E G H I L K M F P S T W Y V B Z X *A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4

Input Layer

IKEEHVI IQAE

-1

002

-425

-20-3 -

31-2 -

3 -10

-1 -3 -2 -

2

Secondary networks(Structure-to-Structure)

HECHECHEC

HEC

IKEEHVIIQAEFYLNPDQSGEF…..

Window

Input Layer

Hidden Layer

Output Layer

Weights

PHD method (Rost and Sander)

• Combine neural networks with sequence profiles

– 6-8 Percentage points increase in prediction accuracy

over standard neural networks

• Use second layer “Structure to structure” network

to filter predictions

• Jury of predictors

• Set up as mail server

PSI-Pred (Jones)

• Use alignments from iterative sequence

searches (PSI-Blast) as input to a neural

network

• Better predictions due to better sequence

profiles

• Available as stand alone program and via

the web

Position specific scoring matrices

(PSI-BLAST profiles) A R N D C Q E G H I L K M F P S T W Y V 1 I -2 -4 -5 -5 -2 -4 -4 -5 -5 6 0 -4 0 -2 -4 -4 -2 -4 -3 4 2 K -1 -1 -2 -2 -3 -1 3 -3 -2 -2 -3 4 -2 -4 -3 1 1 -4 -3 2 3 E 5 -3 -3 -3 -3 3 1 -2 -3 -3 -3 -2 -2 -4 -3 -1 -2 -4 -3 1 4 E -4 -3 2 5 -6 1 5 -4 -3 -6 -6 -2 -5 -6 -4 -2 -3 -6 -5 -5 5 H -4 2 1 1 -5 1 -2 -4 9 -5 -2 -3 -4 -4 -5 -3 -4 -5 1 -5 6 V -3 0 -4 -5 -4 -4 -2 -3 -5 1 -2 1 0 1 -4 -3 3 -5 -3 5 7 I 0 -2 -4 1 -4 -2 -4 -4 -5 1 0 -2 0 2 -5 1 -1 -5 -3 4 8 I -3 0 -5 -5 -4 -2 -5 -6 1 2 4 -4 -1 0 -5 -2 0 -3 5 -1 9 Q -2 -3 -2 -3 -5 4 -1 3 5 -5 -3 -3 -4 -2 -4 2 -1 -4 2 -2 10 A 2 -4 -4 -3 2 -3 -1 -4 -2 1 -1 -4 -3 -4 1 2 3 -5 -1 1 11 E -1 3 1 1 -1 0 1 -4 -3 -1 -3 0 3 -5 4 -1 -3 -6 -3 -1 12 F -3 -5 -5 -5 -4 -4 -4 -1 -1 1 1 -5 2 5 -1 -4 -4 -3 5 2 13 Y 3 -5 -5 -6 3 -4 -5 -2 -1 0 -4 -5 -3 3 -5 -2 -2 -2 7 1 14 L -1 -3 -4 -2 1 5 1 -1 -1 -1 1 -3 -3 1 -5 -1 -1 -2 3 -2 15 N -1 -4 4 1 5 -3 -4 2 -4 -4 -4 -3 -2 -4 -5 2 0 -5 0 0 16 P -2 4 -4 -4 -5 0 -3 3 2 -5 -4 0 -4 -3 0 1 -2 -1 5 -3 17 D -3 -2 1 5 -6 -2 2 2 -1 -2 -2 -3 -5 -4 -5 -1 2 -6 -3 -4

• Sequence-to-structure– Window sizes 15,17,19 and 21– Hidden units 50 and 75– 10-fold cross validation => 80

predictions

• Structure-to-structure– Window size 17– Hidden units 40– 10-fold cross validation => 800

predictions

Several different architectures

Output:

C C H H C C C

Output:

C C C C C C C

• Combining predictions from several networks improves the prediction

• Combinations of 800 different networks were used in the method described by

Petersen TN et al. 2000, Prediction of protein secondary structure at 80 % accuracy. Proteins 41 17-20

The majority rulesThe majority rules

Activities to probabilitiesActivities to probabilities

0.05 0.1 0.15 … 1.00.05 0.990.100.15 0.9 0.83 0.75...1.0

Helix activities (output)Strand activities (output)Coil probabilities! (calculated)

Coil conversion

Benchmarking secondary structure predictions

• EVA– Newly solved structures are send to prediction

servers.– Every week

http://cubic.bioc.columbia.edu/eva/sec/res_sec.html

EVA results (Rost et al., 2001)

• PROFphd 77.0%

• PSIPRED 76.8%

• SAM-T99sec 76.1%

• SSpro 76.0%

• Jpred2 75.5%

• PHD 71.7%– Cubic.columbia.edu/eva

Links to servers

• Several links:http://cubic.bioc.columbia.edu/eva/doc/explain_methods.html#type_sec

• ProfPHD http://www.predictprotein.org/

• PSIPREDhttp://bioinf.cs.ucl.ac.uk/psipred/

• JPredhttp://www.compbio.dundee.ac.uk/~www-jpred/

Practical Conclusions

• If you need a secondary structure prediction use the newer methods based on advanced machine learning methods such as :– ProfPHD– PSIPRED– JPred

• And not one of the older ones such as :– Chou-Fasman– Garnier

Documents

Protein Secondary Structures Assignment and prediction Pernille Haste Andersen 17.05.2006