View
229
Download
2
Embed Size (px)
Citation preview
Protein Secondary Structures
Assignment and prediction
Pernille Haste Andersen
17.05.2006
Outline
• What is protein secondary structure
• How can it be used?
• Different prediction methods– Alignment to homologues– Propensity methods– Neural networks
• Evaluation of prediction methods
• Links to prediction servers
Secondary Structure Elements
ß-strand
Helix
TurnBend
Use of secondary structure
• Classification of protein structures
• Definition of loops (active sites)
• Use in fold recognition methods
• Improvements of alignments
• Definition of domain boundaries
Classification of secondary structure
• Defining features– Dihedral angles– Hydrogen bonds– Geometry
• Assigned manually by crystallographers or• Automatic
– DSSP (Kabsch & Sander,1983)– STRIDE (Frishman & Argos, 1995)– DSSPcont (Andersen et al., 2002)
Dihedral Angles
phi - dihedral angle of the N-Calpha bondpsi - dihedral angle of the Calpha-C bondomega - dihedral angle of the C-N (peptide) bond
From http://www.imb-jena.de
Helices phi(deg) psi(deg) H-bond pattern-----------------------------------------------------------alpha-helix -57.8 -47.0 i+4pi-helix -57.1 -69.7 i+5310 helix -74.0 -4.0 i+3
(omega = 180 deg )From http://www.imb-jena.de
Beta Strands phi(deg) psi(deg) omega (deg)------------------------------------------------------------------beta strand -120 120 180
From http://broccoli.mfn.ki.se/pps_course_96/
Antiparallel
Parallel
Secondary Structure Elements
ß-strand
Helix
TurnBend
Secondary Structure Type Descriptions
* H = alpha helix * G = 310 - helix * I = 5 helix (pi helix)* E = extended strand, participates in beta ladder* B = residue in isolated beta-bridge * T = hydrogen bonded turn * S = bend * C = coil
Automatic assignment programs• DSSP ( http://www.cmbi.kun.nl/gv/dssp/ )• STRIDE ( http://www.hgmp.mrc.ac.uk/Registered/Option/stride.html )• DSSPcont ( http://cubic.bioc.columbia.edu/services/DSSPcont/ )
• The protein data bank visualizes DSSP assignments on structures in the data base
# RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA
1 4 A E 0 0 205 0, 0.0 2,-0.3 0, 0.0 0, 0.0 0.000 360.0 360.0 360.0 113.5 5.7 42.2 25.1 2 5 A H - 0 0 127 2, 0.0 2,-0.4 21, 0.0 21, 0.0 -0.987 360.0-152.8-149.1 154.0 9.4 41.3 24.7 3 6 A V - 0 0 66 -2,-0.3 21,-2.6 2, 0.0 2,-0.5 -0.995 4.6-170.2-134.3 126.3 11.5 38.4 23.5 4 7 A I E -A 23 0A 106 -2,-0.4 2,-0.4 19,-0.2 19,-0.2 -0.976 13.9-170.8-114.8 126.6 15.0 37.6 24.5 5 8 A I E -A 22 0A 74 17,-2.8 17,-2.8 -2,-0.5 2,-0.9 -0.972 20.8-158.4-125.4 129.1 16.6 34.9 22.4 6 9 A Q E -A 21 0A 86 -2,-0.4 2,-0.4 15,-0.2 15,-0.2 -0.910 29.5-170.4 -98.9 106.4 19.9 33.0 23.0 7 10 A A E +A 20 0A 18 13,-2.5 13,-2.5 -2,-0.9 2,-0.3 -0.852 11.5 172.8-108.1 141.7 20.7 31.8 19.5 8 11 A E E +A 19 0A 63 -2,-0.4 2,-0.3 11,-0.2 11,-0.2 -0.933 4.4 175.4-139.1 156.9 23.4 29.4 18.4 9 12 A F E -A 18 0A 31 9,-1.5 9,-1.8 -2,-0.3 2,-0.4 -0.967 13.3-160.9-160.6 151.3 24.4 27.6 15.3 10 13 A Y E -A 17 0A 36 -2,-0.3 2,-0.4 7,-0.2 7,-0.2 -0.994 16.5-156.0-136.8 132.1 27.2 25.3 14.1 11 14 A L E >> -A 16 0A 24 5,-3.2 4,-1.7 -2,-0.4 5,-1.3 -0.929 11.7-122.6-120.0 133.5 28.0 24.8 10.4 12 15 A N T 45S+ 0 0 54 -2,-0.4 -2, 0.0 2,-0.2 0, 0.0 -0.884 84.3 9.0-113.8 150.9 29.7 22.0 8.6 13 16 A P T 45S+ 0 0 114 0, 0.0 -1,-0.2 0, 0.0 -2, 0.0 -0.963 125.4 60.5 -86.5 8.5 32.0 21.6 6.8 14 17 A D T 45S- 0 0 66 2,-0.1 -2,-0.2 1,-0.1 3,-0.1 0.752 89.3-146.2 -64.6 -23.0 33.0 25.2 7.6
Secondary Structure Prediction
• What to predict?– All 8 types or pool types into groups
H
E
C
DSSP
Q3
* H = alpha helix * G = 310 -helix * I = 5 helix (pi helix)
* E = extended strand* B = beta-bridge
* T = hydrogen bonded turn * S = bend * C = coil
Straight HEC
Secondary Structure Prediction
• What to predict?– All 8 types or pool types into groups
H
E
C
Q3
* H = alpha helix
* E = extended strand
* T = hydrogen bonded turn * S = bend * C = coil* G = 310-helix* I = 5 helix (pi helix)* B = beta-bridge
Secondary Structure Prediction
• Simple alignments• Align to a close homolog for which the structure has been
experimentally solved.
• Heuristic Methods (e.g., Chou-Fasman, 1974)• Apply scores for each amino acid an sum up over a
window.
• Neural Networks• Raw Sequence (late 80’s)• Blosum matrix (e.g., PhD, early 90’s)• Position specific alignment profiles (e.g., PsiPred, late 90’s)• Multiple networks balloting, probability conversion, output
expansion (Petersen et al., 2000).
Improvement of accuracy
1974 Chou & Fasman ~50-53%1978 Garnier 63%1987 Zvelebil 66%1988 Quian & Sejnowski 64.3%1993 Rost & Sander 70.8-72.0%1997 Frishman & Argos <75%1999 Cuff & Barton 72.9%1999 Jones 76.5%2000 Petersen et al. 77.9%
Simple Alignments
•Solved structure of a homolog to query is needed•Homologous proteins have ~88% identical (3 state) secondary structure • If no close homologue can be identified alignments will give almost random results
Propensities: Amino acid preferences in -Helix
Propensities: Amino acid preferences in -Strand
Propensities: Amino acid preferences in coil
Chou-Fasman propensities
Name P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)Ala 142 83 66 0.06 0.076 0.035 0.058Arg 98 93 95 0.070 0.106 0.099 0.085Asp 101 54 146 0.147 0.110 0.179 0.081Asn 67 89 156 0.161 0.083 0.191 0.091Cys 70 119 119 0.149 0.050 0.117 0.128Glu 151 37 74 0.056 0.060 0.077 0.064Gln 111 110 98 0.074 0.098 0.037 0.098Gly 57 75 156 0.102 0.085 0.190 0.152His 100 87 95 0.140 0.047 0.093 0.054Ile 108 160 47 0.043 0.034 0.013 0.056Leu 121 130 59 0.061 0.025 0.036 0.070Lys 114 74 101 0.055 0.115 0.072 0.095Met 145 105 60 0.068 0.082 0.014 0.055Phe 113 138 60 0.059 0.041 0.065 0.065Pro 57 55 152 0.102 0.301 0.034 0.068Ser 77 75 143 0.120 0.139 0.125 0.106Thr 83 119 96 0.086 0.108 0.065 0.079Trp 108 137 96 0.077 0.013 0.064 0.167Tyr 69 147 114 0.082 0.065 0.114 0.125Val 106 170 50 0.062 0.048 0.028 0.053
Chou-Fasman
• Generally applicable
• Works for sequences with no solved homologs
• But the accuracy is low!
• The problem is that the method does not use enough information about the structural context of a residue
Neural Networks
• Benefits– Generally applicable– Can capture higher order correlations– Inputs other than sequence information
• Drawbacks– Needs a high amount of data (different solved
structures). However, today nearly 2500 structures with low sequence identity/high resolution are solved
– Complex method with several pitfalls
Architecture
IKEEHVI IQAE
HEC
IKEEHVIIQAEFYLNPDQSGEF…..Window
Input Layer
Hidden Layer
Output Layer
Weights
Sparse encoding
Inp Neuron 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
AAcid
A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
R 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
N 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Q 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Input Layer
IKEEHVI IQAE
0000
001
000
000
000
000
0
BLOSUM 62
A R N D C Q E G H I L K M F P S T W Y V B Z X *A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4
Input Layer
IKEEHVI IQAE
-1
002
-425
-20-3 -
31-2 -
3 -10
-1 -3 -2 -
2
Secondary networks(Structure-to-Structure)
HECHECHEC
HEC
IKEEHVIIQAEFYLNPDQSGEF…..
Window
Input Layer
Hidden Layer
Output Layer
Weights
PHD method (Rost and Sander)
• Combine neural networks with sequence profiles
– 6-8 Percentage points increase in prediction accuracy
over standard neural networks
• Use second layer “Structure to structure” network
to filter predictions
• Jury of predictors
• Set up as mail server
PSI-Pred (Jones)
• Use alignments from iterative sequence
searches (PSI-Blast) as input to a neural
network
• Better predictions due to better sequence
profiles
• Available as stand alone program and via
the web
Position specific scoring matrices
(PSI-BLAST profiles) A R N D C Q E G H I L K M F P S T W Y V 1 I -2 -4 -5 -5 -2 -4 -4 -5 -5 6 0 -4 0 -2 -4 -4 -2 -4 -3 4 2 K -1 -1 -2 -2 -3 -1 3 -3 -2 -2 -3 4 -2 -4 -3 1 1 -4 -3 2 3 E 5 -3 -3 -3 -3 3 1 -2 -3 -3 -3 -2 -2 -4 -3 -1 -2 -4 -3 1 4 E -4 -3 2 5 -6 1 5 -4 -3 -6 -6 -2 -5 -6 -4 -2 -3 -6 -5 -5 5 H -4 2 1 1 -5 1 -2 -4 9 -5 -2 -3 -4 -4 -5 -3 -4 -5 1 -5 6 V -3 0 -4 -5 -4 -4 -2 -3 -5 1 -2 1 0 1 -4 -3 3 -5 -3 5 7 I 0 -2 -4 1 -4 -2 -4 -4 -5 1 0 -2 0 2 -5 1 -1 -5 -3 4 8 I -3 0 -5 -5 -4 -2 -5 -6 1 2 4 -4 -1 0 -5 -2 0 -3 5 -1 9 Q -2 -3 -2 -3 -5 4 -1 3 5 -5 -3 -3 -4 -2 -4 2 -1 -4 2 -2 10 A 2 -4 -4 -3 2 -3 -1 -4 -2 1 -1 -4 -3 -4 1 2 3 -5 -1 1 11 E -1 3 1 1 -1 0 1 -4 -3 -1 -3 0 3 -5 4 -1 -3 -6 -3 -1 12 F -3 -5 -5 -5 -4 -4 -4 -1 -1 1 1 -5 2 5 -1 -4 -4 -3 5 2 13 Y 3 -5 -5 -6 3 -4 -5 -2 -1 0 -4 -5 -3 3 -5 -2 -2 -2 7 1 14 L -1 -3 -4 -2 1 5 1 -1 -1 -1 1 -3 -3 1 -5 -1 -1 -2 3 -2 15 N -1 -4 4 1 5 -3 -4 2 -4 -4 -4 -3 -2 -4 -5 2 0 -5 0 0 16 P -2 4 -4 -4 -5 0 -3 3 2 -5 -4 0 -4 -3 0 1 -2 -1 5 -3 17 D -3 -2 1 5 -6 -2 2 2 -1 -2 -2 -3 -5 -4 -5 -1 2 -6 -3 -4
• Sequence-to-structure– Window sizes 15,17,19 and 21– Hidden units 50 and 75– 10-fold cross validation => 80
predictions
• Structure-to-structure– Window size 17– Hidden units 40– 10-fold cross validation => 800
predictions
Several different architectures
Output:
C C H H C C C
Output:
C C C C C C C
• Combining predictions from several networks improves the prediction
• Combinations of 800 different networks were used in the method described by
Petersen TN et al. 2000, Prediction of protein secondary structure at 80 % accuracy. Proteins 41 17-20
The majority rulesThe majority rules
Activities to probabilitiesActivities to probabilities
0.05 0.1 0.15 … 1.00.05 0.990.100.15 0.9 0.83 0.75...1.0
Helix activities (output)Strand activities (output)Coil probabilities! (calculated)
Coil conversion
Benchmarking secondary structure predictions
• EVA– Newly solved structures are send to prediction
servers.– Every week
http://cubic.bioc.columbia.edu/eva/sec/res_sec.html
EVA results (Rost et al., 2001)
• PROFphd 77.0%
• PSIPRED 76.8%
• SAM-T99sec 76.1%
• SSpro 76.0%
• Jpred2 75.5%
• PHD 71.7%– Cubic.columbia.edu/eva
Links to servers
• Several links:http://cubic.bioc.columbia.edu/eva/doc/explain_methods.html#type_sec
• ProfPHD http://www.predictprotein.org/
• PSIPREDhttp://bioinf.cs.ucl.ac.uk/psipred/
• JPredhttp://www.compbio.dundee.ac.uk/~www-jpred/
Practical Conclusions
• If you need a secondary structure prediction use the newer methods based on advanced machine learning methods such as :– ProfPHD– PSIPRED– JPred
• And not one of the older ones such as :– Chou-Fasman– Garnier