Upload
dangnhu
View
220
Download
4
Embed Size (px)
Citation preview
1
Statistical Methods for Protein Structure Prediction
2
Outline
• Review statistical methods – KNN– Logistic regression
• Introduce neural networks
• Protein secondary structure prediction• Protein disorder prediction
3
Logistic Regression
Given: D = {(xi, yi), i = 1…n} – dataset of labeled examples
x ∈ Rk , where k is the number of features
y ∈ {0, 1}
Task: find a line in the space of features such that positives (y = 1) andnegatives (y = 0) are best separated
k = 2
4
Logistic Regression
← Form of solution
Solution →Prediction
↓
5
Problem with Linear Methods
← Linearly separable?
6
Extend Logistic Regression…
… to a neural network
7
Non-linear Decision Boundaries
speech recognition example
8
Problems in Bioinformatics?
Secondary Structure Prediction
9
Why Predict Protein Structure?Many protein sequences
Few 3-D structures
0
50000
100000
150000
200000
250000
300000
1984 1986 1988 1990 1992 1994 1996 1998 2000
Year
Num
ber
of e
ntri
es
Sequences (PIR)
Structures (PDB)
http://bioinf.cs.ucl.ac.uk
10
Why Predict Protein Structure?
• Experimental methods are expensive, time consuming and difficult to automate
• Predictive methods are easily automated, fast and cheap
• Can be used to improve alignment accuracy
• Can be used to detect domain boundaries within proteins with remote sequence homology
• Predicted structure gives clues about function
• Useful for mutagenesis studies
• Often the first step towards fold recognition
http://bioinf.cs.ucl.ac.uk
11
An Example
12
Protein StructurePrimary (Sequence)
Secondary (Helix/Strand/Coil)and lack of structure (disorder)
Quaternary (Complexes)Domain and Tertiary (Fold)
IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYKSRIQVRLGEHNIDVLEGNEQFINAAKIITHPNFNGNTL...
http://bioinf.cs.ucl.ac.uk
13
Protein Sequence
Database Searching Domain AssignmentMultiple SequenceAlignment
Homologuein PDB
ComparativeModelling
SecondaryStructure
and Disorder
Prediction
No
Yes
3-D Protein Model
FoldRecognition
PredictedFold
Sequence-StructureAlignment
Ab-initioStructurePrediction
No
Yes
Overall Approach
modified from http://bioinf.cs.ucl.ac.uk
14
Secondary Structure Prediction
15
Protein Secondary Structure
STRAND
HELIX
COIL
http://bioinf.cs.ucl.ac.uk
16
1st Generation MethodsBased on single amino acid propertiesExamples:
• Chou & Fasman (1974)
• Lim (1974)
• Garnier, Osguthorpe & Robson (1978)
The Q3 accuracy gives the percentage of residuescorrectly predicted as Coil/Helix/Strand.
These methods had Q3 accuracies around 50-55%
http://bioinf.cs.ucl.ac.uk
17
Prediction Accuracy
Qindex: (Qhelix, Qstrand, Qcoil, Q3) - percentage of residues correctly predicted as α-
helix, β-strand, coil or for all 3 conformations.
Drawbacks:- even a random assignment of structure can
achieve a high score (Holley & Karplus 1991)
1003 ⋅=residuestotal
predictedcorrectly
NN
Q
http://bioinf.cs.ucl.ac.uk
18
Correlation coefficient
False negative
uα
True negative
nα
False positive
oα
True positive
pα
])][][[]([ αααααααα
αααα
α opuponunounpC ++++
= ⋅−⋅
Cα = 1 (=100%)
http://bioinf.cs.ucl.ac.uk
19
Statistical Methods
From PDB database, calculate the propensity for a given amino acid to adopt a certain structural type (H, S, C)
( | ) ( , )( ) ( ) ( )
i i i
i
P aa p aaPp p p aaαα αα α
= =
Example:#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500P(α,aa) = 500/20,000, p(α) = 4,000/20,000, p(aa) = 2,000/20,000
P = 500 / (4,000/10) = 1.25
Used in Chou-Fasman algorithm (1974)http://bioinf.cs.ucl.ac.uk
20
http://bioinf.cs.ucl.ac.uk
21
Chou-Fasman: Initiation
T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57
T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57
Identify regions where 4/6 residues have a P(H) >1.00 “alpha-helix nucleus”
http://bioinf.cs.ucl.ac.uk
22
Chou-Fasman: Propagation
T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57
Extend helix in both directions until a set of four residues have an average P(H) <1.00
http://bioinf.cs.ucl.ac.uk
23
Scan peptide for β-sheet regions
Identify regions where 3/5 have a P(E) >1.00 “β-sheet nucleus”
Extend b-sheet until 4 continuous residues have an average P(E) < 1.00
If region average > 1.05 and the average P(E) > average P(H) then “b-sheet”
T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57P(E) 1.47 0.75 0.55 1.47 0.83 0.37 1.3 1.05 0.93 0.75 1.47 0.75
http://bioinf.cs.ucl.ac.uk
24
Chou-Fasman Prediction
25
Chou-Fasman Prediction
• Predict as α-helix segment with – E[Pα] > 1.03– E[Pα] > E[Pβ]– Not including proline
• Predict as β -strand segment with – E[Pβ] > 1.05– E[Pβ] > E[Pα]
• Others are labeled as turns.
(Various extensions appeared in the literature)http://bioinf.cs.ucl.ac.uk
26
Chou-Fasman Prediction
• To identify a bend at residue number j, calculate the following value p(t) = f(j) · f(j+1) · f(j+2) · f(j+3)
• where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used.
• If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetrapeptide; and (3) the averages for the tetrapeptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.
27
• Achieved accuracy: around 52%• Shortcoming of this method: ignoring the context of the
sequence when predicting from amino-acid sequence
• We would like to use the sequence context as an input to a classifier
• There are many ways to address this.• The most successful to date are based on neural
networks
Chou-Fasman Prediction
http://bioinf.cs.ucl.ac.uk
28
The “Chameleon” sequence
TEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEK
TEAVDAWTVEKAFKTFANDNGVDGAWTVEKAFKTFTVTEK
sequence 1 sequence 2
Replace both chameleon sequences with engineered peptide (“chameleon”)
Source: Minor and Kim 1996, Nature, 380, 730-734
α -helix β-strand
University of Wyoming
29
2nd Generation Methods
Based on peptide segments / residue pairsExamples:
• GOR III (1987)
• Neural Networks: Qian & Sejnowski (1988)among others
These methods had Q3 accuracies around 60-65%
http://bioinf.cs.ucl.ac.uk
30
Qian-Sejnowski Architecture
......
...
...
oαoβoo
HiddenInput Output
Si
Si-w
Si+w
31
3rd Generation Methods
Exploit evolutionary information. Based on conservation analysis of multiple sequence alignments.
• PHD (Q3 ~ 70%)Rost B, Sander, C. (1993) J. Mol. Biol. 232, 584-599.
• PSIPRED (Q3 ~ 77%)Jones, D. T. (1999) J. Mol. Biol. 292, 195-202.Arguably remains the top secondary structure prediction method(won all CASP competitions since 1998).
http://bioinf.cs.ucl.ac.uk
32
What Patterns ofConservation Are Used?
Given a multiple sequence alignment:
• Regions of low conservation COIL
• Regions of conservation– 1,4,5,8 pattern ALPHA HELIX– All hydrophobic BURIED BETA STRAND– Alternating residues SURFACE BETA STRAND– Glycine/Proline TURN
http://bioinf.cs.ucl.ac.uk
33
PSIPRED - Detecting Patterns..VQIVGGPYTCAANSI... Cascaded neural networks
structure resembles PHD
ΣCOIL
HELIX
STRAND
http://bioinf.cs.ucl.ac.uk
34
Profile
AA Sequence
Position-Specific Scoring Matrix
(PSSM)
-1 -1 -1 0 -1 -2 -2 0 -2 2 -1 1 -1 4 -2 -1 2 0 -2 -1 -1 1 -2 0 -2 -2 -2-2 -1 2 -3 1 -2 -3 -1 4 -1 -1 -1 5 -2 -3 -3 -1 -1 6 5 -3 -2 -3 -1 -3 -1 -3-3 0 0 -3 0 -2 -4 0 5 0 -2 0 -1 -2 -4 -2 -1 0 -1 -1 -3 -2 -4 0 -4 7 -4-4 4 -1 -4 0 -4 -4 -1 1 -1 -2 -1 -2 -2 -4 -2 -1 -1 -2 -2 -4 -3 -4 –1 -4 1 -4-2 -4 -4 -1 -3 -3 -1 -1 -3 -1 -3 -1 4 -1 -2 -3 -1 -1 -4 -3 -1 -1 -2 –2 -2 -3 -1-1 1 1 -3 6 -2 -3 -1 0 -1 2 -1 4 -1 -3 -2 0 -1 1 0 -3 -2 -3 –1 -2 0 -3-2 5 1 -3 2 -2 -3 -1 0 -1 -1 -1 0 -1 -4 -2 -1 -1 0 -1 -3 -2 -3 –1 … -3 0 -4-3 -2 -2 -4 -2 -4 -4 -2 -1 -1 -3 -1 -3 0 -4 -3 -1 -2 -3 -3 -4 -2 -4 4 -4 -1 -4-2 -1 -1 -4 0 2 -3 -2 0 -2 -2 -1 -1 -2 -4 -3 -1 -2 -1 -1 -4 3 -3 4 -3 1 -41 -4 -3 3 -3 -2 3 -1 -4 -2 1 -2 -3 -2 5 -3 0 -1 -3 -1 2 1 4 -3 2 -4 52 -4 -3 1 -2 -1 4 -1 -3 -2 -1 -3 -2 -2 2 -3 0 -2 -3 -2 2 1 3 -3 5 -4 2-2 0 5 -3 1 -2 -3 -1 1 -1 1 -1 1 -1 -3 -1 -1 -1 3 2 -3 -2 -3 1 -3 0 -36 -3 -2 1 -1 -1 2 -1 -2 -1 -1 -2 -1 -2 1 -3 3 -1 -2 -1 1 0 1 –2 2 -3 10 -4 -4 -1 -4 3 0 -2 -3 -3 -3 -3 -3 -3 0 -4 -1 -3 -3 -3 -1 -1 0 –3 … 0 -3 0-3 -1 -1 -3 -2 -3 -3 -1 -2 -1 5 -1 -2 3 -3 8 0 -1 -2 -2 -3 -2 -3 –2 -3 -2 -3-2 0 0 -2 0 -2 -3 2 0 3 -1 4 -1 1 -3 -1 0 2 -1 -1 -2 -1 -3 2 -3 1 -3-1 -1 -1 0 -1 -2 -1 5 -1 3 -1 3 -1 -1 -1 -1 1 5 -1 1 -1 -1 -1 –1 -1 0 -1-2 -4 -3 -3 -2 2 -2 -3 -4 -3 -3 -3 -3 -3 -3 -4 -1 -3 -3 -3 -3 -3 -2 –3 -2 -4 -3-1 -3 -2 -1 -2 8 -1 -2 -2 -2 -2 -2 -2 -2 -2 -3 -1 -2 -2 -2 -2 -1 -2 –2 -1 -2 -21 -3 -3 4 -3 -1 2 0 -3 -1 -1 -2 -3 -1 2 -3 0 -1 -3 0 4 3 1 –3 1 -3 3
M D K V Q Y L T N T P S R A I P A T R R V V L G … L N I
AR N D C Q E G H I L K M F P S T W Y V
Input WindowSize = Win
Current Position
35
Improvement over PHD is mainly due to using PSI-BLAST profiles
PSIPRED - Using PSI-BLAST
……………………………………………………
0.10.20.10.10.20.20.10.00.10.10.20.10.10.20.70.20.30.10.20.1
0.70.30.10.10.10.20.10.00.10.20.20.10.20.30.30.10.20.30.30.2
……………………………………………………
VYWTSPFMKLIHGEQCDNRA..VQIVGGPYTCAANSI...
1st Network315 Inputs75 Hidden Units3 Outputs
2nd Network60 Inputs60 Hidden Units3 Outputs
H/E/C3-stateprediction
http://bioinf.cs.ucl.ac.uk
36
PSIPRED – David Jones JMB 1999
37
38
39
PSIPRED Example Output
40
Measures of Secondary Structure Prediction Accuracy
• Q3 scores give the percentage of correctly predicted residues across 3 states (H,E,C)
• SOV scores (Segment OVerlap) give the percentage of correctly predicted SEGMENTS across 3 states
• SSEA scores (Secondary Structure Element Alignment) give a better idea of usefulness of secondary structure prediction for use in fold recognition
http://bioinf.cs.ucl.ac.uk
41
SOV• SOV scores (Segment OVerlap) give the percentage of correctly
predicted SEGMENTS across 3 states
Zemla et al. Proteins 34, 1999
42
SOV• SOV scores (Segment OVerlap) give the percentage of correctly
predicted SEGMENTS across 3 states
Zemla et al. Proteins 34, 1999
43
SOV
Zemla et al. Proteins 34, 1999
s1 – observed segment; s2 – predicted segment; Sα - number of all segment pairs (s1, s2) with at least 1 α-residue in common; minOV is the overlap between s1and s2; maxOV is the length in α-state of union of positions of s1 and s2; Nα - the total number of residues in α-state.
44Method
Acc
urac
y S c
o re
http://bioinf.cs.ucl.ac.uk
45
Secondary Structure PredictionSummary
1st Generation - 1970s• Q3 = 50-55%• Chou & Fausman, GOR
2nd Generation -1980s• Q3 = 60-65%• Qian & Sejnowski, GORIII
3rd Generation - 1990s• Q3 = 70-80%• PHD variants, PSIPRED, GOR V
4th Generation - 2000s?• Upper limit 88%?• Higher accuracy linked to database size?• Is this problem now solved?
http://bioinf.cs.ucl.ac.uk
46
Bob MacCallum
47
Bob MacCallum
48
State of the Art
• Both PHD and Nearest neighbor get about 72%-74% accuracy– Both predicted well in CASP2 (1996)
• PSI-PRED slightly better (around 76%)• Recent trend: combining classification methods
– Best predictions in CASP3 (1998)
• Failures:– Long term effects: S-S bonds, parallel strands– Chemical patterns – Wrong prediction at the ends of helices/strands
http://bioinf.cs.ucl.ac.uk
49
Decision Tree
H / ~H
E / CYes
Yes
No
No
H E C
E / ~E
C/ HYes
Yes
No
No
E C H
C / ~C
H / EYes
Yes
No
No
C H E
http://bioinf.cs.ucl.ac.uk
50
Prediction of Protein Disorder
51
Prediction of Protein Disorder • Relatively new field
• Implications for fold recognition
• Functionally important
• First prediction methods by Romero et al. (1997)
• Many other predictors (about 15 servers in 2006)
http://bioinf.cs.ucl.ac.uk
52
Dataset construction
disordered region ordered region
Dataset consists of:152 disordered proteins (~22,000 residues) – from literature and database search
290 ordered proteins (~67,000 residues) – from the Protein Data Bank
53
Disorder predictor
protein sequence
Post-Processing
Base Predictor
Attribute Construction
Predicted Disordered Regions
PSI-BLASTNRDB
• 79% on long disordered regions and 91% on ordered regions ⇒ overall prediction accuracy is 85%
profiles
20 profile attributes
54
Data representation
W C Y L A M A H Q F AA A G K L K T S A L S C T
class: (0/1)(disordered/ordered)
Input Window(size = Win)
Sequence
Calculate over window:20 CompositionsK2 entropy
14Å Contact NumberHydropathyFlexibility
Coordination NumberBulkinessCFYWVolumeNet Charge
55
Prediction of Disorder
sn = 76%
sp = 91%
accuracy > 80%
56
Comparisons of methods
OLS: Ordinary Least Squares Regression
LR: Logistic Regression
NN: Neural Networks
57
VSL model - Background
58
VSL model
59
VSL model
60
VSL model