Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

1

Statistical Methods for Protein Structure Prediction

2

Outline

• Review statistical methods – KNN– Logistic regression

• Introduce neural networks

• Protein secondary structure prediction• Protein disorder prediction

3

Logistic Regression

Given: D = {(xi, yi), i = 1…n} – dataset of labeled examples

x ∈ Rk , where k is the number of features

y ∈ {0, 1}

Task: find a line in the space of features such that positives (y = 1) andnegatives (y = 0) are best separated

k = 2

4

Logistic Regression

← Form of solution

Solution →Prediction

↓

5

Problem with Linear Methods

← Linearly separable?

6

Extend Logistic Regression…

… to a neural network

7

Non-linear Decision Boundaries

speech recognition example

8

Problems in Bioinformatics?

Secondary Structure Prediction

9

Why Predict Protein Structure?Many protein sequences

Few 3-D structures

0

50000

100000

150000

200000

250000

300000

1984 1986 1988 1990 1992 1994 1996 1998 2000

Year

Num

ber

of e

ntri

es

Sequences (PIR)

Structures (PDB)

http://bioinf.cs.ucl.ac.uk

10

Why Predict Protein Structure?

• Experimental methods are expensive, time consuming and difficult to automate

• Predictive methods are easily automated, fast and cheap

• Can be used to improve alignment accuracy

• Can be used to detect domain boundaries within proteins with remote sequence homology

• Predicted structure gives clues about function

• Useful for mutagenesis studies

• Often the first step towards fold recognition


11

An Example

12

Protein StructurePrimary (Sequence)

Secondary (Helix/Strand/Coil)and lack of structure (disorder)

Quaternary (Complexes)Domain and Tertiary (Fold)

IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYKSRIQVRLGEHNIDVLEGNEQFINAAKIITHPNFNGNTL...


13

Protein Sequence

Database Searching Domain AssignmentMultiple SequenceAlignment

Homologuein PDB

ComparativeModelling

SecondaryStructure

and Disorder

Prediction

No

Yes

3-D Protein Model

FoldRecognition

PredictedFold

Sequence-StructureAlignment

Ab-initioStructurePrediction

No

Yes

Overall Approach

modified from http://bioinf.cs.ucl.ac.uk

14

Secondary Structure Prediction

15

Protein Secondary Structure

STRAND

HELIX

COIL


16

1st Generation MethodsBased on single amino acid propertiesExamples:

• Chou & Fasman (1974)

• Lim (1974)

• Garnier, Osguthorpe & Robson (1978)

The Q3 accuracy gives the percentage of residuescorrectly predicted as Coil/Helix/Strand.

These methods had Q3 accuracies around 50-55%


17

Prediction Accuracy

Qindex: (Qhelix, Qstrand, Qcoil, Q3) - percentage of residues correctly predicted as α-

helix, β-strand, coil or for all 3 conformations.

Drawbacks:- even a random assignment of structure can

achieve a high score (Holley & Karplus 1991)

1003 ⋅=residuestotal

predictedcorrectly

NN

Q


18

Correlation coefficient

False negative

uα

True negative

nα

False positive

oα

True positive

pα

])][][[]([ αααααααα

αααα

α opuponunounpC ++++

= ⋅−⋅

Cα = 1 (=100%)


19

Statistical Methods

From PDB database, calculate the propensity for a given amino acid to adopt a certain structural type (H, S, C)

( | ) ( , )( ) ( ) ( )

i i i

i

P aa p aaPp p p aaαα αα α

= =

Example:#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500P(α,aa) = 500/20,000, p(α) = 4,000/20,000, p(aa) = 2,000/20,000

P = 500 / (4,000/10) = 1.25

Used in Chou-Fasman algorithm (1974)http://bioinf.cs.ucl.ac.uk

20


21

Chou-Fasman: Initiation

T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57

T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57

Identify regions where 4/6 residues have a P(H) >1.00 “alpha-helix nucleus”


22

Chou-Fasman: Propagation

T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57

Extend helix in both directions until a set of four residues have an average P(H) <1.00


23

Scan peptide for β-sheet regions

Identify regions where 3/5 have a P(E) >1.00 “β-sheet nucleus”

Extend b-sheet until 4 continuous residues have an average P(E) < 1.00

If region average > 1.05 and the average P(E) > average P(H) then “b-sheet”

T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57P(E) 1.47 0.75 0.55 1.47 0.83 0.37 1.3 1.05 0.93 0.75 1.47 0.75


24

Chou-Fasman Prediction

25


• Predict as α-helix segment with – E[Pα] > 1.03– E[Pα] > E[Pβ]– Not including proline

• Predict as β -strand segment with – E[Pβ] > 1.05– E[Pβ] > E[Pα]

• Others are labeled as turns.

(Various extensions appeared in the literature)http://bioinf.cs.ucl.ac.uk

26


• To identify a bend at residue number j, calculate the following value p(t) = f(j) · f(j+1) · f(j+2) · f(j+3)

• where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used.

• If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetrapeptide; and (3) the averages for the tetrapeptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.

27

• Achieved accuracy: around 52%• Shortcoming of this method: ignoring the context of the

sequence when predicting from amino-acid sequence

• We would like to use the sequence context as an input to a classifier

• There are many ways to address this.• The most successful to date are based on neural

networks



28

The “Chameleon” sequence

TEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEK

TEAVDAWTVEKAFKTFANDNGVDGAWTVEKAFKTFTVTEK

sequence 1 sequence 2

Replace both chameleon sequences with engineered peptide (“chameleon”)

Source: Minor and Kim 1996, Nature, 380, 730-734

α -helix β-strand

University of Wyoming

29

2nd Generation Methods

Based on peptide segments / residue pairsExamples:

• GOR III (1987)

• Neural Networks: Qian & Sejnowski (1988)among others

These methods had Q3 accuracies around 60-65%


30

Qian-Sejnowski Architecture

......

...

...

oαoβoo

HiddenInput Output

Si

Si-w

Si+w

31

3rd Generation Methods

Exploit evolutionary information. Based on conservation analysis of multiple sequence alignments.

• PHD (Q3 ~ 70%)Rost B, Sander, C. (1993) J. Mol. Biol. 232, 584-599.

• PSIPRED (Q3 ~ 77%)Jones, D. T. (1999) J. Mol. Biol. 292, 195-202.Arguably remains the top secondary structure prediction method(won all CASP competitions since 1998).


32

What Patterns ofConservation Are Used?

Given a multiple sequence alignment:

• Regions of low conservation COIL

• Regions of conservation– 1,4,5,8 pattern ALPHA HELIX– All hydrophobic BURIED BETA STRAND– Alternating residues SURFACE BETA STRAND– Glycine/Proline TURN


33

PSIPRED - Detecting Patterns..VQIVGGPYTCAANSI... Cascaded neural networks

structure resembles PHD

ΣCOIL

HELIX

STRAND


34

Profile

AA Sequence

Position-Specific Scoring Matrix

(PSSM)

-1 -1 -1 0 -1 -2 -2 0 -2 2 -1 1 -1 4 -2 -1 2 0 -2 -1 -1 1 -2 0 -2 -2 -2-2 -1 2 -3 1 -2 -3 -1 4 -1 -1 -1 5 -2 -3 -3 -1 -1 6 5 -3 -2 -3 -1 -3 -1 -3-3 0 0 -3 0 -2 -4 0 5 0 -2 0 -1 -2 -4 -2 -1 0 -1 -1 -3 -2 -4 0 -4 7 -4-4 4 -1 -4 0 -4 -4 -1 1 -1 -2 -1 -2 -2 -4 -2 -1 -1 -2 -2 -4 -3 -4 –1 -4 1 -4-2 -4 -4 -1 -3 -3 -1 -1 -3 -1 -3 -1 4 -1 -2 -3 -1 -1 -4 -3 -1 -1 -2 –2 -2 -3 -1-1 1 1 -3 6 -2 -3 -1 0 -1 2 -1 4 -1 -3 -2 0 -1 1 0 -3 -2 -3 –1 -2 0 -3-2 5 1 -3 2 -2 -3 -1 0 -1 -1 -1 0 -1 -4 -2 -1 -1 0 -1 -3 -2 -3 –1 … -3 0 -4-3 -2 -2 -4 -2 -4 -4 -2 -1 -1 -3 -1 -3 0 -4 -3 -1 -2 -3 -3 -4 -2 -4 4 -4 -1 -4-2 -1 -1 -4 0 2 -3 -2 0 -2 -2 -1 -1 -2 -4 -3 -1 -2 -1 -1 -4 3 -3 4 -3 1 -41 -4 -3 3 -3 -2 3 -1 -4 -2 1 -2 -3 -2 5 -3 0 -1 -3 -1 2 1 4 -3 2 -4 52 -4 -3 1 -2 -1 4 -1 -3 -2 -1 -3 -2 -2 2 -3 0 -2 -3 -2 2 1 3 -3 5 -4 2-2 0 5 -3 1 -2 -3 -1 1 -1 1 -1 1 -1 -3 -1 -1 -1 3 2 -3 -2 -3 1 -3 0 -36 -3 -2 1 -1 -1 2 -1 -2 -1 -1 -2 -1 -2 1 -3 3 -1 -2 -1 1 0 1 –2 2 -3 10 -4 -4 -1 -4 3 0 -2 -3 -3 -3 -3 -3 -3 0 -4 -1 -3 -3 -3 -1 -1 0 –3 … 0 -3 0-3 -1 -1 -3 -2 -3 -3 -1 -2 -1 5 -1 -2 3 -3 8 0 -1 -2 -2 -3 -2 -3 –2 -3 -2 -3-2 0 0 -2 0 -2 -3 2 0 3 -1 4 -1 1 -3 -1 0 2 -1 -1 -2 -1 -3 2 -3 1 -3-1 -1 -1 0 -1 -2 -1 5 -1 3 -1 3 -1 -1 -1 -1 1 5 -1 1 -1 -1 -1 –1 -1 0 -1-2 -4 -3 -3 -2 2 -2 -3 -4 -3 -3 -3 -3 -3 -3 -4 -1 -3 -3 -3 -3 -3 -2 –3 -2 -4 -3-1 -3 -2 -1 -2 8 -1 -2 -2 -2 -2 -2 -2 -2 -2 -3 -1 -2 -2 -2 -2 -1 -2 –2 -1 -2 -21 -3 -3 4 -3 -1 2 0 -3 -1 -1 -2 -3 -1 2 -3 0 -1 -3 0 4 3 1 –3 1 -3 3

M D K V Q Y L T N T P S R A I P A T R R V V L G … L N I

AR N D C Q E G H I L K M F P S T W Y V

Input WindowSize = Win

Current Position

35

Improvement over PHD is mainly due to using PSI-BLAST profiles

PSIPRED - Using PSI-BLAST

……………………………………………………

0.10.20.10.10.20.20.10.00.10.10.20.10.10.20.70.20.30.10.20.1

0.70.30.10.10.10.20.10.00.10.20.20.10.20.30.30.10.20.30.30.2

……………………………………………………

VYWTSPFMKLIHGEQCDNRA..VQIVGGPYTCAANSI...

1st Network315 Inputs75 Hidden Units3 Outputs

2nd Network60 Inputs60 Hidden Units3 Outputs

H/E/C3-stateprediction


36

PSIPRED – David Jones JMB 1999

37

38

39

PSIPRED Example Output

40

Measures of Secondary Structure Prediction Accuracy

• Q3 scores give the percentage of correctly predicted residues across 3 states (H,E,C)

• SOV scores (Segment OVerlap) give the percentage of correctly predicted SEGMENTS across 3 states

• SSEA scores (Secondary Structure Element Alignment) give a better idea of usefulness of secondary structure prediction for use in fold recognition


41

SOV• SOV scores (Segment OVerlap) give the percentage of correctly

predicted SEGMENTS across 3 states

Zemla et al. Proteins 34, 1999

42

SOV• SOV scores (Segment OVerlap) give the percentage of correctly

predicted SEGMENTS across 3 states


43

SOV


s1 – observed segment; s2 – predicted segment; Sα - number of all segment pairs (s1, s2) with at least 1 α-residue in common; minOV is the overlap between s1and s2; maxOV is the length in α-state of union of positions of s1 and s2; Nα - the total number of residues in α-state.

44Method

Acc

urac

y S c

o re


45

Secondary Structure PredictionSummary

1st Generation - 1970s• Q3 = 50-55%• Chou & Fausman, GOR

2nd Generation -1980s• Q3 = 60-65%• Qian & Sejnowski, GORIII

3rd Generation - 1990s• Q3 = 70-80%• PHD variants, PSIPRED, GOR V

4th Generation - 2000s?• Upper limit 88%?• Higher accuracy linked to database size?• Is this problem now solved?


46

Bob MacCallum

47

Bob MacCallum

48

State of the Art

• Both PHD and Nearest neighbor get about 72%-74% accuracy– Both predicted well in CASP2 (1996)

• PSI-PRED slightly better (around 76%)• Recent trend: combining classification methods

– Best predictions in CASP3 (1998)

• Failures:– Long term effects: S-S bonds, parallel strands– Chemical patterns – Wrong prediction at the ends of helices/strands


49

Decision Tree

H / ~H

E / CYes

Yes

No

No

H E C

E / ~E

C/ HYes

Yes

No

No

E C H

C / ~C

H / EYes

Yes

No

No

C H E


50

Prediction of Protein Disorder

51

Prediction of Protein Disorder • Relatively new field

• Implications for fold recognition

• Functionally important

• First prediction methods by Romero et al. (1997)

• Many other predictors (about 15 servers in 2006)


52

Dataset construction

disordered region ordered region

Dataset consists of:152 disordered proteins (~22,000 residues) – from literature and database search

290 ordered proteins (~67,000 residues) – from the Protein Data Bank

53

Disorder predictor

protein sequence

Post-Processing

Base Predictor

Attribute Construction

Predicted Disordered Regions

PSI-BLASTNRDB

• 79% on long disordered regions and 91% on ordered regions ⇒ overall prediction accuracy is 85%

profiles

20 profile attributes

54

Data representation

W C Y L A M A H Q F AA A G K L K T S A L S C T

class: (0/1)(disordered/ordered)

Input Window(size = Win)

Sequence

Calculate over window:20 CompositionsK2 entropy

14Å Contact NumberHydropathyFlexibility

Coordination NumberBulkinessCFYWVolumeNet Charge

55

Prediction of Disorder

sn = 76%

sp = 91%

accuracy > 80%

56

Comparisons of methods

OLS: Ordinary Least Squares Regression

LR: Logistic Regression

NN: Neural Networks

57

VSL model - Background

58

VSL model

59

VSL model

60

VSL model

Documents

Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T