60
1 Statistical Methods for Protein Structure Prediction

Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

  • Upload
    dangnhu

  • View
    220

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

1

Statistical Methods for Protein Structure Prediction

Page 2: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

2

Outline

• Review statistical methods – KNN– Logistic regression

• Introduce neural networks

• Protein secondary structure prediction• Protein disorder prediction

Page 3: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

3

Logistic Regression

Given: D = {(xi, yi), i = 1…n} – dataset of labeled examples

x ∈ Rk , where k is the number of features

y ∈ {0, 1}

Task: find a line in the space of features such that positives (y = 1) andnegatives (y = 0) are best separated

k = 2

Page 4: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

4

Logistic Regression

← Form of solution

Solution →Prediction

Page 5: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

5

Problem with Linear Methods

← Linearly separable?

Page 6: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

6

Extend Logistic Regression…

… to a neural network

Page 7: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

7

Non-linear Decision Boundaries

speech recognition example

Page 8: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

8

Problems in Bioinformatics?

Secondary Structure Prediction

Page 9: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

9

Why Predict Protein Structure?Many protein sequences

Few 3-D structures

0

50000

100000

150000

200000

250000

300000

1984 1986 1988 1990 1992 1994 1996 1998 2000

Year

Num

ber

of e

ntri

es

Sequences (PIR)

Structures (PDB)

http://bioinf.cs.ucl.ac.uk

Page 10: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

10

Why Predict Protein Structure?

• Experimental methods are expensive, time consuming and difficult to automate

• Predictive methods are easily automated, fast and cheap

• Can be used to improve alignment accuracy

• Can be used to detect domain boundaries within proteins with remote sequence homology

• Predicted structure gives clues about function

• Useful for mutagenesis studies

• Often the first step towards fold recognition

http://bioinf.cs.ucl.ac.uk

Page 11: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

11

An Example

Page 12: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

12

Protein StructurePrimary (Sequence)

Secondary (Helix/Strand/Coil)and lack of structure (disorder)

Quaternary (Complexes)Domain and Tertiary (Fold)

IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYKSRIQVRLGEHNIDVLEGNEQFINAAKIITHPNFNGNTL...

http://bioinf.cs.ucl.ac.uk

Page 13: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

13

Protein Sequence

Database Searching Domain AssignmentMultiple SequenceAlignment

Homologuein PDB

ComparativeModelling

SecondaryStructure

and Disorder

Prediction

No

Yes

3-D Protein Model

FoldRecognition

PredictedFold

Sequence-StructureAlignment

Ab-initioStructurePrediction

No

Yes

Overall Approach

modified from http://bioinf.cs.ucl.ac.uk

Page 14: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

14

Secondary Structure Prediction

Page 15: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

15

Protein Secondary Structure

STRAND

HELIX

COIL

http://bioinf.cs.ucl.ac.uk

Page 16: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

16

1st Generation MethodsBased on single amino acid propertiesExamples:

• Chou & Fasman (1974)

• Lim (1974)

• Garnier, Osguthorpe & Robson (1978)

The Q3 accuracy gives the percentage of residuescorrectly predicted as Coil/Helix/Strand.

These methods had Q3 accuracies around 50-55%

http://bioinf.cs.ucl.ac.uk

Page 17: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

17

Prediction Accuracy

Qindex: (Qhelix, Qstrand, Qcoil, Q3) - percentage of residues correctly predicted as α-

helix, β-strand, coil or for all 3 conformations.

Drawbacks:- even a random assignment of structure can

achieve a high score (Holley & Karplus 1991)

1003 ⋅=residuestotal

predictedcorrectly

NN

Q

http://bioinf.cs.ucl.ac.uk

Page 18: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

18

Correlation coefficient

False negative

True negative

False positive

True positive

])][][[]([ αααααααα

αααα

α opuponunounpC ++++

= ⋅−⋅

Cα = 1 (=100%)

http://bioinf.cs.ucl.ac.uk

Page 19: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

19

Statistical Methods

From PDB database, calculate the propensity for a given amino acid to adopt a certain structural type (H, S, C)

( | ) ( , )( ) ( ) ( )

i i i

i

P aa p aaPp p p aaαα αα α

= =

Example:#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500P(α,aa) = 500/20,000, p(α) = 4,000/20,000, p(aa) = 2,000/20,000

P = 500 / (4,000/10) = 1.25

Used in Chou-Fasman algorithm (1974)http://bioinf.cs.ucl.ac.uk

Page 20: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

20

http://bioinf.cs.ucl.ac.uk

Page 21: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

21

Chou-Fasman: Initiation

T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57

T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57

Identify regions where 4/6 residues have a P(H) >1.00 “alpha-helix nucleus”

http://bioinf.cs.ucl.ac.uk

Page 22: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

22

Chou-Fasman: Propagation

T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57

Extend helix in both directions until a set of four residues have an average P(H) <1.00

http://bioinf.cs.ucl.ac.uk

Page 23: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

23

Scan peptide for β-sheet regions

Identify regions where 3/5 have a P(E) >1.00 “β-sheet nucleus”

Extend b-sheet until 4 continuous residues have an average P(E) < 1.00

If region average > 1.05 and the average P(E) > average P(H) then “b-sheet”

T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57P(E) 1.47 0.75 0.55 1.47 0.83 0.37 1.3 1.05 0.93 0.75 1.47 0.75

http://bioinf.cs.ucl.ac.uk

Page 24: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

24

Chou-Fasman Prediction

Page 25: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

25

Chou-Fasman Prediction

• Predict as α-helix segment with – E[Pα] > 1.03– E[Pα] > E[Pβ]– Not including proline

• Predict as β -strand segment with – E[Pβ] > 1.05– E[Pβ] > E[Pα]

• Others are labeled as turns.

(Various extensions appeared in the literature)http://bioinf.cs.ucl.ac.uk

Page 26: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

26

Chou-Fasman Prediction

• To identify a bend at residue number j, calculate the following value p(t) = f(j) · f(j+1) · f(j+2) · f(j+3)

• where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used.

• If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetrapeptide; and (3) the averages for the tetrapeptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.

Page 27: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

27

• Achieved accuracy: around 52%• Shortcoming of this method: ignoring the context of the

sequence when predicting from amino-acid sequence

• We would like to use the sequence context as an input to a classifier

• There are many ways to address this.• The most successful to date are based on neural

networks

Chou-Fasman Prediction

http://bioinf.cs.ucl.ac.uk

Page 28: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

28

The “Chameleon” sequence

TEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEK

TEAVDAWTVEKAFKTFANDNGVDGAWTVEKAFKTFTVTEK

sequence 1 sequence 2

Replace both chameleon sequences with engineered peptide (“chameleon”)

Source: Minor and Kim 1996, Nature, 380, 730-734

α -helix β-strand

University of Wyoming

Page 29: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

29

2nd Generation Methods

Based on peptide segments / residue pairsExamples:

• GOR III (1987)

• Neural Networks: Qian & Sejnowski (1988)among others

These methods had Q3 accuracies around 60-65%

http://bioinf.cs.ucl.ac.uk

Page 30: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

30

Qian-Sejnowski Architecture

......

...

...

oαoβoo

HiddenInput Output

Si

Si-w

Si+w

Page 31: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

31

3rd Generation Methods

Exploit evolutionary information. Based on conservation analysis of multiple sequence alignments.

• PHD (Q3 ~ 70%)Rost B, Sander, C. (1993) J. Mol. Biol. 232, 584-599.

• PSIPRED (Q3 ~ 77%)Jones, D. T. (1999) J. Mol. Biol. 292, 195-202.Arguably remains the top secondary structure prediction method(won all CASP competitions since 1998).

http://bioinf.cs.ucl.ac.uk

Page 32: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

32

What Patterns ofConservation Are Used?

Given a multiple sequence alignment:

• Regions of low conservation COIL

• Regions of conservation– 1,4,5,8 pattern ALPHA HELIX– All hydrophobic BURIED BETA STRAND– Alternating residues SURFACE BETA STRAND– Glycine/Proline TURN

http://bioinf.cs.ucl.ac.uk

Page 33: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

33

PSIPRED - Detecting Patterns..VQIVGGPYTCAANSI... Cascaded neural networks

structure resembles PHD

ΣCOIL

HELIX

STRAND

http://bioinf.cs.ucl.ac.uk

Page 34: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

34

Profile

AA Sequence

Position-Specific Scoring Matrix

(PSSM)

-1 -1 -1 0 -1 -2 -2 0 -2 2 -1 1 -1 4 -2 -1 2 0 -2 -1 -1 1 -2 0 -2 -2 -2-2 -1 2 -3 1 -2 -3 -1 4 -1 -1 -1 5 -2 -3 -3 -1 -1 6 5 -3 -2 -3 -1 -3 -1 -3-3 0 0 -3 0 -2 -4 0 5 0 -2 0 -1 -2 -4 -2 -1 0 -1 -1 -3 -2 -4 0 -4 7 -4-4 4 -1 -4 0 -4 -4 -1 1 -1 -2 -1 -2 -2 -4 -2 -1 -1 -2 -2 -4 -3 -4 –1 -4 1 -4-2 -4 -4 -1 -3 -3 -1 -1 -3 -1 -3 -1 4 -1 -2 -3 -1 -1 -4 -3 -1 -1 -2 –2 -2 -3 -1-1 1 1 -3 6 -2 -3 -1 0 -1 2 -1 4 -1 -3 -2 0 -1 1 0 -3 -2 -3 –1 -2 0 -3-2 5 1 -3 2 -2 -3 -1 0 -1 -1 -1 0 -1 -4 -2 -1 -1 0 -1 -3 -2 -3 –1 … -3 0 -4-3 -2 -2 -4 -2 -4 -4 -2 -1 -1 -3 -1 -3 0 -4 -3 -1 -2 -3 -3 -4 -2 -4 4 -4 -1 -4-2 -1 -1 -4 0 2 -3 -2 0 -2 -2 -1 -1 -2 -4 -3 -1 -2 -1 -1 -4 3 -3 4 -3 1 -41 -4 -3 3 -3 -2 3 -1 -4 -2 1 -2 -3 -2 5 -3 0 -1 -3 -1 2 1 4 -3 2 -4 52 -4 -3 1 -2 -1 4 -1 -3 -2 -1 -3 -2 -2 2 -3 0 -2 -3 -2 2 1 3 -3 5 -4 2-2 0 5 -3 1 -2 -3 -1 1 -1 1 -1 1 -1 -3 -1 -1 -1 3 2 -3 -2 -3 1 -3 0 -36 -3 -2 1 -1 -1 2 -1 -2 -1 -1 -2 -1 -2 1 -3 3 -1 -2 -1 1 0 1 –2 2 -3 10 -4 -4 -1 -4 3 0 -2 -3 -3 -3 -3 -3 -3 0 -4 -1 -3 -3 -3 -1 -1 0 –3 … 0 -3 0-3 -1 -1 -3 -2 -3 -3 -1 -2 -1 5 -1 -2 3 -3 8 0 -1 -2 -2 -3 -2 -3 –2 -3 -2 -3-2 0 0 -2 0 -2 -3 2 0 3 -1 4 -1 1 -3 -1 0 2 -1 -1 -2 -1 -3 2 -3 1 -3-1 -1 -1 0 -1 -2 -1 5 -1 3 -1 3 -1 -1 -1 -1 1 5 -1 1 -1 -1 -1 –1 -1 0 -1-2 -4 -3 -3 -2 2 -2 -3 -4 -3 -3 -3 -3 -3 -3 -4 -1 -3 -3 -3 -3 -3 -2 –3 -2 -4 -3-1 -3 -2 -1 -2 8 -1 -2 -2 -2 -2 -2 -2 -2 -2 -3 -1 -2 -2 -2 -2 -1 -2 –2 -1 -2 -21 -3 -3 4 -3 -1 2 0 -3 -1 -1 -2 -3 -1 2 -3 0 -1 -3 0 4 3 1 –3 1 -3 3

M D K V Q Y L T N T P S R A I P A T R R V V L G … L N I

AR N D C Q E G H I L K M F P S T W Y V

Input WindowSize = Win

Current Position

Page 35: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

35

Improvement over PHD is mainly due to using PSI-BLAST profiles

PSIPRED - Using PSI-BLAST

……………………………………………………

0.10.20.10.10.20.20.10.00.10.10.20.10.10.20.70.20.30.10.20.1

0.70.30.10.10.10.20.10.00.10.20.20.10.20.30.30.10.20.30.30.2

……………………………………………………

VYWTSPFMKLIHGEQCDNRA..VQIVGGPYTCAANSI...

1st Network315 Inputs75 Hidden Units3 Outputs

2nd Network60 Inputs60 Hidden Units3 Outputs

H/E/C3-stateprediction

http://bioinf.cs.ucl.ac.uk

Page 36: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

36

PSIPRED – David Jones JMB 1999

Page 37: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

37

Page 38: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

38

Page 39: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

39

PSIPRED Example Output

Page 40: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

40

Measures of Secondary Structure Prediction Accuracy

• Q3 scores give the percentage of correctly predicted residues across 3 states (H,E,C)

• SOV scores (Segment OVerlap) give the percentage of correctly predicted SEGMENTS across 3 states

• SSEA scores (Secondary Structure Element Alignment) give a better idea of usefulness of secondary structure prediction for use in fold recognition

http://bioinf.cs.ucl.ac.uk

Page 41: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

41

SOV• SOV scores (Segment OVerlap) give the percentage of correctly

predicted SEGMENTS across 3 states

Zemla et al. Proteins 34, 1999

Page 42: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

42

SOV• SOV scores (Segment OVerlap) give the percentage of correctly

predicted SEGMENTS across 3 states

Zemla et al. Proteins 34, 1999

Page 43: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

43

SOV

Zemla et al. Proteins 34, 1999

s1 – observed segment; s2 – predicted segment; Sα - number of all segment pairs (s1, s2) with at least 1 α-residue in common; minOV is the overlap between s1and s2; maxOV is the length in α-state of union of positions of s1 and s2; Nα - the total number of residues in α-state.

Page 44: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

44Method

Acc

urac

y S c

o re

http://bioinf.cs.ucl.ac.uk

Page 45: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

45

Secondary Structure PredictionSummary

1st Generation - 1970s• Q3 = 50-55%• Chou & Fausman, GOR

2nd Generation -1980s• Q3 = 60-65%• Qian & Sejnowski, GORIII

3rd Generation - 1990s• Q3 = 70-80%• PHD variants, PSIPRED, GOR V

4th Generation - 2000s?• Upper limit 88%?• Higher accuracy linked to database size?• Is this problem now solved?

http://bioinf.cs.ucl.ac.uk

Page 46: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

46

Bob MacCallum

Page 47: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

47

Bob MacCallum

Page 48: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

48

State of the Art

• Both PHD and Nearest neighbor get about 72%-74% accuracy– Both predicted well in CASP2 (1996)

• PSI-PRED slightly better (around 76%)• Recent trend: combining classification methods

– Best predictions in CASP3 (1998)

• Failures:– Long term effects: S-S bonds, parallel strands– Chemical patterns – Wrong prediction at the ends of helices/strands

http://bioinf.cs.ucl.ac.uk

Page 49: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

49

Decision Tree

H / ~H

E / CYes

Yes

No

No

H E C

E / ~E

C/ HYes

Yes

No

No

E C H

C / ~C

H / EYes

Yes

No

No

C H E

http://bioinf.cs.ucl.ac.uk

Page 50: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

50

Prediction of Protein Disorder

Page 51: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

51

Prediction of Protein Disorder • Relatively new field

• Implications for fold recognition

• Functionally important

• First prediction methods by Romero et al. (1997)

• Many other predictors (about 15 servers in 2006)

http://bioinf.cs.ucl.ac.uk

Page 52: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

52

Dataset construction

disordered region ordered region

Dataset consists of:152 disordered proteins (~22,000 residues) – from literature and database search

290 ordered proteins (~67,000 residues) – from the Protein Data Bank

Page 53: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

53

Disorder predictor

protein sequence

Post-Processing

Base Predictor

Attribute Construction

Predicted Disordered Regions

PSI-BLASTNRDB

• 79% on long disordered regions and 91% on ordered regions ⇒ overall prediction accuracy is 85%

profiles

20 profile attributes

Page 54: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

54

Data representation

W C Y L A M A H Q F AA A G K L K T S A L S C T

class: (0/1)(disordered/ordered)

Input Window(size = Win)

Sequence

Calculate over window:20 CompositionsK2 entropy

14Å Contact NumberHydropathyFlexibility

Coordination NumberBulkinessCFYWVolumeNet Charge

Page 55: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

55

Prediction of Disorder

sn = 76%

sp = 91%

accuracy > 80%

Page 56: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

56

Comparisons of methods

OLS: Ordinary Least Squares Regression

LR: Logistic Regression

NN: Neural Networks

Page 57: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

57

VSL model - Background

Page 58: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

58

VSL model

Page 59: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

59

VSL model

Page 60: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T

60

VSL model