Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute...

Preview:

Citation preview

Protein Secondary Structure Prediction:A New Improved Knowledge-Based Method

Wen-Lian Hsu

Institute of Information Science

Academia Sinica, Taiwan

2/29

Outline Introduction

PSSP Motivation

Knowledge-Based Method PROSP

An Improved Hybrid Method PROSP II HYPROSP II+

Conclusion

3/29

Protein Structures Primary sequence

Secondary structures

Tertiary structures

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

helices strands loops

Three dimensional packing of secondary structures

4/29

Introduction to PSSP Protein Secondary Structure Prediction

(PSSP) is to predict protein secondary structure based only on its sequence.

Each amino acid is assigned a structure element (SSE): Helix (H), Strand (E) or Coil (C or L).

5/29

Motivation PSSP plays an important role in tertiary

structure predictions Fischer (1996) improved the tertiary structure

prediction accuracy from 59.0 to 71.0 by using PHD to predict SSE.

In Yang’s 2003, the tertiary structure prediction accuracy was improved from 71.9 to 79.0 by using PSIPRED to predict SSE.

Predicted SSE can also be employed in other prediction algorithms as features to improve performance

6/29

Outline Introduction

PSSP Motivation

Knowledge-Based Method PROSP

An Improved Hybrid Method PROSP II HYPROSP II+

Conclusion

7/29

Treat PSSP as a Translation Problem Secondary structure prediction

A language of 20 alphabets

A language of 3 alphabets

8/29

Treating Genomic/Proteomic sequencesas a Language

For proteomic data:

Amino acid motif protein

Alphabet word sentence

paragraph

Protein structure or function

Sentence meaning

Finding the interrelationships of data Data Mining, Knowledge Discovery

9/29

Matching by Semantics (prediction based on evolutionary information)

• Existing sentences in database (understood):– His old father gave me a book.– Joan loves Andy

•• UnderstandingUnderstanding a new sentence– Mary’s lovely daughter does not like John

• Techniques– Corpus analysis– Pattern discovery and matching

• Sequence, semantics (classification, transformation)

– Structure prediction

Speech Recognition ─ ExampleSense Disambiguation in English

Selection of homonyms (or senses) in speech recognition

台 北 市 一 位 小 孩 走 失 了

台 北 市 小 孩台 北 適 宜 走 失 事 宜 一 位 一 味 移 位

11/29

How do we represent the context in a protein sequence (or sentence)? Using motifs as Words?

Motifs could be too specific, do not provide enough coverage

What about using k-mers? Can build (k-mer, structure) pairs How many k-mers can we get? How do we define similar k-mers? (under the

context) How do we combine the structural information

from the k-mers?

12/29

PROSP Our knowledge-based method for PSSP

Constructing a peptide Sequence-Structure Knowledge Base (SSKB)

Use PSI-BLAST to find all peptides similar to those of the target protein

Use similar peptides found in the SSKB to vote for the dominant structure of each amino acid in the target protein.

13/29

Using PSI-BLAST to Amplify the Effect of DSSP Database (create more synonyms)

The number of peptide words is still small (~ 5 million)

Identify similar peptides For each protein p in the NR database, apply PSI-

BLAST to find its HSPs (high score segment pairs).

HSP: an alignment of subsequence of protein p and another protein q with unknown structure

Assign the structure of “selected” peptides of p to those of q These peptides comprise our dictionary (~ 100 million)

14/29

SSKB construction (synonyms)

An example of High-scoring Segment Pair (HSP) from PSI-Blast Search result

known

unknown

15/29

x

H(x)E(x)C(x)

Voting score

x is assigned as helix

HH

HC

EC

SSKB

PSI-Blast

Prediction at a position x

16/29

Outline Introduction

PSSP Motivation

Knowledge-Based Method PROSP

An Improved Hybrid Method PROSP II HYPROSP II+

Conclusion

17/29

Two problems of searching for homologous peptides in protein sequences databases

Redundant information generated by duplicate peptides The voting bias problem in PROSP

Poor prediction accuracy due to insufficient knowledgebase matching boost coverage

18/29

The voting bias problem

Query Sbject

The PSIBLAST results

KTYQCQY…

KPYQCQYKPYQCQYKPYQCQYKPYQCQYKPYQCQYKVYQCQYQPYRCKY

SSKB

KTYQCQY…

HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCCHHHC

CCHHHC

Dominate result

19/29

Clustering HSPs

…MYKKILYPTDFSETAEIALK…MYSKILLMYSKILLMYSKILLMYKKIYLMYKKIYLMYKKIYLMYKKIYLMYSSILYMYSSILY

Similar HSPs

20/29

Measuring the amount of structural information

Low Local match rate

HSPs

There is no information from SSKB7 for this region

Found

Unfound

21/29

Construct SSKB with different lengths (to boost coverage)

HSPs

TrainingProtein

PSI-BLAST search

SSKBwindow length = 7

SSKB construction

window length = 7

HSPs

TrainingProtein

PSI-BLAST search

SSKBwindow length = 5

SSKB construction

window length = 5

22/29

HSPs from SSKB7

Boost match rate using different length peptide record

Protein :

MYKKILYPTDFSETAEIALK…

SSKBWindow length = 7

SSKBWindow length = 7

SSKBWindow length = 5

SSKBWindow length = 5

HH 1 2 1 3 6 7 8…1 2 1 3 6 7 8…

EE 1 2 2 0 0 0 1…1 2 2 0 0 0 1…

CC 2 3 8 8 5 4 2… 2 3 8 8 5 4 2…

HH 1 3 2 5 5 5 2…1 3 2 5 5 5 2…

EE 1 3 2 0 0 0 1…1 3 2 0 0 0 1…

CC 2 4 7 7 6 6 7… 2 4 7 7 6 6 7…

HSPs from SSKB5

23/29

NEW PROSP systemProtein :

MYKKILYPTDFSETAEIALK…SSKB

Window length = 7

SSKBWindow length = 7

SSKBWindow length = 5

SSKBWindow length = 5

HH 1 2 1 3 6 7 8…1 2 1 3 6 7 8…

EE 1 2 2 0 0 0 1…1 2 2 0 0 0 1…

CC 2 3 8 8 5 4 2… 2 3 8 8 5 4 2…

HH 1 3 2 5 5 5 2…1 3 2 5 5 5 2…

EE 1 3 2 0 0 0 1…1 3 2 0 0 0 1…

CC 2 4 7 7 6 6 7… 2 4 7 7 6 6 7…

HHPROSPIIPROSPII((xx)) ← LMR ← LMR7mer7mer((xx))×H×H77((xx))++((1- 1- LMRLMR7mer7mer((xx))))×H×H55((xx))EEPROSPIIPROSPII((xx)) ← LMR ← LMR7mer7mer((xx))×E×E77((xx))++((1- LMR1- LMR7mer7mer((xx))))×E×E55((xx))CCPROSPIIPROSPII((xx)) ← LMR ← LMR7mer7mer((xx))×C×C77((xx))++((1- LMR1- LMR7mer7mer(x(x))))×C×C55((xx)) HH 1 3 2 5 7 6 7…1 3 2 5 7 6 7…

EE 1 3 2 0 0 0 1…1 3 2 0 0 0 1…

CC 2 4 8 8 4 5 6… 2 4 8 8 4 5 6…

24/29

Hybrid by Neural Network

Query Protein

PSIPRED

PROSP

PSIPBLAST

H scoreH score

E scoreE score

C scoreC score

H scoreH score

E scoreE score

C scoreC score

PSSMPSSM

Neural Network Final Result

3 features

3 features

20 features

25/29

Data Sets Two broadly used test sets

CB513 EVAc4

Derivation of the training sets Get 4,572 unique protein chains (with less than 25%

mutual sequence identity) from DSSP database Further remove protein chains of sequence identity

over 25% with the respective test datasets to obtain their respective training datasets.

The final training datasets consist of 4395 and 4055 protein chains for EVAc4 and CB513, respectively.

26/29

55

60

65

70

75

80

85

[0,10) [10,20) [20,30) [30,40) [40,50)

7-mer SSKB 5-mer SSKBPROSP II

The respective performance improvement using SSKB5 and SSKB7

LMR7mer(%)

Q3(%)

Performance of prediction on CB513 by SSKB5, SSKB7 and PROSP II with respect to LMR7mer lower than 50%.

27/29

Performance of HYPROSP II+

Q3 SOV QH_o QH_p QE_o QE_p QC_o QC_p Info

HYPROSPII+ 80.35 78.66 78.65 83.85 61.10 71.27 81.79 76.35 0.44

Errsig 0.84 1.20 1.87 1.75 2.33 2.15 1.05 1.15 0.02

PROFsec 76.54 75.39 67.30 74.00 43.70 43.20 76.80 73.50 0.38

PSIPRED 77.62 76.05 72.90 71.50 38.60 42.30 73.50 76.40 0.38

SAM-T99sec 77.64 75.05 75.50 69.60 38.80 47.30 72.40 75.70 0.39

YASSPP 79.34 78.65 -- -- -- -- -- -- 0.42

HYPROSPII 79.32 76.51 81.49 77.85 60.91 68.83 76.98 77.78 0.41

28/29

ConclusionHYPROSP II+

Using a more robust knowledge-based algorithm PROSP II

More structural information, better prediction. Incremental Learning

The general strategy developed in this paper could be used to enhance the performance of similar approaches in other prediction problems.

People

Wen-Lian HsuTing-Yi SungHsin-Nan Lin

Jia-Ming ChangEi-Wen Yang

Recommended