View
218
Download
1
Category
Tags:
Preview:
Citation preview
Protein Secondary Structure Prediction:A New Improved Knowledge-Based Method
Wen-Lian Hsu
Institute of Information Science
Academia Sinica, Taiwan
2/29
Outline Introduction
PSSP Motivation
Knowledge-Based Method PROSP
An Improved Hybrid Method PROSP II HYPROSP II+
Conclusion
3/29
Protein Structures Primary sequence
Secondary structures
Tertiary structures
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
helices strands loops
Three dimensional packing of secondary structures
4/29
Introduction to PSSP Protein Secondary Structure Prediction
(PSSP) is to predict protein secondary structure based only on its sequence.
Each amino acid is assigned a structure element (SSE): Helix (H), Strand (E) or Coil (C or L).
5/29
Motivation PSSP plays an important role in tertiary
structure predictions Fischer (1996) improved the tertiary structure
prediction accuracy from 59.0 to 71.0 by using PHD to predict SSE.
In Yang’s 2003, the tertiary structure prediction accuracy was improved from 71.9 to 79.0 by using PSIPRED to predict SSE.
Predicted SSE can also be employed in other prediction algorithms as features to improve performance
6/29
Outline Introduction
PSSP Motivation
Knowledge-Based Method PROSP
An Improved Hybrid Method PROSP II HYPROSP II+
Conclusion
7/29
Treat PSSP as a Translation Problem Secondary structure prediction
A language of 20 alphabets
A language of 3 alphabets
8/29
Treating Genomic/Proteomic sequencesas a Language
For proteomic data:
Amino acid motif protein
Alphabet word sentence
paragraph
Protein structure or function
Sentence meaning
Finding the interrelationships of data Data Mining, Knowledge Discovery
9/29
Matching by Semantics (prediction based on evolutionary information)
• Existing sentences in database (understood):– His old father gave me a book.– Joan loves Andy
•• UnderstandingUnderstanding a new sentence– Mary’s lovely daughter does not like John
• Techniques– Corpus analysis– Pattern discovery and matching
• Sequence, semantics (classification, transformation)
– Structure prediction
Speech Recognition ─ ExampleSense Disambiguation in English
Selection of homonyms (or senses) in speech recognition
台 北 市 一 位 小 孩 走 失 了
台 北 市 小 孩台 北 適 宜 走 失 事 宜 一 位 一 味 移 位
11/29
How do we represent the context in a protein sequence (or sentence)? Using motifs as Words?
Motifs could be too specific, do not provide enough coverage
What about using k-mers? Can build (k-mer, structure) pairs How many k-mers can we get? How do we define similar k-mers? (under the
context) How do we combine the structural information
from the k-mers?
12/29
PROSP Our knowledge-based method for PSSP
Constructing a peptide Sequence-Structure Knowledge Base (SSKB)
Use PSI-BLAST to find all peptides similar to those of the target protein
Use similar peptides found in the SSKB to vote for the dominant structure of each amino acid in the target protein.
13/29
Using PSI-BLAST to Amplify the Effect of DSSP Database (create more synonyms)
The number of peptide words is still small (~ 5 million)
Identify similar peptides For each protein p in the NR database, apply PSI-
BLAST to find its HSPs (high score segment pairs).
HSP: an alignment of subsequence of protein p and another protein q with unknown structure
Assign the structure of “selected” peptides of p to those of q These peptides comprise our dictionary (~ 100 million)
14/29
SSKB construction (synonyms)
An example of High-scoring Segment Pair (HSP) from PSI-Blast Search result
known
unknown
15/29
…
x
H(x)E(x)C(x)
Voting score
x is assigned as helix
HH
HC
EC
SSKB
PSI-Blast
Prediction at a position x
16/29
Outline Introduction
PSSP Motivation
Knowledge-Based Method PROSP
An Improved Hybrid Method PROSP II HYPROSP II+
Conclusion
17/29
Two problems of searching for homologous peptides in protein sequences databases
Redundant information generated by duplicate peptides The voting bias problem in PROSP
Poor prediction accuracy due to insufficient knowledgebase matching boost coverage
18/29
The voting bias problem
Query Sbject
The PSIBLAST results
KTYQCQY…
KPYQCQYKPYQCQYKPYQCQYKPYQCQYKPYQCQYKVYQCQYQPYRCKY
SSKB
KTYQCQY…
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCCHHHC
CCHHHC
Dominate result
19/29
Clustering HSPs
…MYKKILYPTDFSETAEIALK…MYSKILLMYSKILLMYSKILLMYKKIYLMYKKIYLMYKKIYLMYKKIYLMYSSILYMYSSILY
Similar HSPs
20/29
Measuring the amount of structural information
Low Local match rate
HSPs
There is no information from SSKB7 for this region
Found
Unfound
21/29
Construct SSKB with different lengths (to boost coverage)
HSPs
TrainingProtein
PSI-BLAST search
SSKBwindow length = 7
SSKB construction
window length = 7
HSPs
TrainingProtein
PSI-BLAST search
SSKBwindow length = 5
SSKB construction
window length = 5
22/29
HSPs from SSKB7
Boost match rate using different length peptide record
Protein :
MYKKILYPTDFSETAEIALK…
SSKBWindow length = 7
SSKBWindow length = 7
SSKBWindow length = 5
SSKBWindow length = 5
HH 1 2 1 3 6 7 8…1 2 1 3 6 7 8…
EE 1 2 2 0 0 0 1…1 2 2 0 0 0 1…
CC 2 3 8 8 5 4 2… 2 3 8 8 5 4 2…
HH 1 3 2 5 5 5 2…1 3 2 5 5 5 2…
EE 1 3 2 0 0 0 1…1 3 2 0 0 0 1…
CC 2 4 7 7 6 6 7… 2 4 7 7 6 6 7…
HSPs from SSKB5
23/29
NEW PROSP systemProtein :
MYKKILYPTDFSETAEIALK…SSKB
Window length = 7
SSKBWindow length = 7
SSKBWindow length = 5
SSKBWindow length = 5
HH 1 2 1 3 6 7 8…1 2 1 3 6 7 8…
EE 1 2 2 0 0 0 1…1 2 2 0 0 0 1…
CC 2 3 8 8 5 4 2… 2 3 8 8 5 4 2…
HH 1 3 2 5 5 5 2…1 3 2 5 5 5 2…
EE 1 3 2 0 0 0 1…1 3 2 0 0 0 1…
CC 2 4 7 7 6 6 7… 2 4 7 7 6 6 7…
HHPROSPIIPROSPII((xx)) ← LMR ← LMR7mer7mer((xx))×H×H77((xx))++((1- 1- LMRLMR7mer7mer((xx))))×H×H55((xx))EEPROSPIIPROSPII((xx)) ← LMR ← LMR7mer7mer((xx))×E×E77((xx))++((1- LMR1- LMR7mer7mer((xx))))×E×E55((xx))CCPROSPIIPROSPII((xx)) ← LMR ← LMR7mer7mer((xx))×C×C77((xx))++((1- LMR1- LMR7mer7mer(x(x))))×C×C55((xx)) HH 1 3 2 5 7 6 7…1 3 2 5 7 6 7…
EE 1 3 2 0 0 0 1…1 3 2 0 0 0 1…
CC 2 4 8 8 4 5 6… 2 4 8 8 4 5 6…
24/29
Hybrid by Neural Network
Query Protein
PSIPRED
PROSP
PSIPBLAST
H scoreH score
E scoreE score
C scoreC score
H scoreH score
E scoreE score
C scoreC score
PSSMPSSM
Neural Network Final Result
3 features
3 features
20 features
25/29
Data Sets Two broadly used test sets
CB513 EVAc4
Derivation of the training sets Get 4,572 unique protein chains (with less than 25%
mutual sequence identity) from DSSP database Further remove protein chains of sequence identity
over 25% with the respective test datasets to obtain their respective training datasets.
The final training datasets consist of 4395 and 4055 protein chains for EVAc4 and CB513, respectively.
26/29
55
60
65
70
75
80
85
[0,10) [10,20) [20,30) [30,40) [40,50)
7-mer SSKB 5-mer SSKBPROSP II
The respective performance improvement using SSKB5 and SSKB7
LMR7mer(%)
Q3(%)
Performance of prediction on CB513 by SSKB5, SSKB7 and PROSP II with respect to LMR7mer lower than 50%.
27/29
Performance of HYPROSP II+
Q3 SOV QH_o QH_p QE_o QE_p QC_o QC_p Info
HYPROSPII+ 80.35 78.66 78.65 83.85 61.10 71.27 81.79 76.35 0.44
Errsig 0.84 1.20 1.87 1.75 2.33 2.15 1.05 1.15 0.02
PROFsec 76.54 75.39 67.30 74.00 43.70 43.20 76.80 73.50 0.38
PSIPRED 77.62 76.05 72.90 71.50 38.60 42.30 73.50 76.40 0.38
SAM-T99sec 77.64 75.05 75.50 69.60 38.80 47.30 72.40 75.70 0.39
YASSPP 79.34 78.65 -- -- -- -- -- -- 0.42
HYPROSPII 79.32 76.51 81.49 77.85 60.91 68.83 76.98 77.78 0.41
28/29
ConclusionHYPROSP II+
Using a more robust knowledge-based algorithm PROSP II
More structural information, better prediction. Incremental Learning
The general strategy developed in this paper could be used to enhance the performance of similar approaches in other prediction problems.
People
Wen-Lian HsuTing-Yi SungHsin-Nan Lin
Jia-Ming ChangEi-Wen Yang
Recommended