Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute...

Protein Secondary Structure Prediction:A New Improved Knowledge-Based Method

Wen-Lian Hsu

Institute of Information Science

Academia Sinica, Taiwan

Outline Introduction

PSSP Motivation

Knowledge-Based Method PROSP

An Improved Hybrid Method PROSP II HYPROSP II+

Conclusion

Protein Structures Primary sequence

Secondary structures

Tertiary structures

MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

helices strands loops

Three dimensional packing of secondary structures

Introduction to PSSP Protein Secondary Structure Prediction

(PSSP) is to predict protein secondary structure based only on its sequence.

Each amino acid is assigned a structure element (SSE): Helix (H), Strand (E) or Coil (C or L).

Motivation PSSP plays an important role in tertiary

structure predictions Fischer (1996) improved the tertiary structure

prediction accuracy from 59.0 to 71.0 by using PHD to predict SSE.

In Yang’s 2003, the tertiary structure prediction accuracy was improved from 71.9 to 79.0 by using PSIPRED to predict SSE.

Predicted SSE can also be employed in other prediction algorithms as features to improve performance

PSSP Motivation

Conclusion

Treat PSSP as a Translation Problem Secondary structure prediction

A language of 20 alphabets

A language of 3 alphabets

Treating Genomic/Proteomic sequencesas a Language

For proteomic data:

Amino acid motif protein

Alphabet word sentence

paragraph

Protein structure or function

Sentence meaning

Finding the interrelationships of data Data Mining, Knowledge Discovery

Matching by Semantics (prediction based on evolutionary information)

• Existing sentences in database (understood):– His old father gave me a book.– Joan loves Andy

•• UnderstandingUnderstanding a new sentence– Mary’s lovely daughter does not like John

• Techniques– Corpus analysis– Pattern discovery and matching

• Sequence, semantics (classification, transformation)

– Structure prediction

Speech Recognition ─ ExampleSense Disambiguation in English

Selection of homonyms (or senses) in speech recognition

台北市一位小孩走失了

台北市小孩台北適宜走失事宜一位一味移位

How do we represent the context in a protein sequence (or sentence)? Using motifs as Words?

Motifs could be too specific, do not provide enough coverage

What about using k-mers? Can build (k-mer, structure) pairs How many k-mers can we get? How do we define similar k-mers? (under the

context) How do we combine the structural information

from the k-mers?

PROSP Our knowledge-based method for PSSP

Constructing a peptide Sequence-Structure Knowledge Base (SSKB)

Use PSI-BLAST to find all peptides similar to those of the target protein

Use similar peptides found in the SSKB to vote for the dominant structure of each amino acid in the target protein.

Using PSI-BLAST to Amplify the Effect of DSSP Database (create more synonyms)

The number of peptide words is still small (~ 5 million)

Identify similar peptides For each protein p in the NR database, apply PSI-

BLAST to find its HSPs (high score segment pairs).

HSP: an alignment of subsequence of protein p and another protein q with unknown structure

Assign the structure of “selected” peptides of p to those of q These peptides comprise our dictionary (~ 100 million)

SSKB construction (synonyms)

An example of High-scoring Segment Pair (HSP) from PSI-Blast Search result

unknown

H(x)E(x)C(x)

Voting score

x is assigned as helix

PSI-Blast

Prediction at a position x

PSSP Motivation

Conclusion

Two problems of searching for homologous peptides in protein sequences databases

Redundant information generated by duplicate peptides The voting bias problem in PROSP

Poor prediction accuracy due to insufficient knowledgebase matching boost coverage

The voting bias problem

Query Sbject

The PSIBLAST results

KTYQCQY…

KPYQCQYKPYQCQYKPYQCQYKPYQCQYKPYQCQYKVYQCQYQPYRCKY

KTYQCQY…

HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHCCHHHC

CCHHHC

Dominate result

Clustering HSPs

…MYKKILYPTDFSETAEIALK…MYSKILLMYSKILLMYSKILLMYKKIYLMYKKIYLMYKKIYLMYKKIYLMYSSILYMYSSILY

Similar HSPs

Measuring the amount of structural information

Low Local match rate

There is no information from SSKB7 for this region

Unfound

Construct SSKB with different lengths (to boost coverage)

TrainingProtein

PSI-BLAST search

SSKBwindow length = 7

SSKB construction

window length = 7

TrainingProtein

PSI-BLAST search

SSKBwindow length = 5

SSKB construction

window length = 5

HSPs from SSKB7

Boost match rate using different length peptide record

Protein :

MYKKILYPTDFSETAEIALK…

SSKBWindow length = 7

HH 1 2 1 3 6 7 8…1 2 1 3 6 7 8…

EE 1 2 2 0 0 0 1…1 2 2 0 0 0 1…

CC 2 3 8 8 5 4 2… 2 3 8 8 5 4 2…

HH 1 3 2 5 5 5 2…1 3 2 5 5 5 2…

EE 1 3 2 0 0 0 1…1 3 2 0 0 0 1…

CC 2 4 7 7 6 6 7… 2 4 7 7 6 6 7…

HSPs from SSKB5

NEW PROSP systemProtein :

MYKKILYPTDFSETAEIALK…SSKB

Window length = 7

HH 1 2 1 3 6 7 8…1 2 1 3 6 7 8…

EE 1 2 2 0 0 0 1…1 2 2 0 0 0 1…

CC 2 3 8 8 5 4 2… 2 3 8 8 5 4 2…

HH 1 3 2 5 5 5 2…1 3 2 5 5 5 2…

EE 1 3 2 0 0 0 1…1 3 2 0 0 0 1…

CC 2 4 7 7 6 6 7… 2 4 7 7 6 6 7…

HHPROSPIIPROSPII((xx)) ← LMR ← LMR7mer7mer((xx))×H×H77((xx))++((1- 1- LMRLMR7mer7mer((xx))))×H×H55((xx))EEPROSPIIPROSPII((xx)) ← LMR ← LMR7mer7mer((xx))×E×E77((xx))++((1- LMR1- LMR7mer7mer((xx))))×E×E55((xx))CCPROSPIIPROSPII((xx)) ← LMR ← LMR7mer7mer((xx))×C×C77((xx))++((1- LMR1- LMR7mer7mer(x(x))))×C×C55((xx)) HH 1 3 2 5 7 6 7…1 3 2 5 7 6 7…

EE 1 3 2 0 0 0 1…1 3 2 0 0 0 1…

CC 2 4 8 8 4 5 6… 2 4 8 8 4 5 6…

Hybrid by Neural Network

Query Protein

PSIPRED

PSIPBLAST

H scoreH score

E scoreE score

C scoreC score

H scoreH score

E scoreE score

C scoreC score

PSSMPSSM

Neural Network Final Result

3 features

20 features

Data Sets Two broadly used test sets

CB513 EVAc4

Derivation of the training sets Get 4,572 unique protein chains (with less than 25%

mutual sequence identity) from DSSP database Further remove protein chains of sequence identity

over 25% with the respective test datasets to obtain their respective training datasets.

The final training datasets consist of 4395 and 4055 protein chains for EVAc4 and CB513, respectively.

[0,10) [10,20) [20,30) [30,40) [40,50)

7-mer SSKB 5-mer SSKBPROSP II

The respective performance improvement using SSKB5 and SSKB7

LMR7mer(%)

Performance of prediction on CB513 by SSKB5, SSKB7 and PROSP II with respect to LMR7mer lower than 50%.

Performance of HYPROSP II+

Q3 SOV QH_o QH_p QE_o QE_p QC_o QC_p Info

HYPROSPII+ 80.35 78.66 78.65 83.85 61.10 71.27 81.79 76.35 0.44

Errsig 0.84 1.20 1.87 1.75 2.33 2.15 1.05 1.15 0.02

PROFsec 76.54 75.39 67.30 74.00 43.70 43.20 76.80 73.50 0.38

PSIPRED 77.62 76.05 72.90 71.50 38.60 42.30 73.50 76.40 0.38

SAM-T99sec 77.64 75.05 75.50 69.60 38.80 47.30 72.40 75.70 0.39

YASSPP 79.34 78.65 -- -- -- -- -- -- 0.42

HYPROSPII 79.32 76.51 81.49 77.85 60.91 68.83 76.98 77.78 0.41

ConclusionHYPROSP II+

Using a more robust knowledge-based algorithm PROSP II

More structural information, better prediction. Incremental Learning

The general strategy developed in this paper could be used to enhance the performance of similar approaches in other prediction problems.

People

Wen-Lian HsuTing-Yi SungHsin-Nan Lin

Jia-Ming ChangEi-Wen Yang

Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute...

Documents

Biomaterials Science - Sinica

Advanced Analytical Chemistry - Sinica

Computer Science Basics for Bioinformatics Wen-Lian Hsu 許聞廉 Institute of Information Science

Replication DNA - Sinica

DNA - Sinica

Theory of Computation - Sinica

Lian Dalen Tetun no Lian Dalen Portugés klase 4

Lab name TBA1IIS internal talk Intelligent Information Integration (I 3 ) Chun-Nan Hsu Institute of Information Science Academia Sinica, Taipei, TAIWAN

Research Topics - Sinica

PQ Trees, PC Trees, and Planar Graphsrmm/pc2.pdf · PQ Trees, PC Trees, and Planar Graphs Wen-Lian Hsu Academia Sinica ... that the graph is planar, or points out a Kuratowski subgraph,

PDF - Academia Sinica

Academia Sinica

portafolio lian

1/62 An Iterative Relaxation Technique for the NMR Backbone Assignment Problem Wen-Lian Hsu Institute of Information Science Academia Sinica

Fu-Lian Hsu Harold D. Banks !RESEARCH · PDF fileFu-Lian Hsu Harold D. Banks ... He named this pure compound morphine for ... birth of organic chemistry with Wohler's synthesis of

Reviews - Sinica

High Throughput NGS Data Analysis · 2018-07-10 · High Throughput NGS Data Analysis. Bioinformatics Lab Wen-Lian Hsu Kart -- An Ultra-fast NGS read mapping Algorithm. Background

Nordin presentation @ akademia sinica

Nanotechnology - Academia Sinica

st Meeting Room - Sinica