Musite: Prediction of ProteinMusite: Prediction of Protein Phosphorylation SitesPhosphorylation Sites
Jianjiong GaoUniversity of Missouri ColumbiaUniversity of Missouri, Columbia
http://musite.sourceforge.net/
Background:Protein Phosphorylation
Protein phosphorylation is one of the mostimportant post-translational modifications.p p
It was estimated that up to 50% of proteins are phosphorylated in some cellular stateAbnormality in phosphorylation is a cause or consequence of many diseases
CancerCancerDiabeteParkinson’sHepertitis B…
Background:Protein Phosphorylation
Phosphorylation-dephosphorylation is a biochemical switch system regulating y g gvarious cellular processes.Catalyzed by various specific proteinCatalyzed by various specific protein kinases.
Kinase
ON
PhosphataseOFF
Phosphatase
Phosphorylation Site PredictionProblem Formulation
Phosphorylation site: a phosphorylated amino acid in a protein (determined by protein sequence)in a protein (determined by protein sequence) General phosphorylation site prediction: to predict whether an amino acid can be phosphorylatedKinase-specific phosphorylation site prediction: to p p p y ppredict whether an amino acid can be phosphorylated by a specific kinasep p y y pBased on protein sequence only
Limitations of Current MethodsLimitations of Current Methods
Current prediction tools have limitations when applying to wholelimitations when applying to whole proteomes
Prediction accuracy could be improvedMost were released as web servers and haveMost were released as web servers and have restrictions for the uploaded data by usersTraining data were out of dateTraining data were out of dateStringency adjustment was not fully supportedsupported
Our tool Musite is uniqueOur tool Musite is unique
Novel method with better accuracyFirst open source tool in the field that meetFirst open-source tool in the field that meet OSI Open Standards RequirementStandalone program designed for proteome-scale predictionpSupport both general and kinase-specific phosphorylation site predictionphosphorylation site predictionSupport customized model trainingSupport continuous stringency adjustment
Phosphorylation Site Prediction Flowchart
Bootstrap
Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,
PhosphoPep,and PhosPhAt
Training data
...TrainingPh h l ti it N h h l ti it
Bootstrapsample 1
Bootstrapsample mNon-redundant datasets built by BLASTclust
... Classifier m
TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction
KNN scores Disorder scoresClassifier 1
Features from Features from
AggregatingSpecificityestimation
Amino acid frequencies
PhosphorylationFeatures frompositive set
Features fromnegative set
Control data
estimation
Making predictions
Phosphorylation prediction model
Control data Making predictions on new data
Phosphorylation Site Prediction Data Extraction
Bootstrap
Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,
PhosphoPep,and PhosPhAt
Training data
...TrainingPh h l ti it N h h l ti it
Bootstrapsample 1
Bootstrapsample mNon-redundant datasets built by BLASTclust
... Classifier m
TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction
KNN scores Disorder scoresClassifier 1
Features from Features from
AggregatingSpecificityestimation
Amino acid frequencies
PhosphorylationFeatures frompositive set
Features fromnegative set
Control data
estimation
Making predictions
Phosphorylation prediction model
Control data Making predictions on new data
Phosphorylation Site Prediction Feature Extraction
Bootstrap
Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,
PhosphoPep,and PhosPhAt
Training data
...TrainingPh h l ti it N h h l ti it
Bootstrapsample 1
Bootstrapsample mNon-redundant datasets built by BLASTclust
... Classifier m
TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction
KNN scores Disorder scoresClassifier 1
Features from Features from
AggregatingSpecificityestimation
Amino acid frequencies
PhosphorylationFeatures frompositive set
Features fromnegative set
Control data
estimation
Making predictions
Phosphorylation prediction model
Control data Making predictions on new data
Phosphorylation Site Prediction Feature Extraction
Bootstrap
Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,
PhosphoPep,and PhosPhAt
Training data
...TrainingPh h l ti it N h h l ti it
Bootstrapsample 1
Bootstrapsample mNon-redundant datasets built by BLASTclust
... Classifier m
TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction
KNN scores Disorder scoresClassifier 1
Features from Features from
AggregatingSpecificityestimation
Amino acid frequencies
PhosphorylationFeatures frompositive set
Features fromnegative set
Control data
estimation
Making predictions
Phosphorylation prediction model
Control data Making predictions on new data
KNN FeaturesMotivation
Rationale of using KNN features: local sequence clusters exist aroundsequence clusters exist around phosphorylation sites, since
Each phosphorylation site is a substrate of a specific protein kinase Substrates of the same kinase or kinase family usually shares similar patterns in local sequences
KNN FeaturesResult
Overall, phosphosites have larger KNN scores 1
(A)
Phospho Nonphospho
have larger KNN scores than non-phosphosites 0.8
core
Average KNN scores 0.4
0.6
KN
N s
c
0.7~0.8 for phosphosites≈0.5 for non-phosphosites
0 25 0 5 1 2 40
0.2
Boxplot of KNN features(H S /Th )
0.25 0.5 1 2 4Size of nearest neighbors (% of sample size)
(Human Ser/Thr)
Disorder FeaturesConcept & Rationale
Disordered region (structure)Some parts of a protein have a rigid structure, such as α-helix and β-sheet.Other parts, disordered regions, do not have well-defined conformationswell defined conformationsThe conformational flexibility of disordered regions may facilitate protein phosphorylationregions may facilitate protein phosphorylation [Dunker, 2008]: protein phosphorylation sites are frequently located within disordered regionsare frequently located within disordered regions
Disorder Features
F h h it
ResultFor phosphosites
Occurrence increases exponentiallywhen disorder score increases
10000
(A) Phospho-S/T in H. sapiens
4
5
6
e d so de sco e c easesFor non-phosphosites
Significantly different distribution0
5000
e 1
2
3
4
Disorder score > 0.50 0.2 0.4 0.6 0.8 1
0
occu
rrenc
e
2
2.5x 10
5(B) Non-phospho-S/T in H. sapiens
-1
0
1
Phosphosites: ~91%Non-phosphosites: ~55%
Phosphosites are significantly 05
1
1.5
2
-4
-3
-2
Phosphosites are significantly over-represented in disordered regions 0 0.2 0.4 0.6 0.8 1
0
0.5
Disorder Score
-6
-5
Histogram of disorder features(Human Ser/Thr)
Amino Acid FrequenciesResult
1
0
0.5
1qu
ency
)
1
-0.5
0
of F
req
H. sapiens (S/T)M. musculus (S/T)
2
-1.5
-1
g 2(R
atio D. melanogaster (S/T)
C. elegans (S/T)S. cerevisiae (S/T)
P R D E S K G A Q N V T H L M I F Y W C-2.5
-2
Log
A i A id
( )A. thaliana (S/T)
P, R, D, E, S, K, and G are enriched around phosphosites
Amino Acid
phosphositesC, W, Y, F, I, M, L, H, T, and V are depleted
Phosphorylation Site Prediction Classifier Training
Bootstrap
Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM,
PhosphoPep,and PhosPhAt
Training data
...TrainingPh h l ti it N h h l ti it
Bootstrapsample 1
Bootstrapsample mNon-redundant datasets built by BLASTclust
... Classifier m
TrainingPhosphorylation sites Non-phosphorylation sitesFeature extraction
KNN scores Disorder scoresClassifier 1
Features from Features from
AggregatingSpecificityestimation
Amino acid frequencies
PhosphorylationFeatures frompositive set
Features fromnegative set
Control data
estimation
Making predictions
Phosphorylation prediction model
Control data Making predictions on new data
ResultsTrained Models
General Prediction/
Kinase-Specific PredictionHuman ser/thr
Human tyr
PredictionATM
Mouse ser/thrMouse tyr
CDK/CDK1/CDK2CK1/CK2
Fluit fly ser/thrWorm ser/thr
MAPK1/MAPK3PKAWorm ser/thr
Yeast ser/thrArabidopsis ser/thr
PKAPKBPKCArabidopsis ser/thr PKCSrc
ResultsCross validation
1
0 8
1C. elegans (S/T)A. thaliana (S/T)H. sapiens (S/T)
0 6
0.8
y 0.8
M. musculus (S/T)S. cerevisiae (S/T)D. melanogaster (S/T)
0 4
0.6
Sen
sitiv
ity
0.6M. musculus (Y)H. sapiens (Y)Random guess
0 2
0.4S
0.2
0.4
0
0.2
0 0.02 0.04 0.06 0.08 0.10
0 0.2 0.4 0.6 0.8 10
1 - Specificity
ResultsComparison to other tools
1
0 8
0.9
1
Musite
0 6
0.7
0.8Scan-xDISPHOSNetPhos
0
0.5
0.6
Sen
sitiv
ity 0.6
0.3
0.4S
0.2
0.4
0.1
0.2
0 0 02 0 04 0 06 0 08 0 10
0 0.2 0.4 0.6 0.8 10
1 - Specificity
0 0.02 0.04 0.06 0.08 0.1
Phosphorylation Site Prediction Software Implementation-Musite
Open SourceLicense: GNU General Public License (GPL)http://musite sourceforge net/http://musite.sourceforge.net/
Stand-alone applicationBased on JavaSupport Windows Linux and Mac OS XSupport Windows, Linux, and Mac OS X
A web server is also being developedg phttp://musite.net/
ImplementationUser Interface
ImplementationCustomized Model Training
A unique utility for users to train di ti d l f th i d tprediction models from their own data
Take advantage of latest dataTake advantage of latest dataTrain disease-specific modelsTrain organ-specific modelsIntegrate into experimental procedure in an g p piterative way
SummarySummary
Musite is for prediction of general and kinase-specific phosphosites in a better accuracyspecific phosphosites in a better accuracy
Musite is a open-source standalone program capable of performing proteome-widecapable of performing proteome wide predictions
AcknowledgementsAcknowledgements
Dr. Dong Xu (University of Missouri) Dr. Jay Thelen (University of Missouri)Jay e e (U e s ty o ssou )Dr. Keith Dunker (Indiana University)Curtis Bollinger (University of Missouri)Curtis Bollinger (University of Missouri)
FundingNSF [# DBI-0604439]
Visit us athttp://musite.sourceforge.netNSF [# DBI 0604439]
NIH [# R21/R33 GM078601]
p ghttp://musite.netPoster R09 at ISMB