Ensemble Learning Model for MS/MS

Ensemble Learning Model to Estimate the Accuracy ofPeptide Identifications Made by MS/MS

Qiang Kou Shan Xiao Xiaohui Yao Zongliang Yue

[email protected]

April 29, 2014

Qiang Kou ([email protected]) Ensemble Learning 1/25 April 29, 2014 1 / 25

Background

Background

Mass spectrometry has become the most widely used tool for thecharacterization of proteins

Many database searching softwares and algorithms have beendeveloped, including SEQUEST, MASCOT, X!tandem, InsPecT,MS-Align+

Scores always have significant overlap between correct and incorrectidentification


Background

Background

0

20

40

60

80

100

120

140

160

180

200

-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3

“Correct”

“Incorrect”

Descriminant Score (D)

Num

ber o

f Spe

ctra

in E

ach

Bin

From Brian C. SearleQiang Kou ([email protected]) Ensemble Learning 3/25 April 29, 2014 3 / 25

Background

Available Software

PeptideProphet [1]

F (x1, x2, . . . , xn) = c0 +∑n

i=1 cixip(+|F ) = p(F |+)p(+)

p(F |+)p(+)+p(F |−)p(−)

Percolator [2, 3]

F (x) =∑

i wihi (x) + b, where hk(x) = tanh((wk)Tx + bk)sigmod loss function: L(F (x), y) = 1/exp(1 + F (x))


Background

Trans-Proteomic Pipeline

PeptideProphet

mzXML X!Tandem Percolator ProteinProphet Proteins

EnsembleLearning


Ensemble Learning

Ensemble Learning


Ensemble Learning

Ensemble Learning

Homogeneous: learners from the same category

boostingbaggingrandom forest

Heterogeneous: learners from different categories


Ensemble Learning

Example of Ensemble Learning

Two real variables

Two random pseudo variables

Three methods: linear model,SVM and random forest


Ensemble Learning

Results of Three Methods


Ensemble Learning

Average of Three Methods


Ensemble Strategy

Ensemble Strategy


Ensemble Strategy

Non-negative Least Squares and Logistic Regression

Non-negative least squares regression

fe(X ) =k∑

i=1

αi fi (X ),∑

αi = 1, αi ≥ 0

Non-negative logistic regression

fe(X ) =1

1 + exp(−∑k

i αi fi (X )), αi ≥ 0


Ensemble Strategy

Greedy Strategy

1 Start with the empty ensemble;

2 Add the model which can maximize the ensemble’s classificationresult on the training dataset;

3 Repeat Step 2 for a fixed number of iterations;

4 Return the final ensemble.


Application in MS/MS




Available Features

Symbol Description

mass precursor neutral masstime retention time∆M mass difference#match numuber of matched ionspepLen peptide lengthcharge charge stateexp E-value#missed #missed cleavagesenzN if prceded by an enzymatic siteenzC if there is an enzymatic C-terminus#consistent #peptide termini consistent with cleavage#ions #fragment ions predicted for peptide#proteins #proteins containing peptideArg,. . . ,Val # each kind of amino acidHyperscore, Nextscore, BScore, YScore scoring functions in X!tandem



Weights in Regularized Generalized Linear Model

Description Weights Description Weights

#missed -1.923 Arg -1.321charge 1.246 Cys -1.062Lys -0.990 His 0.790Trp 0.726 #consistent -0.494Pro 0.407 Asp 0.388Met -0.369 Val 0.350bscore -0.347 Tyr 0.238#ions 0.210



Model Used

Algorithm Description R Package

glm linear model statsrandomForest random forest randomForestknn k-nearest neighbour statsglmnet elastic net glmnetsvm SVM e1071step stepwise glm stats



Training and Testing Dataset

Paola Picotti, et al. Nature 494:266-270, 2013 [4]



ROC Curves

False positive rate

True

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Ensemble Learning 0.873Percolator 0.821PeptideProphet 0.789



Relation between FDR and Ensemble Score with LOESS

●

●●

●●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●●●●●●●●0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Ensemble Score

FD

R



Relation between FDR and Ensemble Score with LOESS

●

●●

●●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●●●●●●●●0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Ensemble Score

FD

R



Number of Correct/Incorrect Identifications with 0.05 FDR

0

500

1000

PeptideProphet Percolator Ensemblemethods

num

ber variable

correct

incorrect



Some Conclusion

Ensemble learning methods often have better results

Very easy to over fit on training data

Time-consuming for model training



Some Conclusion






Some Conclusion






References

Andrew Keller, Alexey I Nesvizhskii, Eugene Kolker, and Ruedi Aebersold.Empirical statistical model to estimate the accuracy of peptide identifications made byMS/MS and database search.Analytical Chemistry, 74(20), 2002.

Lukas Kall, Jesse D Canterbury, Jason Weston, William Stafford Noble, and Michael JMacCoss.Semi-supervised learning for peptide identification from shotgun proteomics datasets.Nature Methods, 4(11), 2007.

Marina Spivak, Jason Weston, Lon Bottou, Lukas Kll, and William Stafford Noble.Improvements to the percolator algorithm for peptide identification from shotgunproteomics data sets.Journal of Proteome Research, 8(7):3737–3745, 2009.

Paola Picotti, Mathieu Clment-Ziza, Henry Lam, David S Campbell, Alexander Schmidt,Eric W Deutsch, Hannes Rst, Zhi Sun, Oliver Rinner, Lukas Reiter, Qin Shen, Jacob JMichaelson, Andreas Frei, Simon Alberti, Ulrike Kusebauch, Bernd Wollscheid, Robert LMoritz, Andreas Beyer, and Ruedi Aebersold.A complete mass-spectrometric map of the yeast proteome applied to quantitative traitanalysis.Nature, 494(7436), 2013.


Thank you

Thank you

Thank you!http://qkou.info/sl.pdf


http://qkou.info/sl.pdf

Education

Ensemble Learning Model for MS/MS