27
Ensemble Learning Model to Estimate the Accuracy of Peptide Identifications Made by MS/MS Qiang Kou Shan Xiao Xiaohui Yao Zongliang Yue [email protected] April 29, 2014 Qiang Kou ([email protected]) Ensemble Learning 1/25 April 29, 2014 1 / 25

Ensemble Learning Model for MS/MS

Embed Size (px)

DESCRIPTION

Ensemble Learning Model to Estimate the Accuracy of Peptide Identi cations Made by MS/MS Course project for B529

Citation preview

Page 1: Ensemble Learning Model for MS/MS

Ensemble Learning Model to Estimate the Accuracy ofPeptide Identifications Made by MS/MS

Qiang Kou Shan Xiao Xiaohui Yao Zongliang Yue

[email protected]

April 29, 2014

Qiang Kou ([email protected]) Ensemble Learning 1/25 April 29, 2014 1 / 25

Page 2: Ensemble Learning Model for MS/MS

Background

Background

Mass spectrometry has become the most widely used tool for thecharacterization of proteins

Many database searching softwares and algorithms have beendeveloped, including SEQUEST, MASCOT, X!tandem, InsPecT,MS-Align+

Scores always have significant overlap between correct and incorrectidentification

Qiang Kou ([email protected]) Ensemble Learning 2/25 April 29, 2014 2 / 25

Page 3: Ensemble Learning Model for MS/MS

Background

Background

0

20

40

60

80

100

120

140

160

180

200

-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3

“Correct”

“Incorrect”

Descriminant Score (D)

Num

ber o

f Spe

ctra

in E

ach

Bin

From Brian C. SearleQiang Kou ([email protected]) Ensemble Learning 3/25 April 29, 2014 3 / 25

Page 4: Ensemble Learning Model for MS/MS

Background

Available Software

PeptideProphet [1]

F (x1, x2, . . . , xn) = c0 +∑n

i=1 cixip(+|F ) = p(F |+)p(+)

p(F |+)p(+)+p(F |−)p(−)

Percolator [2, 3]

F (x) =∑

i wihi (x) + b, where hk(x) = tanh((wk)Tx + bk)sigmod loss function: L(F (x), y) = 1/exp(1 + F (x))

Qiang Kou ([email protected]) Ensemble Learning 4/25 April 29, 2014 4 / 25

Page 5: Ensemble Learning Model for MS/MS

Background

Trans-Proteomic Pipeline

PeptideProphet

mzXML X!Tandem Percolator ProteinProphet Proteins

EnsembleLearning

Qiang Kou ([email protected]) Ensemble Learning 5/25 April 29, 2014 5 / 25

Page 6: Ensemble Learning Model for MS/MS

Ensemble Learning

Ensemble Learning

Qiang Kou ([email protected]) Ensemble Learning 6/25 April 29, 2014 6 / 25

Page 7: Ensemble Learning Model for MS/MS

Ensemble Learning

Ensemble Learning

Homogeneous: learners from the same category

boostingbaggingrandom forest

Heterogeneous: learners from different categories

Qiang Kou ([email protected]) Ensemble Learning 7/25 April 29, 2014 7 / 25

Page 8: Ensemble Learning Model for MS/MS

Ensemble Learning

Example of Ensemble Learning

Two real variables

Two random pseudo variables

Three methods: linear model,SVM and random forest

Qiang Kou ([email protected]) Ensemble Learning 8/25 April 29, 2014 8 / 25

Page 9: Ensemble Learning Model for MS/MS

Ensemble Learning

Results of Three Methods

Qiang Kou ([email protected]) Ensemble Learning 9/25 April 29, 2014 9 / 25

Page 10: Ensemble Learning Model for MS/MS

Ensemble Learning

Average of Three Methods

Qiang Kou ([email protected]) Ensemble Learning 10/25 April 29, 2014 10 / 25

Page 11: Ensemble Learning Model for MS/MS

Ensemble Strategy

Ensemble Strategy

Qiang Kou ([email protected]) Ensemble Learning 11/25 April 29, 2014 11 / 25

Page 12: Ensemble Learning Model for MS/MS

Ensemble Strategy

Non-negative Least Squares and Logistic Regression

Non-negative least squares regression

fe(X ) =k∑

i=1

αi fi (X ),∑

αi = 1, αi ≥ 0

Non-negative logistic regression

fe(X ) =1

1 + exp(−∑k

i αi fi (X )), αi ≥ 0

Qiang Kou ([email protected]) Ensemble Learning 12/25 April 29, 2014 12 / 25

Page 13: Ensemble Learning Model for MS/MS

Ensemble Strategy

Greedy Strategy

1 Start with the empty ensemble;

2 Add the model which can maximize the ensemble’s classificationresult on the training dataset;

3 Repeat Step 2 for a fixed number of iterations;

4 Return the final ensemble.

Qiang Kou ([email protected]) Ensemble Learning 13/25 April 29, 2014 13 / 25

Page 14: Ensemble Learning Model for MS/MS

Application in MS/MS

Application in MS/MS

Qiang Kou ([email protected]) Ensemble Learning 14/25 April 29, 2014 14 / 25

Page 15: Ensemble Learning Model for MS/MS

Application in MS/MS

Available Features

Symbol Description

mass precursor neutral masstime retention time∆M mass difference#match numuber of matched ionspepLen peptide lengthcharge charge stateexp E-value#missed #missed cleavagesenzN if prceded by an enzymatic siteenzC if there is an enzymatic C-terminus#consistent #peptide termini consistent with cleavage#ions #fragment ions predicted for peptide#proteins #proteins containing peptideArg,. . . ,Val # each kind of amino acidHyperscore, Nextscore, BScore, YScore scoring functions in X!tandem

Qiang Kou ([email protected]) Ensemble Learning 15/25 April 29, 2014 15 / 25

Page 16: Ensemble Learning Model for MS/MS

Application in MS/MS

Weights in Regularized Generalized Linear Model

Description Weights Description Weights

#missed -1.923 Arg -1.321charge 1.246 Cys -1.062Lys -0.990 His 0.790Trp 0.726 #consistent -0.494Pro 0.407 Asp 0.388Met -0.369 Val 0.350bscore -0.347 Tyr 0.238#ions 0.210

Qiang Kou ([email protected]) Ensemble Learning 16/25 April 29, 2014 16 / 25

Page 17: Ensemble Learning Model for MS/MS

Application in MS/MS

Model Used

Algorithm Description R Package

glm linear model statsrandomForest random forest randomForestknn k-nearest neighbour statsglmnet elastic net glmnetsvm SVM e1071step stepwise glm stats

Qiang Kou ([email protected]) Ensemble Learning 17/25 April 29, 2014 17 / 25

Page 18: Ensemble Learning Model for MS/MS

Application in MS/MS

Training and Testing Dataset

Paola Picotti, et al. Nature 494:266-270, 2013 [4]

Qiang Kou ([email protected]) Ensemble Learning 18/25 April 29, 2014 18 / 25

Page 19: Ensemble Learning Model for MS/MS

Application in MS/MS

ROC Curves

False positive rate

True

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Ensemble Learning 0.873Percolator 0.821PeptideProphet 0.789

Qiang Kou ([email protected]) Ensemble Learning 19/25 April 29, 2014 19 / 25

Page 20: Ensemble Learning Model for MS/MS

Application in MS/MS

Relation between FDR and Ensemble Score with LOESS

●●

●●

●●●

●●

●●●

●●

●●●

●●●●●●●●0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Ensemble Score

FD

R

Qiang Kou ([email protected]) Ensemble Learning 20/25 April 29, 2014 20 / 25

Page 21: Ensemble Learning Model for MS/MS

Application in MS/MS

Relation between FDR and Ensemble Score with LOESS

●●

●●

●●●

●●

●●●

●●

●●●

●●●●●●●●0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Ensemble Score

FD

R

Qiang Kou ([email protected]) Ensemble Learning 21/25 April 29, 2014 21 / 25

Page 22: Ensemble Learning Model for MS/MS

Application in MS/MS

Number of Correct/Incorrect Identifications with 0.05 FDR

0

500

1000

PeptideProphet Percolator Ensemblemethods

num

ber variable

correct

incorrect

Qiang Kou ([email protected]) Ensemble Learning 22/25 April 29, 2014 22 / 25

Page 23: Ensemble Learning Model for MS/MS

Application in MS/MS

Some Conclusion

Ensemble learning methods often have better results

Very easy to over fit on training data

Time-consuming for model training

Qiang Kou ([email protected]) Ensemble Learning 23/25 April 29, 2014 23 / 25

Page 24: Ensemble Learning Model for MS/MS

Application in MS/MS

Some Conclusion

Ensemble learning methods often have better results

Very easy to over fit on training data

Time-consuming for model training

Qiang Kou ([email protected]) Ensemble Learning 23/25 April 29, 2014 23 / 25

Page 25: Ensemble Learning Model for MS/MS

Application in MS/MS

Some Conclusion

Ensemble learning methods often have better results

Very easy to over fit on training data

Time-consuming for model training

Qiang Kou ([email protected]) Ensemble Learning 23/25 April 29, 2014 23 / 25

Page 26: Ensemble Learning Model for MS/MS

Application in MS/MS

References

Andrew Keller, Alexey I Nesvizhskii, Eugene Kolker, and Ruedi Aebersold.Empirical statistical model to estimate the accuracy of peptide identifications made byMS/MS and database search.Analytical Chemistry, 74(20), 2002.

Lukas Kall, Jesse D Canterbury, Jason Weston, William Stafford Noble, and Michael JMacCoss.Semi-supervised learning for peptide identification from shotgun proteomics datasets.Nature Methods, 4(11), 2007.

Marina Spivak, Jason Weston, Lon Bottou, Lukas Kll, and William Stafford Noble.Improvements to the percolator algorithm for peptide identification from shotgunproteomics data sets.Journal of Proteome Research, 8(7):3737–3745, 2009.

Paola Picotti, Mathieu Clment-Ziza, Henry Lam, David S Campbell, Alexander Schmidt,Eric W Deutsch, Hannes Rst, Zhi Sun, Oliver Rinner, Lukas Reiter, Qin Shen, Jacob JMichaelson, Andreas Frei, Simon Alberti, Ulrike Kusebauch, Bernd Wollscheid, Robert LMoritz, Andreas Beyer, and Ruedi Aebersold.A complete mass-spectrometric map of the yeast proteome applied to quantitative traitanalysis.Nature, 494(7436), 2013.

Qiang Kou ([email protected]) Ensemble Learning 24/25 April 29, 2014 24 / 25

Page 27: Ensemble Learning Model for MS/MS

Thank you

Thank you

Thank you!http://qkou.info/sl.pdf

Qiang Kou ([email protected]) Ensemble Learning 25/25 April 29, 2014 25 / 25