Ensemble Learning Model for MS/MS

Preview:

DESCRIPTION

Ensemble Learning Model to Estimate the Accuracy of Peptide Identi cations Made by MS/MS Course project for B529

Citation preview

Ensemble Learning Model to Estimate the Accuracy ofPeptide Identifications Made by MS/MS

Qiang Kou Shan Xiao Xiaohui Yao Zongliang Yue

qkou@umail.iu.edu

April 29, 2014

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 1/25 April 29, 2014 1 / 25

Background

Background

Mass spectrometry has become the most widely used tool for thecharacterization of proteins

Many database searching softwares and algorithms have beendeveloped, including SEQUEST, MASCOT, X!tandem, InsPecT,MS-Align+

Scores always have significant overlap between correct and incorrectidentification

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 2/25 April 29, 2014 2 / 25

Background

Background

0

20

40

60

80

100

120

140

160

180

200

-3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3

“Correct”

“Incorrect”

Descriminant Score (D)

Num

ber o

f Spe

ctra

in E

ach

Bin

From Brian C. SearleQiang Kou (qkou@umail.iu.edu) Ensemble Learning 3/25 April 29, 2014 3 / 25

Background

Available Software

PeptideProphet [1]

F (x1, x2, . . . , xn) = c0 +∑n

i=1 cixip(+|F ) = p(F |+)p(+)

p(F |+)p(+)+p(F |−)p(−)

Percolator [2, 3]

F (x) =∑

i wihi (x) + b, where hk(x) = tanh((wk)Tx + bk)sigmod loss function: L(F (x), y) = 1/exp(1 + F (x))

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 4/25 April 29, 2014 4 / 25

Background

Trans-Proteomic Pipeline

PeptideProphet

mzXML X!Tandem Percolator ProteinProphet Proteins

EnsembleLearning

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 5/25 April 29, 2014 5 / 25

Ensemble Learning

Ensemble Learning

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 6/25 April 29, 2014 6 / 25

Ensemble Learning

Ensemble Learning

Homogeneous: learners from the same category

boostingbaggingrandom forest

Heterogeneous: learners from different categories

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 7/25 April 29, 2014 7 / 25

Ensemble Learning

Example of Ensemble Learning

Two real variables

Two random pseudo variables

Three methods: linear model,SVM and random forest

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 8/25 April 29, 2014 8 / 25

Ensemble Learning

Results of Three Methods

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 9/25 April 29, 2014 9 / 25

Ensemble Learning

Average of Three Methods

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 10/25 April 29, 2014 10 / 25

Ensemble Strategy

Ensemble Strategy

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 11/25 April 29, 2014 11 / 25

Ensemble Strategy

Non-negative Least Squares and Logistic Regression

Non-negative least squares regression

fe(X ) =k∑

i=1

αi fi (X ),∑

αi = 1, αi ≥ 0

Non-negative logistic regression

fe(X ) =1

1 + exp(−∑k

i αi fi (X )), αi ≥ 0

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 12/25 April 29, 2014 12 / 25

Ensemble Strategy

Greedy Strategy

1 Start with the empty ensemble;

2 Add the model which can maximize the ensemble’s classificationresult on the training dataset;

3 Repeat Step 2 for a fixed number of iterations;

4 Return the final ensemble.

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 13/25 April 29, 2014 13 / 25

Application in MS/MS

Application in MS/MS

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 14/25 April 29, 2014 14 / 25

Application in MS/MS

Available Features

Symbol Description

mass precursor neutral masstime retention time∆M mass difference#match numuber of matched ionspepLen peptide lengthcharge charge stateexp E-value#missed #missed cleavagesenzN if prceded by an enzymatic siteenzC if there is an enzymatic C-terminus#consistent #peptide termini consistent with cleavage#ions #fragment ions predicted for peptide#proteins #proteins containing peptideArg,. . . ,Val # each kind of amino acidHyperscore, Nextscore, BScore, YScore scoring functions in X!tandem

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 15/25 April 29, 2014 15 / 25

Application in MS/MS

Weights in Regularized Generalized Linear Model

Description Weights Description Weights

#missed -1.923 Arg -1.321charge 1.246 Cys -1.062Lys -0.990 His 0.790Trp 0.726 #consistent -0.494Pro 0.407 Asp 0.388Met -0.369 Val 0.350bscore -0.347 Tyr 0.238#ions 0.210

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 16/25 April 29, 2014 16 / 25

Application in MS/MS

Model Used

Algorithm Description R Package

glm linear model statsrandomForest random forest randomForestknn k-nearest neighbour statsglmnet elastic net glmnetsvm SVM e1071step stepwise glm stats

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 17/25 April 29, 2014 17 / 25

Application in MS/MS

Training and Testing Dataset

Paola Picotti, et al. Nature 494:266-270, 2013 [4]

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 18/25 April 29, 2014 18 / 25

Application in MS/MS

ROC Curves

False positive rate

True

pos

itive

rat

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Ensemble Learning 0.873Percolator 0.821PeptideProphet 0.789

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 19/25 April 29, 2014 19 / 25

Application in MS/MS

Relation between FDR and Ensemble Score with LOESS

●●

●●

●●●

●●

●●●

●●

●●●

●●●●●●●●0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Ensemble Score

FD

R

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 20/25 April 29, 2014 20 / 25

Application in MS/MS

Relation between FDR and Ensemble Score with LOESS

●●

●●

●●●

●●

●●●

●●

●●●

●●●●●●●●0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Ensemble Score

FD

R

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 21/25 April 29, 2014 21 / 25

Application in MS/MS

Number of Correct/Incorrect Identifications with 0.05 FDR

0

500

1000

PeptideProphet Percolator Ensemblemethods

num

ber variable

correct

incorrect

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 22/25 April 29, 2014 22 / 25

Application in MS/MS

Some Conclusion

Ensemble learning methods often have better results

Very easy to over fit on training data

Time-consuming for model training

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 23/25 April 29, 2014 23 / 25

Application in MS/MS

Some Conclusion

Ensemble learning methods often have better results

Very easy to over fit on training data

Time-consuming for model training

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 23/25 April 29, 2014 23 / 25

Application in MS/MS

Some Conclusion

Ensemble learning methods often have better results

Very easy to over fit on training data

Time-consuming for model training

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 23/25 April 29, 2014 23 / 25

Application in MS/MS

References

Andrew Keller, Alexey I Nesvizhskii, Eugene Kolker, and Ruedi Aebersold.Empirical statistical model to estimate the accuracy of peptide identifications made byMS/MS and database search.Analytical Chemistry, 74(20), 2002.

Lukas Kall, Jesse D Canterbury, Jason Weston, William Stafford Noble, and Michael JMacCoss.Semi-supervised learning for peptide identification from shotgun proteomics datasets.Nature Methods, 4(11), 2007.

Marina Spivak, Jason Weston, Lon Bottou, Lukas Kll, and William Stafford Noble.Improvements to the percolator algorithm for peptide identification from shotgunproteomics data sets.Journal of Proteome Research, 8(7):3737–3745, 2009.

Paola Picotti, Mathieu Clment-Ziza, Henry Lam, David S Campbell, Alexander Schmidt,Eric W Deutsch, Hannes Rst, Zhi Sun, Oliver Rinner, Lukas Reiter, Qin Shen, Jacob JMichaelson, Andreas Frei, Simon Alberti, Ulrike Kusebauch, Bernd Wollscheid, Robert LMoritz, Andreas Beyer, and Ruedi Aebersold.A complete mass-spectrometric map of the yeast proteome applied to quantitative traitanalysis.Nature, 494(7436), 2013.

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 24/25 April 29, 2014 24 / 25

Thank you

Thank you

Thank you!http://qkou.info/sl.pdf

Qiang Kou (qkou@umail.iu.edu) Ensemble Learning 25/25 April 29, 2014 25 / 25

Recommended