41
Scoring functions and their use in the identification of peptides in mass spectrometry Eugene A Kapp Bioinformatics Walter & Eliza Hall Institute of Medical Research Bioinformatics Summer Course Dec 2012

Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

Embed Size (px)

DESCRIPTION

Scoring functions and their use in the identification of peptides in mass spectrometry

Citation preview

Page 1: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

Scoring functions and their use in the identification of peptides in mass

spectrometry

Eugene A Kapp

BioinformaticsWalter & Eliza Hall Institute of Medical Research

Bioinformatics Summer CourseDec 2012

Page 2: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

Shotgun proteomics: peptide identification methods

Page 3: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

• Sequence against sequence • Can be used to find weak / distant similarity• Can make gapped alignments

• Mass & intensity values against sequence • Looking for identity or near identity• Generally, short peptides

BLAST / FASTA

MS/MS-based ID

Page 4: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

H2N CH C NH CH C NH CH C NH CH C OH

R1 R2 R3 R4

O O O O

H

b3

b ion formation

NH CH C OH

R4

O

H

+H y1

y ion formation

+Neutral pumped away by vacuum system

and/or

H2N CH C NH CH C NH CH C

R1 R2

O O O

R3

zHz+

+

+Neutral pumped away by vacuum system

+

Proton Mobility

Mobile: zpre > #Arg + #Lys + #HisPartially mobile: zpre < #Arg + #Lys + #His and > #ArgNon-mobile: zpre < #Arg

For peptides with non-mobile protons, fragmentation tends to proceed via charge-remote mechanisms. MS/MS spectra will be dominated by a few ions, typically:

C-term side of D, EN-term side of P

Peptide Information Content

Page 5: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

50

0

100

400 2000600 800 18001600140012001000

Re

lativ

e A

bu

nd

an

ce

m/z

VFIMDNCEELIPEYLNFIR

ox Pe

y8

y6y5y4

y9

y8

b10

b11

b11

Spectral information content: “Mobile” Proton

• nP cleavage• metox loss• cP cleavage

++

Page 6: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

0

50

Relat

ive A

bund

ance

400 600 800 1000 1200

100

1400 1600 1800 2000

m/z

-CH3SOH

RVFIMDNCEELIPEYLNFIR

ox Pe

y14

-Pe- (CH3SOH + Pe)

y14y11

~ ~

y6b6

MDNCE

• metox loss• Pe loss• cD cleavage

• cE cleavage• nP cleavage

++

y8

y8 y6

Spectral information content: “Non-mobile” proton

y11

Page 7: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

-0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.30

50

100

150

200

250

d' = 0.55

"Correct sequence"

"Randomised"

Mobile Proton 2+ (1268 unique peptides)

Num

ber

of s

pect

ra

NXCorr

Aims of MS/MS Scoring Functions (1)

Page 8: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

-0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.20

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

320

340

d' = 0.50

"Correct sequence"

"Randomised"

Partially Mobile Proton 2+ (2223 unique peptides)

Num

ber

of s

pect

ra

NXCorr

Aims of MS/MS Scoring Functions (1)

Page 9: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

-0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.10

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

d' = 0.32

"Correct sequence"

"Randomised"

Non-Mobile Proton 2+ (264 unique peptides)

Num

ber

of s

pect

ra

NXCorr

Aims of MS/MS Scoring Functions (1)

Page 10: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Tryptic search

1-Specificity

Sen

siti

vity

Mascot Ion score (AUC=0.98)PeptideProphet (AUC=0.96)Sonar (AUC=0.94)Tandem (AUC=0.93)Spectrummill (tag) (AUC=0.91)Sequest XCorr (AUC=0.91)Spectrummill (AUC=0.86)

A

Kapp et al.Proteomics 2005

Aims of MS/MS Scoring Functions (2)

Page 11: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

• Statistical: - peptide or fragment ion frequency statistics (Mascot/Andromeda)

OR Bayesian model (xxx)

• Non-statistical: Correlation, dot-product -> raw score (SEQUEST/Comet)

SpectrumMill, GutenTag, MyriMatch, Digger, ProteinPilot, Sorcerer,pFind, Peaks, ProteinLynxGS, MSGF, Inspect, OMSSA etc...

Types of MS/MS Scoring Functions

• Blend: raw score -> E-value (X!Tandem)

Page 12: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

similarpeaks(y/b ions)

100%

0%

*

0

/n

i i

i

y bScore I P

spectrumintensities

predicted?(1,0)

X!Tandem’s preliminary score is a dot product of the acquired and model spectra. Because only similar peaks are considered, this is the sum of the intensities of the matched y and b ions.

X!Tandem’s preliminary score is a dot product of the acquired and model spectra. Because only similar peaks are considered, this is the sum of the intensities of the matched y and b ions.

X!Tandem MS/MS Scoring

Image courtesy of Proteome Software

Page 13: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

similarpeaks(y/b ions)

100%

0%

spectrumintensities

predicted?(1,0)

*

0

* !* !n

i i b y

i

HyperScore I P N N

X!Tandem modifies the preliminary score by multiplying by N factorial for the number of b and y ions assigned. The use of factorials is based on the hypergeometric distribution.

X!Tandem modifies the preliminary score by multiplying by N factorial for the number of b and y ions assigned. The use of factorials is based on the hypergeometric distribution.

X!Tandem Hyperscore

Image courtesy of Proteome Software

Page 14: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

0

10

20

30

40

50

60

0 20 40 60 80 100

hyperscore

# re

sults

incorrectIDs

Next, X!Tandem makes a histogram of all the hyperscores for all the peptides in the database that might match this spectrum.

For example, in this figure, 52 peptides were found with a hyperscore of 19, and one peptide with a hyperscore of 83.

X!Tandem assumes that the peptide with the highest hyperscore is correct, and all others are incorrect.

Next, X!Tandem makes a histogram of all the hyperscores for all the peptides in the database that might match this spectrum.

For example, in this figure, 52 peptides were found with a hyperscore of 19, and one peptide with a hyperscore of 83.

X!Tandem assumes that the peptide with the highest hyperscore is correct, and all others are incorrect.

Image courtesy of Proteome Software

Histogram of hyperscores

Page 15: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

0

0.5

1

1.5

2

2.5

3

3.5

4

20 25 30 35 40 45 50

0

10

20

30

40

50

60

0 20 40 60 80 100

hyperscore

# re

sults

log(

# re

sults

)

If the data on the right side of the histogram, (colored in upper figure) is taken and log-transformed, the data fall on a straight line.

A straight line is the expected result from a statistical argument that assumes the incorrect results are random.

Note: this histogram is calculated independently for each spectrum.

If the data on the right side of the histogram, (colored in upper figure) is taken and log-transformed, the data fall on a straight line.

A straight line is the expected result from a statistical argument that assumes the incorrect results are random.

Note: this histogram is calculated independently for each spectrum.

Image courtesy of Proteome Software

Log histogram

Page 16: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

0

0.5

1

1.5

2

2.5

3

3.5

4

20 25 30 35 40 45 50

0

10

20

30

40

50

60

0 20 40 60 80 100

hyperscore

# re

sults

log(

# re

sults

)

significant

X!Tandem has already assumed that the top hyperscore is the only possible correct match.

This match is significant if it is greater than the point at which the straight line through the log data intersects the log(#results)=0 line.

Any hyperscores greater than this are unlikely to have arisen by chance.

X!Tandem has already assumed that the top hyperscore is the only possible correct match.

This match is significant if it is greater than the point at which the straight line through the log data intersects the log(#results)=0 line.

Any hyperscores greater than this are unlikely to have arisen by chance.

Image courtesy of Proteome Software

Significant scores

Page 17: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

-10

-8

-6

-4

-2

0

2

4

6

0 20 40 60 80 100

0

10

20

30

40

50

60

0 20 40 60 80 100

hyperscore

# re

sults

log(

# re

sults

)

E-value=e-8.2

The E-value expresses just how unlikely a greater hyperscore is.

X!Tandem calculates the E-value by extrapolating the red line of the log histogram.

For the example shown, a hyperscore of 83 would occur by chance where the red line crosses 83. The log of this value — the E-value — is -8.2, as shown.

The E-value expresses just how unlikely a greater hyperscore is.

X!Tandem calculates the E-value by extrapolating the red line of the log histogram.

For the example shown, a hyperscore of 83 would occur by chance where the red line crosses 83. The log of this value — the E-value — is -8.2, as shown.

X!Tandem E-value

Page 18: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

• Human (even expert) judgment is subjective and can be unreliable

Why is Probability based scoring important?

Page 19: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Page 20: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Page 21: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

• Human (even expert) judgment is subjective and can be unreliable

• Standard, statistical tests of significance can be applied to the results

• Arbitrary scoring schemes are susceptible to false positives.

Why is Probability based scoring important?

Page 22: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

• Yes, if it is a test sample and you know what the answer should be– Matches to the expected protein sequences are defined to be correct– Matches to other sequences are defined to be wrong

• If the sample is an unknown, then you have to define “correct” very carefully:– The best match in the database?– The best match out of all possible peptides?– The peptide sequence that is uniquely and completely defined by the

MS data?

Can we calculate a probability that a match is correct?

Page 23: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

P = pk

NN

nk

k (1-p)N-k

N is the # of possible fragment ion matches (peplen * 2),n is the # of observed fragment ion matches,k is the # of matches

p (probability of a match) = peak depth / numbins

Where, peak depth = #of peaks per 100 Da window (max 10)And numbins = 100 / (2 * frag_ion_tol)

Score = -10 * log10P

Binomial,Hypergeometric,

Poisson,or EVD ?

Probability model: Andromeda (MaxQuant) – Theoretical model

Page 24: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

a b b-98 b-18 b-17 b++ b++-98 b++-18 b++-17 y y-98 y-18 y-17 y++ y++-98 y++-18 y++-17L1 0 2 0 1 0 0 0 0 0 3 0 1 0 0 0 0 0L2 0 3 0 1 0 0 0 0 0 3 0 1 0 0 0 0 0L3 1 3 0 2 0 0 0 0 0 3 0 1 0 0 0 0 0L4 1 3 0 2 1 0 0 0 0 4 0 1 1 1 0 0 0L5 1 3 0 2 1 0 0 0 0 4 0 2 1 1 0 0 1L6 1 3 0 2 1 0 0 0 0 5 0 2 1 1 0 0 1L7 1 3 0 2 1 0 0 0 0 5 0 2 1 1 0 1 1L8 1 3 0 2 1 0 0 0 0 5 0 2 2 1 0 1 2L9 1 3 0 2 2 0 0 0 0 5 0 2 2 1 0 1 2

L10 1 3 0 2 2 0 0 0 0 5 0 2 2 1 0 1 3

Ion-series

Leve

l

for ONE candidate (decoy) peptide

for ALL candidate (decoy) peptides 1

For @ cell calc slope & interceptbased on all decoy peptides

0 1 2 3 4 5 6 7-10

-505

10152025303540

# of fragment ion matches

Scor

e (-

10*L

gP)

Extrapolate for more matches…

2

Probability model: Digger – empirical NULL model

Page 25: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

Limitations of E-values or P-values

1) P-values or E-values are not well suited for the analysis of large-scale datasets -

Do not allow estimation of global error rates (FDR) as a function of filtering threshold (need formultiple testing correction)

2) Do not directly incorporate additional useful information (e.g., # of missed cleavages, mass accuracy, retention time etc.)

Page 26: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Page 27: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Page 28: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Page 29: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Page 30: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Page 31: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Page 32: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Page 33: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Page 34: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)
Page 35: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

1) For all target “real” peptides generate decoys “on-the-fly” – easy, built-in.

2) Reverse internal peptide residues – if palindrome then randomise

LGEDTLISYR LYSILTDEGR

3) I/L residues taken into account.

4) PTM’s are kept constant but shifted internally within peptide.

5) Similar implementation in Crux (Univ. of Washingon – Noble, MacCoss).

What makes a good decoy?

Page 36: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

What makes a good decoy?

Page 37: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

1) separate or concatenated target decoy sequence database?

Decoy strategies?

Page 38: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

sequest

High scores for short (random) peptides High scores for larger search space

Decoy strategies:Imperfection of scoring functions

Page 39: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

A) Covariance dependency – some spectra score well regardless

B) Reduce co-varying features by using post-processing tools(e.g. PeptideProphet, Percolator, q-ranker etc.) which combinemultiple different features.

Decoy strategies:Imperfection of scoring functions

Page 40: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

Single spectrum analyses (individual probabilities)

Summary: Statistical approaches

Global analysis (individual and global error rates)

Expectation values Similar to sequence similarity searching (BLAST)

• Target-decoy strategy for global FDR• Distribution modeling (e.g. Peptide prophet, Percolator)

for local and global FDR estimation.

Page 41: Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

Multiplexed spectra

Middle-down proteomics

X-linked spectra

PTMs: Phosphorylation - multiple modifications and sites

PTM cross-talk elucidation

Summary: Challenges…