Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)

Scoring functions and their use in the identification of peptides in mass

spectrometry

Eugene A Kapp

BioinformaticsWalter & Eliza Hall Institute of Medical Research

Bioinformatics Summer CourseDec 2012

Shotgun proteomics: peptide identification methods

• Sequence against sequence • Can be used to find weak / distant similarity• Can make gapped alignments

• Mass & intensity values against sequence • Looking for identity or near identity• Generally, short peptides

BLAST / FASTA

MS/MS-based ID

H2N CH C NH CH C NH CH C NH CH C OH

R1 R2 R3 R4

O O O O

H

b3

b ion formation

NH CH C OH

R4

O

H

+H y1

y ion formation

+Neutral pumped away by vacuum system

and/or

H2N CH C NH CH C NH CH C

R1 R2

O O O

R3

zHz+

+

+Neutral pumped away by vacuum system

+

Proton Mobility

Mobile: zpre > #Arg + #Lys + #HisPartially mobile: zpre < #Arg + #Lys + #His and > #ArgNon-mobile: zpre < #Arg

For peptides with non-mobile protons, fragmentation tends to proceed via charge-remote mechanisms. MS/MS spectra will be dominated by a few ions, typically:

C-term side of D, EN-term side of P

Peptide Information Content

50

0

100

400 2000600 800 18001600140012001000

Re

lativ

e A

bu

nd

an

ce

m/z

VFIMDNCEELIPEYLNFIR

ox Pe

y8

y6y5y4

y9

y8

b10

b11

b11

Spectral information content: “Mobile” Proton

• nP cleavage• metox loss• cP cleavage

++

0

50

Relat

ive A

bund

ance

400 600 800 1000 1200

100

1400 1600 1800 2000

m/z

-CH3SOH

RVFIMDNCEELIPEYLNFIR

ox Pe

y14

-Pe- (CH3SOH + Pe)

y14y11

~ ~

y6b6

MDNCE

• metox loss• Pe loss• cD cleavage

• cE cleavage• nP cleavage

++

y8

y8 y6

Spectral information content: “Non-mobile” proton

y11

-0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.30

50

100

150

200

250

d' = 0.55

"Correct sequence"

"Randomised"

Mobile Proton 2+ (1268 unique peptides)

Num

ber

of s

pect

ra

NXCorr

Aims of MS/MS Scoring Functions (1)

-0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.20

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

320

340

d' = 0.50

"Correct sequence"

"Randomised"

Partially Mobile Proton 2+ (2223 unique peptides)

Num

ber

of s

pect

ra

NXCorr


-0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.10

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

d' = 0.32

"Correct sequence"

"Randomised"

Non-Mobile Proton 2+ (264 unique peptides)

Num

ber

of s

pect

ra

NXCorr


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Tryptic search

1-Specificity

Sen

siti

vity

Mascot Ion score (AUC=0.98)PeptideProphet (AUC=0.96)Sonar (AUC=0.94)Tandem (AUC=0.93)Spectrummill (tag) (AUC=0.91)Sequest XCorr (AUC=0.91)Spectrummill (AUC=0.86)

A

Kapp et al.Proteomics 2005


• Statistical: - peptide or fragment ion frequency statistics (Mascot/Andromeda)

OR Bayesian model (xxx)

• Non-statistical: Correlation, dot-product -> raw score (SEQUEST/Comet)

SpectrumMill, GutenTag, MyriMatch, Digger, ProteinPilot, Sorcerer,pFind, Peaks, ProteinLynxGS, MSGF, Inspect, OMSSA etc...

Types of MS/MS Scoring Functions

• Blend: raw score -> E-value (X!Tandem)

similarpeaks(y/b ions)

100%

0%

*

0

/n

i i

i

y bScore I P

spectrumintensities

predicted?(1,0)

X!Tandem’s preliminary score is a dot product of the acquired and model spectra. Because only similar peaks are considered, this is the sum of the intensities of the matched y and b ions.

X!Tandem’s preliminary score is a dot product of the acquired and model spectra. Because only similar peaks are considered, this is the sum of the intensities of the matched y and b ions.

X!Tandem MS/MS Scoring

Image courtesy of Proteome Software

similarpeaks(y/b ions)

100%

0%

spectrumintensities

predicted?(1,0)

*

0

* !* !n

i i b y

i

HyperScore I P N N

X!Tandem modifies the preliminary score by multiplying by N factorial for the number of b and y ions assigned. The use of factorials is based on the hypergeometric distribution.

X!Tandem modifies the preliminary score by multiplying by N factorial for the number of b and y ions assigned. The use of factorials is based on the hypergeometric distribution.

X!Tandem Hyperscore


0

10

20

30

40

50

60

0 20 40 60 80 100

hyperscore

# re

sults

incorrectIDs

Next, X!Tandem makes a histogram of all the hyperscores for all the peptides in the database that might match this spectrum.

For example, in this figure, 52 peptides were found with a hyperscore of 19, and one peptide with a hyperscore of 83.

X!Tandem assumes that the peptide with the highest hyperscore is correct, and all others are incorrect.

Next, X!Tandem makes a histogram of all the hyperscores for all the peptides in the database that might match this spectrum.

For example, in this figure, 52 peptides were found with a hyperscore of 19, and one peptide with a hyperscore of 83.

X!Tandem assumes that the peptide with the highest hyperscore is correct, and all others are incorrect.


Histogram of hyperscores

0

0.5

1

1.5

2

2.5

3

3.5

4

20 25 30 35 40 45 50

0

10

20

30

40

50

60

0 20 40 60 80 100

hyperscore

# re

sults

log(

# re

sults

)

If the data on the right side of the histogram, (colored in upper figure) is taken and log-transformed, the data fall on a straight line.

A straight line is the expected result from a statistical argument that assumes the incorrect results are random.

Note: this histogram is calculated independently for each spectrum.

If the data on the right side of the histogram, (colored in upper figure) is taken and log-transformed, the data fall on a straight line.

A straight line is the expected result from a statistical argument that assumes the incorrect results are random.

Note: this histogram is calculated independently for each spectrum.


Log histogram

0

0.5

1

1.5

2

2.5

3

3.5

4

20 25 30 35 40 45 50

0

10

20

30

40

50

60

0 20 40 60 80 100

hyperscore

# re

sults

log(

# re

sults

)

significant

X!Tandem has already assumed that the top hyperscore is the only possible correct match.

This match is significant if it is greater than the point at which the straight line through the log data intersects the log(#results)=0 line.

Any hyperscores greater than this are unlikely to have arisen by chance.

X!Tandem has already assumed that the top hyperscore is the only possible correct match.

This match is significant if it is greater than the point at which the straight line through the log data intersects the log(#results)=0 line.

Any hyperscores greater than this are unlikely to have arisen by chance.


Significant scores

-10

-8

-6

-4

-2

0

2

4

6

0 20 40 60 80 100

0

10

20

30

40

50

60

0 20 40 60 80 100

hyperscore

# re

sults

log(

# re

sults

)

E-value=e-8.2

The E-value expresses just how unlikely a greater hyperscore is.

X!Tandem calculates the E-value by extrapolating the red line of the log histogram.

For the example shown, a hyperscore of 83 would occur by chance where the red line crosses 83. The log of this value — the E-value — is -8.2, as shown.

The E-value expresses just how unlikely a greater hyperscore is.

X!Tandem calculates the E-value by extrapolating the red line of the log histogram.

For the example shown, a hyperscore of 83 would occur by chance where the red line crosses 83. The log of this value — the E-value — is -8.2, as shown.

X!Tandem E-value

• Human (even expert) judgment is subjective and can be unreliable

Why is Probability based scoring important?

• Human (even expert) judgment is subjective and can be unreliable

• Standard, statistical tests of significance can be applied to the results

• Arbitrary scoring schemes are susceptible to false positives.

Why is Probability based scoring important?

• Yes, if it is a test sample and you know what the answer should be– Matches to the expected protein sequences are defined to be correct– Matches to other sequences are defined to be wrong

• If the sample is an unknown, then you have to define “correct” very carefully:– The best match in the database?– The best match out of all possible peptides?– The peptide sequence that is uniquely and completely defined by the

MS data?

Can we calculate a probability that a match is correct?

P = pk

NN

nk

k (1-p)N-k

N is the # of possible fragment ion matches (peplen * 2),n is the # of observed fragment ion matches,k is the # of matches

p (probability of a match) = peak depth / numbins

Where, peak depth = #of peaks per 100 Da window (max 10)And numbins = 100 / (2 * frag_ion_tol)

Score = -10 * log10P

Binomial,Hypergeometric,

Poisson,or EVD ?

Probability model: Andromeda (MaxQuant) – Theoretical model

a b b-98 b-18 b-17 b++ b++-98 b++-18 b++-17 y y-98 y-18 y-17 y++ y++-98 y++-18 y++-17L1 0 2 0 1 0 0 0 0 0 3 0 1 0 0 0 0 0L2 0 3 0 1 0 0 0 0 0 3 0 1 0 0 0 0 0L3 1 3 0 2 0 0 0 0 0 3 0 1 0 0 0 0 0L4 1 3 0 2 1 0 0 0 0 4 0 1 1 1 0 0 0L5 1 3 0 2 1 0 0 0 0 4 0 2 1 1 0 0 1L6 1 3 0 2 1 0 0 0 0 5 0 2 1 1 0 0 1L7 1 3 0 2 1 0 0 0 0 5 0 2 1 1 0 1 1L8 1 3 0 2 1 0 0 0 0 5 0 2 2 1 0 1 2L9 1 3 0 2 2 0 0 0 0 5 0 2 2 1 0 1 2

L10 1 3 0 2 2 0 0 0 0 5 0 2 2 1 0 1 3

Ion-series

Leve

l

for ONE candidate (decoy) peptide

for ALL candidate (decoy) peptides 1

For @ cell calc slope & interceptbased on all decoy peptides

0 1 2 3 4 5 6 7-10

-505

10152025303540

# of fragment ion matches

Scor

e (-

10*L

gP)

Extrapolate for more matches…

2

Probability model: Digger – empirical NULL model

Limitations of E-values or P-values

1) P-values or E-values are not well suited for the analysis of large-scale datasets -

Do not allow estimation of global error rates (FDR) as a function of filtering threshold (need formultiple testing correction)

2) Do not directly incorporate additional useful information (e.g., # of missed cleavages, mass accuracy, retention time etc.)

1) For all target “real” peptides generate decoys “on-the-fly” – easy, built-in.

2) Reverse internal peptide residues – if palindrome then randomise

LGEDTLISYR LYSILTDEGR

3) I/L residues taken into account.

4) PTM’s are kept constant but shifted internally within peptide.

5) Similar implementation in Crux (Univ. of Washingon – Noble, MacCoss).

What makes a good decoy?

What makes a good decoy?

1) separate or concatenated target decoy sequence database?

Decoy strategies?

sequest

High scores for short (random) peptides High scores for larger search space

Decoy strategies:Imperfection of scoring functions

A) Covariance dependency – some spectra score well regardless

B) Reduce co-varying features by using post-processing tools(e.g. PeptideProphet, Percolator, q-ranker etc.) which combinemultiple different features.

Decoy strategies:Imperfection of scoring functions

Single spectrum analyses (individual probabilities)

Summary: Statistical approaches

Global analysis (individual and global error rates)

Expectation values Similar to sequence similarity searching (BLAST)

• Target-decoy strategy for global FDR• Distribution modeling (e.g. Peptide prophet, Percolator)

for local and global FDR estimation.

Multiplexed spectra

Middle-down proteomics

X-linked spectra

PTMs: Phosphorylation - multiple modifications and sites

PTM cross-talk elucidation

Summary: Challenges…

Documents

Scoring Functions and Their Use in the Identification of Peptides in Mass Spectrometry - BioinfoSummer 2012 (Eugene Kapp)