View
352
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Scoring functions and their use in the identification of peptides in mass spectrometry
Citation preview
Scoring functions and their use in the identification of peptides in mass
spectrometry
Eugene A Kapp
BioinformaticsWalter & Eliza Hall Institute of Medical Research
Bioinformatics Summer CourseDec 2012
Shotgun proteomics: peptide identification methods
• Sequence against sequence • Can be used to find weak / distant similarity• Can make gapped alignments
• Mass & intensity values against sequence • Looking for identity or near identity• Generally, short peptides
BLAST / FASTA
MS/MS-based ID
H2N CH C NH CH C NH CH C NH CH C OH
R1 R2 R3 R4
O O O O
H
b3
b ion formation
NH CH C OH
R4
O
H
+H y1
y ion formation
+Neutral pumped away by vacuum system
and/or
H2N CH C NH CH C NH CH C
R1 R2
O O O
R3
zHz+
+
+Neutral pumped away by vacuum system
+
Proton Mobility
Mobile: zpre > #Arg + #Lys + #HisPartially mobile: zpre < #Arg + #Lys + #His and > #ArgNon-mobile: zpre < #Arg
For peptides with non-mobile protons, fragmentation tends to proceed via charge-remote mechanisms. MS/MS spectra will be dominated by a few ions, typically:
C-term side of D, EN-term side of P
Peptide Information Content
50
0
100
400 2000600 800 18001600140012001000
Re
lativ
e A
bu
nd
an
ce
m/z
VFIMDNCEELIPEYLNFIR
ox Pe
y8
y6y5y4
y9
y8
b10
b11
b11
Spectral information content: “Mobile” Proton
• nP cleavage• metox loss• cP cleavage
++
0
50
Relat
ive A
bund
ance
400 600 800 1000 1200
100
1400 1600 1800 2000
m/z
-CH3SOH
RVFIMDNCEELIPEYLNFIR
ox Pe
y14
-Pe- (CH3SOH + Pe)
y14y11
~ ~
y6b6
MDNCE
• metox loss• Pe loss• cD cleavage
• cE cleavage• nP cleavage
++
y8
y8 y6
Spectral information content: “Non-mobile” proton
y11
-0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.30
50
100
150
200
250
d' = 0.55
"Correct sequence"
"Randomised"
Mobile Proton 2+ (1268 unique peptides)
Num
ber
of s
pect
ra
NXCorr
Aims of MS/MS Scoring Functions (1)
-0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.20
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
d' = 0.50
"Correct sequence"
"Randomised"
Partially Mobile Proton 2+ (2223 unique peptides)
Num
ber
of s
pect
ra
NXCorr
Aims of MS/MS Scoring Functions (1)
-0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.10
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
d' = 0.32
"Correct sequence"
"Randomised"
Non-Mobile Proton 2+ (264 unique peptides)
Num
ber
of s
pect
ra
NXCorr
Aims of MS/MS Scoring Functions (1)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Tryptic search
1-Specificity
Sen
siti
vity
Mascot Ion score (AUC=0.98)PeptideProphet (AUC=0.96)Sonar (AUC=0.94)Tandem (AUC=0.93)Spectrummill (tag) (AUC=0.91)Sequest XCorr (AUC=0.91)Spectrummill (AUC=0.86)
A
Kapp et al.Proteomics 2005
Aims of MS/MS Scoring Functions (2)
• Statistical: - peptide or fragment ion frequency statistics (Mascot/Andromeda)
OR Bayesian model (xxx)
• Non-statistical: Correlation, dot-product -> raw score (SEQUEST/Comet)
SpectrumMill, GutenTag, MyriMatch, Digger, ProteinPilot, Sorcerer,pFind, Peaks, ProteinLynxGS, MSGF, Inspect, OMSSA etc...
Types of MS/MS Scoring Functions
• Blend: raw score -> E-value (X!Tandem)
similarpeaks(y/b ions)
100%
0%
*
0
/n
i i
i
y bScore I P
spectrumintensities
predicted?(1,0)
X!Tandem’s preliminary score is a dot product of the acquired and model spectra. Because only similar peaks are considered, this is the sum of the intensities of the matched y and b ions.
X!Tandem’s preliminary score is a dot product of the acquired and model spectra. Because only similar peaks are considered, this is the sum of the intensities of the matched y and b ions.
X!Tandem MS/MS Scoring
Image courtesy of Proteome Software
similarpeaks(y/b ions)
100%
0%
spectrumintensities
predicted?(1,0)
*
0
* !* !n
i i b y
i
HyperScore I P N N
X!Tandem modifies the preliminary score by multiplying by N factorial for the number of b and y ions assigned. The use of factorials is based on the hypergeometric distribution.
X!Tandem modifies the preliminary score by multiplying by N factorial for the number of b and y ions assigned. The use of factorials is based on the hypergeometric distribution.
X!Tandem Hyperscore
Image courtesy of Proteome Software
0
10
20
30
40
50
60
0 20 40 60 80 100
hyperscore
# re
sults
incorrectIDs
Next, X!Tandem makes a histogram of all the hyperscores for all the peptides in the database that might match this spectrum.
For example, in this figure, 52 peptides were found with a hyperscore of 19, and one peptide with a hyperscore of 83.
X!Tandem assumes that the peptide with the highest hyperscore is correct, and all others are incorrect.
Next, X!Tandem makes a histogram of all the hyperscores for all the peptides in the database that might match this spectrum.
For example, in this figure, 52 peptides were found with a hyperscore of 19, and one peptide with a hyperscore of 83.
X!Tandem assumes that the peptide with the highest hyperscore is correct, and all others are incorrect.
Image courtesy of Proteome Software
Histogram of hyperscores
0
0.5
1
1.5
2
2.5
3
3.5
4
20 25 30 35 40 45 50
0
10
20
30
40
50
60
0 20 40 60 80 100
hyperscore
# re
sults
log(
# re
sults
)
If the data on the right side of the histogram, (colored in upper figure) is taken and log-transformed, the data fall on a straight line.
A straight line is the expected result from a statistical argument that assumes the incorrect results are random.
Note: this histogram is calculated independently for each spectrum.
If the data on the right side of the histogram, (colored in upper figure) is taken and log-transformed, the data fall on a straight line.
A straight line is the expected result from a statistical argument that assumes the incorrect results are random.
Note: this histogram is calculated independently for each spectrum.
Image courtesy of Proteome Software
Log histogram
0
0.5
1
1.5
2
2.5
3
3.5
4
20 25 30 35 40 45 50
0
10
20
30
40
50
60
0 20 40 60 80 100
hyperscore
# re
sults
log(
# re
sults
)
significant
X!Tandem has already assumed that the top hyperscore is the only possible correct match.
This match is significant if it is greater than the point at which the straight line through the log data intersects the log(#results)=0 line.
Any hyperscores greater than this are unlikely to have arisen by chance.
X!Tandem has already assumed that the top hyperscore is the only possible correct match.
This match is significant if it is greater than the point at which the straight line through the log data intersects the log(#results)=0 line.
Any hyperscores greater than this are unlikely to have arisen by chance.
Image courtesy of Proteome Software
Significant scores
-10
-8
-6
-4
-2
0
2
4
6
0 20 40 60 80 100
0
10
20
30
40
50
60
0 20 40 60 80 100
hyperscore
# re
sults
log(
# re
sults
)
E-value=e-8.2
The E-value expresses just how unlikely a greater hyperscore is.
X!Tandem calculates the E-value by extrapolating the red line of the log histogram.
For the example shown, a hyperscore of 83 would occur by chance where the red line crosses 83. The log of this value — the E-value — is -8.2, as shown.
The E-value expresses just how unlikely a greater hyperscore is.
X!Tandem calculates the E-value by extrapolating the red line of the log histogram.
For the example shown, a hyperscore of 83 would occur by chance where the red line crosses 83. The log of this value — the E-value — is -8.2, as shown.
X!Tandem E-value
• Human (even expert) judgment is subjective and can be unreliable
Why is Probability based scoring important?
• Human (even expert) judgment is subjective and can be unreliable
• Standard, statistical tests of significance can be applied to the results
• Arbitrary scoring schemes are susceptible to false positives.
Why is Probability based scoring important?
• Yes, if it is a test sample and you know what the answer should be– Matches to the expected protein sequences are defined to be correct– Matches to other sequences are defined to be wrong
• If the sample is an unknown, then you have to define “correct” very carefully:– The best match in the database?– The best match out of all possible peptides?– The peptide sequence that is uniquely and completely defined by the
MS data?
Can we calculate a probability that a match is correct?
P = pk
NN
nk
k (1-p)N-k
N is the # of possible fragment ion matches (peplen * 2),n is the # of observed fragment ion matches,k is the # of matches
p (probability of a match) = peak depth / numbins
Where, peak depth = #of peaks per 100 Da window (max 10)And numbins = 100 / (2 * frag_ion_tol)
Score = -10 * log10P
Binomial,Hypergeometric,
Poisson,or EVD ?
Probability model: Andromeda (MaxQuant) – Theoretical model
a b b-98 b-18 b-17 b++ b++-98 b++-18 b++-17 y y-98 y-18 y-17 y++ y++-98 y++-18 y++-17L1 0 2 0 1 0 0 0 0 0 3 0 1 0 0 0 0 0L2 0 3 0 1 0 0 0 0 0 3 0 1 0 0 0 0 0L3 1 3 0 2 0 0 0 0 0 3 0 1 0 0 0 0 0L4 1 3 0 2 1 0 0 0 0 4 0 1 1 1 0 0 0L5 1 3 0 2 1 0 0 0 0 4 0 2 1 1 0 0 1L6 1 3 0 2 1 0 0 0 0 5 0 2 1 1 0 0 1L7 1 3 0 2 1 0 0 0 0 5 0 2 1 1 0 1 1L8 1 3 0 2 1 0 0 0 0 5 0 2 2 1 0 1 2L9 1 3 0 2 2 0 0 0 0 5 0 2 2 1 0 1 2
L10 1 3 0 2 2 0 0 0 0 5 0 2 2 1 0 1 3
Ion-series
Leve
l
for ONE candidate (decoy) peptide
for ALL candidate (decoy) peptides 1
For @ cell calc slope & interceptbased on all decoy peptides
0 1 2 3 4 5 6 7-10
-505
10152025303540
# of fragment ion matches
Scor
e (-
10*L
gP)
Extrapolate for more matches…
2
Probability model: Digger – empirical NULL model
Limitations of E-values or P-values
1) P-values or E-values are not well suited for the analysis of large-scale datasets -
Do not allow estimation of global error rates (FDR) as a function of filtering threshold (need formultiple testing correction)
2) Do not directly incorporate additional useful information (e.g., # of missed cleavages, mass accuracy, retention time etc.)
1) For all target “real” peptides generate decoys “on-the-fly” – easy, built-in.
2) Reverse internal peptide residues – if palindrome then randomise
LGEDTLISYR LYSILTDEGR
3) I/L residues taken into account.
4) PTM’s are kept constant but shifted internally within peptide.
5) Similar implementation in Crux (Univ. of Washingon – Noble, MacCoss).
What makes a good decoy?
What makes a good decoy?
1) separate or concatenated target decoy sequence database?
Decoy strategies?
sequest
High scores for short (random) peptides High scores for larger search space
Decoy strategies:Imperfection of scoring functions
A) Covariance dependency – some spectra score well regardless
B) Reduce co-varying features by using post-processing tools(e.g. PeptideProphet, Percolator, q-ranker etc.) which combinemultiple different features.
Decoy strategies:Imperfection of scoring functions
Single spectrum analyses (individual probabilities)
Summary: Statistical approaches
Global analysis (individual and global error rates)
Expectation values Similar to sequence similarity searching (BLAST)
• Target-decoy strategy for global FDR• Distribution modeling (e.g. Peptide prophet, Percolator)
for local and global FDR estimation.
Multiplexed spectra
Middle-down proteomics
X-linked spectra
PTMs: Phosphorylation - multiple modifications and sites
PTM cross-talk elucidation
Summary: Challenges…