Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
MS/MS Database Searching
Jimmy EngDay 1
October 16, 2006
2
Day 1 Lecture Topics
• Basic background & motivation• Peptide fragmentation, nomenclature• Peptide vs. tandem mass spectra• Sequence database searching
– Databases– Enzymes– Modifications
• Interpretation of search results; manual validation
• Introduction to software tools
3
HPLC
Identify proteinsin complex
1D or 2D chromatographicseparation of peptidesDenatured protein
complexPeptides
Mass SpecDb search
Protein Identification Strategy
4
TPPTPP
xINTERACTxINTERACT
PeptideProphetPeptideProphet XPRESS/ASAPRatioLibra
XPRESS/ASAPRatioLibra
mzXML file formatmzXML file format
ProteinProphetProteinProphet
SBEAMSSBEAMS
PeptideAtlasPeptideAtlas
Pep3DPep3DSEQUEST/COMETMascot/ProbID/SpectraST
SEQUEST/COMETMascot/ProbID/SpectraST
CytoscapeCytoscape
LC-MS/MS DataLC-MS/MS Data
pepXML file formatpepXML file format
protXML file formatprotXML file format
QualscoreQualscore
Gaggle…Gaggle…
XLinkXLink
5
MassAnalysis
peptidesprotein peptides+
+
+
+
++++
IonizationDigestion
Single Stage MS
m/z
MS
6
Ionization Isolation Fragmentation MassAnalysis
proteinpeptide
fragments
Digestion
peptides++
+
+
++
++
Tandem MS
++
+++
++
++++ +
m/zm/z
MS MS/MS
7
time (scan #)
inte
nsity
m/z
m/z
inte
nsity
2D view: m/z, intensity
3D view: m/z, intensity, time
Mass vs. Intensity vs. Time
8
Mass vs. Intensity vs. Timein
ten s
ity
MS scans
time (scan #)
m/z
m/zm/z
m/z
9
Mass vs. Intensity vs. Time
MS scans
time (scan #)
inte
nsity m/z
m/zm/z
m/z
1000.2
10
tryp m yo 0 1 # 2 9 4 R T : 9 .8 9 A V : 1 N L : 1 .1 2 E 7T : + c F u l l m s [ 3 0 0 .0 0 -1 6 0 0 .0 0 ]
4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0 1 4 0 0 1 6 0 0m /z
0
5
1 0
1 5
2 0
2 5
3 0
3 5
4 0
4 5
5 0
5 5
6 0
6 5
7 0
7 5
8 0
8 5
9 0
9 5
1 0 0
Rel
ativ
e Ab
unda
nce
6 6 1 .6
7 0 4 .24 9 6 .4
5 2 8 .39 9 1 .76 1 8 .73 4 2 .7
7 0 5 .1
9 9 2 .64 6 4 .4 9 5 2 .39 2 7 .1 1 1 2 8 .27 9 9 .95 8 0 .4 1 2 8 9 .8 1 4 8 5 .01 3 8 7 .1 1 5 4 1 .3
MS/MS Data Acquisition
2. Select an ion
1. Acquire full (MS) scan
3. Isolate ion
MS/MS scan
4. Fragment ion
11
MS vs. MS/MS
MS
time (scan #)
inte
nsity
m/z
m/zm/z
m/z
MS/MS
12
2D view of an LC-MS experiment
You’ll learn all about Pep3D
soon!
13
Amino Acids
Amino acid 3LC SLC Average MonoisotopicGlycine Gly G 57.0519 57.02146Alanine Ala A 71.0788 71.03711Serine Ser S 87.0782 87.02303Proline Pro P 97.1167 97.05276Valine Val V 99.1326 99.06841Threonine Thr T 101.1051 101.04768Cysteine Cys C 103.1388 103.00919Leucine Leu L 113.1594 113.08406Isoleucine Ile I 113.1594 113.08406Asparagine Asn N 114.1038 114.04293Aspartic acid Asp D 115.0886 115.02694Glutamine Gln Q 128.1307 128.05858Lysine Kys K 128.1741 128.09496Glutamic acid Glu E 129.1155 129.04259Methionine Met M 131.1926 131.04049Histidine His H 137.1411 137.05891Phenyalanine Phe F 147.1766 147.06841Arginine Arg R 156.1875 156.10111Tyrosine Tyr Y 163.1760 163.06333Tryptophan Trp W 186.2132 186.07931
14
Average vs. Monoisotopic Mass
Monoisotopic mass
For example:DIGSESTEDQAMEDIK
Mono MH+: 1767.7594 DaAvg MH+: 1768.8438 Da
Average mass – centroid of isotopic envelope
Charge state = 1 / ΔmΔm
Difference in mass can be significant!
15
Fragment Ions
H2N C C N C C N C C N C COOH
H H H H H H H
R1 R2 R3 R4O O O
a1
x3 x2 x1
a2 a3b1
y3 y2 y1
b2 b3c1
z3 z2 z1
c2 c3
H+
http://www.matrixscience.com/help/fragmentation_help.html
16
d-, v-, and w-ions are created by side chain cleavage. These ions are typically generated during high energy collision induced dissociation conditions. Of note, d- and w- ions allow the isobaric residues leucineand isoleucine to be differentiated.
H2N C C N C
H H H
R1 O CHR’
d2
H+
C C N C COOH
H H
R4OCHR’
H w2
H+
C C N C COOH
H H H
R4O
HN
v2
H+
http://www.matrixscience.com/help/fragmentation_help.html
Fragment Ion Types
17
Immonium Ions
An internal fragment with just a single side chain formed by a combination of a type and y type cleavage is called an immonium ion. The presence of these ions can be a diagnostic to the presence of the corresponding amino acid in the peptide sequence.
http://www.abrf.org/ResearchGroups/MassSpectrometry/EPosters/ms97quiz/residueMasses.html
Amino Acid Residue Mass Immonium ion mass Amino Acid Residue Mass Immonium ion massGlycine 57.02147 30.03438 - Asparagine 114.04293 87.05584 +Alanine 71.03712 44.05003 - Aspartic acid 115.02695 88.03986 +Serine 87.03203 60.04494 + Glutamine 128.05858 101.0715 +Proline 97.05277 70.06568 ++ Lysine 128.09497 101.1079 (84.08136)Valine 99.06842 72.08133 ++ Glutamic acid 129.0426 102.0555 +Threonine 101.04768 74.06059 + Methionine 131.04049 104.0534 +Cysteine 103.00919 76.0221 - - oxidized methionine 147.0354 120.0483 +- carbamidomethylated 160.03065 133.0436 + Histidine 137.05891 110.0718 ++- carboxymethylated 161.01466 134.0276 + Phenylalanine 147.06842 120.0813 ++- acrylamide adduct 174.0643 147.0772 + Arginine 156.10112 129.114 -Isoleucine 113.08407 86.09698 ++ Tyrosine 163.06333 136.0762 ++Leucine 113.08407 86.09698 ++ Tryptophan 186.07392 159.0922 +
http://www.matrixscience.com/help/fragmentation_help.html
18
70 → P86 → I/L
120 → F
MALDI-TOF-TOF tandem mass spectrum
APNDFNLKrabbit glycogen phosphorylase
70 → P86 → I/L
120 → F
Immonium Ions
19
D L Y S K
D
D L
D L Y
D L Y S
L Y S K
Y S K
S K
K
N-terminal fragments C-terminal fragments
+
Peptide Fragmentation
20
A-P-N-D-F-N-L-K(MH+ 918.5)
B-ions Y-ions72.0 A P-N-D-F-N-L-K 847.4
169.1 A-P N-D-F-N-L-K 750.4283.1 A-P-N D-F-N-L-K 636.3398.2 A-P-N-D F-N-L-K 521.3545.2 A-P-N-D-F N-L-K 374.2659.3 A-P-N-D-F-N L-K 260.2772.4 A-P-N-D-F-N-L K 147.1
monoisotopic masses
Fragmenting a Peptide
21
A-P-N-D-F-N-L-K(MH+ 918.5)
Sequence vs. Tandem Mass Spectrum
22
A P N D F N L K
B-ions
Sequence vs. Tandem Mass Spectrum
23
APNDFNLKY-ions
Sequence vs. Tandem Mass Spectrum
24
Sequence vs. Tandem Mass Spectrum
A P N D F N L K
APNDFNLK
25
Raw, uninterpretedMS/MS spectra Sequence Database
>SEQ1CVVEELCPTPEGKDIGESVDLLKLQWCWENGTLRSLDCDVVS>SEQ2DLRSWTVRIDALNHGVKPHPPNVSVVDLTNR>
Uninterpreted MS/MS Database Search
26
Input: • Fragmentation spectrum• Precursor mass, charge state
1. From database, select peptides that equal the input mass
2. Theoretically fragment peptides3. Compare theoretical fragments to
acquired spectrum4. Generate score5. Rank by score and display best
matches
SequenceDatabase
Uninterpreted MS/MS Database Search
27
Raw MS/MS spectra
Sequence Database
>SEQ1CVVRELCPTPEGKDIGESVDLLKLQWCWENGTLRSLDCDVVSRDIGSESTEDRAMEDIK>SEQ2DLRSWTVRIDALNHGVKPHPPNVSVVDLTNRGDVEKGKKIFVQKCAQCHTVEKGGKHKT
Similarity score1.000.340.29
Peptides ofsame nominal
mass
Uninterpreted MS/MS Database Search
28
MASCOT
29
MASCOT
30
MASCOT
31
MASCOT
32
Mascot Score?
From presentation on MatrixScience web site:• Each ion series is matched and scored independently• If an ion series contains only a random number of
matches, or less, it is discarded• All combinations of the ion series with non-random
levels of matching are tested to see which combination will give the highest score
• Having “too many” ion series doesn’t affect the score, it just reduces specificity
33
Interpreting Mascot results
• Ions Score = -10 x Log(P)
– Calculation of P is ‘black box’
– Extension of the MOWSE score
34
Interpreting Mascot results
• Identity threshold = -10 x Log(E/N)– E is the significance threshold– N is the number of peptides in the database matching the
precursor mass
• Example– If you can accept a 1 in 20 chance of a false positive select an
E of 0.05– If there are 4000 peptides that match the precursor ion mass
S = -10 x Log(0.05/4000)= 49
Matrix Science http://www.matrixscience.com/pdf/2005WKSHP4.pdf
35
Interpreting Mascot results
• Homology threshold – “The homology threshold is an empirical
measure of whether the match is an outlier”
Matrix Science http://www.matrixscience.com/pdf/2005WKSHP4.pdf
36
Interpreting Mascot results
• Expectation value– The number of times you could expect to get this
score or better by chance• E = Pthresh x (10 ^ ((Sthresh - score) / 10))• If Pthresh = 0.05 and Sthresh = 50
– score = 40 corresponds to E = 0.5– score = 50 corresponds to E = 0.05– score = 60 corresponds to E = 0.005
Matrix Science http://www.matrixscience.com/pdf/2005WKSHP4.pdf
37
• Protein, nucleic acid, and EST sequence databases
• Optionally include enzyme specificity in the search
• Post-translation modifications can be identified
• Search software
MS/MS Database Search Parameters
38
Raw genomic
Transcript or EST
Protein sequence
Sequence Databases
39
• Protein, nucleic acid, and short EST sequence databases can all be searched
• Optionally include enzyme specificity in the search
• Post-translation modifications can be identified
• Search software
MS/MS Database Search Parameters
40
DB: enzyme constraint
41
GDVEKGTKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGSK
TGQAPGFSYTDANKNKGITWGEETLMEYLENPKSYIPGT
GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRK
TGQAPGFSYTDANKNKGITWGEETLMEYLENPKKYIPGT
tryptic peptides:
enzyme-unconstrained peptides:
DB: enzyme constraint
42
human IPI database, 47,754
# tryptic # unconstr.mass peptides peptides factor
1000 Da 1,430 321,999 225x
2000 Da 466 325,096 697x
3000 Da 249 317,750 1276x
DB: tryptic peptides vs. unconstrained search
43
• Protein, nucleic acid, and short EST sequence databases can all be searched
• Optionally include enzyme specificity in the search
• Post-translation modifications can be identified
• Search software
MS/MS Database Search Parameters
44
• Static Modification– All occurrences of an amino acid is modified
• Variable/Differential Modification– One or more occurrences of an amino acid may
be modified
• Modifications can typically be specified on any residue(s) or termini.
Post-Translation Modifications
45
1. DIGSESTEDQAMEDYK 3. DIGSESTEDQAMEDYK
2. DIGSESTEDQAMEDYK 4. DIGSESTEDQAMEDYK
P
P PP
Serine phosphorylation:
How many peptide forms are possible if you consider serine and threonine phosphorylation for the above peptide? Serine + threonine + tyrosine?
Variable Modifications
46
human IPI database, 47,754
# tryptic phos STY # unconstrmass peptides tryptic factor peptides
1000 Da 1,430 5,093 3.5x 321,999
2000 Da 466 7,283 15.6x 325,096
3000 Da 249 16,761 67.3x 317,750
unconstrphos STY
1,167,740
4,538,383
15,641,722
Variable Modification Search
47
• Protein, nucleic acid, and short EST sequence databases can all be searched
• Optionally include enzyme specificity in the search
• Post-translation modifications can be identified
• Search software
Uninterpreted MS/MS Database Search
48
• Phenyx
• SpectrumMill
• ProteinPilot
• SEQUEST
• X! Tandem
• OMSSA
• ProbID
What about other programs?
49
ProbID
50
ProbID
51
ProbID
52
ProbID
Immonium ions:H, M, W, Y, F
pr(II(S)|k,B) = 1 – i/5where i = # of immonium
peaks in spectrum w/ocorresponding amino acid
Unmatched ions:pr(N(S)|k,B) = (1/massmax – massmin)r
where r = # of unmatched ionsand massmax & massmin are the
highest and lowest peaks in spectrum
53
ProbIDMatch pattern:
pr(pat(S)|k,B) = (# of matched ion pairs) / 3(n-1)n = # of AA in peptide
Matched ions:
ai = amplitude of each peakmi = mass of each peak
σ = mass accuracy std dev
∏−
=2
2
2)(
)|)pr(M( σmm
i
i
eak,BS
54
ProbID output
55
X! Tandem
• Open source search engine• Very fast• Lots of user-definable search options• Built-in “refinement” mode
56
X! Tandem refinement mode:
1st pass search
(Tryptic, Ox M)Full
database
Identified proteins
Subset DB2nd pass search
multiple parameters
Not identified in 1st pass
57
Interpretation Rules
K.LLGNQATFSPIVTVEPR.R
K.SPSDVKPLPSPDTDVPLSSVE.I
D.PEDVFTENPDEKSIITY.V
An enzyme un-restricted search can greatly assist in the interpretation process.
Look for peptides that exhibit the expected cleavage at both the N- and C-terminus.
Don’t bother with peptides that exhibit no correct cleavage.
58
Match all fragment ions!
Correct identifications don’t exhibit random fragment ion matches. Look for a series of y-ions or b-ions.
Trypsin leaves a basic residue (K or R) at the C-terminus which translate to strong y-ions so hopefully the big peaks match y-ions.
Interpretation Rules
59
If a big peak matches a y-ion from an N-terminal cleavage of proline, that is a good indication of a correct identification.
The reverse is not true: a proline in a peptide that does not correspond to a big peak is not an indication of an incorrect identification.
Interpretation Rules
60
Random or reverse databases?
MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYRQQCPFEDHVKLVNEVTEFAKTCVADESAENCDKSLHTLFGDKLCTVA
MKSYASSFLFLLSIFTVWRGVFRRHADKHAVESRFKFNEEGLDKYQAFAILVLARVHDEFPCQQKAFETVENVLKDCNEASEDAVCTKDGFLTHLSKAVTCL
Original sequence:
Reverse peptide sequence:
When searching forward + reverse sequence database, estimated number of incorrect matches is:
2 * (# reverse matches passing cutoff)# total matches passing cutoff