Jimmy Eng - Toolstools.proteomecenter.org/course/lectures/0610-Day1.Eng.pdf · 2006-10-16 · • Protein, nucleic acid, and short EST sequence databases can all be searched • Optionally

1

MS/MS Database Searching

Jimmy EngDay 1

October 16, 2006

2

Day 1 Lecture Topics

• Basic background & motivation• Peptide fragmentation, nomenclature• Peptide vs. tandem mass spectra• Sequence database searching

– Databases– Enzymes– Modifications

• Interpretation of search results; manual validation

• Introduction to software tools

3

HPLC

Identify proteinsin complex

1D or 2D chromatographicseparation of peptidesDenatured protein

complexPeptides

Mass SpecDb search

Protein Identification Strategy

4

TPPTPP

xINTERACTxINTERACT

PeptideProphetPeptideProphet XPRESS/ASAPRatioLibra

XPRESS/ASAPRatioLibra

mzXML file formatmzXML file format

ProteinProphetProteinProphet

SBEAMSSBEAMS

PeptideAtlasPeptideAtlas

Pep3DPep3DSEQUEST/COMETMascot/ProbID/SpectraST

SEQUEST/COMETMascot/ProbID/SpectraST

CytoscapeCytoscape

LC-MS/MS DataLC-MS/MS Data

pepXML file formatpepXML file format

protXML file formatprotXML file format

QualscoreQualscore

Gaggle…Gaggle…

XLinkXLink

5

MassAnalysis

peptidesprotein peptides+

+

+

+

++++

IonizationDigestion

Single Stage MS

m/z

MS

6

Ionization Isolation Fragmentation MassAnalysis

proteinpeptide

fragments

Digestion

peptides++

+

+

++

++

Tandem MS

++

+++

++

++++ +

m/zm/z

MS MS/MS

7

time (scan #)

inte

nsity

m/z

m/z

inte

nsity

2D view: m/z, intensity

3D view: m/z, intensity, time

Mass vs. Intensity vs. Time

8

Mass vs. Intensity vs. Timein

ten s

ity

MS scans

time (scan #)

m/z

m/zm/z

m/z

9

Mass vs. Intensity vs. Time

MS scans

time (scan #)

inte

nsity m/z

m/zm/z

m/z

1000.2

10

tryp m yo 0 1 # 2 9 4 R T : 9 .8 9 A V : 1 N L : 1 .1 2 E 7T : + c F u l l m s [ 3 0 0 .0 0 -1 6 0 0 .0 0 ]

4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0 1 4 0 0 1 6 0 0m /z

0

5

1 0

1 5

2 0

2 5

3 0

3 5

4 0

4 5

5 0

5 5

6 0

6 5

7 0

7 5

8 0

8 5

9 0

9 5

1 0 0

Rel

ativ

e Ab

unda

nce

6 6 1 .6

7 0 4 .24 9 6 .4

5 2 8 .39 9 1 .76 1 8 .73 4 2 .7

7 0 5 .1

9 9 2 .64 6 4 .4 9 5 2 .39 2 7 .1 1 1 2 8 .27 9 9 .95 8 0 .4 1 2 8 9 .8 1 4 8 5 .01 3 8 7 .1 1 5 4 1 .3

MS/MS Data Acquisition

2. Select an ion

1. Acquire full (MS) scan

3. Isolate ion

MS/MS scan

4. Fragment ion

11

MS vs. MS/MS

MS

time (scan #)

inte

nsity

m/z

m/zm/z

m/z

MS/MS

12

2D view of an LC-MS experiment

You’ll learn all about Pep3D

soon!

13

Amino Acids

Amino acid 3LC SLC Average MonoisotopicGlycine Gly G 57.0519 57.02146Alanine Ala A 71.0788 71.03711Serine Ser S 87.0782 87.02303Proline Pro P 97.1167 97.05276Valine Val V 99.1326 99.06841Threonine Thr T 101.1051 101.04768Cysteine Cys C 103.1388 103.00919Leucine Leu L 113.1594 113.08406Isoleucine Ile I 113.1594 113.08406Asparagine Asn N 114.1038 114.04293Aspartic acid Asp D 115.0886 115.02694Glutamine Gln Q 128.1307 128.05858Lysine Kys K 128.1741 128.09496Glutamic acid Glu E 129.1155 129.04259Methionine Met M 131.1926 131.04049Histidine His H 137.1411 137.05891Phenyalanine Phe F 147.1766 147.06841Arginine Arg R 156.1875 156.10111Tyrosine Tyr Y 163.1760 163.06333Tryptophan Trp W 186.2132 186.07931

14

Average vs. Monoisotopic Mass

Monoisotopic mass

For example:DIGSESTEDQAMEDIK

Mono MH+: 1767.7594 DaAvg MH+: 1768.8438 Da

Average mass – centroid of isotopic envelope

Charge state = 1 / ΔmΔm

Difference in mass can be significant!

15

Fragment Ions

H2N C C N C C N C C N C COOH

H H H H H H H

R1 R2 R3 R4O O O

a1

x3 x2 x1

a2 a3b1

y3 y2 y1

b2 b3c1

z3 z2 z1

c2 c3

H+

http://www.matrixscience.com/help/fragmentation_help.html

16

d-, v-, and w-ions are created by side chain cleavage. These ions are typically generated during high energy collision induced dissociation conditions. Of note, d- and w- ions allow the isobaric residues leucineand isoleucine to be differentiated.

H2N C C N C

H H H

R1 O CHR’

d2

H+

C C N C COOH

H H

R4OCHR’

H w2

H+

C C N C COOH

H H H

R4O

HN

v2

H+


Fragment Ion Types

17

Immonium Ions

An internal fragment with just a single side chain formed by a combination of a type and y type cleavage is called an immonium ion. The presence of these ions can be a diagnostic to the presence of the corresponding amino acid in the peptide sequence.

http://www.abrf.org/ResearchGroups/MassSpectrometry/EPosters/ms97quiz/residueMasses.html

Amino Acid Residue Mass Immonium ion mass Amino Acid Residue Mass Immonium ion massGlycine 57.02147 30.03438 - Asparagine 114.04293 87.05584 +Alanine 71.03712 44.05003 - Aspartic acid 115.02695 88.03986 +Serine 87.03203 60.04494 + Glutamine 128.05858 101.0715 +Proline 97.05277 70.06568 ++ Lysine 128.09497 101.1079 (84.08136)Valine 99.06842 72.08133 ++ Glutamic acid 129.0426 102.0555 +Threonine 101.04768 74.06059 + Methionine 131.04049 104.0534 +Cysteine 103.00919 76.0221 - - oxidized methionine 147.0354 120.0483 +- carbamidomethylated 160.03065 133.0436 + Histidine 137.05891 110.0718 ++- carboxymethylated 161.01466 134.0276 + Phenylalanine 147.06842 120.0813 ++- acrylamide adduct 174.0643 147.0772 + Arginine 156.10112 129.114 -Isoleucine 113.08407 86.09698 ++ Tyrosine 163.06333 136.0762 ++Leucine 113.08407 86.09698 ++ Tryptophan 186.07392 159.0922 +


18

70 → P86 → I/L

120 → F

MALDI-TOF-TOF tandem mass spectrum

APNDFNLKrabbit glycogen phosphorylase

70 → P86 → I/L

120 → F

Immonium Ions

19

D L Y S K

D

D L

D L Y

D L Y S

L Y S K

Y S K

S K

K

N-terminal fragments C-terminal fragments

+

Peptide Fragmentation

20

A-P-N-D-F-N-L-K(MH+ 918.5)

B-ions Y-ions72.0 A P-N-D-F-N-L-K 847.4

169.1 A-P N-D-F-N-L-K 750.4283.1 A-P-N D-F-N-L-K 636.3398.2 A-P-N-D F-N-L-K 521.3545.2 A-P-N-D-F N-L-K 374.2659.3 A-P-N-D-F-N L-K 260.2772.4 A-P-N-D-F-N-L K 147.1

monoisotopic masses

Fragmenting a Peptide

21

A-P-N-D-F-N-L-K(MH+ 918.5)

Sequence vs. Tandem Mass Spectrum

22

A P N D F N L K

B-ions


23

APNDFNLKY-ions


24


A P N D F N L K

APNDFNLK

25

Raw, uninterpretedMS/MS spectra Sequence Database

>SEQ1CVVEELCPTPEGKDIGESVDLLKLQWCWENGTLRSLDCDVVS>SEQ2DLRSWTVRIDALNHGVKPHPPNVSVVDLTNR>

Uninterpreted MS/MS Database Search

26

Input: • Fragmentation spectrum• Precursor mass, charge state

1. From database, select peptides that equal the input mass

2. Theoretically fragment peptides3. Compare theoretical fragments to

acquired spectrum4. Generate score5. Rank by score and display best

matches

SequenceDatabase


27

Raw MS/MS spectra

Sequence Database

>SEQ1CVVRELCPTPEGKDIGESVDLLKLQWCWENGTLRSLDCDVVSRDIGSESTEDRAMEDIK>SEQ2DLRSWTVRIDALNHGVKPHPPNVSVVDLTNRGDVEKGKKIFVQKCAQCHTVEKGGKHKT

Similarity score1.000.340.29

Peptides ofsame nominal

mass


28

MASCOT

29

MASCOT

30

MASCOT

31

MASCOT

32

Mascot Score?

From presentation on MatrixScience web site:• Each ion series is matched and scored independently• If an ion series contains only a random number of

matches, or less, it is discarded• All combinations of the ion series with non-random

levels of matching are tested to see which combination will give the highest score

• Having “too many” ion series doesn’t affect the score, it just reduces specificity

33

Interpreting Mascot results

• Ions Score = -10 x Log(P)

– Calculation of P is ‘black box’

– Extension of the MOWSE score

34


• Identity threshold = -10 x Log(E/N)– E is the significance threshold– N is the number of peptides in the database matching the

precursor mass

• Example– If you can accept a 1 in 20 chance of a false positive select an

E of 0.05– If there are 4000 peptides that match the precursor ion mass

S = -10 x Log(0.05/4000)= 49

Matrix Science http://www.matrixscience.com/pdf/2005WKSHP4.pdf

35


• Homology threshold – “The homology threshold is an empirical

measure of whether the match is an outlier”


36


• Expectation value– The number of times you could expect to get this

score or better by chance• E = Pthresh x (10 ^ ((Sthresh - score) / 10))• If Pthresh = 0.05 and Sthresh = 50

– score = 40 corresponds to E = 0.5– score = 50 corresponds to E = 0.05– score = 60 corresponds to E = 0.005


37

• Protein, nucleic acid, and EST sequence databases

• Optionally include enzyme specificity in the search

• Post-translation modifications can be identified

• Search software

MS/MS Database Search Parameters

38

Raw genomic

Transcript or EST

Protein sequence

Sequence Databases

39

• Protein, nucleic acid, and short EST sequence databases can all be searched



• Search software


40

DB: enzyme constraint

41

GDVEKGTKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGSK

TGQAPGFSYTDANKNKGITWGEETLMEYLENPKSYIPGT

GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRK

TGQAPGFSYTDANKNKGITWGEETLMEYLENPKKYIPGT

tryptic peptides:

enzyme-unconstrained peptides:

DB: enzyme constraint

42

human IPI database, 47,754

# tryptic # unconstr.mass peptides peptides factor

1000 Da 1,430 321,999 225x

2000 Da 466 325,096 697x

3000 Da 249 317,750 1276x

DB: tryptic peptides vs. unconstrained search

43




• Search software


44

• Static Modification– All occurrences of an amino acid is modified

• Variable/Differential Modification– One or more occurrences of an amino acid may

be modified

• Modifications can typically be specified on any residue(s) or termini.

Post-Translation Modifications

45

1. DIGSESTEDQAMEDYK 3. DIGSESTEDQAMEDYK

2. DIGSESTEDQAMEDYK 4. DIGSESTEDQAMEDYK

P

P PP

Serine phosphorylation:

How many peptide forms are possible if you consider serine and threonine phosphorylation for the above peptide? Serine + threonine + tyrosine?

Variable Modifications

46

human IPI database, 47,754

# tryptic phos STY # unconstrmass peptides tryptic factor peptides

1000 Da 1,430 5,093 3.5x 321,999

2000 Da 466 7,283 15.6x 325,096

3000 Da 249 16,761 67.3x 317,750

unconstrphos STY

1,167,740

4,538,383

15,641,722

Variable Modification Search

47




• Search software


48

• Phenyx

• SpectrumMill

• ProteinPilot

• SEQUEST

• X! Tandem

• OMSSA

• ProbID

What about other programs?

49

ProbID

50

ProbID

51

ProbID

52

ProbID

Immonium ions:H, M, W, Y, F

pr(II(S)|k,B) = 1 – i/5where i = # of immonium

peaks in spectrum w/ocorresponding amino acid

Unmatched ions:pr(N(S)|k,B) = (1/massmax – massmin)r

where r = # of unmatched ionsand massmax & massmin are the

highest and lowest peaks in spectrum

53

ProbIDMatch pattern:

pr(pat(S)|k,B) = (# of matched ion pairs) / 3(n-1)n = # of AA in peptide

Matched ions:

ai = amplitude of each peakmi = mass of each peak

σ = mass accuracy std dev

∏−

=2

2

2)(

)|)pr(M( σmm

i

i

eak,BS

54

ProbID output

55

X! Tandem

• Open source search engine• Very fast• Lots of user-definable search options• Built-in “refinement” mode

56

X! Tandem refinement mode:

1st pass search

(Tryptic, Ox M)Full

database

Identified proteins

Subset DB2nd pass search

multiple parameters

Not identified in 1st pass

57

Interpretation Rules

K.LLGNQATFSPIVTVEPR.R

K.SPSDVKPLPSPDTDVPLSSVE.I

D.PEDVFTENPDEKSIITY.V

An enzyme un-restricted search can greatly assist in the interpretation process.

Look for peptides that exhibit the expected cleavage at both the N- and C-terminus.

Don’t bother with peptides that exhibit no correct cleavage.

58

Match all fragment ions!

Correct identifications don’t exhibit random fragment ion matches. Look for a series of y-ions or b-ions.

Trypsin leaves a basic residue (K or R) at the C-terminus which translate to strong y-ions so hopefully the big peaks match y-ions.


59

If a big peak matches a y-ion from an N-terminal cleavage of proline, that is a good indication of a correct identification.

The reverse is not true: a proline in a peptide that does not correspond to a big peak is not an indication of an incorrect identification.


60

Random or reverse databases?

MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYRQQCPFEDHVKLVNEVTEFAKTCVADESAENCDKSLHTLFGDKLCTVA

MKSYASSFLFLLSIFTVWRGVFRRHADKHAVESRFKFNEEGLDKYQAFAILVLARVHDEFPCQQKAFETVENVLKDCNEASEDAVCTKDGFLTHLSKAVTCL

Original sequence:

Reverse peptide sequence:

When searching forward + reverse sequence database, estimated number of incorrect matches is:

2 * (# reverse matches passing cutoff)# total matches passing cutoff

Documents

Jimmy Eng - Toolstools.proteomecenter.org/course/lectures/0610-Day1.Eng.pdf · 2006-10-16 · • Protein, nucleic acid, and short EST sequence databases can all be searched • Optionally