1
Faster, more sensitive peptide identification from Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression tandem mass spectra by sequence database compression Nathan J. Edwards Nathan J. Edwards Center for Bioinformatics & Computational Biology, University of Maryland, College Park Center for Bioinformatics & Computational Biology, University of Maryland, College Park Peptide Sequences are Short Peptide Sequences are Short Introduction Introduction Current sequence databases contain considerable peptide sequence redundancy. Amino-acid sequence databases often contain considerably fewer distinct 30-mers than its size suggests. The distinct 30-mers can be searched faster than the original sequence database, while the statistical significance of the peptide identifications is improved. Peptides identified in MS/MS workflows are rarely longer than 30 amino-acids. • Trypsin cuts at K or R, unless followed by P • Precursor ion is usually < 3000 Da • Charge state is usually +1, +2, or +3 Peptides of more than 20 amino-acids typically don’t fragment well. 30 is a conservative upper- bound on the length of peptides identified by MS/MS workflows. Protein Sequence Databases Protein Sequence Databases IPI-HUMAN, from EBI; IPI, concatenation of IPI-HUMAN, IPI-MOUSE, and IPI-RAT from EBI; Swiss-Prot, from ExPASy; Swiss-Prot-VS, Swiss-Prot plus varsplic.pl variant enumeration; UniProt, concatenation of Swiss-Prot and TrEMBL; UniProt-VS, UniProt plus varsplic.pl variant enumeration; MSDB, from Imperial College; NRP, from NCI Frederick; NCBI-nr, from NCBI; and UnionNR, non-redundant Figure 1: Sequence databases: size, C 3 size, distinct 30- mers. C C 3 3 Database Compression Database Compression C 3 sequence database [1] Complete: All 30-mers are represented Correct: No new 30-mers are represented Compact: 30-mers occur exactly once. Figure 1 shows the size of the C compression for various databases. 0.0E +00 2.0E +07 4.0E +07 6.0E +07 8.0E +07 1.0E +08 Size 20358846 54145883 56454588 89541275 C3 S ize 13854679 37961385 52662145 54534356 30-mers 12115520 29769766 44374286 45307827 IPI-HUM AN IPI S wiss-Prot Swiss-Prot-VS 0.0E +00 2.0E +08 4.0E +08 6.0E +08 8.0E +08 Size 472581860 506796094 481919777 495502241 619132252 674700840 C3 S ize 337119564 338890778 342924164 351600578 463517034 473665310 30-mers 274510105 275391669 276523755 283160529 378721915 385369671 UniProt UniProt-VS MSDB NRP N CB I-nr UnionNR Experiment Parameters Experiment Parameters • ISB 17 Protein Mix (2043 MS/MS Spectra) • Mascot 2.0 • Precursor tolerance: 2 Da • Fragment tolerance: 0.15 Da • Up to 2 missed trypsin cleavages • IPI-HUMAN, Swiss-Prot(-VS), UniProt(-VS) Faster Search Faster Search 0 1000 2000 3000 4000 5000 6000 7000 IPI-H UM AN C3 IPI-H UM AN Swiss-ProtC3 Swiss-P rot S wiss-P rot-V S C 3 Swiss-Prot-VS Uni-ProtC 3 U ni-P rot U niProt-VS C 3 U niProt-VS 0% 20% 40% 60% 80% 100% 120% IPI-H UM AN C3 IPI-H UM AN S w iss-P rotC3 S wiss-P rot S w iss-P rot-V S C 3 Swiss-Prot-VS Uni-ProtC 3 U ni-P rot U niProt-VS C 3 U niProt-VS Figure 2: Absolute (seconds) and relative Mascot search time for original and C 3 sequence databases. Blue represents Mascot search time, other colors represent peptide mapping time. More Sensitive Search More Sensitive Search Figure 3: Significant E-values for Mascot search against Swiss- Prot-VS (x) vs C 3 Swiss-Prot-VS (y). y = 0.6089x R 2 = 0.9995 1.0E -08 1.0E -07 1.0E -06 1.0E -05 1.0E -04 1.0E -03 1.0E -02 1.0E -01 1.0E +00 1.0E -08 1.0E -07 1.0E -06 1.0E -05 1.0E -04 1.0E -03 1.0E -02 1.0E -01 1.0E +00 Figure 4: Size (Blue, bottom axis, in Mb) and Mascot search time (Red, top axis, in hours) for C 3 and brute force Human EST database. References References Searching Human ESTs Searching Human ESTs Human ESTs represent a largely unexplored source of peptide sequence data. EST sequence databases are highly redundant, with a sequencing error rate of about 1%. We translate the Human ESTs in all 6 frames and C 3 compress the result. In the process, we eliminate open reading frames of less than 50 amino acids, and amino-acid 30-mers that are observed only once. 0 1000 2000 3000 4000 5000 6000 7000 dbE S T C 3 dbE S T 0 2 4 6 8 10 12 14 16 18 We searched the Open Proteomics Database MS/MS dataset “SiHa human cell line used to model cervical cancer” against C 3 Human dbEST with Mascot on a PC (512Mb RAM) Total spectra: 47788 Search Time: 4.5 hours Novel Peptides: 19 [1] N. Edwards and R. Lippert. Sequence database compression for peptide identification from tandem mass spectra. WABI 2004. (Projected) P-1014

Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational

Embed Size (px)

Citation preview

Page 1: Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational

Faster, more sensitive peptide identification fromFaster, more sensitive peptide identification fromtandem mass spectra by sequence database tandem mass spectra by sequence database

compressioncompression

Faster, more sensitive peptide identification fromFaster, more sensitive peptide identification fromtandem mass spectra by sequence database tandem mass spectra by sequence database

compressioncompressionNathan J. EdwardsNathan J. Edwards

Center for Bioinformatics & Computational Biology, University of Maryland, College ParkCenter for Bioinformatics & Computational Biology, University of Maryland, College ParkNathan J. EdwardsNathan J. Edwards

Center for Bioinformatics & Computational Biology, University of Maryland, College ParkCenter for Bioinformatics & Computational Biology, University of Maryland, College Park

Peptide Sequences are ShortPeptide Sequences are ShortPeptide Sequences are ShortPeptide Sequences are Short

IntroductionIntroductionIntroductionIntroduction

Current sequence databases contain considerable peptide sequence redundancy. Amino-acid sequence databases often contain considerably fewer distinct 30-mers than its size suggests. The distinct 30-mers can be searched faster than the original sequence database, while the statistical significance of the peptide identifications is improved.

Peptides identified in MS/MS workflows are rarely longer than 30 amino-acids. • Trypsin cuts at K or R, unless followed by P• Precursor ion is usually < 3000 Da• Charge state is usually +1, +2, or +3• Peptides of more than 20 amino-acids typically don’t fragment well.

30 is a conservative upper-bound on the length of peptides identified by MS/MS workflows.

Protein Sequence DatabasesProtein Sequence DatabasesProtein Sequence DatabasesProtein Sequence Databases

IPI-HUMAN, from EBI; IPI, concatenation of IPI-HUMAN, IPI-MOUSE, and IPI-RAT from EBI; Swiss-Prot, from ExPASy; Swiss-Prot-VS, Swiss-Prot plus varsplic.pl variant enumeration; UniProt, concatenation of Swiss-Prot and TrEMBL; UniProt-VS, UniProt plus varsplic.pl variant enumeration; MSDB, from Imperial College; NRP, from NCI Frederick; NCBI-nr, from NCBI; and UnionNR, non-redundant union of all.

Figure 1 shows size and distinct 30-mers.

Figure 1: Sequence databases: size, C3 size, distinct 30-mers.

CC33 Database Compression Database CompressionCC33 Database Compression Database Compression

C3 sequence database [1]Complete: All 30-mers are representedCorrect: No new 30-mers are representedCompact: 30-mers occur exactly once.

Figure 1 shows the size of the C3 sequence database compression for various databases.

0.0E+00

2.0E+07

4.0E+07

6.0E+07

8.0E+07

1.0E+08

Size 20358846 54145883 56454588 89541275

C3 Size 13854679 37961385 52662145 54534356

30-mers 12115520 29769766 44374286 45307827

IPI-HUMAN IPI Swiss-Prot Swiss-Prot-VS

0.0E+00

2.0E+08

4.0E+08

6.0E+08

8.0E+08

Size 472581860 506796094 481919777 495502241 619132252 674700840

C3 Size 337119564 338890778 342924164 351600578 463517034 473665310

30-mers 274510105 275391669 276523755 283160529 378721915 385369671

UniProt UniProt-VS MSDB NRP NCBI-nr UnionNR

Experiment ParametersExperiment ParametersExperiment ParametersExperiment Parameters

• ISB 17 Protein Mix (2043 MS/MS Spectra)• Mascot 2.0

• Precursor tolerance: 2 Da• Fragment tolerance: 0.15 Da• Up to 2 missed trypsin cleavages

• IPI-HUMAN, Swiss-Prot(-VS), UniProt(-VS)• Dell PC with 512 Mb of RAM

Faster Search Faster Search Faster Search Faster Search

0

1000

2000

3000

4000

5000

6000

7000

IPI-H

UMAN C

3

IPI-H

UMAN

Swiss-P

rot C

3

Swiss-P

rot

Swiss-P

rot-V

S C3

Swiss-P

rot-V

S

Uni-Pro

t C3

Uni-Pro

t

UniPro

t-VS C

3

UniPro

t-VS

0%

20%

40%

60%

80%

100%

120%

IPI-H

UMAN C

3

IPI-H

UMAN

Swiss-P

rot C

3

Swiss-P

rot

Swiss-P

rot-V

S C3

Swiss-P

rot-V

S

Uni-Pro

t C3

Uni-Pro

t

UniPro

t-VS C

3

UniPro

t-VS

Figure 2: Absolute (seconds) and relative Mascot search time for original and C3 sequence databases. Blue represents Mascot search time, other colors represent peptide mapping time.

More Sensitive SearchMore Sensitive SearchMore Sensitive SearchMore Sensitive Search

Figure 3: Significant E-values for Mascot search against Swiss-Prot-VS (x) vs C3 Swiss-Prot-VS (y).

y = 0.6089x

R2 = 0.9995

1.0E-08

1.0E-07

1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

1.0E+00

1.0E-081.0E-071.0E-061.0E-051.0E-041.0E-031.0E-021.0E-011.0E+00

Figure 4: Size (Blue, bottom axis, in Mb) and Mascot search time (Red, top axis, in hours) for C3 and brute force Human EST database.

ReferencesReferencesReferencesReferences

Searching Human ESTs Searching Human ESTs Searching Human ESTs Searching Human ESTs

Human ESTs represent a largely unexplored source of peptide sequence data. EST sequence databases are highly redundant, with a sequencing error rate of about 1%.

We translate the Human ESTs in all 6 frames and C3 compress the result. In the process, we eliminate open reading frames of less than 50 amino acids, and amino-acid 30-mers that are observed only once.

We achieve 40 fold compression in sequence database size and therefore running time.

(Projected)

0 1000 2000 3000 4000 5000 6000 7000

dbEST

C3 dbEST

0 2 4 6 8 10 12 14 16 18

We searched the Open Proteomics Database MS/MS dataset “SiHa human cell line used to model cervical cancer” against C3 Human dbEST with Mascot on a PC (512Mb RAM)Total spectra: 47788Search Time: 4.5 hoursNovel Peptides: 19

[1] N. Edwards and R. Lippert. Sequence database compression for peptide identification from tandem mass spectra. WABI 2004.

(Projected)

P-1014