Upload
randolf-sutton
View
213
Download
0
Embed Size (px)
Citation preview
Faster, more sensitive peptide identification fromFaster, more sensitive peptide identification fromtandem mass spectra by sequence database tandem mass spectra by sequence database
compressioncompression
Faster, more sensitive peptide identification fromFaster, more sensitive peptide identification fromtandem mass spectra by sequence database tandem mass spectra by sequence database
compressioncompressionNathan J. EdwardsNathan J. Edwards
Center for Bioinformatics & Computational Biology, University of Maryland, College ParkCenter for Bioinformatics & Computational Biology, University of Maryland, College ParkNathan J. EdwardsNathan J. Edwards
Center for Bioinformatics & Computational Biology, University of Maryland, College ParkCenter for Bioinformatics & Computational Biology, University of Maryland, College Park
Peptide Sequences are ShortPeptide Sequences are ShortPeptide Sequences are ShortPeptide Sequences are Short
IntroductionIntroductionIntroductionIntroduction
Current sequence databases contain considerable peptide sequence redundancy. Amino-acid sequence databases often contain considerably fewer distinct 30-mers than its size suggests. The distinct 30-mers can be searched faster than the original sequence database, while the statistical significance of the peptide identifications is improved.
Peptides identified in MS/MS workflows are rarely longer than 30 amino-acids. • Trypsin cuts at K or R, unless followed by P• Precursor ion is usually < 3000 Da• Charge state is usually +1, +2, or +3• Peptides of more than 20 amino-acids typically don’t fragment well.
30 is a conservative upper-bound on the length of peptides identified by MS/MS workflows.
Protein Sequence DatabasesProtein Sequence DatabasesProtein Sequence DatabasesProtein Sequence Databases
IPI-HUMAN, from EBI; IPI, concatenation of IPI-HUMAN, IPI-MOUSE, and IPI-RAT from EBI; Swiss-Prot, from ExPASy; Swiss-Prot-VS, Swiss-Prot plus varsplic.pl variant enumeration; UniProt, concatenation of Swiss-Prot and TrEMBL; UniProt-VS, UniProt plus varsplic.pl variant enumeration; MSDB, from Imperial College; NRP, from NCI Frederick; NCBI-nr, from NCBI; and UnionNR, non-redundant union of all.
Figure 1 shows size and distinct 30-mers.
Figure 1: Sequence databases: size, C3 size, distinct 30-mers.
CC33 Database Compression Database CompressionCC33 Database Compression Database Compression
C3 sequence database [1]Complete: All 30-mers are representedCorrect: No new 30-mers are representedCompact: 30-mers occur exactly once.
Figure 1 shows the size of the C3 sequence database compression for various databases.
0.0E+00
2.0E+07
4.0E+07
6.0E+07
8.0E+07
1.0E+08
Size 20358846 54145883 56454588 89541275
C3 Size 13854679 37961385 52662145 54534356
30-mers 12115520 29769766 44374286 45307827
IPI-HUMAN IPI Swiss-Prot Swiss-Prot-VS
0.0E+00
2.0E+08
4.0E+08
6.0E+08
8.0E+08
Size 472581860 506796094 481919777 495502241 619132252 674700840
C3 Size 337119564 338890778 342924164 351600578 463517034 473665310
30-mers 274510105 275391669 276523755 283160529 378721915 385369671
UniProt UniProt-VS MSDB NRP NCBI-nr UnionNR
Experiment ParametersExperiment ParametersExperiment ParametersExperiment Parameters
• ISB 17 Protein Mix (2043 MS/MS Spectra)• Mascot 2.0
• Precursor tolerance: 2 Da• Fragment tolerance: 0.15 Da• Up to 2 missed trypsin cleavages
• IPI-HUMAN, Swiss-Prot(-VS), UniProt(-VS)• Dell PC with 512 Mb of RAM
Faster Search Faster Search Faster Search Faster Search
0
1000
2000
3000
4000
5000
6000
7000
IPI-H
UMAN C
3
IPI-H
UMAN
Swiss-P
rot C
3
Swiss-P
rot
Swiss-P
rot-V
S C3
Swiss-P
rot-V
S
Uni-Pro
t C3
Uni-Pro
t
UniPro
t-VS C
3
UniPro
t-VS
0%
20%
40%
60%
80%
100%
120%
IPI-H
UMAN C
3
IPI-H
UMAN
Swiss-P
rot C
3
Swiss-P
rot
Swiss-P
rot-V
S C3
Swiss-P
rot-V
S
Uni-Pro
t C3
Uni-Pro
t
UniPro
t-VS C
3
UniPro
t-VS
Figure 2: Absolute (seconds) and relative Mascot search time for original and C3 sequence databases. Blue represents Mascot search time, other colors represent peptide mapping time.
More Sensitive SearchMore Sensitive SearchMore Sensitive SearchMore Sensitive Search
Figure 3: Significant E-values for Mascot search against Swiss-Prot-VS (x) vs C3 Swiss-Prot-VS (y).
y = 0.6089x
R2 = 0.9995
1.0E-08
1.0E-07
1.0E-06
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
1.0E-081.0E-071.0E-061.0E-051.0E-041.0E-031.0E-021.0E-011.0E+00
Figure 4: Size (Blue, bottom axis, in Mb) and Mascot search time (Red, top axis, in hours) for C3 and brute force Human EST database.
ReferencesReferencesReferencesReferences
Searching Human ESTs Searching Human ESTs Searching Human ESTs Searching Human ESTs
Human ESTs represent a largely unexplored source of peptide sequence data. EST sequence databases are highly redundant, with a sequencing error rate of about 1%.
We translate the Human ESTs in all 6 frames and C3 compress the result. In the process, we eliminate open reading frames of less than 50 amino acids, and amino-acid 30-mers that are observed only once.
We achieve 40 fold compression in sequence database size and therefore running time.
(Projected)
0 1000 2000 3000 4000 5000 6000 7000
dbEST
C3 dbEST
0 2 4 6 8 10 12 14 16 18
We searched the Open Proteomics Database MS/MS dataset “SiHa human cell line used to model cervical cancer” against C3 Human dbEST with Mascot on a PC (512Mb RAM)Total spectra: 47788Search Time: 4.5 hoursNovel Peptides: 19
[1] N. Edwards and R. Lippert. Sequence database compression for peptide identification from tandem mass spectra. WABI 2004.
(Projected)
P-1014