Upload
independent
View
0
Download
0
Embed Size (px)
Citation preview
Research Article
De novo peptide sequencing by tandem MSusing complementary CID and electrontransfer dissociation
De novo sequencing of peptides using tandem MS is difficult due to missing fragment
ions in the spectra commonly obtained after CID of peptide precursor ions. Comple-
menting CID spectra with spectra obtained in an ion-trap mass spectrometer upon
electron transfer dissociation (ETD) significantly increases the sequence coverage with
diagnostic ions. In the de novo sequencing algorithm CompNovo presented here, a
divide-and-conquer approach was combined with an efficient mass decomposition
algorithm to exploit the complementary information contained in CID and ETD spectra.
After optimizing the parameters for the algorithm on a well-defined training data set
obtained for peptides from nine known proteins, the CompNovo algorithm was applied
to the de novo sequencing of peptides derived from a whole protein extract of Sorangiumcellulosum bacteria. To 2406 pairs of CID and ETD spectra contained in this data set, 675
fully correct sequences were assigned, which represent a success rate of 28.1%. It is
shown that the CompNovo algorithm yields significantly improved sequencing accuracy
as compared with published approaches using only CID spectra or combined CID and
ETD spectra.
Keywords:
CID / De novo / ETD / OpenMS / Sorangium cellulosumDOI 10.1002/elps.200900332
1 Introduction
The fully automated annotation of mass spectra generated by
MS/MS of proteolytic peptides followed by computer-aided
database searching forms the basis of high-throughput
identification of proteins in proteome analysis [1, 2]. Never-
theless, the majority of mass spectra generated in a proteomic
experiment remains un-annotated, mainly because of
incomplete sequence information in the mass spectra or
the lack of sequence information about the protein under
investigation in sequence databases. Moreover, the similarity
of many protein subsequences together with incomplete
sequence information in the mass spectra results in false-
positive identifications [3]. In order to increase the reliability
of protein identifications based on MS data in proteome
analysis, it is therefore mandatory both to improve coverage
of the complete peptide sequence with diagnostic fragment
ions and to implement efficient computer algorithms to
automatically interpret the fragment ion spectra [4].
Gas-phase fragmentation of peptide ions by low-energy
CID allows inference of their sequence from typical ion
series [5, 6], which means primarily b-ions and y-ions
resulting from cleavage of the peptide bond. More recently,
CID has been complemented by electron-induced frag-
mentation methods such as electron capture dissociation
(ECD, [7]) or electron transfer dissociation (ETD, [8]). In
ETD, multiply protonated peptide ions are fragmented upon
transfer of electrons from an electron carrier molecule
derived from anthracene or fluoranthene by chemical ioni-
zation, resulting in dissociation of the peptide radical ions
due to highly exothermic electron recombination reactions.
In contrast to CID, ETD has no preferred cleavage sites
except proline, which leads to a more uniform distribution
of the fragment ions, mainly of c- and z-type, along the
Andreas Bertsch1
Andreas Leinenbach2,3
Anton Pervukhin4
Markus Lubeck5
Ralf Hartmer5
Carsten Baessmann5
Yasser Abbas Elnakady6
Rolf Muller6
Sebastian Bocker4
Christian G. Huber3
Oliver Kohlbacher1
1Center for BioinformaticsTubingen, Eberhard-Karls-Universitat Tubingen, Tubingen,Germany
2Instrumental Analysis andBioanalysis, SaarlandUniversity, Saarbrucken,Germany
3Department of MolecularBiology, Division of Chemistryand Bioanalytics, University ofSalzburg, Salzburg, Austria
4Lehrstuhl Bioinformatik,Friedrich-Schiller-UniversitatJena, Jena, Germany
5Bruker Daltonik GmbH, Bremen,Germany
6Department of PharmaceuticalBiotechnology, SaarlandUniversity, Saarbrucken,Germany
Received December 18, 2008Revised July 22, 2009Accepted July 29, 2009
Abbreviations: ECD, electron capture dissociation; ETD,
electron transfer dissociation
Correspondence: Andreas Bertsch, Universitat Tubingen,Wilhelm Schickard Institute, Sand 14, 72076 Tubingen, GermanyE-mail: [email protected]:149-7071-29-5152
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com
Electrophoresis 2009, 30, 3736–37473736
peptide sequence. Hence, CID and ETD are complementary
fragmentation techniques, leading to a more comprehensive
coverage of a peptide sequence with fragment ions [9].
It was shown that the sequence coverage of doubly
charged peptide ions could be increased from 84% with CID
alone to 91% upon combining fragment information from
CID and ECD [4]. If a sequence database of the species
under investigation is available, annotation of fragment ion
mass spectra can be performed by matching theoretical ions
of candidate peptide sequences to the fragment ions
observed in the experimental mass spectrum. A variety of
database searching algorithms are currently in use for fully
automated spectrum interpretation, such as MASCOT [2],
SEQUEST [1], OMSSA [10], and X!Tandem [11]. Although
these approaches are standard procedures in proteomic
research, there are cases where appropriate sequence
databases are unavailable or where sequencing errors or
sequence polymorphisms render database searching
unsuitable. For those cases, de novo identification, i.e. the
inference of the peptide sequence directly from the experi-
mental spectrum, represents a viable alternative.
The de novo identification problem has been studied
intensively [12–17]. Common methods can be subdivided
into two steps: candidate sequence generation and scoring.
Most of the approaches (e.g. [14–22]) use a spectrum graph,
in which the nodes correspond to ions of the spectrum and
edges correspond to valid mass differences. A valid mass
difference means that the difference between the m/z ratios
of two ions in the spectrum corresponds to the mass of one
of the common amino acids. The candidate sequences are
generated by traversal through the graph. Another approach
makes use of a hidden Markov model, which directly
scores the generated sequence [23]. A divide-and-conquer
approach, where the candidates are generated from
subspectra (between abundant ions) via exhaustive
enumeration of amino acid combinations has also been
used [24] for de novo sequencing. Very recently, an integer
linear program formulation was published [25], showing
superior performance compared with other well-known denovo searching tools both for data measured on high-reso-
lution quadrupole time-of-flight and for low-resolution ion-
trap mass spectrometers. Although some approaches show
good performances on high-quality spectra, in which most
of the theoretical ions are present, the overall performance
compared with database-driven approaches is still poor.
Scoring of candidate sequences against the experimental
spectrum was also studied intensively (e.g. [15, 26–28]),
because such a scoring procedure is needed by most of the
database-driven and de novo approaches.
Horn et al. [29] presented a method of whole protein denovo sequencing employing CID and ECD as complementary
fragmentation techniques on a high-resolution Fourier-trans-
form ion cyclotron resonance mass spectrometer. Exploiting
the high mass accuracy of Fourier-transform ion cyclotron
resonance-MS/MS and the complementarity of both frag-
mentation techniques, the same group devised a linear de novosequencing algorithm, which starts with the most reliable
fragment ion data and propagates linearly by including addi-
tional amino acid masses [4]. From a total of 7852 MS/MS
data sets included in the de novo sequencing evaluation, the
completely correct sequences of 1704 peptides could be
annotated, representing 22% of all peptides identified of all
peptides identified by MASCOT with a Mowse score greater
than or equal to 34. Very recently, Datta et al. [30] imple-
mented a sophisticated machine learning approach to make
use of ETD and CID spectra for de novo sequencing. They use
a tree-augmented naive Bayes classifier to learn specific
features of fragmentation types in the spectra and combine
them into one spectrum. Then a graph-based approach was
employed to obtain the final list of peptides. The focus of this
work, however, was on the generation of peptide candidates.
Recent instrumental developments have led to the
commercial availability of ion-trap mass spectrometers,
which are able to generate CID and ETD spectra from the
same precursor ion in a short period of time [8]. Given the
fact that high-resolution mass spectrometers are quite
expensive and that the majority of protein identification data
today is still measured on low-resolution ion-trap mass
spectrometers, we aim in this study at a method for the denovo sequencing of peptides employing CID and ETD in an
ion-trap mass spectrometer. We demonstrate that these
spectra contain complementary information that can be
used to make de novo identification much more reliable. In
order to optimize the parameters employed for a combina-
torial algorithm, which we call CompNovo, we utilize CID
and ETD fragment ion spectra obtained from tryptic
peptides of nine standard proteins. The algorithm then
employs a divide-and-conquer approach together with rapid
mass decomposition to annotate the fragment ion masses
derived from the combined CID and ETD spectrum. Finally,
the optimized parameters are utilized to obtain de novosequence information from CID and ETD fragmentation
data derived from the measurement of the whole proteome
of the bacterium Sorangium cellulosum [31].
2 Materials and methods
2.1 Experimental data generation
For the generation of the training data set, nine proteins,
obtained from Sigma (St. Louis, MO) or Fluka (Buchs,
Switzerland), were partitioned into two mixtures and
tryptically digested (trypsin from Promega, Madison, WI)
using published protocols [32]. The mixtures contained the
proteins in concentrations between 0.3 and 1.7 pmol/mL.
Mixture 1 comprised thyroglobulin (bovine thyroid gland)
and albumin (bovine serum), whereas mixture 2 contained
b-casein (bovine milk), conalbumin (chicken egg white),
myelin basic protein (bovine), hemoglobin (human), leptin
(human), creatine phosphokinase (rabbit muscle), a1-acid-
glycoprotein (human plasma), and albumin (bovine serum).
The resulting peptide mixtures were then separated
using capillary ion-pair RP HPLC (IP-RP-HPLC) and
Electrophoresis 2009, 30, 3736–3747 Proteomics and 2-DE 3737
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com
subsequently identified by ESI-MS/MS as described in
detail in [32, 33]. Briefly, one-microliter samples was
desalted and preconcentrated at a flow rate of 10 mL/min of
0.05% aqueous TFA in a capillary/nano HPLC system
(Model Ultimate, Dionex Benelux, Amsterdam, The Neth-
erlands) using a 10.0� 0.2 mm monolithic poly-(styrene/
divinylbenzene) preconcentration column, which was
prepared according to the published protocols [32, 34]. The
separation of the preconcentrated peptides was carried out
upon transfer to a 50� 0.1 mm monolithic poly-(styrene/
divinylbenzene) separation column (Dionex Benelux),
employing a gradient of 0–40% ACN in 0.05% v/v aqueous
TFA in 60 min at 551C. Digests were analyzed in quad-
ruplicate at a flow rate of 0.7 mL/min.
An ion-trap mass spectrometer with implemented ETD
module (Model HCTultra PTM discovery system, Bruker
Daltonik GmbH, Bremen, Germany) and equipped with a
modified ESI-ion source (spray capillary: fused silica capil-
lary, 0.090 mm od and 0.020 mm id) was used. A detailed
description of the ETD setup of the ion-trap instrument
including the negative chemical ionization source needed
for the generation of the reagent anion of fluoranthene and
the transfer into the ion trap can be found in [35]. Tandem
mass spectra were generated using sequential CID and ETD
fragmentation of the same precursor ion. Singly charged
analytes were automatically excluded. The ETD reaction
time, i.e. the reaction time for electron transfer from fluor-
anthene anions to peptide cations, was set to 100 ms. A
short low-energy collisional activation step (‘‘smart decom-
position’’, Bruker Daltonics [36]) was performed after elec-
tron transfer to further improve the fragmentation especially
of doubly and triply charged peptides, because the frag-
ments often do not dissociate due to non-covalent binding
forces. The scan range was 210–2000 Da to exclude fluor-
anthene ions and to achieve a reasonable scan time for one
spectrum pair.
To generate a data set of a real life sample, approxi-
mately 690 mg of proteins extracted from cultivated S. cellu-losum [31], a soil living bacterium, were tryptically digested
(trypsin from Promega) using the published protocols [32].
The resulting peptide mixture was separated using an off-
line two-dimensional HPLC setup exactly as described in
[37]. IP-RP-HPLC-ESI-MS/MS conditions and parameters
for alternating CID and ETD fragmentation and detection
were as described in the previous paragraph.
2.2 Data preprocessing
A training data set was extracted from the analysis of the two
digests of the nine proteins. After optimization of the
algorithmic parameters with the training data set, the
algorithm was applied to the de novo sequencing of peptides
in a digest of a protein extract from S. cellulosum. Since none
of the commercially available database-driven search
engines supported the use of CID and ETD spectrum pairs
during the identification process and identification using
ETD only showed considerably poorer performance with
very few identifications unique to ETD, only CID spectra
were used for the annotation of the peptides in the data sets.
CID spectra from the eight HPLC runs of the standard
protein mixture were identified through database searches
using MASCOT version 2.1.03 [2], allowing one missed
cleavage and carboxymethylation of cysteine as fixed
modification and oxidation of methionine as variable
modification. Oxidized methionine has almost the same
mass as phenylalanine, which will be taken into considera-
tion in our evaluation below.
Spectra of doubly charged precursors and hits to tryptic
peptides with a peptide score greater than 10 were used.
This low-score cut-off was comparable to that of used by
others [38] and ensured that spectra of lower quality were
also included in the training data set. Manual inspection of
the spectra with scores less than 10 showed that the ion
series were highly incomplete and the spectra therefore not
suitable for parameter estimation. In order to avoid bias to
specific sequences, we allowed spectra corresponding to the
same peptide sequence and charge combination only once
in the data set. The final data set was constructed by
selecting all spectra of peptides with a molecular mass less
than or equal to 2000 Da where the peptide sequence was a
sub-sequence of one of the proteins present in the sample.
This selection resulted in a data set of 156 pairs of ETD/CID
spectra of different doubly charged peptides. Only seven
triply charged peptides were present and these were treated
separately.
In order to distinguish correctly from incorrectly deter-
mined sequences, the peptides in the benchmark data set of
S. cellulosum were identified using MASCOT, version 2.2.04,
OMSSA, version 2.1.4 and X!Tandem, version 08-02-01. The
protein sequence database used included all 9320 protein
sequences from S. cellulosum of Uni-ProtKB/Swiss-Prot
release 15.4/57.4 and 86 protein sequences of trypsin and
keratin. Cysteine carboxymethylation and methionine
oxidation were included as variable modifications and up to
one missed cleavage was allowed during the search runs. To
ensure that the peptides included in the data set were
correctly identified, a more complex verification procedure
was performed for the identifications due to the unknown
protein content in the sample. The identification was carried
out using a composite database in which reversed sequences
of all proteins were appended to the original S. cellulosumsequence database [39]. False discovery rates were estimated
from the obtained identification lists. A peptide sequence
was accepted if at least two identification engines identified
the peptide as a top hit with a false discovery rate of 5% or
better. If more than one spectrum pair represented the same
sequence, the one with the best average false discovery rate
was chosen. The 5% false discovery rate score cut-offs for
the three search engines were 0.5 for the MASCOT E-value
(MASCOT Mowse score 420), 0.09 for the X!Tandem
E-value, and 0.1 for the OMSSA E-value. This means that
only scores better than the thresholds were accepted for the
consensus. Using the 38 261 MS/MS spectrum pairs in our
Electrophoresis 2009, 30, 3736–37473738 A. Bertsch et al.
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com
data set, this filtering procedure resulted in 2406 spectrum
pairs of charge two and 123 of charge three, in which each
spectrum pair corresponded to a unique peptide sequence.
The data sets used in this study have been deposited in the
Proteomics Identification Database (PRIDE [40, 41]) and are
publicly available under the experiment accessions 8689 and
8690.
2.3 Algorithm overview
The CompNovo algorithm comprises two steps: candidate
generation and candidate scoring. After preprocessing of the
spectrum, a divide-and-conquer algorithm decomposes the
spectrum into smaller parts until the segment is small
enough such that the mass decomposition can generate all
amino acid compositions for a given mass difference. From
this list of compositions, all permutations of sequences are
generated, which are then scored against the spectra. The
best scoring sequences are kept and used as candidates for
the segment. After all segments have been processed, the
final candidate list is generated and the candidate sequences
are scored against the spectra. This is performed by a two-
step approach. First, very simple spectra are generated from
the candidate sequences and scored against the experimen-
tal spectra. The best candidates are then used in a second
scoring step to generate more detailed theoretical spectra,
e.g., by including additional ion series, losses, and isotope
clusters, which are in turn scored against the experimental
spectra. The various modules of the CompNovo sequencing
algorithm are schematically shown in Fig. 1. We will
now describe the individual parts of the algorithm in more
detail.
The CompNovo algorithm is part of the OpenMS plat-
form and downloadable from www.openms.de. The source
code of CompNovo is available and will be released under an
open-source license as part of the upcoming version of our
software packages OpenMS [42] and TOPP [43].
2.4 Spectrum preprocessing
The CompNovo algorithm starts with y-ions of the CID
spectrum. To select as many true y-ions as possible from the
spectrum and to reduce the probability of selecting other ion
types in the divide-and-conquer step, the following scoring
steps are performed. The initial score for each of the ions
present in the spectrum is its intensity. As ETD spectra
contain abundant peaks of the precursor ion in almost all
cases [44], the mass of the peptide can be determined within
the mass tolerance of the tandem MS scan. This allows an
accurate scoring of the complementary features of the
spectra. First, the isotope patterns of each of the peaks are
scored against the theoretical isotope distribution pattern
based on the molecular mass and the average elemental
composition of peptides with respect to carbon, hydrogen,
oxygen, and nitrogen. A simple correlation is used as a
multiplicative scoring in subsequent steps. Doubly charged
ions that have isotope ions correlating well with a theoretical
isotope distribution are transformed and inserted as singly
charged ions into the spectrum, or the intensity is added to
the corresponding singly charged ion if this is already
present in the spectrum. The scores of a witness set of ions,
i.e. the set of all ions that support the y-ion, are used to
enhance its score. This includes neutral loss of water and
ammonia from y-ions as well as complementary b-ions,
which can be witnessed by putative a-ions. In addition to the
witness set present in the CID spectrum, the c- and z-type
ions of the ETD spectrum are used to provide further
evidence that a supposed y-ion is correctly assigned. ETD
spectra often contain radical variants of the c- and z-type
ions also [44]; therefore, the single radical ions are also used
for scoring. The intensities of the y-ion and the weighted
intensities of the witness ions are added. Here, the
weighting factor is derived from the relative difference of
the deviation of the expected versus calculated mass
difference. For instance, if a putative y-ion is witnessed by
a water loss ion, the mass difference between both ions
should be 18.0 Da. However, if it is observed in a distance of
18.2 Da the factor is 0.5, given a mass tolerance of 0.4 Da in
the MS/MS scan. This is the same factor as used in the
Figure 1. Schematic diagram of the CompNovo de novosequencing algorithm. Each spectrum pair was preprocessedto assign a score for being a y-ion to each peak. The spectrumwas subdivided into smaller sections using putative y-ions aspivot ions. Sufficiently small subsections were then decom-posed into amino acid compositions, which in turn were used togenerate subsequences. The scored subsequences of twosubsections of a section were joined and scored against thespectrum pair. Candidates of the whole spectrum were re-scoredusing a more detailed scoring scheme.
Electrophoresis 2009, 30, 3736–3747 Proteomics and 2-DE 3739
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com
spectra comparison, described in Section 2.8. Finally, the
low-mass-range and high-mass-range candidates are
checked to determine whether a mass decomposition is
possible for the ions. If no valid mass decomposition was
found for a specific ion, the ion cannot be a y-ion and the
score is set to 0 to avoid selection as pivot ion. All the ions
which have positive scores can be used as pivot ions.
2.5 Rapid mass decomposition
In the second step of candidate generation, mass differences
between ions are used to generate candidates for the
subsequent divide-and-conquer analysis. Given a pair of
pivot masses, we search for all amino acid compositions that
are compatible with this mass difference, that is, have a
theoretical mass close to the measured mass difference.
Clearly, we cannot recover the order of amino acids solely
from the mass difference, but we can infer the multiplicity
of each amino acid in every hypothetical amino acid
sequence explaining the mass difference. In this context,
we say that we decompose the mass difference over the
amino acid alphabet. To find all of these decompositions,
there exist efficient algorithms [45, 46] where running time
only depends on the number of decompositions of the input
mass, and not on the mass itself. In fact, these algorithms
can only be applied to integer masses, and hence we use the
correction described in [47] to account for rounding error
accumulation. This approach is efficient and at the same
time guarantees that no decomposition is missed. Mass
accuracy of the ion-trap mass spectra used in our analysis is
comparatively low. To guarantee that the complete analysis
pipeline is sufficiently fast, decompositions must be
computed in a matter of milliseconds. Doing so over the
amino acid alphabet, which can also include additional
modified amino acid masses, can be carried out only for
relatively small mass differences.
2.6 Divide-and-conquer approach
In order to be able to correlate measured mass differences
with a reasonable number of peptide sequence candidates,
subsequences are generated using a divide-and-conquer
approach, similar to the one described by Zhang [24]. The
key difference is that, instead of using b-ions as pivot ions,
our approach uses y-ions. A high-scoring putative y-ion is
used in the beginning to divide the spectrum into two parts.
The upper m/z part represents the prefix of the peptide
sequence and the lower m/z part the suffix of the peptide
sequence. If such a newly created segment is wider than a
given threshold, it is further divided into two new
subsegments using a pivot ion. To ensure that the correct
sequence is generated, and to guard against occasional false
pivot ions (non-y-ions), several pivot ions are selected perstep. If the subsegment is small enough, the rapid mass
decomposition algorithm is used to generate all possible
amino acid compositions explaining the observed mass
difference. From each of the compositions all permutations
of possible sequences are generated. To reduce the number
of permutations, they are scored against the experimental
spectra and only the best candidates are kept. This is done
by creating theoretical spectra of the peaks generated by the
permutations for a subsegment. When a subsegment is
located somewhere in the center of the whole mass range of
the spectrum, a prefix and a suffix are added during
generation of the theoretical spectra. These spectra are then
in turn used for comparison with the experimental
spectrum pair. When the generation of the candidates of
both segments of a divide step is completed, the results are
combined. Each of the candidates of the left segment is
combined with every candidate of the right segment. The
created set of candidates is again scored and only the best
candidates are kept. The whole procedure is repeated for
several ions and the candidates are recorded for the final
scoring of the full-sized sequence candidates.
2.7 Theoretical spectra
Theoretical spectra are used to score against the experi-
mental spectra. CID and ETD spectra are treated separately,
and then both scores are simply added to get a similarity of a
sequence and the experimental spectrum pair. To speed up
computations, CID spectra are generated in two ways. The
first type is used in the divide-and-conquer part and includes
only b- and y-ions of same intensities. The second type of
CID spectra also accounts for intensities of the different ions
[1] including a-, b-, and y-ions as well as neutral losses from
b- and y-ions. Water loss ions are added if the b- or y-ion
contains at least one of S, T, E, or D. Ammonia loss ions are
added if the b- or y-ion contains at least one of Q, N, R, or K.
The relative intensity of y-ions is 100%, b-ions are included
with a relative intensity of 80%. Losses from y-ions are
included with 10% and losses from b-ions with 2% relative
intensity. The intensity of the b-and y-ions (but not of the
loss ions) is distributed over the monoisotopic and the
isotope peaks according to the isotope distribution. ETD
spectra are generated using c-and z-type ions and using the
isotope distribution to add also isotope peaks. No differences
are made for the intensities of different ion types in ETD
spectra.
2.8 Candidate scoring
Finally, the candidates produced by the divide-and-conquer
approach are scored against the experimental spectra. For
this purpose theoretical CID and ETD spectra of the
peptides are generated. These spectra are compared with
the experimental spectrum and a ranked list according to
the similarity is created. As the number of candidates can be
very large, a two-step scoring is applied. The similarities of
the spectra are scored using the following similarity
Electrophoresis 2009, 30, 3736–37473740 A. Bertsch et al.
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com
function and a weighting factor is used to account for the
differences in expected and observed ion positions:
fi;j ¼e� jPi � PjjjPi � Pjj
;
where f is the weighting factor of ion i of the first spectrum
and ion j of the second spectrum. The positions are noted as
Pi and Pj, respectively. Parameter e is the mass tolerance
allowed for the scoring and it was set to 0.4. The similarity of
two spectra is calculated as follows:
s ¼P
i;j
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiIi � Ij � fi;j
p
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPiIi �
PjIj
p ;
where Ii represents the intensity of the i-th peak of the first
spectrum and Ij stands for the intensity of the j-th peak of
the second spectrum. For the first scoring, the numerator is
calculated by using all possible pairs of ions where the
positions do not differ more than e. For the more expensive
final scoring, this is only calculated for the best pairs, which
means each ion can have at most one partner in the other
spectrum. The pairs are determined using a spectrum
alignment algorithm, where the number of paired ions is
maximized and the sum of position distances is minimized.
Intensities are ignored by the spectrum alignment algo-
rithm. First, very simple spectra for CID consisting only of
b- and y-ions are created, and for theoretical ETD spectra
only c- and z-type ions are used. The candidates are scored
using the first version of the scoring scheme. The best
peptide sequences of the pre-scoring are used for an in-
depth scoring. Therefore, b- and y-ions as well as a-ions and
neutral losses are used with different intensities in a theo-
retical spectrum. Additionally, doubly charged b- and y-ions
are used. The ETD spectra are simpler because only c-type
ions and z-type ions are used. The ions and the isotopes of
the ions are used for scoring against experimental CID and
ETD spectrum. Finally, a ranked list is generated from the
scores, which is the output of the algorithm.
2.9 Quality assessment
To count the number of correct amino acids of the predicted
versus the correct peptide sequence, the following rules are
used. Amino acid residues I/L, Q/K as well as F and
oxidized methionine cannot be distinguished by their
masses on low-resolution mass spectrometers and are
therefore treated as one amino acid. The predicted and
correct sequences are compared from the left to the
right and an amino acid of the predicted peptide sequence
is counted as correct if the corresponding amino acid
in the correct sequence is identical. ‘‘Corresponding’’
in this case means that the prefix masses of the predicted
and correct peptide sequences of the amino acid pair
under consideration do not differ by more than 2 Da.
This relatively high threshold takes into account the
sum of the precursor mass tolerance and the fragment
mass tolerance. This threshold also handles the case, e.g.,
of the peptides DFPIANGER (correct) and NFPIANGER
(sequenced) where the prefix masses of the correct amino
acids, FPIANGER are shifted by 1 Da. The rules given above
have another effect. If, for example, a correct amino acid K is
predicted as a sub-sequence GA, this is counted as one error,
if a correct subsequence GA is predicted as K, this is
counted as two errors. The longest consecutive amino acid
sequence of a predicted peptide and the number of
incorrectly predicted amino acids are then computed using
these rules.
2.10 Parameter determination for the CompNovo
algorithm
The parameters of the CompNovo algorithm used for the
benchmark data set were determined using the training data
set. To do so, a grid search for each of the individual
parameters was performed. Final parameters were 450 Da
for the maximal decomposition mass difference, the 40 best
solutions were kept for each subsegment, and then divide
steps were performed nine times using nine different pivot
ions, if available. The 250 best-scoring peptide candidates
were selected for rescoring using detailed CID spectra. For
the specification of arbitrary post-translational modifica-
tions, CompNovo used the controlled vocabulary terms as
specified by the PSI-MOD working group [48]. Additionally,
the modifications can be set to fixed or variable, as the case
requires.
2.11 Benchmarking against other de novo programs
PepNovo [21], version v2.00, was used employing
LTQ_LOW_TRYP as model parameter, which matches a
low-resolution ion-trap mass spectrometer. The precursor
mass tolerance was set to 1.5 Da, which was the same as for
CompNovo. The fragment mass tolerance was automatically
set to 0.4 Da by the chosen model. A post-translational
modification was created for cysteine with a shift of
158.0055 Da to account for the carboxymethylation of
cysteine, which was set as variable. LutefiskXP [17], version
1.0.5a, was used. Default parameters were used except for
the ion tolerance, which was set to 0.4 Da, the same as for
CompNovo. Additionally, the cysteine mass was set to the
mass of carboxymethylated cysteine.
3 Results and discussion
3.1 Complementary fragmentation observed in ETD
and CID spectra
One of the major advantages of using both ETD and CID
fragmentation techniques in one experiment rests within
the complementary properties of the spectra they produce
[49, 50]. Peptides dissociated by CID tend to show fragments
Electrophoresis 2009, 30, 3736–3747 Proteomics and 2-DE 3741
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com
from cleavages more in the middle of the sequence [51]. In
contrast, peptides dissociated through ETD fragment
preferentially toward the ends of the sequence. These
effects are shown in Figs. 2A and B, which was created
using doubly charged peptides of the benchmark data set of
S. cellulosum peptides. It shows clearly that, although b-ions
or y-ions might be present, the intensities are comparatively
low at the terminal sequence positions. On the contrary,
c-and z-ions frequently occur in high abundance in the
higher mass range, indicating preferential cleavage of
terminal amino acid residues. As reported previously,
ETD-specific ions generally have significantly lower inten-
sities compared with those generated by CID [49, 50].
Figures 2C and D show the same analysis using triply
charged peptides. Although ETD spectra tend to show many
abundant ions, the overall sequence coverage in the CID
spectra drops dramatically, especially for high-mass ions.
To quantify this observation and to demonstrate the
complementarity of the fragmentation methods, we
performed another statistical analysis of the frequency of
observation of the ions in our benchmark data set for CID
and ETD fragmentation. For each backbone position the
number of observed ions was counted and plotted in Fig. 3.
Figures 3A and B show the complementarity of the frag-
mentation properties of ETD and CID for doubly charged
precursor ions. It can be seen that the amino- and carboxy-
terminal amino acids are primarily covered by c- and z-type
ions in the ETD spectra, whereas the amino acids in the
middle of the peptide sequence are much more completely
covered with b- and y-ions. The combination of both types is
thus the logical consequence in an attempt to cover the
complete sequence in a de novo sequencing approach.
Figures 3C and D show that although ETD spectra of triply
charged peptides cover most of the sequence ions with
reasonable probabilities, the probabilities of seeing ions
specific to CID fragmentation drop dramatically in this case.
This makes the combination of both types of fragmentation
less useful in the case of triply charged ions.
3.2 Application of the CompNovo algorithm to the
analysis of an S. cellulosum protein lysate
In order to determine the most suitable parameters for the
CompNovo sequencing algorithm, we used the CID and
ETD mass spectra generated in an ion-trap instrument
operated in alternating CID/ETD data acquisition mode of
the same precursor ion. So-called smart decomposition was
enabled in the instrument, which imposes a short collisional
activation step before the expulsion of the fragment ions
from the trap to support the dissociation of radical precursor
ions activated by electron transfer. A well-defined training
set of tryptic peptides obtained upon digestion of nine
standard proteins and acceptance only of peptide sequences
as correct that were part of the included protein sequences
(resulting in 156 pairs of CID/ETD spectra) allowed
adjusting the parameters to the values given in Section 2.
Using the optimized parameters, the algorithm was
applied to the de novo sequencing of tryptic peptides from
proteins contained in a lysate of S. cellulosum bacteria. Since
we cannot define a priori which sequence delivered by the
algorithm is a correct sequence in the large data set, we used
consensus identifications by means of at least two of the
three different database searching routines, including
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Rel
ativ
e A
bund
ance
Mass of Ion/Mass of Precursor Ion
b-ionsy-ions
CID 2+A B
C D
0
0.05
0.1
0.15
0.2
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Rel
ativ
e A
bund
ance
Mass of Ion/Mass of Precursor Ion
c-ionsz-ions
ETD 2+
0
0.2
0.4
0.6
0.8
1
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Rel
ativ
e A
bund
ance
Mass of Ion/Mass of Precursor Ion
b-ionsy-ions
CID 3+
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Rel
ativ
e A
bund
ance
Mass of Ion/Mass of Precursor Ion
c-ionsz-ions
ETD 3+
Figure 2. Relative abun-dances of the different ionsproduced upon CID (A, C)and ETD fragmentation(B, D) of doubly (A, C) andtriply (B, D) charged precur-sor ions as a function of themolecular mass of the ions.Adapted from [51], the barsshow the median peakintensities and the linesshow the 25th and 75thpercentile of the peak inten-sities. The upper plots werecreated using doublycharged peptides (n 5 2406)and the bottom ones usingtriply charged peptides.
Electrophoresis 2009, 30, 3736–37473742 A. Bertsch et al.
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com
MASCOT, OMSSA, and X!Tandem and a maximum of false
discovery rate of 5% to obtain the sequences corresponding
to the evaluated mass spectra. These consensus identifica-
tions yielded 2406 different pairs of CID/ETD spectra which
were submitted to the CompNovo algorithm. To these
spectrum pairs CompNovo assigned a total of 675 fully
correct sequences, representing a success rate of 28.1%.
Although, at first sight, this number might look moderate, it
is among the highest success rates published so far for the
de novo sequencing of peptides. Moreover, this result is
superior to a de novo sequencing algorithm applied to CID
and ECD data, in which 1704 full sequences out of 7852
candidates (22%) with high-quality mass spectra (MASCOT
Mowse score 34) were successfully annotated (see Fig. 3 in
[4]). In its present implementation, the improved success
rate of CompNovo comes, however, at the cost of consider-
able computing time. Despite the rapid mass decomposition
algorithm, CompNovo on average needs about 13 s perspectrum pair from the benchmark data set to suggest a set
of peptide sequences (CompNovoCID about 8 s). PepNovo is
able to annotate one spectrum in about 0.2 s, whereas
LutefiskXP needs more than 30 s per spectrum. Speed
improvements are planned in terms of parallelization of the
implementation. Further, it should be mentioned that the
implementation has been optimized for identification
performance and has not yet been optimized for computa-
tional speed. Overall search times for de novo identification
are not the bottleneck in the whole proteome analysis (Total
search time for the whole proteome of S. cellulosum was
about 20 h computing time.).
The reason for the inability to correctly assign all
sequences is not due to the inability of the algorithm itself to
find the correct sequence for a given mass spectrum, but
predominantly lies in the absence of unequivocal sequence
information in the MS data. We included mass spectra
identified with a MASCOT Mowse score of about 20 (5%
false discovery rate cut-off) in our evaluation data set, in
order to include spectra of medium quality for the assign-
ments. Manual inspection of the spectra corroborated the
missing or low abundance of diagnostic ions in the spectra,
as shown in Figs. 4 and 5. Moreover, CompNovo was able to
correctly identify 74% of all amino acids in the 2406 spec-
trum pairs of the benchmark data set (i.e. correct amino acid
at the correct position), which proves the high performance
of the algorithm for retrieving sequence information from
combined CID and ETD-MS/MS data.
Allowing a maximum of two incorrect amino acids in the
sequences, about 52% of the sequences of the benchmark data
set were assigned to the corresponding mass spectra. Another
interesting aspect of the data evaluation is that CompNovo
yielded the correct sequence within the top ten scoring
suggested sequences from about 50% of the spectra. Addi-
tionally, more than 75% of the reported sequences contain at
least a correct subsequence of length six and almost 85%
contain a tag of a length of five, which are already unique for a
peptide in many cases. This constitutes highly useful infor-
mation in data evaluations focused at supporting database
searching routines and represents a basis for utilizing the
reported hits in a homology-based sequence search [52].
To show that our approach can reveal novel peptides not
identified by standard sequence database searches, we
performed de novo sequencing on all spectra of the S. cellulo-sum data set with CompNovo. As a simple test, we searched for
sequences among the CompNovo top-scoring sequences that
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 k-5 k-4 k-3 k-2 k-1
Rel
ativ
e F
requ
ency
Backbone Position
b-ionsy-ions
CID 2+
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 k-5 k-4 k-3 k-2 k-1
Rel
ativ
e F
requ
ency
Backbone Position
c-ionsz-ions
ETD 2+
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 k-5 k-4 k-3 k-2 k-1
Rel
ativ
e F
requ
ency
Backbone Position
b-ionsy-ions
CID 3+
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 k-5 k-4 k-3 k-2 k-1
Rel
ativ
e F
requ
ency
Backbone Position
c-ionsz-ions
ETD 3+
A B
C D
Figure 3. Coverage of aminoacids in a peptide sequencewith fragment ions upon CID(A, C) and ETD (B, D) frag-mentation of doubly (A, B)and triply (C, D) chargedprecursor ions. The numbersindicate the positions of therespective amino acids inthe sequence, in which 1stands for the amino-term-inal amino acid and k for thecarboxy-terminal aminoacid. The upper plots werecreated from 2406 spectrumpairs of doubly chargedpeptides, the bottom plotsfrom 123 spectrum pairs oftriply charged peptides.
Electrophoresis 2009, 30, 3736–3747 Proteomics and 2-DE 3743
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com
were identical to sequences of the reference proteome within
one amino acid (single-point mutations). Of these, we manu-
ally selected three examples for further inspection. The three
spectra could be identified by manual annotation as the
peptides FENMGAQMVR, LLDQGQAGDNVFCLLR, and
TTDVTGTVNLPEGVR. The spectra and the sequences are
available in the Supporting Information (Figs. 1–3). All three
spectrum pairs show good coverage of all relevant ions. The
peptides can thus be identified as single-point mutations of the
reference proteome with high confidence based on their
CompNovo identification.
The results discussed above clearly prove that de novosequencing algorithms can be successfully applied to
interpret low-resolution tandem mass spectra acquired with
a mass accuracy of 70.3 Da. Although most of the infor-
mation required for sequencing lie in the series of fragment
ions, in which the mass differences between fragment ions
encode the sequence, the mass accuracy is clearly not
sufficient to distinguish some amino acids, such as gluta-
mine/asparagine, as well as phenylalanine and oxidized
methionine. Although isobaric amino acids, such as leucine
and isoleucine, cannot even be resolved on high-resolution
mass spectrometers, higher mass accuracy as offered by
time-of-flight or Fourier-transform instruments will help to
further improve the accuracy of the CompNovo algorithm.
3.3 Influence of precursor ion charge state on de
novo sequencing
Figures 6A and B show the annotated CID and ETD mass
spectra of the doubly charged precursor ion of the peptide
LFIDPVLGLAPYQGR from the protein Succinyl-CoA ligase
(ADP forming) subunit b (sce9142) of S. cellulosum. Although
de novo sequencing using the CID spectrum and all CID-based
de novo sequencing tools failed to provide the correct sequence
because of missing or low-abundant signals, the combination
of both spectra and application of CompNovo yielded the fully
correct sequence as top-ranked hit. The signals annotated in the
spectra clearly show that all positions in the amino acid
sequence are well covered by diagnostic ions in the CID or ETD
spectrum. As discussed above, the sequence coverage decreased
significantly in the CID spectra of triply charged peptides (Fig.
3C), whereas ETD spectra still contained diagnostic ions for
most of the amino acids in the sequences, although at
significantly less signal intensities (Fig. 3D). In due conse-
0
500
1000
1500
2000
2500
0 200 400 600 800 1000
Inte
nsity
m/z
R A E F L F A A L S
S L A A G L F E A R
CID 2+
0
2000
4000
6000
8000
10000
12000
14000
16000
0 200 400 600 800 1000
Inte
nsity
m/z
R A E F L G A A L S
S L A A G L F E A R
ETD 2+
A B
Figure 4. (A) CID spectrum ofthe doubly charged precursorion of the peptide SLAAGL-FEAR. The doubly chargedpeptide shows a few match-ing peaks, most of the peaksremained unmatched. (B) ETDspectrum of the peptideSLAAGLFEAR. Although afew peaks could be matchedin the spectrum, most of thesignals remain unannotated.
0
50000
100000
150000
200000
250000
300000
350000
400000
0 200 400 600 800 1000 1200 1400 1600 1800
Inte
nsity
m/z
K E V L D V W E P A P A V G L A S
S A L G V A P A P E W V D L V E K
CID 2+
0
20000
40000
60000
80000
100000
120000
140000
0 200 400 600 800 1000 1200 1400 1600 1800
Inte
nsity
m/z
K E V L D V W E P A P A V G L A S
S A L G V A P A P E W V D L V E K
ETD 2+
A B
Figure 5. (A) CID spectrum of the doubly charge peptide SALGVAPAPEWVDLVEK. Although this peptide shows a good coverage of ions,some backbone fragment ions in the middle are absent. Additionally, most of the peaks in the lower m/z range have very low intensities.(B) The ETD spectrum of the peptide SALGVAPAPEWVDLVEK shows an almost complete ladder of ions, representing fragmentation atthe N-terminal part of the peptide. However, some of the ions have low intensities.
Electrophoresis 2009, 30, 3736–37473744 A. Bertsch et al.
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com
quence, the identification performance of de novo sequencing
was much lower for triply charged peptides. The coverage of
cleavage sites dropped as evidenced by the ion statistics and, as
expected, the identification performance of the CID-based
algorithms was comparatively low. In 123 CID mass spectra of
triply charged precursor ions found in our benchmark data set,
PepNovo identified only 19.5% of the amino acids correctly,
whereas CompNovo identified 33.5% of the amino acids in the
corresponding pairs of CID/ETD mass spectra. However, the
implemented experimental setup and parameters predomi-
nantly yielded spectrum pairs of doubly charged precursor ions
and only about 4.9% of the spectra were from triply charged
peptides in the benchmark data set, which makes the de novosequencing of triply and even higher charged peptide ions not a
very important issue. The CID and ETD spectra of the triply
charged peptide precursor ion LFIDPVLGLAPYQGR shown in
Figs. 6C and D demonstrate the poor coverage of the sequence
with abundant fragment ions. From this fragmentation data,
none of the algorithms was able to identify the correct sequence
of the peptide. Nevertheless, CompNovo at least correctly
identified a sequence tag of six consecutive correct amino acids
in the peptide sequence.
3.4 Comparison of de novo sequencing algorithms
and improvement of sequence annotation by
including ETD spectral information
In order to assess the performance of CompNovo, we
compared the results of the implementation of the
algorithm on our benchmark data set with those employing
other publicly available de novo software packages (Lute-
fiskXP and PepNovo). Since these packages use CID spectra
only, CompNovo has a clear advantage through using both
CID and ETD spectra. In order to characterize the
improvement that combined ETD and CID offer over CID,
we also tested a slightly modified version of CompNovo that
uses the same algorithm but makes use only of CID spectra
(CompNovoCID). The results of this comparison are
summarized in Table 1. This table reports the correctly
identified peptides as well as identifications with one to
three incorrect amino acids with respect to the true
sequence. It can be seen that CompNovo not only facilitates
a significant improvement in the success rate of suggested
sequences with respect to other algorithms (from 2.2 to
28.1% for fully correct peptides) but also significantly
improves the correctness of the sequences upon inclusion
of ETD data (from 9.8 to 28.1% for fully correct peptide
sequences). The identification rates for the different
algorithms depending on the number of tolerated false
amino acid annotations are shown in Fig. 7. It can be seen
that CompNovo starts with a success rate of 28.1% for
entirely correct peptide sequences (number of allowed
incorrect residues 5 0), which is significantly larger than
for all other investigated algorithms. CompNovo reaches a
success rate of more than 50% already upon the allowance
of two wrong amino acids, whereas the other algorithms
require a tolerance of at least five falsely assigned amino
acids for the same success rate. Although CompNovoCID
was able to identify 55% of all amino acids correctly, about
0
5000
10000
15000
20000
25000
30000
35000
0 200 400 600 800 1000 1200 1400 1600
Inte
nsity
m/z
R G Q Y P A L G L V P D I F L
L F I D P V L G L A P Y Q G R
CID 2+
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 200 400 600 800 1000 1200 1400 1600
Inte
nsity
m/z
R G Q Y P A L G L V P D I F L
L F I D P V L G L A P Y Q G R
ETD 2+
0
2000
4000
6000
8000
10000
0 200 400 600 800 1000 1200 1400 1600
Inte
nsity
m/z
R G Q Y P A L G L V P D I F L
L F I D P V L G L A P Y Q G R
CID 3+
0
1000
2000
3000
4000
5000
6000
0 200 400 600 800 1000 1200 1400 1600
Inte
nsity
m/z
R G Q Y P A L G L V P D I F L
L F I D P V L G L A P Y Q G R
ETD 3+
A B
C D
Figure 6. CID (A, C) and ETD(B, D) mass spectra ofdoubly (A, B) and triply(C, D) charged precursorions of the peptide LFIDPVL-GLAPYQGR. (A, B) The spec-tra in (A) and (B) show agood coverage of highlyabundant ions. However,only CompNovo was ableto annotate the correctsequence, although theETD spectrum of the doublycharged peptide containedonly a small number ofevidences. (B, D) The triplycharged CID spectrumshows poor fragmentationbehavior. Although the ETDspectrum shows most of theions as low abundantsignals, CompNovo wasonly able to identify asequence tag of length six.
Electrophoresis 2009, 30, 3736–3747 Proteomics and 2-DE 3745
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com
74% of the amino acids were correctly sequenced by
CompNovo, representing a significant increase over meth-
ods using only CID spectra for this data set. For more than
50% of the spectra, CompNovo has a correct sequence
within the Top 10, CompNovoCID for about 24% and
PepNovo for about 11%. This shows that the sequence
generation works very well for the divide-and-conquer
approach. Figure 8 shows how the different identification
engines perform in terms of correctly predicted sub-
sequences. This measure is useful for assessing the
possibility of using the reported hits in a homology-based
sequence search. Although about 84% of the sequences
reported by CompNovo contain a correct tag of length five,
this is the case for only 53 and 9% of the sequences retrieved
by PepNovo and LutefiskXP, respectively.
4 Concluding remarks
De novo sequencing using fragmentation data from a
combination of CID and ETD spectra gives significantly
improved results when compared with de novo sequencing
based on CID data only. The better de novo sequencing
performance is due to the complementary fragmentation
information contained in the two types of spectra, which
results in more complete ion series covering the peptide
sequence. The CompNovo de novo sequencing algorithm
makes use of this complementary information and thus more
frequently succeeds in annotating a correct sequence where
de novo sequencing based on CID or ETD spectra alone fails
due to the lack of sequence information. We further show
that using a rapid mass decomposition algorithm improves
the de novo sequencing algorithm. Improvements in the ion
scoring model and theoretical spectra generation as well as
the utilization of accurate fragment ion masses as obtained
from high-resolution mass spectrometers are expected to
facilitate achieving performances comparable to sequence
database searching algorithms.
The authors thank Andreas Kamper for proofreading themanuscript. The project was supported by Bundesministeriumfur Bildung und Forschung (BMBF) grant 0313842A.
The authors have declared no conflict of interest.
5 References
[1] Eng, J. K., McCormack, A. L., Yates, J. R., III, J. Am. Soc.Mass Spectrom. 1994, 5, 976–989.
[2] Perkins, D. N., Pappin, D. J. C., Creasy, D. M., Cottrell,J. S. Electrophoresis 1999, 20, 3551–3567.
[3] Nesvizhskii, A. I., Vitek, O., Aebersold, R., Nat. Methods2007, 4, 787–797.
Table 1. Identification rates for the different de novo search programs for the benchmark data set consisting of 2406 spectrum pairsa)
LutefiskXP (%) PepNovo (%) CompNovoCID (%) CompNovo (%)
Correct peptides 0.0 2.7 9.8 28.1
Within one residue 0.0 2.9 9.9 29.0
Within two residues 1.4 12.1 24.9 51.7
Within three residues 2.4 18.3 31.8 60.1
Total correct residues 8.5 46.3 54.8 73.7
a) Only the top-ranked peptide sequences were considered for each spectrum pair delivered by each search engine.
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14 16 18 20 22
Iden
tific
atio
n R
ate
Number of Allowed Incorrect Residues
LutefiskXPPepNovo
CompNovoCIDCompNovo
Figure 7. Identification rates of the different algorithms depend-ing on the number of tolerated false amino acid annotations inthe top-ranked peptide sequence (benchmark data set, n 5 2406).
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14 16 18 20
Rel
ativ
e F
requ
ency
Number of Correct Consecutive Residues
LutefiskXPPepNovo
CompNovoCIDCompNovo
Figure 8. Correctly predicted subsequences in the top-rankedpeptide sequences as a function of subsequence length uponapplication of different de novo algorithms on the benchmarkdata set (n 5 2406).
Electrophoresis 2009, 30, 3736–37473746 A. Bertsch et al.
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com
[4] Savitski, M. M., Nielsen, M. L., Kjeldsen, F., Zubarev,R. A., J. Proteome Res. 2005, 4, 2348–2354.
[5] Roepstorff, P., Fohlmann, J., Biomed. Mass Spectrom.1984, 11, 601.
[6] Biemann, K., Biomed. Environ. Mass Spectrom. 1988,16, 99–111.
[7] Zubarev, R. A., Kelleher, N. L., McLafferty, F. W., J. Am.Chem. Soc. 1998, 120, 3265–3266.
[8] Syka, J. E., Coon, J. J., Schroeder, M. J., Shabanowitz,J., Hunt, D. F., Proc. Natl. Acad. Sci. USA 2004, 101,9528–9533.
[9] Zubarev, R. A., Mass Spectrom. Rev. 2003, 22, 57–77.
[10] Geer, L. Y., Markey, S. P., Kowalak, J. A., Wagner, L.,Xu, M., Maynard, D.M., Yang, X., et al., J. Proteome Res.2004, 3, 958–964.
[11] Craig, R., Beavis, R., Bioinformatics 2004, 20, 1466–1467.
[12] Bartels, C., Biomed. Environ. Mass Spectrom. 1990, 19,363–368.
[13] Fernandez-de-Cossio, J., Gonzalez, J., Besada, V.,Comput. Appl. Biosci. 1995, 11, 427–434.
[14] Taylor, J. A., Johnson, R. S., Rapid Commun. MassSpectrom. 1997, 11, 1067–1075.
[15] Dancik, V., Addona, T. A., Clauser, K. R., Vath, J. E.,Pevzner, P. A., J. Comput. Biol. 1999, 6, 327–342.
[16] Chen, T., Kao, M. Y., Tepel, M., Rush, J., Church, G. M.,J. Comput. Biol. 2001, 8, 325–337.
[17] Taylor, J. A., Johnson, R. S., Anal. Chem. 2001, 73,2594–2604.
[18] Fernandez-de-Cossio, J., Gonzalez, J., Satomi, Y.,Shima, T., Okumura, N., Besada, V., Betancourt, L.,et al., Electrophoresis 2000, 21, 1694–1699.
[19] Chen, T., Bingwen, L., J. Comp. Biol. 2003, 10, 1–12.
[20] Bern, M., Goldberg, D., J. Comp. Biol. 2006, 13, 364–378.
[21] Frank, A., Pevzner, P., Anal. Chem. 2005, 77, 964–973.
[22] Lubeck, O., Sewell, C., Gu, S., Chen, X., Cai, D. M., Proc.IEEE 2002, 90, 1868–1874.
[23] Fischer, B., Roth, V., Roos, F., Grossmann, J., Baginsky, S.,Widmayer, P., Gruissem, W., Buhmann, J. M., Anal.Chem. 2005, 77, 7265–7273.
[24] Zhang, Z., Anal. Chem. 2004, 76, 6374–6383.
[25] DiMaggio, P. A., Jr. , Floudas, C. A., Anal. Chem. 2007,79, 1433–1446.
[26] Bafna, V., Edwards, N., Bioinformatics 2001, 17, S13–S21.
[27] Havilio, M., Haddad, Y., Smilansky, Z., Anal. Chem.2003, 75, 435–444.
[28] Elias, J. E., Gibbons, F. D., King, O. D., Roth, F. P., Gygi,S. P., Nat. Biotechnol. 2004, 22, 214–219.
[29] Horn, D. M., Zubarev, R. A., McLafferty, R. W., Proc.Natl. Acad. Sci. USA 2000, 97, 10313–10317.
[30] Datta, R., Bern, M., Proc. Res. Comput. Mol. Biol. 2008,4955, 140–153.
[31] Schneiker, S., Perlova, O., Kaiser, O., Gerth, K., Alici, A.,Altmeyer, M. O., Bartels, D., et al., Nat. Biotechnol. 2007,25, 1281–1289.
[32] Schley, C., Swart, R., Huber, C. G., J. Chromatogr. A2006, 1136, 210–220.
[33] Toll, H., Wintringer, R., Schweiger-Hufnagel, U., Huber,C. G., J. Sep. Sci. 2005, 28, 1666–1674.
[34] Premstaller, A., Oberacher, H., Huber, C. G., Anal. Chem.2000, 72, 4386.
[35] Hartmer, R., Kaplan, D. A., Gebhardt, C. A., Ledertheil,T., Brekenfeld, A., Int. J. Mass Spectrom. 2008, 276,82–90.
[36] Hartmer, R., Lubeck, M., Kiehne, A., Baessmann, C.,Brekenfeld, A., 54th ASMS Conference on Mass Spec-trometry and Allied Topics, San Antonio, 2006.
[37] Delmotte, N., Lasaosa, M., Tholey, A., Heinzle, E., Huber,C. G., J. Proteome Res. 2007, 6, 4363–4373.
[38] Vasilescu, J., Smith, J. C., Zweitzig, D. R., Denis, N. J.,Haines, D. S., Figeys, D., J. Mass Spectrom. 2008, 43,296–304.
[39] Elias, J. E., Haas, W., Faherty, B. K., Gygi, S., Nat.Methods 2005, 2, 667–675.
[40] Martens, L., Hermjakob, H., Jones, P., Adamski, M.,Taylor, C., States, D., Gevaert, K., et al., Proteomics2005, 5, 3537–3545.
[41] Jones, P., Cote, R. G., Martens, L., Quinn, A. F., Taylor,C. F., Derache, W., Hermjakob, H., Apweiler, R., NucleicAcids Res. 2006, 1, D659–D666.
[42] Sturm, S., Bertsch, A., Gropl, C., Hildebrandt, A.,Hussong, R., Lange, E., Pfeifer, N., et al., BMC Bio-informatics 2008, 9, 163.
[43] Kohlbacher, O., Reinert, K., Gropl, C., Lange, E., Pfeifer,N., Schulz-Trieglaff, O., Sturm, M., Bioinformatics 2007,23, e191–e197.
[44] Pitteri, S. J., Chrisman, P. A., McLuckey, S. A., Anal.Chem. 2005, 77, 5662–5669.
[45] Bocker, S., Liptak, Z., Martin, M., Pervukhin, A., Sudek,H., Bioinformatics 2008, 24, 591–593.
[46] Bocker, S., Liptak, Z., Algorithmica 2007, 48, 413–432.
[47] Bocker, S., Letzel, M., Liptak, Z., Pervukhin, A., Bioin-formatics 2009, 25, 218–224.
[48] Montecchi-Palazzi, L., Beavis, R., Binz, P. A., Chalkley,R. J., Cottrell, J., Creasy, D., Shofstahl, J., et al., Nat.Biotechnol. 2008, 26, 864–866.
[49] Good, D. M., Wirtala, M., McAlister, G. C., Coon, J. J.,Mol. Cell. Proteomics 2007, 6, 1942–1951.
[50] Molina, H., Matthiesen, R., Kandasamy, K., Padey, A.,Anal. Chem. 2008, 80, 4825–4835.
[51] Tabb, D. L., Smith, L. L., Breci, L. A., Wysocki, V. H., Lin,D., Yates, J. R., Anal. Chem. 2003, 75, 1155–1163.
[52] Kim, S., Gupta, N., Bandeira, N., Pevzner, P. A., Mol.Cell. Proteomics 2009, 8, 53–69.
Electrophoresis 2009, 30, 3736–3747 Proteomics and 2-DE 3747
& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com