12
Research Article De novo peptide sequencing by tandem MS using complementary CID and electron transfer dissociation De novo sequencing of peptides using tandem MS is difficult due to missing fragment ions in the spectra commonly obtained after CID of peptide precursor ions. Comple- menting CID spectra with spectra obtained in an ion-trap mass spectrometer upon electron transfer dissociation (ETD) significantly increases the sequence coverage with diagnostic ions. In the de novo sequencing algorithm CompNovo presented here, a divide-and-conquer approach was combined with an efficient mass decomposition algorithm to exploit the complementary information contained in CID and ETD spectra. After optimizing the parameters for the algorithm on a well-defined training data set obtained for peptides from nine known proteins, the CompNovo algorithm was applied to the de novo sequencing of peptides derived from a whole protein extract of Sorangium cellulosum bacteria. To 2406 pairs of CID and ETD spectra contained in this data set, 675 fully correct sequences were assigned, which represent a success rate of 28.1%. It is shown that the CompNovo algorithm yields significantly improved sequencing accuracy as compared with published approaches using only CID spectra or combined CID and ETD spectra. Keywords: CID / De novo / ETD / OpenMS / Sorangium cellulosum DOI 10.1002/elps.200900332 1 Introduction The fully automated annotation of mass spectra generated by MS/MS of proteolytic peptides followed by computer-aided database searching forms the basis of high-throughput identification of proteins in proteome analysis [1, 2]. Never- theless, the majority of mass spectra generated in a proteomic experiment remains un-annotated, mainly because of incomplete sequence information in the mass spectra or the lack of sequence information about the protein under investigation in sequence databases. Moreover, the similarity of many protein subsequences together with incomplete sequence information in the mass spectra results in false- positive identifications [3]. In order to increase the reliability of protein identifications based on MS data in proteome analysis, it is therefore mandatory both to improve coverage of the complete peptide sequence with diagnostic fragment ions and to implement efficient computer algorithms to automatically interpret the fragment ion spectra [4]. Gas-phase fragmentation of peptide ions by low-energy CID allows inference of their sequence from typical ion series [5, 6], which means primarily b-ions and y-ions resulting from cleavage of the peptide bond. More recently, CID has been complemented by electron-induced frag- mentation methods such as electron capture dissociation (ECD, [7]) or electron transfer dissociation (ETD, [8]). In ETD, multiply protonated peptide ions are fragmented upon transfer of electrons from an electron carrier molecule derived from anthracene or fluoranthene by chemical ioni- zation, resulting in dissociation of the peptide radical ions due to highly exothermic electron recombination reactions. In contrast to CID, ETD has no preferred cleavage sites except proline, which leads to a more uniform distribution of the fragment ions, mainly of c- and z-type, along the Andreas Bertsch 1 Andreas Leinenbach 2,3 Anton Pervukhin 4 Markus Lubeck 5 Ralf Hartmer 5 Carsten Baessmann 5 Yasser Abbas Elnakady 6 Rolf Mu ¨ ller 6 Sebastian Bo ¨ cker 4 Christian G. Huber 3 Oliver Kohlbacher 1 1 Center for Bioinformatics Tu ¨ bingen, Eberhard-Karls- Universita ¨ t Tu ¨ bingen, Tu ¨ bingen, Germany 2 Instrumental Analysis and Bioanalysis, Saarland University, Saarbru ¨ cken, Germany 3 Department of Molecular Biology, Division of Chemistry and Bioanalytics, University of Salzburg, Salzburg, Austria 4 Lehrstuhl Bioinformatik, Friedrich-Schiller-Universita ¨t Jena, Jena, Germany 5 Bruker Daltonik GmbH, Bremen, Germany 6 Department of Pharmaceutical Biotechnology, Saarland University, Saarbru ¨ cken, Germany Received December 18, 2008 Revised July 22, 2009 Accepted July 29, 2009 Abbreviations: ECD, electron capture dissociation; ETD, electron transfer dissociation Correspondence: Andreas Bertsch, Universita ¨t Tu ¨ bingen, Wilhelm Schickard Institute, Sand 14, 72076 Tu ¨ bingen, Germany E-mail: [email protected] Fax:149-7071-29-5152 & 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com Electrophoresis 2009, 30, 3736–3747 3736

De novo peptide sequencing by tandem MS using complementary CID and electron transfer dissociation

Embed Size (px)

Citation preview

Research Article

De novo peptide sequencing by tandem MSusing complementary CID and electrontransfer dissociation

De novo sequencing of peptides using tandem MS is difficult due to missing fragment

ions in the spectra commonly obtained after CID of peptide precursor ions. Comple-

menting CID spectra with spectra obtained in an ion-trap mass spectrometer upon

electron transfer dissociation (ETD) significantly increases the sequence coverage with

diagnostic ions. In the de novo sequencing algorithm CompNovo presented here, a

divide-and-conquer approach was combined with an efficient mass decomposition

algorithm to exploit the complementary information contained in CID and ETD spectra.

After optimizing the parameters for the algorithm on a well-defined training data set

obtained for peptides from nine known proteins, the CompNovo algorithm was applied

to the de novo sequencing of peptides derived from a whole protein extract of Sorangiumcellulosum bacteria. To 2406 pairs of CID and ETD spectra contained in this data set, 675

fully correct sequences were assigned, which represent a success rate of 28.1%. It is

shown that the CompNovo algorithm yields significantly improved sequencing accuracy

as compared with published approaches using only CID spectra or combined CID and

ETD spectra.

Keywords:

CID / De novo / ETD / OpenMS / Sorangium cellulosumDOI 10.1002/elps.200900332

1 Introduction

The fully automated annotation of mass spectra generated by

MS/MS of proteolytic peptides followed by computer-aided

database searching forms the basis of high-throughput

identification of proteins in proteome analysis [1, 2]. Never-

theless, the majority of mass spectra generated in a proteomic

experiment remains un-annotated, mainly because of

incomplete sequence information in the mass spectra or

the lack of sequence information about the protein under

investigation in sequence databases. Moreover, the similarity

of many protein subsequences together with incomplete

sequence information in the mass spectra results in false-

positive identifications [3]. In order to increase the reliability

of protein identifications based on MS data in proteome

analysis, it is therefore mandatory both to improve coverage

of the complete peptide sequence with diagnostic fragment

ions and to implement efficient computer algorithms to

automatically interpret the fragment ion spectra [4].

Gas-phase fragmentation of peptide ions by low-energy

CID allows inference of their sequence from typical ion

series [5, 6], which means primarily b-ions and y-ions

resulting from cleavage of the peptide bond. More recently,

CID has been complemented by electron-induced frag-

mentation methods such as electron capture dissociation

(ECD, [7]) or electron transfer dissociation (ETD, [8]). In

ETD, multiply protonated peptide ions are fragmented upon

transfer of electrons from an electron carrier molecule

derived from anthracene or fluoranthene by chemical ioni-

zation, resulting in dissociation of the peptide radical ions

due to highly exothermic electron recombination reactions.

In contrast to CID, ETD has no preferred cleavage sites

except proline, which leads to a more uniform distribution

of the fragment ions, mainly of c- and z-type, along the

Andreas Bertsch1

Andreas Leinenbach2,3

Anton Pervukhin4

Markus Lubeck5

Ralf Hartmer5

Carsten Baessmann5

Yasser Abbas Elnakady6

Rolf Muller6

Sebastian Bocker4

Christian G. Huber3

Oliver Kohlbacher1

1Center for BioinformaticsTubingen, Eberhard-Karls-Universitat Tubingen, Tubingen,Germany

2Instrumental Analysis andBioanalysis, SaarlandUniversity, Saarbrucken,Germany

3Department of MolecularBiology, Division of Chemistryand Bioanalytics, University ofSalzburg, Salzburg, Austria

4Lehrstuhl Bioinformatik,Friedrich-Schiller-UniversitatJena, Jena, Germany

5Bruker Daltonik GmbH, Bremen,Germany

6Department of PharmaceuticalBiotechnology, SaarlandUniversity, Saarbrucken,Germany

Received December 18, 2008Revised July 22, 2009Accepted July 29, 2009

Abbreviations: ECD, electron capture dissociation; ETD,

electron transfer dissociation

Correspondence: Andreas Bertsch, Universitat Tubingen,Wilhelm Schickard Institute, Sand 14, 72076 Tubingen, GermanyE-mail: [email protected]:149-7071-29-5152

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com

Electrophoresis 2009, 30, 3736–37473736

peptide sequence. Hence, CID and ETD are complementary

fragmentation techniques, leading to a more comprehensive

coverage of a peptide sequence with fragment ions [9].

It was shown that the sequence coverage of doubly

charged peptide ions could be increased from 84% with CID

alone to 91% upon combining fragment information from

CID and ECD [4]. If a sequence database of the species

under investigation is available, annotation of fragment ion

mass spectra can be performed by matching theoretical ions

of candidate peptide sequences to the fragment ions

observed in the experimental mass spectrum. A variety of

database searching algorithms are currently in use for fully

automated spectrum interpretation, such as MASCOT [2],

SEQUEST [1], OMSSA [10], and X!Tandem [11]. Although

these approaches are standard procedures in proteomic

research, there are cases where appropriate sequence

databases are unavailable or where sequencing errors or

sequence polymorphisms render database searching

unsuitable. For those cases, de novo identification, i.e. the

inference of the peptide sequence directly from the experi-

mental spectrum, represents a viable alternative.

The de novo identification problem has been studied

intensively [12–17]. Common methods can be subdivided

into two steps: candidate sequence generation and scoring.

Most of the approaches (e.g. [14–22]) use a spectrum graph,

in which the nodes correspond to ions of the spectrum and

edges correspond to valid mass differences. A valid mass

difference means that the difference between the m/z ratios

of two ions in the spectrum corresponds to the mass of one

of the common amino acids. The candidate sequences are

generated by traversal through the graph. Another approach

makes use of a hidden Markov model, which directly

scores the generated sequence [23]. A divide-and-conquer

approach, where the candidates are generated from

subspectra (between abundant ions) via exhaustive

enumeration of amino acid combinations has also been

used [24] for de novo sequencing. Very recently, an integer

linear program formulation was published [25], showing

superior performance compared with other well-known denovo searching tools both for data measured on high-reso-

lution quadrupole time-of-flight and for low-resolution ion-

trap mass spectrometers. Although some approaches show

good performances on high-quality spectra, in which most

of the theoretical ions are present, the overall performance

compared with database-driven approaches is still poor.

Scoring of candidate sequences against the experimental

spectrum was also studied intensively (e.g. [15, 26–28]),

because such a scoring procedure is needed by most of the

database-driven and de novo approaches.

Horn et al. [29] presented a method of whole protein denovo sequencing employing CID and ECD as complementary

fragmentation techniques on a high-resolution Fourier-trans-

form ion cyclotron resonance mass spectrometer. Exploiting

the high mass accuracy of Fourier-transform ion cyclotron

resonance-MS/MS and the complementarity of both frag-

mentation techniques, the same group devised a linear de novosequencing algorithm, which starts with the most reliable

fragment ion data and propagates linearly by including addi-

tional amino acid masses [4]. From a total of 7852 MS/MS

data sets included in the de novo sequencing evaluation, the

completely correct sequences of 1704 peptides could be

annotated, representing 22% of all peptides identified of all

peptides identified by MASCOT with a Mowse score greater

than or equal to 34. Very recently, Datta et al. [30] imple-

mented a sophisticated machine learning approach to make

use of ETD and CID spectra for de novo sequencing. They use

a tree-augmented naive Bayes classifier to learn specific

features of fragmentation types in the spectra and combine

them into one spectrum. Then a graph-based approach was

employed to obtain the final list of peptides. The focus of this

work, however, was on the generation of peptide candidates.

Recent instrumental developments have led to the

commercial availability of ion-trap mass spectrometers,

which are able to generate CID and ETD spectra from the

same precursor ion in a short period of time [8]. Given the

fact that high-resolution mass spectrometers are quite

expensive and that the majority of protein identification data

today is still measured on low-resolution ion-trap mass

spectrometers, we aim in this study at a method for the denovo sequencing of peptides employing CID and ETD in an

ion-trap mass spectrometer. We demonstrate that these

spectra contain complementary information that can be

used to make de novo identification much more reliable. In

order to optimize the parameters employed for a combina-

torial algorithm, which we call CompNovo, we utilize CID

and ETD fragment ion spectra obtained from tryptic

peptides of nine standard proteins. The algorithm then

employs a divide-and-conquer approach together with rapid

mass decomposition to annotate the fragment ion masses

derived from the combined CID and ETD spectrum. Finally,

the optimized parameters are utilized to obtain de novosequence information from CID and ETD fragmentation

data derived from the measurement of the whole proteome

of the bacterium Sorangium cellulosum [31].

2 Materials and methods

2.1 Experimental data generation

For the generation of the training data set, nine proteins,

obtained from Sigma (St. Louis, MO) or Fluka (Buchs,

Switzerland), were partitioned into two mixtures and

tryptically digested (trypsin from Promega, Madison, WI)

using published protocols [32]. The mixtures contained the

proteins in concentrations between 0.3 and 1.7 pmol/mL.

Mixture 1 comprised thyroglobulin (bovine thyroid gland)

and albumin (bovine serum), whereas mixture 2 contained

b-casein (bovine milk), conalbumin (chicken egg white),

myelin basic protein (bovine), hemoglobin (human), leptin

(human), creatine phosphokinase (rabbit muscle), a1-acid-

glycoprotein (human plasma), and albumin (bovine serum).

The resulting peptide mixtures were then separated

using capillary ion-pair RP HPLC (IP-RP-HPLC) and

Electrophoresis 2009, 30, 3736–3747 Proteomics and 2-DE 3737

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com

subsequently identified by ESI-MS/MS as described in

detail in [32, 33]. Briefly, one-microliter samples was

desalted and preconcentrated at a flow rate of 10 mL/min of

0.05% aqueous TFA in a capillary/nano HPLC system

(Model Ultimate, Dionex Benelux, Amsterdam, The Neth-

erlands) using a 10.0� 0.2 mm monolithic poly-(styrene/

divinylbenzene) preconcentration column, which was

prepared according to the published protocols [32, 34]. The

separation of the preconcentrated peptides was carried out

upon transfer to a 50� 0.1 mm monolithic poly-(styrene/

divinylbenzene) separation column (Dionex Benelux),

employing a gradient of 0–40% ACN in 0.05% v/v aqueous

TFA in 60 min at 551C. Digests were analyzed in quad-

ruplicate at a flow rate of 0.7 mL/min.

An ion-trap mass spectrometer with implemented ETD

module (Model HCTultra PTM discovery system, Bruker

Daltonik GmbH, Bremen, Germany) and equipped with a

modified ESI-ion source (spray capillary: fused silica capil-

lary, 0.090 mm od and 0.020 mm id) was used. A detailed

description of the ETD setup of the ion-trap instrument

including the negative chemical ionization source needed

for the generation of the reagent anion of fluoranthene and

the transfer into the ion trap can be found in [35]. Tandem

mass spectra were generated using sequential CID and ETD

fragmentation of the same precursor ion. Singly charged

analytes were automatically excluded. The ETD reaction

time, i.e. the reaction time for electron transfer from fluor-

anthene anions to peptide cations, was set to 100 ms. A

short low-energy collisional activation step (‘‘smart decom-

position’’, Bruker Daltonics [36]) was performed after elec-

tron transfer to further improve the fragmentation especially

of doubly and triply charged peptides, because the frag-

ments often do not dissociate due to non-covalent binding

forces. The scan range was 210–2000 Da to exclude fluor-

anthene ions and to achieve a reasonable scan time for one

spectrum pair.

To generate a data set of a real life sample, approxi-

mately 690 mg of proteins extracted from cultivated S. cellu-losum [31], a soil living bacterium, were tryptically digested

(trypsin from Promega) using the published protocols [32].

The resulting peptide mixture was separated using an off-

line two-dimensional HPLC setup exactly as described in

[37]. IP-RP-HPLC-ESI-MS/MS conditions and parameters

for alternating CID and ETD fragmentation and detection

were as described in the previous paragraph.

2.2 Data preprocessing

A training data set was extracted from the analysis of the two

digests of the nine proteins. After optimization of the

algorithmic parameters with the training data set, the

algorithm was applied to the de novo sequencing of peptides

in a digest of a protein extract from S. cellulosum. Since none

of the commercially available database-driven search

engines supported the use of CID and ETD spectrum pairs

during the identification process and identification using

ETD only showed considerably poorer performance with

very few identifications unique to ETD, only CID spectra

were used for the annotation of the peptides in the data sets.

CID spectra from the eight HPLC runs of the standard

protein mixture were identified through database searches

using MASCOT version 2.1.03 [2], allowing one missed

cleavage and carboxymethylation of cysteine as fixed

modification and oxidation of methionine as variable

modification. Oxidized methionine has almost the same

mass as phenylalanine, which will be taken into considera-

tion in our evaluation below.

Spectra of doubly charged precursors and hits to tryptic

peptides with a peptide score greater than 10 were used.

This low-score cut-off was comparable to that of used by

others [38] and ensured that spectra of lower quality were

also included in the training data set. Manual inspection of

the spectra with scores less than 10 showed that the ion

series were highly incomplete and the spectra therefore not

suitable for parameter estimation. In order to avoid bias to

specific sequences, we allowed spectra corresponding to the

same peptide sequence and charge combination only once

in the data set. The final data set was constructed by

selecting all spectra of peptides with a molecular mass less

than or equal to 2000 Da where the peptide sequence was a

sub-sequence of one of the proteins present in the sample.

This selection resulted in a data set of 156 pairs of ETD/CID

spectra of different doubly charged peptides. Only seven

triply charged peptides were present and these were treated

separately.

In order to distinguish correctly from incorrectly deter-

mined sequences, the peptides in the benchmark data set of

S. cellulosum were identified using MASCOT, version 2.2.04,

OMSSA, version 2.1.4 and X!Tandem, version 08-02-01. The

protein sequence database used included all 9320 protein

sequences from S. cellulosum of Uni-ProtKB/Swiss-Prot

release 15.4/57.4 and 86 protein sequences of trypsin and

keratin. Cysteine carboxymethylation and methionine

oxidation were included as variable modifications and up to

one missed cleavage was allowed during the search runs. To

ensure that the peptides included in the data set were

correctly identified, a more complex verification procedure

was performed for the identifications due to the unknown

protein content in the sample. The identification was carried

out using a composite database in which reversed sequences

of all proteins were appended to the original S. cellulosumsequence database [39]. False discovery rates were estimated

from the obtained identification lists. A peptide sequence

was accepted if at least two identification engines identified

the peptide as a top hit with a false discovery rate of 5% or

better. If more than one spectrum pair represented the same

sequence, the one with the best average false discovery rate

was chosen. The 5% false discovery rate score cut-offs for

the three search engines were 0.5 for the MASCOT E-value

(MASCOT Mowse score 420), 0.09 for the X!Tandem

E-value, and 0.1 for the OMSSA E-value. This means that

only scores better than the thresholds were accepted for the

consensus. Using the 38 261 MS/MS spectrum pairs in our

Electrophoresis 2009, 30, 3736–37473738 A. Bertsch et al.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com

data set, this filtering procedure resulted in 2406 spectrum

pairs of charge two and 123 of charge three, in which each

spectrum pair corresponded to a unique peptide sequence.

The data sets used in this study have been deposited in the

Proteomics Identification Database (PRIDE [40, 41]) and are

publicly available under the experiment accessions 8689 and

8690.

2.3 Algorithm overview

The CompNovo algorithm comprises two steps: candidate

generation and candidate scoring. After preprocessing of the

spectrum, a divide-and-conquer algorithm decomposes the

spectrum into smaller parts until the segment is small

enough such that the mass decomposition can generate all

amino acid compositions for a given mass difference. From

this list of compositions, all permutations of sequences are

generated, which are then scored against the spectra. The

best scoring sequences are kept and used as candidates for

the segment. After all segments have been processed, the

final candidate list is generated and the candidate sequences

are scored against the spectra. This is performed by a two-

step approach. First, very simple spectra are generated from

the candidate sequences and scored against the experimen-

tal spectra. The best candidates are then used in a second

scoring step to generate more detailed theoretical spectra,

e.g., by including additional ion series, losses, and isotope

clusters, which are in turn scored against the experimental

spectra. The various modules of the CompNovo sequencing

algorithm are schematically shown in Fig. 1. We will

now describe the individual parts of the algorithm in more

detail.

The CompNovo algorithm is part of the OpenMS plat-

form and downloadable from www.openms.de. The source

code of CompNovo is available and will be released under an

open-source license as part of the upcoming version of our

software packages OpenMS [42] and TOPP [43].

2.4 Spectrum preprocessing

The CompNovo algorithm starts with y-ions of the CID

spectrum. To select as many true y-ions as possible from the

spectrum and to reduce the probability of selecting other ion

types in the divide-and-conquer step, the following scoring

steps are performed. The initial score for each of the ions

present in the spectrum is its intensity. As ETD spectra

contain abundant peaks of the precursor ion in almost all

cases [44], the mass of the peptide can be determined within

the mass tolerance of the tandem MS scan. This allows an

accurate scoring of the complementary features of the

spectra. First, the isotope patterns of each of the peaks are

scored against the theoretical isotope distribution pattern

based on the molecular mass and the average elemental

composition of peptides with respect to carbon, hydrogen,

oxygen, and nitrogen. A simple correlation is used as a

multiplicative scoring in subsequent steps. Doubly charged

ions that have isotope ions correlating well with a theoretical

isotope distribution are transformed and inserted as singly

charged ions into the spectrum, or the intensity is added to

the corresponding singly charged ion if this is already

present in the spectrum. The scores of a witness set of ions,

i.e. the set of all ions that support the y-ion, are used to

enhance its score. This includes neutral loss of water and

ammonia from y-ions as well as complementary b-ions,

which can be witnessed by putative a-ions. In addition to the

witness set present in the CID spectrum, the c- and z-type

ions of the ETD spectrum are used to provide further

evidence that a supposed y-ion is correctly assigned. ETD

spectra often contain radical variants of the c- and z-type

ions also [44]; therefore, the single radical ions are also used

for scoring. The intensities of the y-ion and the weighted

intensities of the witness ions are added. Here, the

weighting factor is derived from the relative difference of

the deviation of the expected versus calculated mass

difference. For instance, if a putative y-ion is witnessed by

a water loss ion, the mass difference between both ions

should be 18.0 Da. However, if it is observed in a distance of

18.2 Da the factor is 0.5, given a mass tolerance of 0.4 Da in

the MS/MS scan. This is the same factor as used in the

Figure 1. Schematic diagram of the CompNovo de novosequencing algorithm. Each spectrum pair was preprocessedto assign a score for being a y-ion to each peak. The spectrumwas subdivided into smaller sections using putative y-ions aspivot ions. Sufficiently small subsections were then decom-posed into amino acid compositions, which in turn were used togenerate subsequences. The scored subsequences of twosubsections of a section were joined and scored against thespectrum pair. Candidates of the whole spectrum were re-scoredusing a more detailed scoring scheme.

Electrophoresis 2009, 30, 3736–3747 Proteomics and 2-DE 3739

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com

spectra comparison, described in Section 2.8. Finally, the

low-mass-range and high-mass-range candidates are

checked to determine whether a mass decomposition is

possible for the ions. If no valid mass decomposition was

found for a specific ion, the ion cannot be a y-ion and the

score is set to 0 to avoid selection as pivot ion. All the ions

which have positive scores can be used as pivot ions.

2.5 Rapid mass decomposition

In the second step of candidate generation, mass differences

between ions are used to generate candidates for the

subsequent divide-and-conquer analysis. Given a pair of

pivot masses, we search for all amino acid compositions that

are compatible with this mass difference, that is, have a

theoretical mass close to the measured mass difference.

Clearly, we cannot recover the order of amino acids solely

from the mass difference, but we can infer the multiplicity

of each amino acid in every hypothetical amino acid

sequence explaining the mass difference. In this context,

we say that we decompose the mass difference over the

amino acid alphabet. To find all of these decompositions,

there exist efficient algorithms [45, 46] where running time

only depends on the number of decompositions of the input

mass, and not on the mass itself. In fact, these algorithms

can only be applied to integer masses, and hence we use the

correction described in [47] to account for rounding error

accumulation. This approach is efficient and at the same

time guarantees that no decomposition is missed. Mass

accuracy of the ion-trap mass spectra used in our analysis is

comparatively low. To guarantee that the complete analysis

pipeline is sufficiently fast, decompositions must be

computed in a matter of milliseconds. Doing so over the

amino acid alphabet, which can also include additional

modified amino acid masses, can be carried out only for

relatively small mass differences.

2.6 Divide-and-conquer approach

In order to be able to correlate measured mass differences

with a reasonable number of peptide sequence candidates,

subsequences are generated using a divide-and-conquer

approach, similar to the one described by Zhang [24]. The

key difference is that, instead of using b-ions as pivot ions,

our approach uses y-ions. A high-scoring putative y-ion is

used in the beginning to divide the spectrum into two parts.

The upper m/z part represents the prefix of the peptide

sequence and the lower m/z part the suffix of the peptide

sequence. If such a newly created segment is wider than a

given threshold, it is further divided into two new

subsegments using a pivot ion. To ensure that the correct

sequence is generated, and to guard against occasional false

pivot ions (non-y-ions), several pivot ions are selected perstep. If the subsegment is small enough, the rapid mass

decomposition algorithm is used to generate all possible

amino acid compositions explaining the observed mass

difference. From each of the compositions all permutations

of possible sequences are generated. To reduce the number

of permutations, they are scored against the experimental

spectra and only the best candidates are kept. This is done

by creating theoretical spectra of the peaks generated by the

permutations for a subsegment. When a subsegment is

located somewhere in the center of the whole mass range of

the spectrum, a prefix and a suffix are added during

generation of the theoretical spectra. These spectra are then

in turn used for comparison with the experimental

spectrum pair. When the generation of the candidates of

both segments of a divide step is completed, the results are

combined. Each of the candidates of the left segment is

combined with every candidate of the right segment. The

created set of candidates is again scored and only the best

candidates are kept. The whole procedure is repeated for

several ions and the candidates are recorded for the final

scoring of the full-sized sequence candidates.

2.7 Theoretical spectra

Theoretical spectra are used to score against the experi-

mental spectra. CID and ETD spectra are treated separately,

and then both scores are simply added to get a similarity of a

sequence and the experimental spectrum pair. To speed up

computations, CID spectra are generated in two ways. The

first type is used in the divide-and-conquer part and includes

only b- and y-ions of same intensities. The second type of

CID spectra also accounts for intensities of the different ions

[1] including a-, b-, and y-ions as well as neutral losses from

b- and y-ions. Water loss ions are added if the b- or y-ion

contains at least one of S, T, E, or D. Ammonia loss ions are

added if the b- or y-ion contains at least one of Q, N, R, or K.

The relative intensity of y-ions is 100%, b-ions are included

with a relative intensity of 80%. Losses from y-ions are

included with 10% and losses from b-ions with 2% relative

intensity. The intensity of the b-and y-ions (but not of the

loss ions) is distributed over the monoisotopic and the

isotope peaks according to the isotope distribution. ETD

spectra are generated using c-and z-type ions and using the

isotope distribution to add also isotope peaks. No differences

are made for the intensities of different ion types in ETD

spectra.

2.8 Candidate scoring

Finally, the candidates produced by the divide-and-conquer

approach are scored against the experimental spectra. For

this purpose theoretical CID and ETD spectra of the

peptides are generated. These spectra are compared with

the experimental spectrum and a ranked list according to

the similarity is created. As the number of candidates can be

very large, a two-step scoring is applied. The similarities of

the spectra are scored using the following similarity

Electrophoresis 2009, 30, 3736–37473740 A. Bertsch et al.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com

function and a weighting factor is used to account for the

differences in expected and observed ion positions:

fi;j ¼e� jPi � PjjjPi � Pjj

;

where f is the weighting factor of ion i of the first spectrum

and ion j of the second spectrum. The positions are noted as

Pi and Pj, respectively. Parameter e is the mass tolerance

allowed for the scoring and it was set to 0.4. The similarity of

two spectra is calculated as follows:

s ¼P

i;j

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiIi � Ij � fi;j

p

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPiIi �

PjIj

p ;

where Ii represents the intensity of the i-th peak of the first

spectrum and Ij stands for the intensity of the j-th peak of

the second spectrum. For the first scoring, the numerator is

calculated by using all possible pairs of ions where the

positions do not differ more than e. For the more expensive

final scoring, this is only calculated for the best pairs, which

means each ion can have at most one partner in the other

spectrum. The pairs are determined using a spectrum

alignment algorithm, where the number of paired ions is

maximized and the sum of position distances is minimized.

Intensities are ignored by the spectrum alignment algo-

rithm. First, very simple spectra for CID consisting only of

b- and y-ions are created, and for theoretical ETD spectra

only c- and z-type ions are used. The candidates are scored

using the first version of the scoring scheme. The best

peptide sequences of the pre-scoring are used for an in-

depth scoring. Therefore, b- and y-ions as well as a-ions and

neutral losses are used with different intensities in a theo-

retical spectrum. Additionally, doubly charged b- and y-ions

are used. The ETD spectra are simpler because only c-type

ions and z-type ions are used. The ions and the isotopes of

the ions are used for scoring against experimental CID and

ETD spectrum. Finally, a ranked list is generated from the

scores, which is the output of the algorithm.

2.9 Quality assessment

To count the number of correct amino acids of the predicted

versus the correct peptide sequence, the following rules are

used. Amino acid residues I/L, Q/K as well as F and

oxidized methionine cannot be distinguished by their

masses on low-resolution mass spectrometers and are

therefore treated as one amino acid. The predicted and

correct sequences are compared from the left to the

right and an amino acid of the predicted peptide sequence

is counted as correct if the corresponding amino acid

in the correct sequence is identical. ‘‘Corresponding’’

in this case means that the prefix masses of the predicted

and correct peptide sequences of the amino acid pair

under consideration do not differ by more than 2 Da.

This relatively high threshold takes into account the

sum of the precursor mass tolerance and the fragment

mass tolerance. This threshold also handles the case, e.g.,

of the peptides DFPIANGER (correct) and NFPIANGER

(sequenced) where the prefix masses of the correct amino

acids, FPIANGER are shifted by 1 Da. The rules given above

have another effect. If, for example, a correct amino acid K is

predicted as a sub-sequence GA, this is counted as one error,

if a correct subsequence GA is predicted as K, this is

counted as two errors. The longest consecutive amino acid

sequence of a predicted peptide and the number of

incorrectly predicted amino acids are then computed using

these rules.

2.10 Parameter determination for the CompNovo

algorithm

The parameters of the CompNovo algorithm used for the

benchmark data set were determined using the training data

set. To do so, a grid search for each of the individual

parameters was performed. Final parameters were 450 Da

for the maximal decomposition mass difference, the 40 best

solutions were kept for each subsegment, and then divide

steps were performed nine times using nine different pivot

ions, if available. The 250 best-scoring peptide candidates

were selected for rescoring using detailed CID spectra. For

the specification of arbitrary post-translational modifica-

tions, CompNovo used the controlled vocabulary terms as

specified by the PSI-MOD working group [48]. Additionally,

the modifications can be set to fixed or variable, as the case

requires.

2.11 Benchmarking against other de novo programs

PepNovo [21], version v2.00, was used employing

LTQ_LOW_TRYP as model parameter, which matches a

low-resolution ion-trap mass spectrometer. The precursor

mass tolerance was set to 1.5 Da, which was the same as for

CompNovo. The fragment mass tolerance was automatically

set to 0.4 Da by the chosen model. A post-translational

modification was created for cysteine with a shift of

158.0055 Da to account for the carboxymethylation of

cysteine, which was set as variable. LutefiskXP [17], version

1.0.5a, was used. Default parameters were used except for

the ion tolerance, which was set to 0.4 Da, the same as for

CompNovo. Additionally, the cysteine mass was set to the

mass of carboxymethylated cysteine.

3 Results and discussion

3.1 Complementary fragmentation observed in ETD

and CID spectra

One of the major advantages of using both ETD and CID

fragmentation techniques in one experiment rests within

the complementary properties of the spectra they produce

[49, 50]. Peptides dissociated by CID tend to show fragments

Electrophoresis 2009, 30, 3736–3747 Proteomics and 2-DE 3741

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com

from cleavages more in the middle of the sequence [51]. In

contrast, peptides dissociated through ETD fragment

preferentially toward the ends of the sequence. These

effects are shown in Figs. 2A and B, which was created

using doubly charged peptides of the benchmark data set of

S. cellulosum peptides. It shows clearly that, although b-ions

or y-ions might be present, the intensities are comparatively

low at the terminal sequence positions. On the contrary,

c-and z-ions frequently occur in high abundance in the

higher mass range, indicating preferential cleavage of

terminal amino acid residues. As reported previously,

ETD-specific ions generally have significantly lower inten-

sities compared with those generated by CID [49, 50].

Figures 2C and D show the same analysis using triply

charged peptides. Although ETD spectra tend to show many

abundant ions, the overall sequence coverage in the CID

spectra drops dramatically, especially for high-mass ions.

To quantify this observation and to demonstrate the

complementarity of the fragmentation methods, we

performed another statistical analysis of the frequency of

observation of the ions in our benchmark data set for CID

and ETD fragmentation. For each backbone position the

number of observed ions was counted and plotted in Fig. 3.

Figures 3A and B show the complementarity of the frag-

mentation properties of ETD and CID for doubly charged

precursor ions. It can be seen that the amino- and carboxy-

terminal amino acids are primarily covered by c- and z-type

ions in the ETD spectra, whereas the amino acids in the

middle of the peptide sequence are much more completely

covered with b- and y-ions. The combination of both types is

thus the logical consequence in an attempt to cover the

complete sequence in a de novo sequencing approach.

Figures 3C and D show that although ETD spectra of triply

charged peptides cover most of the sequence ions with

reasonable probabilities, the probabilities of seeing ions

specific to CID fragmentation drop dramatically in this case.

This makes the combination of both types of fragmentation

less useful in the case of triply charged ions.

3.2 Application of the CompNovo algorithm to the

analysis of an S. cellulosum protein lysate

In order to determine the most suitable parameters for the

CompNovo sequencing algorithm, we used the CID and

ETD mass spectra generated in an ion-trap instrument

operated in alternating CID/ETD data acquisition mode of

the same precursor ion. So-called smart decomposition was

enabled in the instrument, which imposes a short collisional

activation step before the expulsion of the fragment ions

from the trap to support the dissociation of radical precursor

ions activated by electron transfer. A well-defined training

set of tryptic peptides obtained upon digestion of nine

standard proteins and acceptance only of peptide sequences

as correct that were part of the included protein sequences

(resulting in 156 pairs of CID/ETD spectra) allowed

adjusting the parameters to the values given in Section 2.

Using the optimized parameters, the algorithm was

applied to the de novo sequencing of tryptic peptides from

proteins contained in a lysate of S. cellulosum bacteria. Since

we cannot define a priori which sequence delivered by the

algorithm is a correct sequence in the large data set, we used

consensus identifications by means of at least two of the

three different database searching routines, including

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Rel

ativ

e A

bund

ance

Mass of Ion/Mass of Precursor Ion

b-ionsy-ions

CID 2+A B

C D

0

0.05

0.1

0.15

0.2

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Rel

ativ

e A

bund

ance

Mass of Ion/Mass of Precursor Ion

c-ionsz-ions

ETD 2+

0

0.2

0.4

0.6

0.8

1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Rel

ativ

e A

bund

ance

Mass of Ion/Mass of Precursor Ion

b-ionsy-ions

CID 3+

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Rel

ativ

e A

bund

ance

Mass of Ion/Mass of Precursor Ion

c-ionsz-ions

ETD 3+

Figure 2. Relative abun-dances of the different ionsproduced upon CID (A, C)and ETD fragmentation(B, D) of doubly (A, C) andtriply (B, D) charged precur-sor ions as a function of themolecular mass of the ions.Adapted from [51], the barsshow the median peakintensities and the linesshow the 25th and 75thpercentile of the peak inten-sities. The upper plots werecreated using doublycharged peptides (n 5 2406)and the bottom ones usingtriply charged peptides.

Electrophoresis 2009, 30, 3736–37473742 A. Bertsch et al.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com

MASCOT, OMSSA, and X!Tandem and a maximum of false

discovery rate of 5% to obtain the sequences corresponding

to the evaluated mass spectra. These consensus identifica-

tions yielded 2406 different pairs of CID/ETD spectra which

were submitted to the CompNovo algorithm. To these

spectrum pairs CompNovo assigned a total of 675 fully

correct sequences, representing a success rate of 28.1%.

Although, at first sight, this number might look moderate, it

is among the highest success rates published so far for the

de novo sequencing of peptides. Moreover, this result is

superior to a de novo sequencing algorithm applied to CID

and ECD data, in which 1704 full sequences out of 7852

candidates (22%) with high-quality mass spectra (MASCOT

Mowse score 34) were successfully annotated (see Fig. 3 in

[4]). In its present implementation, the improved success

rate of CompNovo comes, however, at the cost of consider-

able computing time. Despite the rapid mass decomposition

algorithm, CompNovo on average needs about 13 s perspectrum pair from the benchmark data set to suggest a set

of peptide sequences (CompNovoCID about 8 s). PepNovo is

able to annotate one spectrum in about 0.2 s, whereas

LutefiskXP needs more than 30 s per spectrum. Speed

improvements are planned in terms of parallelization of the

implementation. Further, it should be mentioned that the

implementation has been optimized for identification

performance and has not yet been optimized for computa-

tional speed. Overall search times for de novo identification

are not the bottleneck in the whole proteome analysis (Total

search time for the whole proteome of S. cellulosum was

about 20 h computing time.).

The reason for the inability to correctly assign all

sequences is not due to the inability of the algorithm itself to

find the correct sequence for a given mass spectrum, but

predominantly lies in the absence of unequivocal sequence

information in the MS data. We included mass spectra

identified with a MASCOT Mowse score of about 20 (5%

false discovery rate cut-off) in our evaluation data set, in

order to include spectra of medium quality for the assign-

ments. Manual inspection of the spectra corroborated the

missing or low abundance of diagnostic ions in the spectra,

as shown in Figs. 4 and 5. Moreover, CompNovo was able to

correctly identify 74% of all amino acids in the 2406 spec-

trum pairs of the benchmark data set (i.e. correct amino acid

at the correct position), which proves the high performance

of the algorithm for retrieving sequence information from

combined CID and ETD-MS/MS data.

Allowing a maximum of two incorrect amino acids in the

sequences, about 52% of the sequences of the benchmark data

set were assigned to the corresponding mass spectra. Another

interesting aspect of the data evaluation is that CompNovo

yielded the correct sequence within the top ten scoring

suggested sequences from about 50% of the spectra. Addi-

tionally, more than 75% of the reported sequences contain at

least a correct subsequence of length six and almost 85%

contain a tag of a length of five, which are already unique for a

peptide in many cases. This constitutes highly useful infor-

mation in data evaluations focused at supporting database

searching routines and represents a basis for utilizing the

reported hits in a homology-based sequence search [52].

To show that our approach can reveal novel peptides not

identified by standard sequence database searches, we

performed de novo sequencing on all spectra of the S. cellulo-sum data set with CompNovo. As a simple test, we searched for

sequences among the CompNovo top-scoring sequences that

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 k-5 k-4 k-3 k-2 k-1

Rel

ativ

e F

requ

ency

Backbone Position

b-ionsy-ions

CID 2+

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 k-5 k-4 k-3 k-2 k-1

Rel

ativ

e F

requ

ency

Backbone Position

c-ionsz-ions

ETD 2+

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 k-5 k-4 k-3 k-2 k-1

Rel

ativ

e F

requ

ency

Backbone Position

b-ionsy-ions

CID 3+

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 k-5 k-4 k-3 k-2 k-1

Rel

ativ

e F

requ

ency

Backbone Position

c-ionsz-ions

ETD 3+

A B

C D

Figure 3. Coverage of aminoacids in a peptide sequencewith fragment ions upon CID(A, C) and ETD (B, D) frag-mentation of doubly (A, B)and triply (C, D) chargedprecursor ions. The numbersindicate the positions of therespective amino acids inthe sequence, in which 1stands for the amino-term-inal amino acid and k for thecarboxy-terminal aminoacid. The upper plots werecreated from 2406 spectrumpairs of doubly chargedpeptides, the bottom plotsfrom 123 spectrum pairs oftriply charged peptides.

Electrophoresis 2009, 30, 3736–3747 Proteomics and 2-DE 3743

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com

were identical to sequences of the reference proteome within

one amino acid (single-point mutations). Of these, we manu-

ally selected three examples for further inspection. The three

spectra could be identified by manual annotation as the

peptides FENMGAQMVR, LLDQGQAGDNVFCLLR, and

TTDVTGTVNLPEGVR. The spectra and the sequences are

available in the Supporting Information (Figs. 1–3). All three

spectrum pairs show good coverage of all relevant ions. The

peptides can thus be identified as single-point mutations of the

reference proteome with high confidence based on their

CompNovo identification.

The results discussed above clearly prove that de novosequencing algorithms can be successfully applied to

interpret low-resolution tandem mass spectra acquired with

a mass accuracy of 70.3 Da. Although most of the infor-

mation required for sequencing lie in the series of fragment

ions, in which the mass differences between fragment ions

encode the sequence, the mass accuracy is clearly not

sufficient to distinguish some amino acids, such as gluta-

mine/asparagine, as well as phenylalanine and oxidized

methionine. Although isobaric amino acids, such as leucine

and isoleucine, cannot even be resolved on high-resolution

mass spectrometers, higher mass accuracy as offered by

time-of-flight or Fourier-transform instruments will help to

further improve the accuracy of the CompNovo algorithm.

3.3 Influence of precursor ion charge state on de

novo sequencing

Figures 6A and B show the annotated CID and ETD mass

spectra of the doubly charged precursor ion of the peptide

LFIDPVLGLAPYQGR from the protein Succinyl-CoA ligase

(ADP forming) subunit b (sce9142) of S. cellulosum. Although

de novo sequencing using the CID spectrum and all CID-based

de novo sequencing tools failed to provide the correct sequence

because of missing or low-abundant signals, the combination

of both spectra and application of CompNovo yielded the fully

correct sequence as top-ranked hit. The signals annotated in the

spectra clearly show that all positions in the amino acid

sequence are well covered by diagnostic ions in the CID or ETD

spectrum. As discussed above, the sequence coverage decreased

significantly in the CID spectra of triply charged peptides (Fig.

3C), whereas ETD spectra still contained diagnostic ions for

most of the amino acids in the sequences, although at

significantly less signal intensities (Fig. 3D). In due conse-

0

500

1000

1500

2000

2500

0 200 400 600 800 1000

Inte

nsity

m/z

R A E F L F A A L S

S L A A G L F E A R

CID 2+

0

2000

4000

6000

8000

10000

12000

14000

16000

0 200 400 600 800 1000

Inte

nsity

m/z

R A E F L G A A L S

S L A A G L F E A R

ETD 2+

A B

Figure 4. (A) CID spectrum ofthe doubly charged precursorion of the peptide SLAAGL-FEAR. The doubly chargedpeptide shows a few match-ing peaks, most of the peaksremained unmatched. (B) ETDspectrum of the peptideSLAAGLFEAR. Although afew peaks could be matchedin the spectrum, most of thesignals remain unannotated.

0

50000

100000

150000

200000

250000

300000

350000

400000

0 200 400 600 800 1000 1200 1400 1600 1800

Inte

nsity

m/z

K E V L D V W E P A P A V G L A S

S A L G V A P A P E W V D L V E K

CID 2+

0

20000

40000

60000

80000

100000

120000

140000

0 200 400 600 800 1000 1200 1400 1600 1800

Inte

nsity

m/z

K E V L D V W E P A P A V G L A S

S A L G V A P A P E W V D L V E K

ETD 2+

A B

Figure 5. (A) CID spectrum of the doubly charge peptide SALGVAPAPEWVDLVEK. Although this peptide shows a good coverage of ions,some backbone fragment ions in the middle are absent. Additionally, most of the peaks in the lower m/z range have very low intensities.(B) The ETD spectrum of the peptide SALGVAPAPEWVDLVEK shows an almost complete ladder of ions, representing fragmentation atthe N-terminal part of the peptide. However, some of the ions have low intensities.

Electrophoresis 2009, 30, 3736–37473744 A. Bertsch et al.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com

quence, the identification performance of de novo sequencing

was much lower for triply charged peptides. The coverage of

cleavage sites dropped as evidenced by the ion statistics and, as

expected, the identification performance of the CID-based

algorithms was comparatively low. In 123 CID mass spectra of

triply charged precursor ions found in our benchmark data set,

PepNovo identified only 19.5% of the amino acids correctly,

whereas CompNovo identified 33.5% of the amino acids in the

corresponding pairs of CID/ETD mass spectra. However, the

implemented experimental setup and parameters predomi-

nantly yielded spectrum pairs of doubly charged precursor ions

and only about 4.9% of the spectra were from triply charged

peptides in the benchmark data set, which makes the de novosequencing of triply and even higher charged peptide ions not a

very important issue. The CID and ETD spectra of the triply

charged peptide precursor ion LFIDPVLGLAPYQGR shown in

Figs. 6C and D demonstrate the poor coverage of the sequence

with abundant fragment ions. From this fragmentation data,

none of the algorithms was able to identify the correct sequence

of the peptide. Nevertheless, CompNovo at least correctly

identified a sequence tag of six consecutive correct amino acids

in the peptide sequence.

3.4 Comparison of de novo sequencing algorithms

and improvement of sequence annotation by

including ETD spectral information

In order to assess the performance of CompNovo, we

compared the results of the implementation of the

algorithm on our benchmark data set with those employing

other publicly available de novo software packages (Lute-

fiskXP and PepNovo). Since these packages use CID spectra

only, CompNovo has a clear advantage through using both

CID and ETD spectra. In order to characterize the

improvement that combined ETD and CID offer over CID,

we also tested a slightly modified version of CompNovo that

uses the same algorithm but makes use only of CID spectra

(CompNovoCID). The results of this comparison are

summarized in Table 1. This table reports the correctly

identified peptides as well as identifications with one to

three incorrect amino acids with respect to the true

sequence. It can be seen that CompNovo not only facilitates

a significant improvement in the success rate of suggested

sequences with respect to other algorithms (from 2.2 to

28.1% for fully correct peptides) but also significantly

improves the correctness of the sequences upon inclusion

of ETD data (from 9.8 to 28.1% for fully correct peptide

sequences). The identification rates for the different

algorithms depending on the number of tolerated false

amino acid annotations are shown in Fig. 7. It can be seen

that CompNovo starts with a success rate of 28.1% for

entirely correct peptide sequences (number of allowed

incorrect residues 5 0), which is significantly larger than

for all other investigated algorithms. CompNovo reaches a

success rate of more than 50% already upon the allowance

of two wrong amino acids, whereas the other algorithms

require a tolerance of at least five falsely assigned amino

acids for the same success rate. Although CompNovoCID

was able to identify 55% of all amino acids correctly, about

0

5000

10000

15000

20000

25000

30000

35000

0 200 400 600 800 1000 1200 1400 1600

Inte

nsity

m/z

R G Q Y P A L G L V P D I F L

L F I D P V L G L A P Y Q G R

CID 2+

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 200 400 600 800 1000 1200 1400 1600

Inte

nsity

m/z

R G Q Y P A L G L V P D I F L

L F I D P V L G L A P Y Q G R

ETD 2+

0

2000

4000

6000

8000

10000

0 200 400 600 800 1000 1200 1400 1600

Inte

nsity

m/z

R G Q Y P A L G L V P D I F L

L F I D P V L G L A P Y Q G R

CID 3+

0

1000

2000

3000

4000

5000

6000

0 200 400 600 800 1000 1200 1400 1600

Inte

nsity

m/z

R G Q Y P A L G L V P D I F L

L F I D P V L G L A P Y Q G R

ETD 3+

A B

C D

Figure 6. CID (A, C) and ETD(B, D) mass spectra ofdoubly (A, B) and triply(C, D) charged precursorions of the peptide LFIDPVL-GLAPYQGR. (A, B) The spec-tra in (A) and (B) show agood coverage of highlyabundant ions. However,only CompNovo was ableto annotate the correctsequence, although theETD spectrum of the doublycharged peptide containedonly a small number ofevidences. (B, D) The triplycharged CID spectrumshows poor fragmentationbehavior. Although the ETDspectrum shows most of theions as low abundantsignals, CompNovo wasonly able to identify asequence tag of length six.

Electrophoresis 2009, 30, 3736–3747 Proteomics and 2-DE 3745

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com

74% of the amino acids were correctly sequenced by

CompNovo, representing a significant increase over meth-

ods using only CID spectra for this data set. For more than

50% of the spectra, CompNovo has a correct sequence

within the Top 10, CompNovoCID for about 24% and

PepNovo for about 11%. This shows that the sequence

generation works very well for the divide-and-conquer

approach. Figure 8 shows how the different identification

engines perform in terms of correctly predicted sub-

sequences. This measure is useful for assessing the

possibility of using the reported hits in a homology-based

sequence search. Although about 84% of the sequences

reported by CompNovo contain a correct tag of length five,

this is the case for only 53 and 9% of the sequences retrieved

by PepNovo and LutefiskXP, respectively.

4 Concluding remarks

De novo sequencing using fragmentation data from a

combination of CID and ETD spectra gives significantly

improved results when compared with de novo sequencing

based on CID data only. The better de novo sequencing

performance is due to the complementary fragmentation

information contained in the two types of spectra, which

results in more complete ion series covering the peptide

sequence. The CompNovo de novo sequencing algorithm

makes use of this complementary information and thus more

frequently succeeds in annotating a correct sequence where

de novo sequencing based on CID or ETD spectra alone fails

due to the lack of sequence information. We further show

that using a rapid mass decomposition algorithm improves

the de novo sequencing algorithm. Improvements in the ion

scoring model and theoretical spectra generation as well as

the utilization of accurate fragment ion masses as obtained

from high-resolution mass spectrometers are expected to

facilitate achieving performances comparable to sequence

database searching algorithms.

The authors thank Andreas Kamper for proofreading themanuscript. The project was supported by Bundesministeriumfur Bildung und Forschung (BMBF) grant 0313842A.

The authors have declared no conflict of interest.

5 References

[1] Eng, J. K., McCormack, A. L., Yates, J. R., III, J. Am. Soc.Mass Spectrom. 1994, 5, 976–989.

[2] Perkins, D. N., Pappin, D. J. C., Creasy, D. M., Cottrell,J. S. Electrophoresis 1999, 20, 3551–3567.

[3] Nesvizhskii, A. I., Vitek, O., Aebersold, R., Nat. Methods2007, 4, 787–797.

Table 1. Identification rates for the different de novo search programs for the benchmark data set consisting of 2406 spectrum pairsa)

LutefiskXP (%) PepNovo (%) CompNovoCID (%) CompNovo (%)

Correct peptides 0.0 2.7 9.8 28.1

Within one residue 0.0 2.9 9.9 29.0

Within two residues 1.4 12.1 24.9 51.7

Within three residues 2.4 18.3 31.8 60.1

Total correct residues 8.5 46.3 54.8 73.7

a) Only the top-ranked peptide sequences were considered for each spectrum pair delivered by each search engine.

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14 16 18 20 22

Iden

tific

atio

n R

ate

Number of Allowed Incorrect Residues

LutefiskXPPepNovo

CompNovoCIDCompNovo

Figure 7. Identification rates of the different algorithms depend-ing on the number of tolerated false amino acid annotations inthe top-ranked peptide sequence (benchmark data set, n 5 2406).

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14 16 18 20

Rel

ativ

e F

requ

ency

Number of Correct Consecutive Residues

LutefiskXPPepNovo

CompNovoCIDCompNovo

Figure 8. Correctly predicted subsequences in the top-rankedpeptide sequences as a function of subsequence length uponapplication of different de novo algorithms on the benchmarkdata set (n 5 2406).

Electrophoresis 2009, 30, 3736–37473746 A. Bertsch et al.

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com

[4] Savitski, M. M., Nielsen, M. L., Kjeldsen, F., Zubarev,R. A., J. Proteome Res. 2005, 4, 2348–2354.

[5] Roepstorff, P., Fohlmann, J., Biomed. Mass Spectrom.1984, 11, 601.

[6] Biemann, K., Biomed. Environ. Mass Spectrom. 1988,16, 99–111.

[7] Zubarev, R. A., Kelleher, N. L., McLafferty, F. W., J. Am.Chem. Soc. 1998, 120, 3265–3266.

[8] Syka, J. E., Coon, J. J., Schroeder, M. J., Shabanowitz,J., Hunt, D. F., Proc. Natl. Acad. Sci. USA 2004, 101,9528–9533.

[9] Zubarev, R. A., Mass Spectrom. Rev. 2003, 22, 57–77.

[10] Geer, L. Y., Markey, S. P., Kowalak, J. A., Wagner, L.,Xu, M., Maynard, D.M., Yang, X., et al., J. Proteome Res.2004, 3, 958–964.

[11] Craig, R., Beavis, R., Bioinformatics 2004, 20, 1466–1467.

[12] Bartels, C., Biomed. Environ. Mass Spectrom. 1990, 19,363–368.

[13] Fernandez-de-Cossio, J., Gonzalez, J., Besada, V.,Comput. Appl. Biosci. 1995, 11, 427–434.

[14] Taylor, J. A., Johnson, R. S., Rapid Commun. MassSpectrom. 1997, 11, 1067–1075.

[15] Dancik, V., Addona, T. A., Clauser, K. R., Vath, J. E.,Pevzner, P. A., J. Comput. Biol. 1999, 6, 327–342.

[16] Chen, T., Kao, M. Y., Tepel, M., Rush, J., Church, G. M.,J. Comput. Biol. 2001, 8, 325–337.

[17] Taylor, J. A., Johnson, R. S., Anal. Chem. 2001, 73,2594–2604.

[18] Fernandez-de-Cossio, J., Gonzalez, J., Satomi, Y.,Shima, T., Okumura, N., Besada, V., Betancourt, L.,et al., Electrophoresis 2000, 21, 1694–1699.

[19] Chen, T., Bingwen, L., J. Comp. Biol. 2003, 10, 1–12.

[20] Bern, M., Goldberg, D., J. Comp. Biol. 2006, 13, 364–378.

[21] Frank, A., Pevzner, P., Anal. Chem. 2005, 77, 964–973.

[22] Lubeck, O., Sewell, C., Gu, S., Chen, X., Cai, D. M., Proc.IEEE 2002, 90, 1868–1874.

[23] Fischer, B., Roth, V., Roos, F., Grossmann, J., Baginsky, S.,Widmayer, P., Gruissem, W., Buhmann, J. M., Anal.Chem. 2005, 77, 7265–7273.

[24] Zhang, Z., Anal. Chem. 2004, 76, 6374–6383.

[25] DiMaggio, P. A., Jr. , Floudas, C. A., Anal. Chem. 2007,79, 1433–1446.

[26] Bafna, V., Edwards, N., Bioinformatics 2001, 17, S13–S21.

[27] Havilio, M., Haddad, Y., Smilansky, Z., Anal. Chem.2003, 75, 435–444.

[28] Elias, J. E., Gibbons, F. D., King, O. D., Roth, F. P., Gygi,S. P., Nat. Biotechnol. 2004, 22, 214–219.

[29] Horn, D. M., Zubarev, R. A., McLafferty, R. W., Proc.Natl. Acad. Sci. USA 2000, 97, 10313–10317.

[30] Datta, R., Bern, M., Proc. Res. Comput. Mol. Biol. 2008,4955, 140–153.

[31] Schneiker, S., Perlova, O., Kaiser, O., Gerth, K., Alici, A.,Altmeyer, M. O., Bartels, D., et al., Nat. Biotechnol. 2007,25, 1281–1289.

[32] Schley, C., Swart, R., Huber, C. G., J. Chromatogr. A2006, 1136, 210–220.

[33] Toll, H., Wintringer, R., Schweiger-Hufnagel, U., Huber,C. G., J. Sep. Sci. 2005, 28, 1666–1674.

[34] Premstaller, A., Oberacher, H., Huber, C. G., Anal. Chem.2000, 72, 4386.

[35] Hartmer, R., Kaplan, D. A., Gebhardt, C. A., Ledertheil,T., Brekenfeld, A., Int. J. Mass Spectrom. 2008, 276,82–90.

[36] Hartmer, R., Lubeck, M., Kiehne, A., Baessmann, C.,Brekenfeld, A., 54th ASMS Conference on Mass Spec-trometry and Allied Topics, San Antonio, 2006.

[37] Delmotte, N., Lasaosa, M., Tholey, A., Heinzle, E., Huber,C. G., J. Proteome Res. 2007, 6, 4363–4373.

[38] Vasilescu, J., Smith, J. C., Zweitzig, D. R., Denis, N. J.,Haines, D. S., Figeys, D., J. Mass Spectrom. 2008, 43,296–304.

[39] Elias, J. E., Haas, W., Faherty, B. K., Gygi, S., Nat.Methods 2005, 2, 667–675.

[40] Martens, L., Hermjakob, H., Jones, P., Adamski, M.,Taylor, C., States, D., Gevaert, K., et al., Proteomics2005, 5, 3537–3545.

[41] Jones, P., Cote, R. G., Martens, L., Quinn, A. F., Taylor,C. F., Derache, W., Hermjakob, H., Apweiler, R., NucleicAcids Res. 2006, 1, D659–D666.

[42] Sturm, S., Bertsch, A., Gropl, C., Hildebrandt, A.,Hussong, R., Lange, E., Pfeifer, N., et al., BMC Bio-informatics 2008, 9, 163.

[43] Kohlbacher, O., Reinert, K., Gropl, C., Lange, E., Pfeifer,N., Schulz-Trieglaff, O., Sturm, M., Bioinformatics 2007,23, e191–e197.

[44] Pitteri, S. J., Chrisman, P. A., McLuckey, S. A., Anal.Chem. 2005, 77, 5662–5669.

[45] Bocker, S., Liptak, Z., Martin, M., Pervukhin, A., Sudek,H., Bioinformatics 2008, 24, 591–593.

[46] Bocker, S., Liptak, Z., Algorithmica 2007, 48, 413–432.

[47] Bocker, S., Letzel, M., Liptak, Z., Pervukhin, A., Bioin-formatics 2009, 25, 218–224.

[48] Montecchi-Palazzi, L., Beavis, R., Binz, P. A., Chalkley,R. J., Cottrell, J., Creasy, D., Shofstahl, J., et al., Nat.Biotechnol. 2008, 26, 864–866.

[49] Good, D. M., Wirtala, M., McAlister, G. C., Coon, J. J.,Mol. Cell. Proteomics 2007, 6, 1942–1951.

[50] Molina, H., Matthiesen, R., Kandasamy, K., Padey, A.,Anal. Chem. 2008, 80, 4825–4835.

[51] Tabb, D. L., Smith, L. L., Breci, L. A., Wysocki, V. H., Lin,D., Yates, J. R., Anal. Chem. 2003, 75, 1155–1163.

[52] Kim, S., Gupta, N., Bandeira, N., Pevzner, P. A., Mol.Cell. Proteomics 2009, 8, 53–69.

Electrophoresis 2009, 30, 3736–3747 Proteomics and 2-DE 3747

& 2009 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.electrophoresis-journal.com