PeptideClassifier for protein inference and targeted

nature biotechnology volume 28 number 7 july 2010 647

To the Editor:Direct protein profiling offers unique insights beyond those afforded by transcriptomics or genomics technologies. These include information about the abundance level, post-translational modifications and interaction partners of proteins. Shotgun proteomics1 is

the method of choice for the comprehensive analysis of complex protein mixtures, and extensive proteome coverage has recently been reported even for eukaryotic model organisms2,3. However, shotgun proteomics also faces significant challenges, such as the protein inference problem4. Protein inference

refers to the process of deducing the proteins that were originally present in a sample on the basis of the experimentally identified peptides. Because significant amounts of shared peptides—peptides that could be derived from several proteins—introduce ambiguity, protein inference can be tricky

Experimental workflow Computational workflow

>GenomeATGGTGGGTGGCAAGAAGAAAACCAAGATA...

>Protein_2.1

MVQYNFKRIT...>Protein_1.2MVQYNFKRIT...

>Protein_1.1MVQYNFKRITVVPNGK...

MVQYNFKRITVVPNGKMVVRPYNDELR

>Gene_2CTACTCGCATGTAGA

>Gene_1CTACTCGCATGTAGA...

Genome

Proteins

Genes

Peptides

Spectra

Measurement & prediction

MS spectrum

Protein mixture

Proteolytic digestion In silico digestion

Predicted spectrum

Peptide classification

MS/MS spectrum

Protein inference

Minimal protein list

Peptide spectrum match

Peptide selection for quantitative proteomics

Peptide ScoreMVQYNFK 0.99 ITVVPNGK 0.25MVVRPYNDELR 0.95

Scored peptide list

Peptide Score Class Proteins/genesMVQYNFK 0.99 1a Protein_1.1ITVVPNGK 0.25 2 Gene_1MVVRPYNDELR 0.95 3b Protein_1.1,2.1

Classified peptide list

Figure 1 Schematic overview of where PeptideClassifier maps onto a shotgun proteomics workflow and selected applications. Proteins are extracted and digested with a protease (typically trypsin) before further separation of the peptide mixture, ionization, selection of precursor ions, fragmentation and recording of fragment ion spectra in a mass spectrometer. PeptideClassifier takes peptides assigned by the PSM process as input, ideally further processed with a probabilistic or other scoring scheme. Because gene-model information is included in the classification process (red arrow), several different peptide classes with distinct information content can be reported (see main text). For deterministic protein inference, a minimal list of protein identifications can be generated for peptides above a user-defined threshold (for more details, see Supplementary Fig. 1). Other major applications include the information content–based selection of peptides for targeted quantitative proteomics workflows (based on experimental data or an in silico pre computed index for all peptides; see dashed arrow) and integration with transcriptomics data (not shown).

PeptideClassifier for protein inference and targeted quantitative proteomics

correspondence©

201

0 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

648 volume 28 number 7 july 2010 nature biotechnology

per evidence class; and finally, it can report a minimal set of protein identifications that would explain the remaining ambiguous peptides, following the Occam’s Razor approach5. In its current implementation, PeptideClassifier can work with the output of common database search engines or, alternatively, with a list of confident peptide identifications provided by a user.

The classification and protein inference approach is generic: on the basis of their different gene structures, for prokaryotes we report three peptide evidence classes (classes 1a, 3a and 3b), and for eukaryotes, to capture potential alternative splice isoforms, we consider three additional evidence classes (classes 1b, 2a and 2b) (Fig. 2). Class 1a peptides unambiguously identify a single unique protein sequence. Class 1b peptides also unambiguously identify one unique protein sequence, but this sequence could be derived from distinct splice isoform transcripts of a gene model that, although identical in the coding sequence, differ in the 5′ or 3′ untranslated region, or in both regions. We extend the original classification8 to further distinguish class 2 peptides into

the Nature Biotechnology website, or from http://www.mop.uzh.ch/software.html) and illustrate its general applicability for both eukaryotes and prokaryotes, and its value for applications beyond protein inference (Supplementary Table 1). These include integration with transcriptomics data and information content–based selection of peptides for targeted quantitative proteomics studies (Fig. 1).

PeptideClassifier can classify shotgun proteomics data from any organism, provided that a clear relationship exists between the gene model, its encoded protein sequences and their identifiers. Several reference databases (such as FlyBase, Wormbase, TAIR, ENSEMBL and RefSeq) fulfill this requirement. PeptideClassifier carries out several steps (Supplementary Fig. 1): first, it analyzes protein sequence redundancies and generates an identifiable proteome index; second, it parses the database search result files; third, it classifies the experimentally identified peptides into six evidence classes with different information contents (see below); fourth, it infers a minimal list of protein identifications

and error prone. Furthermore, the error rate at the protein level is typically substantially higher than that at the peptide level4,5. Informatics solutions that provide accurate and reproducible results are thus needed to minimize the propagation of errors in the literature and in data repositories, and to allow readers to critically evaluate the conclusions of papers6,7.

To address this issue, we have recently devised a novel, deterministic peptide classification and protein inference scheme8. This approach is the first to take into account the gene model–protein sequence–protein identifier relationships. Each peptide sequence is classified according to its information content with respect to protein sequences and gene models (Fig. 1). This allows shared peptides to be further distinguished depending on whether the implied proteins could be encoded either by the same or by distinct gene models.

Here we announce the release of the modular software tool PeptideClassifier (folders containing the code for PeptideClassifier as well as some AuxiliaryScripts can be downloaded on

Class Protein sequence(s) Protein isoform(s) Gene(s)

1a Unambiguous Unambiguous Unambiguous

1b Unambiguous Ambiguous Unambiguous

2a Ambiguous Ambiguous Unambiguous

2b Ambiguous Ambiguous Unambiguous

3a Unambiguous Ambiguous Ambiguous

3b Ambiguous Ambiguous Ambiguous

5’3’

Protein 1.2

Protein 1.1

Gene 1 Gene 2

Protein 2.1

Protein 2.2

Protein 2.3

3’5’

Gene 3

Protein 3.1

Gene 4

Protein 4.1

Protein 2.4

Eukaryotes

Prokaryotes

5’3’

Protein 1

Gene 1

3’5’

Gene 2

Protein 2

Gene 3

Protein 3

UTR

UTR

UTR

Figure 2 Overview of the distinct peptide evidence classes of our classification scheme for eukaryotes and prokaryotes. We distinguish six peptide evidence classes for eukaryotes (upper panel). Introns are shown as gray boxes, exons as green boxes and splicing events as dashed green lines. The 3′ untranslated region (UTR) is indicated for one gene model, to highlight the relevance of 5′ or 3′ UTRs for assigning class-1b identifications. Also shown are sets of two gene models that encode an identical protein sequence (dashed gray lines). Owing to the lack of splice variants, class 1b, 2a and 2b do not apply to prokaryotes (lower panel). A table summarizing the ability of the respective peptide evidence classes to distinguish protein sequences, annotated protein isoforms and genes is shown in the figure; the implications for major applications are indicated in Supplementary Table 1.

CORR eSPO nden Ce©

201

0 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

nature biotechnology volume 28 number 7 july 2010 649

for all human gene models. In the first phase, peptides specific for a gene model but not a specific splice isoform or modified protein are required. Thus, a classification that displays the in silico pre computed information content of each peptide could help researchers select the best candidates for a specific use case both at the splice-variant level and at the gene-model level. We detail the steps for generating a proteome-wide pre computed peptide information content index and its advantages for this use case in Supplementary Table 3.

Accurate protein identification and quantification are of key interest for the proteomics field. Our classification scheme, which is, to our knowledge, the first to consider gene model–protein sequence–protein identifier relationships, can help to minimize potential protein inference errors. PeptideClassifier displays all ambiguities, enabling a researcher to further examine candidates of specific interest and to distinguish or even remove protein-level ambiguities by integrating transcriptomics or other data sets. Its applications for data integration and information content–based selection of peptides for targeted quantitative proteomics are expected to find widespread use.

Note: Supplementary information is available on the Nature Biotechnology website.

ACKNOWLEDGMENTSWe thank C. Panse for contributing a first XML parser to extract information from database search engine output files, E. Brunner and G. Hausmann for feedback on the manuscript and K. Basler, U. Grossniklaus, R. Aebersold, M. Hengartner and J. Jiricny for continued support of the Quantitative Model Organism Proteomics bioinformatics core group. E.Q. and C.H.A. are members of the Quantitative Model Organism Proteomics Initiative, which is supported by the University Research Priority Program Systems Biology/Functional Genomics of the University of Zurich.

AUTHOR CONTRIBUTIONSE.Q. wrote the software code and documentation and generated the figures; C.H.A. originally devised the peptide classification scheme and wrote the manuscript.

COMPETING FINANCIAL INTERESTSThe authors declare no competing financial interests.

Ermir Qeli & Christian H Ahrens

Quantitative Model Organism Proteomics, Institute of Molecular Life Sciences, University of Zürich, Winterthurerstrasse 190, 8057 Zürich, Switzerland. Correspondence should be addressed to C.H.A. ([email protected]).

1. Washburn, M.P., Wolters, d. & Yates, J.R. III. Nat. Biotechnol. 19, 242–247 (2001).

2. Brunner, e. et al. Nat. Biotechnol. 25, 576–583

identifications reported in the literature. Alternatively, by opting to manually validate all single-hit identifications with an information-rich peptide, a user could reduce the overall protein FDR while keeping valid single hits. We have shown that single hits passing manual evaluation (only 35% of all single hits) are enriched in short and low-abundance proteins8, which, by definition, will contribute fewer observable peptides. The rejected single hits accounted for around 90% of the incorrect PSM matches estimated to be present in the data set on the basis of target-decoy database search results. Removing them greatly reduces the actual FDR in the data set at the peptide level and even more so at the protein level. Ideally, one would be able to rely on a scoring scheme; the solution by Gupta and Pevzner13 may represent one valuable resource. We suggest that the guidelines for protein identification6 be extended to consider the peptide information content.

The classification facilitates seamless integration with transcriptomics data. We have demonstrated this for current transcriptomics platforms, which predominantly report results at the gene-model level. However, to allow more fine-grained integration with data from exon-based array platforms or RNA-Seq and take advantage of their potential to distinguish splice variants, we have further subdivided class 2 peptides: class 2a peptides imply a proper subset of distinct protein sequences encoded by one gene model, whereas class 2b peptides imply all encoded protein sequences. In combination, class 1a, 1b and 2a peptides can thus be informative in identifying and distinguishing different splice isoforms. Because a substantial part of the continuous updates to eukaryotic reference protein databases represent splice variants (Supplementary Table 2), we expect that such a classification will become increasingly valuable. In addition, shared peptides may lead to inaccurate protein quantification results in semiquantitative spectral counting applications: the distinction of several peptide evidence classes provided by PeptideClassifier can help prevent this.

Finally, PeptideClassifier can assist in selecting the most relevant peptides for targeted quantitative proteomics approaches using multiple reaction monitoring. Applications can range from selecting proteotypic peptides from existing experimental proteome catalogs14 to supporting larger projects like the Human Protein Detection and Quantitation project15, which aims to identify expression evidence

those peptides that identify a proper subset (class 2a) versus those that imply all protein sequences encoded by a gene model (class 2b). Finally, class 3a peptides unambiguously identify a protein sequence that can be encoded by several gene models from distinct loci. Such cases, which include histones or the products of duplicated genes in prokaryotes, are typically very rare. In contrast, class 3b peptides are derived from different protein sequences encoded by gene models from distinct loci. They have the least information content but can account for a large percentage of the experimental data8.

The conceptually simple extension of integrating the gene model distinguishes our solution from other common protein inference tools, such as ProteinProphet5, IsoformResolver9, Scaffold10 or IDPicker11. Similar to these tools, PeptideClassifier addresses protein inference using the assigned peptides, but it does not try to improve the peptide-spectrum matching (PSM) process (Fig. 1). Notably, for protein inference, our deterministic method considers only peptides above a user-defined threshold, and not lower-scoring peptides that could provide additional evidence for certain protein identifications, the default approach adopted by ProteinProphet5 and Scaffold10. The deterministic approach is therefore very stringent. Similar to IDPicker11, Scaffold10 and other solutions, the output of two different database search algorithms could in principle be classified and integrated to achieve additional stringency.

One example of the use of PeptideClassifier concerns the reporting of reference data sets in proteomics, where the error rate should be as minimal as possible. For peptides of class 1a, 1b and 3a, a minimal list of nonoverlapping, unambiguous protein sequence identifications can be generated (Fig. 2). For cases in which the peptide evidence cannot distinguish between several possible protein sequences that are encoded either by the same gene model (class 2a and 2b) or by different gene models (class 3b) (Fig. 2), a minimal list of ambiguous protein identifications can be generated that reports the inherent ambiguities (for more details, see Supplementary Fig. 1). Using a target-decoy database search strategy12, one can estimate the spectrum-level false discovery rate (FDR) for a selected peptide confidence threshold. Because the FDR is much higher for proteins identified by a single hit, one option would be to exclude them. Existing guidelines for protein identification requiring two distinct peptides6 have limited the number of false-positive protein

CORR eSPO nden Ce©

201

0 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

650 volume 28 number 7 july 2010 nature biotechnology

With the broadening availability of tools and methods, researchers have to define the most efficient binder applicable to the method and approach they have selected. These applications are carried out under different experimental conditions, which affect the choice of affinity reagent used. For example, binders can be either in solution or immobilized to a solid phase, and target proteins may be present either in a native, conformationally folded form or in a denatured state. To compare affinity reagents and decide upon the most appropriate one, users need comprehensive information regarding each reagent. Currently, multiple sources of information exist, including commercial catalogs of antibodies, portals centralizing affinity-reagent properties from various sources and experimental results published in the literature describing the successful use of a binder in a specific application. Large-scale production initiatives also add other sources such as validation and quality-control results from production centers and independent quality assessment laboratories (such as the Antibodypedia portal; http://www.antibodypedia.org/). Even so, the available information may be incomplete; for example, the identification of a protein belonging to a particular family using a given antibody may be reported with no information concerning the assessment of possible cross-reactivity of the antibody with other family members. Existing information may also be biased by unsubstantiated reports from a commercial producer. Furthermore, data may appear contradictory at first glance, owing to a lack of precision in target or sample descriptions.

The purpose of MIAPAR is to permit the reliable identification of affinity reagent–target–application triples. A binder is designed and produced for the detection of a particular target protein or peptide, often within a complex mixture. For maximum benefit of potential users, reporting of data about such a protein binder must describe (or reference) both its intended target and its qualities as a molecular tool. Ideally, such a description should include: (i) affinity reagent (and target) production processes, which may influence the characteristics of the binder and permit the unambiguous identification of the molecules; (ii) properties of the reagent as a binding tool, including its specificity, affinity, binding kinetics and cross-reactivity; (iii) the use of the reagent in applications (that is, compatibility with experimental

(2004).10. Searle, B.C., Turner, M. & nesvizhskii, A.I. J.

Proteome Res. 7, 245–253 (2008).11. Ma, Z.Q. et al. J. Proteome Res. 8, 3872–3881

(2009).12. elias, J.e. & Gygi, S.P. Nat. Methods 4, 207–214

(2007).13. Gupta, n. & Pevzner, P.A. J. Proteome Res. 8, 4173–

4181 (2009).14. Ahrens, C.H., Brunner, e., Hafen, e., Aebersold, R.

& Basler, K. Fly 1, 182–186 (2007).15. Anderson, n.L. et al. Mol. Cell. Proteomics 8, 883–

886 (2009).

(2007).3. de Godoy, L.M. et al. Nature 455, 1251–1254

(2008).4. nesvizhskii, A.I. & Aebersold, R. Mol. Cell.

Proteomics 4, 1419–1440 (2005).5. nesvizhskii, A.I., Keller, A., Kolker, e. & Aebersold,

R. Anal. Chem. 75, 4646–4658 (2003).6. Carr, S. et al. Mol. Cell. Proteomics 3, 531–533

(2004).7. nesvizhskii, A.I., Vitek, O. & Aebersold, R. Nat.

Methods 4, 787–797 (2007).8. Grobei, M.A. et al. Genome Res. 19, 1786–1800

(2009).9. Resing, K.A. et al. Anal. Chem. 76, 3556–3568

Minimum information about a protein affinity reagent (MIAPAR)

To the Editor:We wish to alert your readers to MIAPAR, the minimum information about a protein affinity reagent. This is a proposal developed within the community as an important first step in formalizing standards in reporting the production and properties of protein binding reagents, such as antibodies, developed and sold for the identification and detection of specific proteins present in biological samples. It defines a checklist of required information, intended for use by producers of affinity reagents, quality-control laboratories, users and databases (Supplementary Table 1). We envision that both commercial and freely available affinity reagents, as well as published studies using these reagents, could include a MIAPAR-compliant document describing the product’s properties with every available binding partner. This would enable the user or reader to make a fully informed evaluation of the validity of conclusions drawn using this reagent (Fig. 1).

Supplementary Table 2 shows an example of a MIAPAR-compliant document, which could be derived from the information supplied in a single publication using the workflow summarized in Supplementary Figure 1.

Affinity reagents serve various roles in experimental studies. These include protein sample identification and detection; protein capture for isolation, purification and quantification; and functional studies. The choice of an applicable molecular tool is conditioned by the experimental objectives and the chosen approaches and methods. This has led to a widening of the range of molecules being used as affinity reagents (Table 1 and ref. 1). The best established are ‘natural’ polyclonal and monoclonal

antibodies; however, an expanding range of recombinant constructs are now available, including single-chain variable fragments (scFvs), single-domain antibody fragments and diabodies. More recently, alternative affinity reagents have been developed, the biophysical properties of which present advantages in specific applications. They include protein scaffolds, such as fibronectin, lipocalins and ankyrin and armadillo repeat domains, and nucleic acid aptamers. These reagents are used in a growing range of experimental methods, including enzyme-linked immunosorbent assay (ELISA), western blotting, immunohistochemistry, affinity chromatography and immunoprecipitation (Table 2).

At the same time, the systematic characterization of complete proteomes has led to an increase in the scale on which affinity reagents are produced. Several ambitious projects aim to develop systematic affinity-reagent collections. In Europe, they include the EU ProteomeBinders consortium1, the Human Protein Resource and Human Protein Atlas2 and the Antibody Factory3. In the United States, the National Cancer Institute (Bethesda, MD) has initiated the Clinical Proteomic Reagents Resource within the Clinical Proteomic Technologies Initiative for Cancer4. Globally, the Human Proteome Organization (HUPO) Human Antibody Initiative aims to promote and facilitate the use of antibodies for proteomics research, which embraces many of these activities (http://www.hupo.org/research/hai), and the HUPO Proteomics Standards Initiative has developed PSI-PAR, a global community standard format for the representation and exchange of protein affinity-reagent data5.

CORR eSPO nden Ce©

201

0 N

atu

re A

mer

ica,

Inc.

All

rig

hts

res

erve

d.

http://www.antibodypedia.org/

Peptide Classification

Reference Protein Database

Analysis of protein sequence redundancies

(FastaAnalyzer)

Identifiable Proteome Index

Sequence ProteinsMVQYNFKYISHLNAR... Prot_1.1MALLVFQLKITVVPNGK... Prot_2.1, Prot_2.2MEDLVVPKYISHLNAR... Prot_3.1, Prot_900.1...

Peptide-Protein AssignmentsSpectrum# Peptide_Seq Protein_Alternatives Score1 MVQYNFK Prot_1.1 0.952 ITVVPNGK Prot_2.1, Prot_2.2 0.913 YISHLNAR Prot_1.1, Prot_3.1, Prot_900.1 0.92...

Organism-specific gene-protein mappingGene ProteinGene_1 Prot_1.1Gene_2 Prot_2.1Gene_2 Prot_2.2...

Classified PeptidesSpectrum# Peptide_Seq Protein_Alternatives Score Class1 MVQYNFK Prot_1.1 0.95 1a2 ITVVPNGK Prot_2.1, Prot_2.2 0.91 23 YISHLNAR Prot_1.1, Prot_3.1, Prot_900.1 0.92 3b...

Extraction of minimal protein list for shared peptides

(SharedPeptideExplainer)

Minimal Protein List

1

Externalpost-processing

e.g. TPP

Peptide Spectrum Match

User-selected peptides

Peptide ListpepXML Results

XML parsing(PepXMLParser)

Peptide-Protein Mapping(Peptide2Proteins)2 2’

5

3

Extraction of minimal protein list for unique peptides

(UniquePeptideExplainer)4

Supplementary Fig. 1, Qeli & Ahrens

(PeptideClassifier)

Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.

Supplementary Figure 1: Schematic overview of the modular architecture of

PeptideClassifier and its functionalities.

The names of individual software modules are shown in blue, respectively. The score

shown in the peptide-protein assignment box and the classified peptides box

schematically represents the score a probabilistic or other scoring algorithm would

provide as a measure for the quality of the PSM assignment.

Description of individual PeptideClassifier modules and their functionalities:

PeptideClassifier is command-line driven in order to support large-scale proteomics

approaches. For the entire software package and each of the five individual software

modules, a readme file is provided that explains the usage.

• Analysis of protein sequence redundancies (step 1, FastaAnalyzer)

In a first step, a reference protein database of choice is analyzed to identify any

protein sequence redundancies and to define the set of protein sequences that differ

by at least one amino acid and thus are (at least in theory) distinguishable by tandem

mass spectrometry (identifiable proteome index). Such redundancies can include

proteins encoded for by the same gene model (whose transcripts only differ in the 5’

or 3’ untranslated regions, UTRs) or proteins encoded for by different gene models

(e.g. histones). For each unique sequence, all protein identifiers pointing to that

sequence are listed. This information is later used in the peptide classification

process.

• Organism-specific gene/protein naming convention & supported

databases

1Nature Biotechnology, vol. 28, no. 7 Ahrens, C. et al.

PeptideClassifier requires a clear mapping between gene models and their encoded

proteins. The annotation of genes and their encoded proteins follows different rules in

different organisms. For some model organisms such as Drosophila and Arabidopsis,

the annotation is straightforward in the sense that different splice variants encoded

by the same gene model have the same prefix in the identifier and differ from each

other only in their suffix. For example the Drosophila gene “supercoiling factor” with

the gene symbol ‘scf’ and the gene identifier CG9148 encodes 3 protein products

annotated as CG9148-PA, CG9148-PB and CG9148-PC. The annotation for

Arabidopsis thaliana and Caenorhabditis elegans follows similar conventions. Hence

for these model organisms the relationships between protein names and gene model

names is trivial and can easily be extracted.

Extracting these relationships for other organisms, such as human and mouse,

depends on the reference database. For the IPI databases that are built from

different resources in order to provide a very comprehensive database, not all entries

have a link to a gene model identifier. While ENSEMBL reference databases do not

possess such a clear and intuitive relationship between the gene model and the

protein identifier as Drosophila and Arabidopsis, the mapping between gene model

and protein identifiers can be extracted from the headers of the respective Fasta

sequence entries and has the following form:

ENSP00000386659|ENSG00000222041

ENSP00000386718|ENSG00000222041

ENSP00000386407|ENSG00000222041

ENSP00000387320|ENSG00000222041

In this example, the human gene ENSG00000222041 has four splice variants named

as ENSP00000386659, ENSP00000386718, ENSP00000386407 and

ENSP00000387320. As such, ENSEMBL annotated data can be easily processed

following the same rules.


In the folder Auxiliary_Scripts we have provided several shell scripts to enable users

to generate these gene model - protein identifier mapping files for their reference

database version of choice. By adapting the code of the shell scripts for the human or

mouse ENSEMBL database, these can easily be extended to other ENSEMBL model

organism databases.

Importantly, researchers could also opt to build their own enhanced database and

map the proteins without gene model identifier onto a reference genome, assign their

own identifiers, and then use PeptideClassifier. Any clear mapping e.g. from the

unique locus position of the gene to protein name would work. This would require a

continued bioinformatics effort, but represents a solution to map the peptides back to

the gene level and to minimize mis-classifications based on non-comprehensive

databases.

• XML parsing (Step 2, PepXMLParser)

The purpose of this step is to parse pepXML data files, which contain the output of

common database search algorithms and, if applicable, subsequent post-processing

steps (e.g. PeptideProphet as part of the Trans-Proteomics Pipeline), into a comma-

separated file format (CSV format) for easier downstream processing.

• Peptide-Protein mapping (Step 2’, Peptide2Proteins)

As an alternative to the above step, a user can also provide a list of peptides that will

be processed further through the pipeline. For each peptide in the list, all protein(s)

where this peptide is part of their sequence (a substring) are extracted. This list of

alternative proteins for each peptide is thus similar to the output obtained after the

processing of pepXML files.


• Peptide classification (Step 3, PeptideClassifier)

The central classification step takes as input the peptides with the respective

alternative protein information, as well as protein sequence - protein identifier and

protein identifier - gene model relationships and classifies the peptides in a

deterministic way into one of six predefined peptide evidence classes for eukaryotes,

and three classes for prokaryotes.

• Extraction of a minimal protein list for unique peptides (Step 4,

UniquePeptideExplainer)

After the classification step, all peptides are categorized into one of three groups for

prokaryotes (1a, 3a, 3b) or six groups for eukaryotes (1a, 1b, 2a, 2b, 3a, 3b). We

thus extended the original classification reported in1 by further distinguishing class 2

peptides and by also covering prokaryotes. Proteins implied by peptides of classes

1a, 1b and 3a do not overlap with each other, implying as such a minimal set of

proteins that explains these peptides.

• Extraction of a minimal protein list for shared peptides (Step 5,

SharedPeptideExplainer)

The ambiguity of shared peptides which could be derived from several proteins, for

example splice variants, or similar proteins encoded by distinct gene models that

arose from duplication events, pose a problem for protein inference2. While they are

less prominent in prokaryotic proteomes, they are quite common in higher plants, in

the mouse and human proteomes. Shared peptides of classes 2a, 2b and 3b can

imply several distinct protein sequences. For eukaryotes, class 2a and 2b peptides

can imply proteins that could also contain peptides of class 1a, 1b and 3a, and as


such these peptides can be explainable by these proteins. Any overlaps in this

context are considered and excluded.

Peptides of class 3b, present in both eukaryotes and prokaryotes, constitute shared

peptides of lowest information content pointing to several proteins encoded for by

different gene models. Some of these peptides point to proteins that are also

identified with unique peptides. Applying the “Occam’s Razor” principle3 in this case,

the simplest hypothesis that would explain these peptides is that the proteins with

unique peptides are “present” in the sample and that these shared peptides also

originated from these proteins.

However, some of the class 3b peptides remain unexplained by unambiguously

identified proteins. This case is treated separately from the rest and the explanation

of these peptides is solved using an integer programming approach (again following

the Occam’s Razor principle to find the minimal set of proteins that explain these

remaining peptides).

These peptides can imply up to N additional proteins, where N is the maximal

number of proteins implied by all otherwise unexplained 3b peptides. The problem of

finding the minimal set of proteins can be formulated in the following way. We

denote y i a binary variable representing whether the protein i is in the sample ( )

or not ( ). Each peptide

yi =1

yi = 0 j ∈ J3b implies a set of proteins

Z j ⊆ {y1, y2,..., yN yi ∈ {0,1}} .

Then the purpose would be to find the N dimensional configuration vector ˆ Y that

solves the following integer programming problem:

Minimize: yi

i=1

N

∑

Subject to: for each yi

yi ∈Z j

∑ >=1 j ∈ J3b


This optimization problem can be solved using libraries for Linear Programming e.g.

using the GNU Linear Programming Kit (http://www.gnu.org/software/glpk/).

A shell script that includes the instructions on how to run it with protein lists implied

by class 3b peptides is provided in the software code (SharedPeptideExplainer.sh).

As a last point, even some peptides of class 2a/2b, might remain unexplained by

unambiguous protein identifications and as such they can be processed similarly to

peptides of class 3b to derive a minimal list of protein identifications that explain

these peptides.

References

1. Grobei, M.A. et al. Genome Res. 19, 1786-1800 (2009). 2. Nesvizhskii, A.I. & Aebersold, R. Mol. Cell. Proteomics 4, 1419-1440

(2005). 3. Nesvizhskii, A.I., Keller, A., Kolker, E. & Aebersold, R. Anal. Chem. 75,

4646-4658 (2003).


Protein inference Targeted proteomics Data integration with transcriptomics

1a unambiguous unambiguos unambiguous minimal list proteotypic peptide integrate gene model based array datapotential to integrate exon array data

1b unambiguous ambiguous unambiguous minimal list proteotypic peptide integrate gene model based array datapotential to integrate exon array data

2a ambiguous ambiguous unambiguous specific for one gene modelidentifies proper subset

identifies a proper subset of isoforms of a gene model

integrate gene model based array datapotential to integrate exon array data

2b ambiguous ambiguous unambiguos specific for one gene modelimplies all encoded proteins

identifies all annotated isoforms of a gene model

integrate gene model based array data

3a unambiguous ambiguous ambiguous minimal list of proteinscan be encoded by distinct gene models (proteotypic peptide)

potential to distinguish an ambiguous identification at the gene model level with transcriptomics, RT-PCR, or

other data

3b ambiguous ambiguous ambiguousminimal list of proteins to explain remaining

peptide evidence*(Occam's razor)

n.a. (potential to minimize the list of ambiguous proteins by integrating transcriptomics, RT-PCR, or other data )

* see Supplementary Fig. 1 for more details. Supplementary Table 1. Implications of the six peptide evidence classes pre-computed by PeptideClassifier for selected major applications.

Implication for major applicationsPeptide evidence

class Protein

sequence(s) Gene(s)Protein isoform(s)


Supplementary Table 2: Overview of the percentage of gene models with multiple

annotated protein sequences in reference databases.

PeptideClassifier is most useful when working with completely sequenced and well-

annotated genomes and the corresponding reference protein databases. An analysis of

different model organism and other reference database releases over time indicates that the

percentage of gene models for which more than one protein sequence is annotated

(including alternatively spliced protein isoforms) is increasing. This increase is particularly

striking for the human proteome. This is in part due to the formation of a Genome Reference

Consortium which aims to provide a comprehensive reference genome annotation

(http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml).

Organism /

reference database

# Proteins # Gene

models (GM) # GM

with one protein identifier

# GM with more than one protein

identifier D. melanogaster

FlyBase 5.13 21,317 14,101 10,461 3,640 (25.8%)

C. elegans Wormpep 140

22,420 19,735 17,844 1,891 (9.6%)

C. elegans Wormpep 208

24,202 20,226 17,516 2,710 (13.4%)

A. thaliana TAIR7

31,921 27,029 23,230 3,799 (14.0%)

A. thaliana TAIR9

33,410 27,379 22,826 4,553 (16.6%)

M. musculus ENSEMBL NCBIM 37.53

40,732 23,497 14,594 8,903 (37.9%)

H. sapiens ENSEMBL NCBI36.53

47,509 21,673 10,427 11,246 (51.9%)

H. sapiens GRCh 36.56

77,748 23,678 8,945 14,733 (62.2%)

Global gene expression analysis technologies including whole genome tiling microarrays and

RNA-Seq have exceptional potential to provide experimental evidence for genes missed by

current gene prediction algorithms and to identify alternative splice isoforms. Therefore, a

peptide classification scheme as the one employed by PeptideClassifier should become

increasingly important.


Supplementary Table 3. Advantages of an in silico classification for information content-

based selection of peptides for targeted quantitative proteomics.

The example peptide ADQELLMYSHDNIICGITSVAFSR shown below implies 4 mouse

proteins, 3 of which have a different protein sequence; it has been observed multiple times in

public data downloaded from PRIDE. Our classification readily provides the information that

i) all these proteins are encoded by the same gene model, which is not apparent from the

cryptic protein identifiers*, and ii) that the peptide cannot distinguish between the 4 proteins

and can thus only imply the encoding gene model (i.e. a class 2b peptide (see also Figure 1).

*For data searched in house, the protein identifier - gene model relationship information

would have to be parsed separately from the PSM search results and inspected manually.

This is very cumbersome and not practical for large-scale datasets. Importantly, for data

downloaded from public databases, the information that the proteins implied by a given

peptide are encoded by the same gene model is often not included. Thus, a pre-computed

index detailing the evidence class of each peptide offers multiple benefits.

Information that needs to be extracted / computed in silico

Peptide Identifiers of implied protein(s)

Protein length

Gene model

Peptide class

ADQELLMYSHDNIICGITSVAFSR

ENSMUSP00000106656

382 ENSMUSG00000029713

2b

ENSMUSP00000031726

340 ENSMUSG00000029713

ENSMUSP00000106649

296 ENSMUSG00000029713

ENSMUSP00000106652

296 ENSMUSG00000029713

The PRIDE dataset also contained other experimentally observed tryptic peptides, to some

of which PeptideClassifier would assign an evidence class with higher information content

(see colored boxes in the ClustalW multiple sequence alignment shown below). The class

1a peptides (blue boxes) unambiguously identify one specific protein sequence, while the

class 2a peptide (green box) implies two distinct protein sequences encoded for by this gene

model. This represents a proper subset of all four possible splice variants.


class 1a class 1a ENSMUSP00000106656 MGCSRPGMGIIPGLGPAARSCLPQPPSRGPQPPPTPPHGPGAMSELEQLRQEAEQLRNQI 60 ENSMUSP00000031726 ------------------------------------------MSELEQLRQEAEQLRNQI 18 ENSMUSP00000106649 ------------------------------------------------------------ ENSMUSP00000106652 ------------------------------------------------------------ class 2a ENSMUSP00000106656 RDARKACGDSTLTQITAGLDPVGRIQMRTRRTLRGHLAKIYAMHWGTDSRLLVSASQDGK 120 ENSMUSP00000031726 RDARKACGDSTLTQITAGLDPVGRIQMRTRRTLRGHLAKIYAMHWGTDSRLLVSASQDGK 78 ENSMUSP00000106649 --------------------------MRTRRTLRGHLAKIYAMHWGTDSRLLVSASQDGK 34 ENSMUSP00000106652 --------------------------MRTRRTLRGHLAKIYAMHWGTDSRLLVSASQDGK 34 ********************************** ENSMUSP00000106656 LIIWDSYTTNKVHAIPLRSSWVMTCAYAPSGNFVACGGLDNICSIYSLKTREGNVRVSRE 180 ENSMUSP00000031726 LIIWDSYTTNKVHAIPLRSSWVMTCAYAPSGNFVACGGLDNICSIYSLKTREGNVRVSRE 138 ENSMUSP00000106649 LIIWDSYTTNKVHAIPLRSSWVMTCAYAPSGNFVACGGLDNICSIYSLKTREGNVRVSRE 94 ENSMUSP00000106652 LIIWDSYTTNKVHAIPLRSSWVMTCAYAPSGNFVACGGLDNICSIYSLKTREGNVRVSRE 94 ************************************************************ ENSMUSP00000106656 LPGHTGYLSCCRFLDDNQIITSSGDTTCALWDIETGQQTVGFAGHSGDVMSLSLAPDGRT 240 ENSMUSP00000031726 LPGHTGYLSCCRFLDDNQIITSSGDTTCALWDIETGQQTVGFAGHSGDVMSLSLAPDGRT 198 ENSMUSP00000106649 LPGHTGYLSCCRFLDDNQIITSSGDTTCALWDIETGQQTVGFAGHSGDVMSLSLAPDGRT 154 ENSMUSP00000106652 LPGHTGYLSCCRFLDDNQIITSSGDTTCALWDIETGQQTVGFAGHSGDVMSLSLAPDGRT 154 ************************************************************ ENSMUSP00000106656 FVSGACDASIKLWDVRDSMCRQTFIGHESDINAVAFFPNGYAFTTGSDDATCRLFDLRAD 300 ENSMUSP00000031726 FVSGACDASIKLWDVRDSMCRQTFIGHESDINAVAFFPNGYAFTTGSDDATCRLFDLRAD 258 ENSMUSP00000106649 FVSGACDASIKLWDVRDSMCRQTFIGHESDINAVAFFPNGYAFTTGSDDATCRLFDLRAD 214 ENSMUSP00000106652 FVSGACDASIKLWDVRDSMCRQTFIGHESDINAVAFFPNGYAFTTGSDDATCRLFDLRAD 214 ************************************************************ class 2b ENSMUSP00000106656 QELLMYSHDNIICGITSVAFSRSGRLLLAGYDDFNCNIWDAMKGDRAGVLAGHDNRVSCL 360 ENSMUSP00000031726 QELLMYSHDNIICGITSVAFSRSGRLLLAGYDDFNCNIWDAMKGDRAGVLAGHDNRVSCL 318 ENSMUSP00000106649 QELLMYSHDNIICGITSVAFSRSGRLLLAGYDDFNCNIWDAMKGDRAGVLAGHDNRVSCL 274 ENSMUSP00000106652 QELLMYSHDNIICGITSVAFSRSGRLLLAGYDDFNCNIWDAMKGDRAGVLAGHDNRVSCL 274 ************************************************************ ENSMUSP00000106656 GVTDDGMAVATGSWDSFLKIWN 382 ENSMUSP00000031726 GVTDDGMAVATGSWDSFLKIWN 340 ENSMUSP00000106649 GVTDDGMAVATGSWDSFLKIWN 296 ENSMUSP00000106652 GVTDDGMAVATGSWDSFLKIWN 296 ********************** CLUSTAL 2.0.12 multiple sequence alignment

Readily available information concerning different peptide evidence classes will be important

for the integration of shotgun proteomics data with transcriptomics data, both from gene

model-based and exon-based gene expression analysis platforms or RNA-Seq, and for the

selection of peptides for targeted quantitative proteomics projects.

For the specific use case of selecting peptides for targeted quantitative proteomics projects,

a researcher can use PeptideClassifier to compute a proteome-wide peptide index, and then

browse all peptides for a specific gene model of interest. Integration of publicly available

experimental data (e.g. from PeptideAtlas, PRIDE or Tranche) and ideally peptide

detectability prediction, could further facilitate the decision which peptide likely would have

the best chances to work well with selective reaction monitoring (SRM) or multiple reaction


monitoring (MRM). Below, we illustrate this use case for the ENSEMBL human protein

database (see Supplementary Fig. 1 for more detail on the usage of the respective

PeptideClassifier software modules).

Step 1: Extract the relationships between gene models and protein identifiers.

(the example database used below can be found in the folder Auxiliary_Scripts)

cat Hsapiens_NCBI36.53.fasta | grep \> | tr "|" " " | awk '{for(i=1;i<=NF;i++){if(i==1 ||

$i~/gene/){printf $i"\t"}} printf "\n"}' | sed -e "s/>//" -e "s/gene://" -e "s/\t$//" | awk '{print

$2"\t"$1}' | sort -u > Gene_Protein_Map_Human_36.53

Step 2: Analyze redundancies in the database.

./FastaAnalyzer/FastaAnalyzer Hsapiens_NCBI36.53.fasta >

FastaAnalyzer_Results_Human_36.53

Step 3: Peptide digestion and filtering.

This step performs a theoretical protease digest of the proteins, filters the resulting peptides

according to a predefined mass range and provides the identifiers of all possible proteins

they point at. Digestdb, the script that performs the in silico digestion of proteins, is available

from the Trans-Proteomics Pipeline (TPP, http://tools.proteomecenter.org/).

digestdb Hsapiens_NCBI36.53.fasta | awk '$2>=450 && $2<=4500 && length($4)>=6{print

$4"\t"$1}' | sort -u | awk '{if($1!=previous){printf "\n"$1";"$2","; previous=$1}else{printf $2","}}' |

sed 's/,$//' | grep "[A-Z]" > Peptides_Proteins_Human_36.53

Step 4: Peptide classification.

java -jar ./PeptideClassifier/PeptideClassifier.jar Peptides_Proteins_Human_36.53

Gene_Protein_Map_Human_37.53 FastaAnalyzer_Results_Human_36.53 >

outfile_classified_peptides_36.53


Step 5: Select a gene model of interest and all its encoded proteins; assess the classified

peptides.

cat Gene_Protein_Map_Human_NCBI36.53 | grep ENSG000000xxxxx | cut -f2 >

proteins_of_interest

cat outfile_classified_peptides_36.53 | grep -f proteins_of_interest

From the output list, peptides of a specific evidence class can be selected for the proteins or

gene model of interest. To select peptides that could imply all protein isoforms encoded by a

gene model, a user could consider class 1a peptides for gene models that encode only one

protein sequence, class 1b peptides for gene models which encode several

undistinguishable protein sequences, and class 2b peptides for gene models that encode two

or more distinct protein sequences.

The class 2 a peptides on the other hand imply a proper subset of the protein sequences

encoded by the gene model. Class 3a peptides are special in that they unambiguously

identify one protein sequence which can be encoded for by distinct gene models. Proteomics

methods are not able to distinguish the encoding gene model, and other technologies would

have to be applied.

The list of classified peptides that can be extracted for the mouse gene model

ENSMUSG00000029713 (which was described above) by this exemplary use case is shown

on the next page.


Peptide sequence (from tryptic digest) Implied proteins Evidence class

(calculated for entire database)

ACGDSTLTQITAGLDPVGRENSMUSP00000031726,ENSMUSP00000106656

2a

ADQELLMYSHDNIICGITSVAFSRENSMUSP00000031726,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656

2b

AGVLAGHDNRENSMUSP00000030940,ENSMUSP00000031726,ENSMUSP00000101241,ENSMUSP00000106649, ENSMUSP00000106652,ENSMUSP00000106656

3b

ELPGHTGYLSCCRENSMUSP00000029212,ENSMUSP00000031726,ENSMUSP00000103869,ENSMUSP00000106649, ENSMUSP00000106652,ENSMUSP00000106656

3b

GPQPPPTPPHGPGAMSELEQLRENSMUSP00000106656

1a

IYAMHWGTDSRENSMUSP00000030940,ENSMUSP00000031726,ENSMUSP00000101241,ENSMUSP00000106649, ENSMUSP00000106652,ENSMUSP00000106656

3b

LIIWDSYTTNKENSMUSP00000029212,ENSMUSP00000030940,ENSMUSP00000031726,ENSMUSP00000101241, ENSMUSP00000103869,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656

3b

LLLAGYDDFNCNIWDAMKENSMUSP00000031726,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656

2b

LLVSASQDGKENSMUSP00000024206,ENSMUSP00000029212,ENSMUSP00000030940,ENSMUSP00000031726, ENSMUSP00000101241,ENSMUSP00000103869,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656

3b

MGCSRPGMGIIPGLGPAARENSMUSP00000106656

1a

MSELEQLRENSMUSP00000029212,ENSMUSP00000031726,ENSMUSP00000057938,ENSMUSP00000103869

3b

QEAEQLRENSMUSP00000029212,ENSMUSP00000031726,ENSMUSP00000103869,ENSMUSP00000106656

3b

QTFIGHESDINAVAFFPNGYAFTTGSDDATCRENSMUSP00000031726,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656

2b

SCLPQPPSRENSMUSP00000106656

1a

SSWVMTCAYAPSGNFVACGGLDNICSIYSLKENSMUSP00000031726,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656

2b

TFVSGACDASIKENSMUSP00000031726,ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656

2b

VHAIPLRENSMUSP00000024206,ENSMUSP00000030940,ENSMUSP00000031726,ENSMUSP00000101241, ENSMUSP00000106649,ENSMUSP00000106652,ENSMUSP00000106656

3b

VSCLGVTDDGMAVATGSWDSFLKENSMUSP00000030940,ENSMUSP00000031726,ENSMUSP00000101241,ENSMUSP00000106649, ENSMUSP00000106652,ENSMUSP00000106656

3b


Documents

PeptideClassifier for protein inference and targeted