14
1 Epitopes Toolkit (EpiT) Yasser EL-Manzalawy http://www.cs.iastate.edu/~yasser March 9, 2009 What is EpiT? Epitopes Toolkit (EpiT) is a platform for developing epitope prediction tools. An EpiT developer can distribute his predictor as a serialized Java object (model file). This allows other EpiT users to use his predictor on their own machines, rebuild the predictor on other datasets, or combine the predictor with other predictors to obtain a customized hybrid or consensus predictor. Overview of EpiT EpiT has two main components: i.  Model builder , an application for building and evaluating epitope predictors and serializing these models in a binary format (model files) ii. Predictor , an application for applying a model to test data (e.g., set of epitopes or protein sequences). Model builder The model builder application is an extension of Weka [1], a well-known machine learning workbench supporting many standard machine learning algorithms. Weka provides tools for data pre-processing, classification, regression, clustering, validation, and visualization. Furthermore, Weka provides a framework for implementing new machine learning methods and data pre-processors. The model builder in EpiT offers the following extensions to Weka: i) a suite of data pre-processors (called filters in Weka) for converting epitope sequences into a vector of numerical features such that Weka supported methods can be applied to the data. The current implementation supports filters for converting epitope sequences into amino acid compositions, dipeptide compositions, amino acid pair propensities [2], composition- transition-distribution (CTD) [3,4], and nominal attributes. Once epitope sequences have been converted into numeric or nominal features, any suitable Weka learner can be trained and evaluated on that datasets; ii) a

EpiT Tutorial

Embed Size (px)

Citation preview

Page 1: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 1/14

1

Epitopes Toolkit (EpiT)Yasser EL-Manzalawy

http://www.cs.iastate.edu/~yasser 

March 9, 2009

What is EpiT?

Epitopes Toolkit (EpiT) is a platform for developing epitope prediction tools.

An EpiT developer can distribute his predictor as a serialized Java object

(model file). This allows other EpiT users to use his predictor on their own

machines, rebuild the predictor on other datasets, or combine the predictor

with other predictors to obtain a customized hybrid or consensus predictor.

Overview of EpiT

EpiT has two main components:

i.   Model builder , an application for building and evaluating epitope

predictors and serializing these models in a binary format (model

files)

ii.  Predictor , an application for applying a model to test data (e.g., set of 

epitopes or protein sequences).

Model builder

The model builder  application is an extension of Weka [1], a well-known

machine learning workbench supporting many standard machine learning

algorithms. Weka provides tools for data pre-processing, classification,

regression, clustering, validation, and visualization. Furthermore, Weka

provides a framework for implementing new machine learning methods and

data pre-processors.

The model builder in EpiT offers the following extensions to Weka: i) a

suite of data pre-processors (called filters in Weka) for converting epitope

sequences into a vector of numerical features such that Weka supportedmethods can be applied to the data. The current implementation supports

filters for converting epitope sequences into amino acid compositions,

dipeptide compositions, amino acid pair propensities [2], composition-

transition-distribution (CTD) [3,4], and nominal attributes. Once epitope

sequences have been converted into numeric or nominal features, any

suitable Weka learner can be trained and evaluated on that datasets; ii) a

Page 2: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 2/14

2

number of methods that can be directly (without applying any filters) trained

and evaluated for qualitative and quantitative epitope predictions.

The current implementation of EpiT provides classifiers for propensity scale

methods (e.g., Parker’s hydropholicity scale [5]), position specific scoring

matrix (PSSM) [6], and a method for predicting MHC class II binding

affinity using multiple-instance regression [7]. In addition, a meta-classifier

for building a consensus predictor combining a group of predictors and a

meta-classifier for building epitope predictors from highly unbalanced

training datasets by randomly under-sampling instances from the majority

class. More information about these extensions is provided in the Epit API

documentation.

Predictor

The Predictor is a graphical user interface (GUI) for applying a model to a

test datasets. Specifically, the user inputs the model file, the test data file, the

output file name, the format of the test data (set of epitopes or fasta

sequences), the type of the problem (peptide-based or residue-based) [8],

and the length of the peptide/window sequence. The output of the predictor

is a summary of the input model (model name, model parameters, and the

name of the datasets used to build the model) followed by the predictions.

The predictions are four tab-separated columns. The first column is the

epitope/antigen identifier. The second and third columns are position and the

sequence of the predicted peptide/residue sequence. The last column is the

predicted scores.

Installing and using EpiT

EpiT is platform-independent since it is implemented in Java. For Installing

EpiT, one needs to download it from the project web site and unzip the

compressed file. For running EpiT, you need to add all the jar files included

in the lib folder to the CLASSPATH and run the epit.jar file (See

RunEpiT.bat as an example).

The following command sets the CLASSPATH and runs EpiT: java –Xmx512m -classpath "./epit.jar;./lib/weka.jar;./lib/readseq.jar;./lib/swing-layout-

1.0.3.jar;./lib/swing-worker-1.2.jar;." epit.gui.MainGUI

Example 1: Predicting linear B-cell epitopes using FBCPred model

Page 3: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 3/14

3

FBCPred [9] is a recent method for predicting flexible length linear B-cell

epitopes using subsequence kernel. An implementation of this method is

available on BCPREDS web server. However, users are restricted to submit

one protein sequence at a time. In this example, we demonstrate how to use

the Predictor application in EpiT and the FBCPred model file provided in

the Examples folder to predict potential linear B-cell epitopes.

1.  Run EpiT

2.  Go to Application menu and select Predictor application

3.  Press the Model button to view an open file dialog and use it to

enter the “./Examples/models/FBCPred.model”

4.  Press the Test button to view an open file dialog and use it to enter

the file containing the test sequences in fasta format

“./Examples/data/test.fasta.txt”5.  Press the Output button to view a save file dialog and use it to

specify the path and the name of the file that the predictions will be

outputted to (e.g., “./Examples/fbcpred.test.out.txt”).

6.  Set the peptide length to 14 (default value for FBCPred method).

7.  Press the Predict button to get the predictions (See Figure 1).

8.  Change the test file to “./Examples/data/abcpred.blind.txt”. This is

the blind test set published by Saha et al. [10].

9.  Set the output file to “./Examples/data/fbcpred.abcpred.out.txt”.

10. Change the Input format to “epitopes list”. Note that the peptide

length will be changed to -1. This implies that full-length test

epitopes will be fed to the model for prediction without applying a

sliding window to fix the length of the test peptides submitted to

the classifier.

11. Press Predict button to get the predictions (See Figure 2).

Page 4: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 4/14

4

Figure 1: Output predictions of applying FBCPred model to antigen

sequences in test.fasta.txt.

Figure 2: Output predictions of applying FBCPred model to

ABCPred blind test set in abcpred.blind.txt.

Page 5: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 5/14

5

Example 2: Developing a Position Specific Scoring Matrix (PSSM) for

predicting 20-mer Predicting linear B-cell peptides1-  Run EpiT

2-  Go to Application menu and select Model builder application. A

modified version of Weka explorer will be displayed.

3-  Press the open file button and use the open file dialog to open

“./Examples/data/BCPred20.nr80.arff”. This is the datasets that has

been used to develop the 20-mer peptides classifier for BCPred

method [11]. Each instance is 20 residues in length and is associated

with a binary label to indicate whether the corresponding peptide is a

linear B-cell epitope or not. Figure 3 provides some useful

information about this dataset.

4-  Click the Classify tab.

5-  Click Choose button to select the classification method and selectepit.classifiers.matrix.PSSMClassifier (See Figure 4).

6-  Click Start button to begin a 10-fold cross-validation test to evaluate

the PSSM classifier on the BCPred 20-mer dataset. At the end, the

program will output the PSSM matrix constructed using the entire

training dataset and will also output several performance metrics

obtained using the cross-validation test. For more details, please see

the Weka explorer tutorial available at:

7-  In the result panel, right click on the classifier name and select “Save

model” from the popup menu and save the model as

“./Examples/models/pssm.model” (See Figure 5).

Page 6: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 6/14

6

Figure 3: EpiT model builder, an extended version of Weka GUI explorer.

Figure 4: Selecting the PSSM classifier.

Page 7: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 7/14

Page 8: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 8/14

8

Figure 6: A reported poor performance of the PSSM model built using

positive information only and assuming uniform background probabilities.

Example 3: Developing a propensity scale based method for predictinglinear B-cell epitopes

1-  Run EpiT

2-  Go to Application menu and select Model builder application. A

modified version of Weka explorer will be displayed.

3-  Press the open file button and use the open file dialog to open

“./Examples/data/BCPred20.nr80.arff”.

4-  Click the Classify tab.

5-  Click Choose button to select the classification method and select

epit.classifiers.propensity.PropensityScale. The default parametersettings for this method are: standard 20 amino acids alphabet,

Parker’s hydrophilicity scale, and window size = -1.

6-  Click Start button to begin a 10-fold cross-validation test to evaluate

the PSSM classifier on the BCPred 20-mer dataset.

Page 9: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 9/14

9

7-  In the result panel, right click on the classifier name and select “Save

model” from the popup menu and save the model as

“./Examples/models/parker.model”.

It should be mentioned that, the EpiT distribution includes 544 amino acid

propensity scales extracted from AAIndex. Any of these scales can be used

with the PropensityScale classifier instead of the default Parker’s

hydrophilicity scale.

Example 4: Peptide-based and residue-based linear B-cell epitopes

prediction using Parker’s propensity scale

1.  Run EpiT

2.  Go to Application menu and select Predictor application

3.  Press the Model button to view an open file dialog and use it toenter the “./Examples/models/parker.model”

4.  Press the Test button to view an open file dialog and use it to enter

the file containing the test sequences in fasta format

“./Examples/data/test.fasta.txt”

5.  Press the Output button to view a save file dialog and use it to

specify the path and the name of the file that the predictions will be

outputted to (e.g., “./Examples/  parker.test.peptide.out.txt”).

6.  Set the peptide length to 14. Note that, setting the window size to -

1 when building parker.model allows us to evaluate it using any

Peptide/Window length. Otherwise, we have to use the exact size

that has been specified during the training of the model.

7.  Press the Predict button to get predictions for each 14-mer peptide

in the test sequences.

8.  Change the instance type to residue-based.

9.  Set the window length to 7 (has to be an odd number)

10. Set the output file to “parker.test.residue.out.txt”

11. Press the Predict button to get prediction scores for each residue in

the test sequences (See Figure 7).

Page 10: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 10/14

10

Figure 7: Residue-based classification using parker.model.

Example 5: Developing a Naïve Bayes classifier for predicting linear B-

cell epitopes using amino acid composition information

Because the majority of Weka implemented algorithms, including Naïve

Bayes classifier, are not applicable on datasets with string attributes, EpiT

provides a set of filters for converting epitope sequences into feature vectors.

1-  Run EpiT

2-  Go to Application menu and select Model builder application. A

modified version of Weka explorer will be displayed.3-  Press the open file button and use the open file dialog to open

“./Examples/data/BCPred20.nr80.arff”.

4-  Click the Classify tab.

5-  Click Choose button to select the classification method and select

weka.classifiers.meta.FilteredClassifier.

Page 11: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 11/14

11

6-  Left-click on the classifier name to edit the FilteredClassifier

properties. Set the classifier to weak.bayes.NaiveBayes. Set the filter

to epit.filters.unsupervised.attribute.SequenceComposition. Click OK

to close the properties window.

7-  Click Start button to begin a 10-fold cross-validation test to evaluate

the model on the BCPred 20-mer dataset.

8-  In the result panel, right click on the classifier name and select “Save

model” from the popup menu and save the model as

“./Examples/models/nbac.model”.

Example 6: Developing a consensus predictor for predicting flexible-

length linear B-cell epitopes

Let’s assume that we may have several models for predicting flexible-length

linear B-cell epitopes. Our goal is to combine the predictions of thesemodels into a consensus prediction. In general, we expect the consensus

method combining several methods to outperform any individual method.

There are two ways of obtaining consensus predictions. First, one can use

the Predictor application to apply every individual model on the test data.

Then, the output predictions can be combined into a consensus prediction

(e.g., via importing the predictions into an Excel sheet and combining them

or by writing a simple script to combine these predictions). Second, one can

use the weak.classifiers.meta.Vote classifier and

epit.classifiers.meta.ModelBased to build a consensus predictor and use the

Predictor application to apply this consensus predictor to the test data.

1-  Run EpiT

2-  Go to Application menu and select Model builder application. A

modified version of Weka explorer will be displayed.

3-  Press the open file button and use the open file dialog to open

“./Examples/data/BCPred20.nr80.arff”.

4-  Click the Classify tab.

5-  Click Choose button to select the classification method and selectweka.classifiers.meta.Vote.

6-  Left-click on the classifier name to edit the Vote classifier properties.

For the classifiers property, add two epit.classifiers.meta.ModelBased

classifiers and set their ModelFile property to

“./Examples\models\FBCPred.model”,

“./Examples\models\parker.model”, respectively.

Page 12: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 12/14

12

7-  Select “use training set” as the test option and click Start button to

begin evaluating the consensus model on the BCPred 20-mer dataset.

It should be noted that the FBCPred.model was built using FBCPred

dataset and in this example the consensus model is evaluated on

BCPred 20-mer dataset. Because both datasets were extracted from

the BciPep database, the reported performance is expected to be

overoptimistic. If your goal, is to evaluated a consensus model of 

combining FBCPred and Parker’s hydrophilicity scale, then you

should use the Vote to combine an SMO classifier with subsequence

kernel (FBCPred method) and a PropensityScale classifier.

8-  In the result panel, right click on the classifier name and select “Save

model” from the popup menu and save the model as

“./Examples/models/consensus.model”.

Figure 8: Setting the properties of the Vote classifier.

Page 13: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 13/14

13

Example 7: Using EpiT to build a hybrid predictor

Briefly, you can follow the approach described in Example 6 to use any

Weka meta-classifier to build a hybrid model combining several existing

models (each model will be encapsulated in a ModelBased classifier) or to

build and evaluate a hybrid model combining several prediction methods.

Updating an existing model

An interesting feature in EpiT is that it allows anyone to rebuild an existing

model. Assume that you have augmented FBCPred dataset with newly

reported epitopes data and your goal is to rebuild your own FBCPred model

with the modified dataset. Note that in Figure 1, the Predictor application is

reporting the classification method and the parameters that have been used to

build the original FBCPred model. Therefore, to build your own updatedFBCPred model, you can use this information and the  Model builder  

application to evaluate and build your own model.

Extending EpiT

EpiT is an open source project under the GNU General Public License

(GPL). This assures that anyone can freely extend or change this software as

long as the modified software will be licensed under the GNU GPL. We

encourage bioinformatics developers to participate in EpiT by contributing

new components (e.g., filters or machine learning methods), new epitope

datasets in Weka accepted formats, or new epitope prediction tools in the

form of model files.

References

[1] Witten, I., Frank, E., 2005. Data mining: Practical machine learning tools

and techniques, 2nd Edition. Morgan Kaufmann.

[2] Chen, J., Liu, H., Yang, J., Chou, K., 2007. Prediction of linear B-cellepitopes using amino acid pair antigenicity scale. Amino Acids 33, 423–428.

[3] Cui, J., Han, L., Lin, H., Tan, Z., Jiang, L., Cao, Z., Chen, Y., 2006.

MHC-BPS: MHC binder prediction server for identifying peptides of 

flexible lengths from sequence derived physicochemical properties.

Immunogenetics 58, 607–613.

Page 14: EpiT Tutorial

8/2/2019 EpiT Tutorial

http://slidepdf.com/reader/full/epit-tutorial 14/14

14

[4] EL-Manzalawy, Y., Dobbs, D., Honavar, V., 2008a. On Evaluating

MHC-II Binding Peptide Prediction Methods. PLoS ONE 3.

[5] Parker, J., Guo, D and, H. R., 1986. New hydrophilicity scale derived

from highperformance liquid chromatography peptide retention data:

correlation of predicted surface residues with antigenicity and x-ray-derived

accessible sites. Biochemistry 25, 5425–5432.

[6] Henikoff, J., Henikoff, S., 1996. Using substitution probabilities to

improve positionspecific scoring matrices. Bioinformatics 12, 135–143.

[7] EL-Manzalawy, Y., Dobbs, D., Honavar, V., 2009. Predicting MHC-II

binding affinity using multiple instance regression. Submitted to IEEE/ACM

Trans Comput Biol Bioinform.

[8] EL-Manzalawy, Y., Dobbs, D., Honavar, V., 2008c. Predicting linear B-

cell epitopes using evolutionary information. IEEE International Conference

on Bioinformatics and Biomedicine.

[9] EL-Manzalawy, Y., Dobbs, D., Honavar, V., 2008b. Predicting flexible

length linear B-cell epitopes. 7th International Conference on Computational

Systems Bioinformatics, 121–131.

[10] Saha, S. and Raghava, G. (2006b). Prediction of continuous B-cell

epitopes in an antigen using recurrent neural network. Proteins, 65:40-48.

[11] EL-Manzalawy, Y., Dobbs, D., Honavar, V., 2008d. Predicting linear

B-cell epitopes using string kernels. J. Mol. Recognit. 21, 243–255.