User Guide - Artificial Intelligence Research Laboratoryailab.ist.psu.edu/gennotate/gennotate_tutorial.pdfWEKA provides tools for data pre-processing, classification, regression, clustering,

1 Copyright © Gennotate development team

User Guide

Written By

Yasser EL-Manzalawy


Introduction

As large amounts of genome sequence data are becoming available nowadays, the

development of reliable and efficient genome annotation tools required to assign biological

interpretation to the DNA sequence becomes more desirable. Although several

computational genome annotation tools have been proposed, accurate and scalable

genome annotation remains a major challenge.

A variety of knowledge-based, statistical, and machine learning methods have been

developed for many genome annotation tasks. They differ in terms of the training data sets

used to train the predictive models, the data representations (e.g., sequence features) used

for encoding the inputs and outputs (class labels) of the predictive models, the algorithms

used for building the predictors, and the validation data sets and the performance metrics

used to assess the effectiveness of the predictors. Often, it is the case that the data sets,

implementations of algorithms, the data representations used, are simply not available to

the research community in a form that allows rigorous comparison of alternative

approaches. Yet, such comparisons are essential for determining the strengths and

limitations of existing approaches so that further research can be focused on improving

these methods. For example, some of the methods are accessible via the Internet as online

Web servers. Comparison of the underlying computational methods implemented by such

servers is not straightforward in the absence of access to implementations of the

algorithms and the precise data sets and data representations used. This is further

complicated by the fact that some of the servers often update the predictors periodically

using newly available data, newer computational methods, or data representations, making

it difficult to determine whether the reported or measured changes in predictive accuracy

stem from improvements in the methods, data representations, or better data sets.

What is Gennotate?

Gennotate is a platform for sharing data representations, predictors, and machine learning

algorithms for a broad range of gene structure prediction tasks.

Gennotate has two main components (see Figure 1):

1) Model builder, an application for building and evaluating predictors and serializing

these models in a binary format (model files).

2) Predictor, an application for applying a model to test data (e.g., sequences to be

annotated). The model builder application is an extension of WEKA [1], a widely

used machine learning workbench supporting many standard machine learning

algorithms. WEKA provides tools for data pre-processing, classification, regression,

clustering, validation, and visualization. Furthermore, WEKA provides a framework

for implementing new machine learning methods and data pre-processors. The

model builder extends WEKA by adding a suite of data pre-processors (called filters


in WEKA) for converting molecular sequences into vectors of numerical features

such that WEKA supported methods can be applied to the data. The current

implementation supports filters for generating several of the widely used data

representations of molecular sequences. Once the sequences are converted into

numeric or nominal features, any suitable WEKA learner can be trained and

evaluated on that data set.

Model builder

The model builder extends WEKA with a variety of DNA sequence preprocessors (WEKA

filters) and a number of classification algorithms (e.g., classifiers based on Markov models).

The very few exceptions, machine learning algorithms supported in WEKA can not be

directly applied to DNA sequence data. A preprocessing step that extracts some features

from sequence data is often required. Gennotate model builder provides more than 30

implemented sequence and structure –based DNA feature extraction methods.

Additionally, a filter called ConcatenateFilter generates new features based on the

combination of any set of Gennotate features. Table 1 summarizes the list of currently

implemented Gennotate filters. For detailed information about these filters, please check

the Gennotate API documentation available at

http://ailab.cs.iastate.edu/gennotate/javadoc/index.html.

Once the features have been extracted from DNA sequences, many WEKA supported

machine learning algorithms can be applied (including state-of-the-art algorithms for

classification, regression, clustering, and feature selection). In addition to WEKA supported

implementation, Gennotate can run any third party extension of WEKA. The procedure is as

a simple as just adding some extra jar files to our CLASSPATH when running Gennotate.

Figure 1: Gennotate model builder (left) and predictor (right).


The current implementation of Gennotate enriches WEKA with a number of classification

algorithms summarized in Table 2.

Table 1: List of Gennotate supported filters

Filter Description

DDNAFilter A filer for extracting dinucleotide structure features from DNA sequences.

DNA2Filter A filer for converting DNA sequence into a new sequence over a new

alphabet defined over all dinucleotide symbols.

DNASeqToNominalFilter A Filter to convert a string attribute of DNA sequence into nominal

attributes.

DNCFilter A Filter to convert a string attribute into 400 features representing

compositions of dinucleotides.

KMerFilter A Filter to convert a string attribute into numeric features represented as

the frequency of its k-mer substrings.

MonoHBondFilter A Filter for extracting hydrogen bond based DNA structure features.

NCFilter A Filter to convert a string attribute into numeric features representing

compositions of nucleotides.

SubSequenceFilter A Filter for extracting a substring from the DNA sequence.

TRIDNAFilter A filer for extracting tri-nucleotide structure features from DNA

sequences.

ConcatenateFilter A filter for concatenating multiple filters.

Table 2: List of Gennotate classification algorithms

Classifier Description

HMMClassifier A classifier for implementing a Hidden Markov Model from sequence data.

IMMClassifier A classifier for implementing an Interpolated Markov Model from sequence data.

MMClassifier A classifier for implementing a Markov Model from sequence data.

BalancedClassifier A meta classifier for training a base classifier on unbalanced data set.

ModelBased A meta classifier for performing classification/regression using a specified model

file.

Predictor

The Predictor is a graphical user interface (GUI) for applying a saved prediction model to a

test datasets. Specifically, the user inputs the model file, the test data file, the output file

name, the format of the test data (DNA fragments (one fragment per line) or fasta

sequences), the type of the problem (peptide-based or nucleotide-based), and the

length of the peptide/window sequence. The output of the predictor is a summary of the

input model (model name, model parameters, and the name of the datasets used to build

the model) followed by the predictions. The predictions are four tab-separated columns

(See Figure 2). The first column is the sequence identifier. The second and third columns

are position and the sequence of the predicted peptide/nucleotide sequence. The last

column is the predicted scores.


Installing and running Gennotate

Gennotate is platform-independent since it is implemented in Java. For Installing

Gennotate, one needs to download it from the project web site and unzip the

compressed file. For running Gennotate, you need to add all the jar files included in the lib

folder to the CLASSPATH and run the gennotate.jar file.

For example, the following command sets the CLASSPATH and runs Gennotate on Windows

machines:

java -Xmx1024m -classpath "gennotate.jar;weka.jar" gennotate.gui.MainGUI

For linux machines, replace “;” with “:”.

Using Gennotate

In this section, we show several examples on how to use Gennotate to develop predictors

from DNA sequence data. For this purpose, we use two in-house data sets for predicting

sigma 70 promoters in E. coli:

1) Sigma70.arff is a non-redundant data set extracted from RegulonDB on June 24,

2013. The data set contains 579 promoter sequences published before April 2009.

None of 579 shares more than 45% similarity with any other sequence in the

Figure 2: Example Predictor output.


promoter data. There are also 579 non-promoter sequences in which none of them

shares more than 45% with any either promoter or non-promoter sequence.

2) Sigma70_test is a non-redundant data set extracted from RegulonDB on June 24,

2013. All promoter sequences are published after April 2009. The data has 792

promoters and 792 non-promoters sequences. None of the sequences shares more

than 45% with any other sequence. The test data is provides in two formats: 1)

standard WEKA format (file Sigma70_test.arff); 2) one fragment per line format (file

Sigma70_test.txt).

Building your first predictor

Here, we show how to build your first predictor using Sigma70.arff data and HMMClassifier

and store it for future use on test data.

1. Run Gennotate

2. Go to Application menu and select model builder application.

3. In the model builder window (WEKA explorer augmented with Gennotate filters and

prediction methods) click open and select the file /Example/Data/Sigma70.arff.

4. Click classify tab

5. In classifier panel click choose and browse for HMMClassifier

6. The HMMClassifier has two parameters: input data alphabet (default ACGTN); and

whether the input sequences has gaps (default false). Keep the default parameters

and click OK.

7. Having both the data set and the classification algorithm specified, we are ready to

build the model and evaluate it using 10-fold cross-validation. Just click start button

and wait for the 10-fold cross-validation procedure to finish. The classifier output

shows several statistical estimates of HMMClassifier using 10-fold cross-validation.

For example, the accuracy and AUC of the model are 72.8% and 0.81, respectively.

8. To save the model, right click on the model in the Result list panel and select Save

model. Save your model as /Examples/Models/Sigma70HMM.model.


Applying your model to test data

There are several methods for applying your model on a test data. First, if your test data are

stored in a WEKA format, then you can use the model builder directly to apply the model to

test data and get predictions and some performance measures. To do that following the

following steps:

1. In Test options panel, click supplied test data and click Set to specify the test data file

/Examples/Models/Sigma70_test.arff

2. Right click on Result list panel and select load model to load

/Examples/Models/Sigma70HMM.model. After successfully loading the model, the

classifier output shows information about the training data, the algorithms and its

parameters.


3. By default, WEKA explorer does not output predictions. To output predictions, click

More options and mark output predictions option.

4. Click Start. Wait for evaluating the model on the test data. Then the classifier output

panel will display the predictions and some performance evaluation measures.

Second, if your test data are in Gennotate supported formats (e.g., FASTA or single DNA

fragment per line), then you can use the Predictor application to apply a saved model and

get predictions. For example, to apply Sigma70HMM.model to the test data in

/Examples/Data/Sigma70_test.txt data, follow the following steps:

1. Run Predictor from Application menu.


2. Specify, your inputs and output file as in the below figure.

3. Click Predict and wait to see the output in the Predictions panel and also in the

output file /Examples/Output/Sigma70_test_out.txt.


Case Study 1: Predicting promoter

regions in E.coli using sequence and

structure features

In the previous section, we show how to build a HMM model for predicting Sigma70

promoters in E.coli. A major difference between HMMClassifier and traditional

classifiers such as Naïve Bayes (NB) and Random Forest is that HMMClassifier can

be applied directly to sequence data while traditional classifiers expect the data to

be in the form of feature vectors extracted from the original sequence data. Here, we

show how to simultaneously, extract features from sequence data and build/test a

model. Thanks to WEKA FilteredClassifier which allows us to specify a machine

learning algorithm and a filter to be applied on the fly before feeding the data to the

predictor.

To build a NB classifier using 3-mer features, follow these steps:

1. Run Gennotate


3. In the model builder window (WEKA explorer augmented with Gennotate filters

and prediction methods) click open and select the file

/Example/Data/Sigma70.arff.

4. Click classify tab.

5. In classifier panel click choose and browse for

weka.classifiers.meta.FilteredClassifier.

6. Click on the classifier schema in classifier panel to get the following window.


7. Change the classifier to weka.classifiers.bayes.NaiveBayes (with its default

parameters) and the filter to gennotate.filters.unsupervised.KMerFilter (set k

parameter to 3). Click OK.

8. Click Start to run the 10-fold cross-validation experiment. The following figure

shows the result of our experiment.

You can repeat the preceding procedure for different choices of classifiers and Gennotate

filters. Table 3, compares Naïve Bayes (NB) and Random Forest (with 50 trees) (RF50) for

k = 1,2,3, and 4. Interestingly, none of the classifiers has a competitive performance with

the HMM classifier which achieved AUC equals 0.81 on the same data set.


Table 3: Performance (in terms of AUC score) comparison of NB and RF50 on Sigma70 data

using different sequence-based features.

Features NB RF50

1-mer 0.64 0.58

2-mer 0.65 0.66

3-mer 0.65 0.67

4-mer 0.65 0.66

To build models using structure features, follow the preceding procedure and replace

KMerFilter with DDNAFilter which allows us to experiment with 12 different dinucleotide

structure-based features [2] (See Gennotete API documentation for detailed information

about these methods). Table 4, compares NB and RF50 using 10-fold cross-validation and

different structure-based features extracted from Sigma70.arff data. In several cases,

structure-based features helped us to reach a performance that competitive with HMM

classifier. RF50 seems to be doing better than NB. However, it should be noted that the

number of trees were arbitrary set to 50. There could be room for potential improvements

using larger numbers of trees (we leave this as an exercise for the user). For future

experiments, let’s save the best model in Table 4 as

/Examples/Models/Sigma70_Stability_RF50.model.

Table 4: Performance (in terms of AUC score) comparison of NB and RF50 on Sigma70 data

using twelve different dinucleotide structure-based features.

Features NB RF50

DI_APHYLICITY 0.68 0.68

DI_BDNATWISTOHLER 0.62 0.68

DI_BDNATWISTOLSON 0.64 0.77

DI_DNABENDSTIFF 0.78 0.76

DI_DNADENATURE 0.76 0.77

DI_ZDNASTABENERGY 0.78 0.79

DI_DUPLEXSTAB_DISRUPTENERGY 0.74 0.78

DI_DUPLEXSTAB_FREEENERGY 0.77 0.77

DI_PINDUCEDDEFORM 0.74 0.77

DI_PROPELLERTWIST 0.75 0.75

DI_PROTEINDNATWIST 0.64 0.65

DI_STACKINGENERGY 0.77 0.77


Case Study 2: Improved prediction of

promoter regions in E.coli

In case study 1, we evaluated the prediction of sigma 70 promoters in E.coli using twelve

different methods for extracting dinucleotide features. In general, a better performance can

be achieved by: 1) combing a set of these features; 2) building an ensemble of classifiers

where each base classifier is trained using different structure-based feature; 3) combining

all 12 sets of structure features and using a feature selection method to find an optimal

subset of features. Here, we show how to use Gennotate to build improved methods using

these three approaches.

Concatenating features

To build a single classifier that takes as input the features extracting using the twelve different

dinucleotide features, use gennotate.filters.ConcatenateFilter.

1. Run Gennotate






5. In classifier panel click choose and browse for

weka.classifiers.meta.FilteredClassifier.

6. Click on the classifier schema in classifier panel to get the following window.


7. Change the classifier to weka.classifiers.trees.RandomForest (set the number of

trees to 50) and the filter to gennotate.filters.ConcatenateFilter

8. Click on the ConcatenateFilter and input the twelve filters (e.g., DDNAFilter with

twelve different selections of ConversionTable parameter.




The following figure shows the cross-validation performance of the predictor using twelve

combined sets of structure features. The result is better than any single set of structure

features.


Concatenating features and selecting optimal subset of features

In the preceding experiment, we show that working with a concatenation of twelve set of features can

improve the performance of the resulting model. In general, this high dimensional feature space might

have several irrelevant and/or redundant features. Here, we show how to use WEKA feature selections

with our ConcatenateFilter to further improve the performance of the resulting model.

1. Follow steps 1-6 in the previous experiment

2. In FilteredClassifier window, choose RF50 as your classifier and choose

weka.classifiers.AttributeSelection as your filter.

3. Click on MultiFilter and input two Filters: i) ConcatenateFilter with twelve

DDNAFilters, each with different choice of ConversionTable; ii) AttributeSelection

filter.

4. For AttributeSelection filter, you can play with large pool of WEKA provided feature

selection methods and search algorithms. For our experiment, we ranked all

features based on information gain and used the 20 top ranked features.

5. Click Start and wait for the cross-validation results. The output result will show the

top 100 features used to build the model (See Table 5 for a list of top 20 features)

and will also show some performance measures. Interestingly, the model has a

better AUC (0.83) than the model that uses all the set of features (0.80).


Table 5: List of top 20 structure features

DI_DUPLEXSTAB_FREEENERGY48

DI_DNABENDSTIFF48

DI_DUPLEXSTAB_DISRUPTENERGY48


DI_PINDUCEDDEFORM49

DI_DNABENDSTIFF49



DI_DNABENDSTIFF47

DI_STACKINGENERGY48


DI_DNADENATURE48

DI_DNABENDSTIFF50

DI_ZDNASTABENERGY49

DI_ZDNASTABENERGY50


DI_STACKINGENERGY47


DI_DNADENATURE50

DI_STACKINGENERGY49

Improved predictions of sigma 70 promoters using Ensemble of classifiers

In this experiment, we build a number of classifiers using RF50 and different choices of the

dinucleotide structure-based features. The base classifiers we be combined together using

a second stage classifier using WEKA Logistic classifier.

1. Run Gennotate






5. In classifier panel click choose and browse for weka.classifiers.meta.Stacking. Set

numFolds to 3 and set the metaClassifier to weka.classifiers.functions.Logistic

and input 12 classifiers, each is a FilteredClassifier with RF50 and different

choice of ConversioTable for DDNAFilter.





Case Study 3: Improved prediction of

promoter regions in E.coli using meta-

predictors

An Interesting property of Gennotate is that it allows sharing not only data sets but also the

learned models. Once you have a number of different predictors for the same classification

task. You can: 1) use the Predictor application to apply any of these predictors to some test

data; 2) Rebuild the prediction model using updated/different training data; 3) build a

consensus or hybrid predictor that combine these predictors. The first usage has been

shown earlier. The second usage can be done simply by: loading the new training data;

loading the current model; and performing cross-validation experiment. The results will

show the performance of the new model which also can be saved as a model file for further

use. The third usage is the focus of this case study.

To facilitate the development of a consensus/hybrid predictor that relies on existing

predictors not necessarily developed by the same user, Gennotate provides a meta-

classifier called ModelBased. In the following experiment, we show how to use ModelBased

classifier to build a consensus predictor combining Sigma70HMM and

Sigma70_Stability_RF50 developed earlier.

1. Run Gennotate




/Example/Data/Sigma70_tets.arff. Please note that our goal is to combine

existing models. So, there is no need to retrain these models but instead we will

use the test data to evaluate the combination of these predictors.


5. In classifier panel click choose and browse for weka.classifiers.meta.Vote


6. Input two classifiers each one is using gennotate.classifiers.meta.ModelBased and

set modelFile parameter as showing in the following figure.

7. In the Test options panel, choose use training data. Note that the ModelBased

classifier does not perform any training. It just loads the model and keeps it for

predictions. Hence, what is reported is the performance of applying the models

encapsulated within the ModelBased classifier to what it seems as training data.

The obtained performance of the consensus predictor combining the HMM model and the

RF50 model is almost the same as the performance of HMM (AUC equals 0.81). In practice,

we expect improvements in performance when we combine several (not just two)

predictors.

Please note that we can build a hybrid model using the HMM and RF50 models simply by

following the preceding procedure and replacing the Vote classifier with Stacking classifier.

However, in that case the user might perform cross-validation test and the result should be

handled with caution because the test data has been used to train the meta-predictor in the

Stacking classifier.


Extending Gennotate

Gennotate is extendable, in the sense that anyone can add extra filters or extra

classification methods. To add your own classification methods or filters, please follow the

procedure described in the WEKA documentation on how to write your own classifier and

your own filter. Once you have a jar file including your added components, just add it to our

CLASSPATH when running Gennotate and enjoy your customized version of Gennotate.

References

[1] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining

software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 10-18.

[2] Gan, Y., Guan, J., & Zhou, S. (2012). A comparison study on feature selection of DNA structural properties

for promoter prediction. BMC bioinformatics, 13(1), 4.

Documents

User Guide - Artificial Intelligence Research Laboratoryailab.ist.psu.edu/gennotate/gennotate_tutorial.pdfWEKA provides tools for data pre-processing, classification, regression, clustering,