5
Molecular Classification of Newcastle Disease Virus Based on Degree of Virulence Saleh Esmate Aly , Hanaa Ismail Elshazly , Ahmed Fouad Ali , Hussein Ali Hussein , Aboul Ella Hassanien , Gerald Schaefer § and Md. Atiqur Rahman Ahad Virology Department, Faculty of Veterinary Medicine, Cairo University, Egypt Faculty of Computers and Information, Cairo University, Egypt Department of Computer Science, Faculty of Computers and Information, Suez Canal University, Egypt § Department of Computer Science, Loughborough University, U.K. University of Dhaka, Dhaka, Bangladesh Abstract—Newcastle disease (ND) is one of the most serious infectious diseases of poultry, which have an important economic impact on poultry sector production. The causative agent of the disease is Newcastle disease virus (NDV). NDV strains can be classified into two types according to virulence, namely highly virulent (velogenic) and low virulent (lentogenic) based on their pathogenicity in chickens. In this paper, we address the problem of classifying the new isolated sequence of the newcastle disease virus according to the fusion protein. In order to classify the degree of virulence of the Newcastle virus, we propose a new approach based on the rotation forest algorithm. The performance of our approach is evaluated and compared with three benchmark algorithms. The results show that our proposed algorithm is able to achieve perfect recognition on the benchmark dataset. I. I NTRODUCTION Fusion protein cleavage site (FPCS) has been postulated to be a primary determinant of NDV virulence degree, so that we can classify any isolated sequence according to amino acid at cleavage site.Fig. 1 shows a subset of sequences of fusion protein collected from gene bank. In order to understand the viral virulence, we must give attention to the molecular basis by studying the nucleotide sequence of the virus in order to assess whether the virus has the genetic makeup to be highly pathogenic for poultry or not [1]. NDV strains can be categorised into two types according to virulence, highly vir- ulent (velogenic) and low virulent (lentogenic), based on their pathogenicity in chickens [2], [3]. The degree of pathogenicity are closely related to amino acids sequence motif present at cleavage site of the precursor fusion protein (F0) and the ability of cellular proteases to cleave the F0 protein of different pathotypes [1], [4], [5]. The precursor fusion glycoprotein synthesises as a fusion-inactive precursor (F0) and must be cleaved into F1 and F2 by host proteases to become fusion- active [6]. Sequence analysis of the F protein cleavage site can be used to predict potential pathogenicity of NDV instead of conventional methods such as mean death time (MDT) and intracerebral pathogenic index tests (ICPI) [5]. The virulence of NDV is known to be associated with differences in the amino acid sequence surrounding the post-translational cleav- age site of the F0 protein, with differences in the cleavage Fig. 1. Amino acid sequences at cleavage site from site 112 to 117. sites being directly related to the virulence of the strain. Most viruses that are high virulent for chickens have the amino acid sequence 112 R/KR-Q-K/R-R 116 at the C-terminus of the F2 protein and F (phenylalanine) at residue 117, the N-terminus of the F1 protein. In contrast, viruses of low virulence have sequences in the same region of 112 G/E-K/R-Q-G/E-R 116 and L (leucine) at residue 117 [1], [4], [5]. In this paper, in order to classify the degree of virulence of the Newcastle virus, we apply classification based on the rotation forest algorithm on a collection of sequences from the gene bank collected at National Center for Biotechnology Information (NCBI). Rotation forest is a relatively recent ensemble classifier. The ensemble concept is based on the idea that each predictor makes different errors and the process of gathering results can thus minimise differences and provide more accurate results. Empirical and theoretical studies [7], [8] have proved the improvement of performance for approaches following the ensemble classifier concept. Multiple classifier systems can outperform the single classifier both in terms of accuracy and robustness [9], while it has been shown that this improvement is also related to the diversity in an ensemble scheme [10]. 3rd INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION 2014 978-1-4799-5180-2/14/$31.00 ©2014 IEEE

[IEEE 2014 International Conference on Informatics, Electronics & Vision (ICIEV) - Dhaka, Bangladesh (2014.5.23-2014.5.24)] 2014 International Conference on Informatics, Electronics

Embed Size (px)

Citation preview

Page 1: [IEEE 2014 International Conference on Informatics, Electronics & Vision (ICIEV) - Dhaka, Bangladesh (2014.5.23-2014.5.24)] 2014 International Conference on Informatics, Electronics

Molecular Classification of Newcastle Disease VirusBased on Degree of Virulence

Saleh Esmate Aly∗, Hanaa Ismail Elshazly†, Ahmed Fouad Ali‡, Hussein Ali Hussein∗, Aboul Ella Hassanien†,Gerald Schaefer§ and Md. Atiqur Rahman Ahad¶

∗ Virology Department, Faculty of Veterinary Medicine, Cairo University, Egypt† Faculty of Computers and Information, Cairo University, Egypt

‡ Department of Computer Science, Faculty of Computers and Information, Suez Canal University, Egypt§ Department of Computer Science, Loughborough University, U.K.

¶ University of Dhaka, Dhaka, Bangladesh

Abstract—Newcastle disease (ND) is one of the most seriousinfectious diseases of poultry, which have an important economicimpact on poultry sector production. The causative agent ofthe disease is Newcastle disease virus (NDV). NDV strains canbe classified into two types according to virulence, namelyhighly virulent (velogenic) and low virulent (lentogenic) basedon their pathogenicity in chickens. In this paper, we addressthe problem of classifying the new isolated sequence of thenewcastle disease virus according to the fusion protein. In orderto classify the degree of virulence of the Newcastle virus, wepropose a new approach based on the rotation forest algorithm.The performance of our approach is evaluated and comparedwith three benchmark algorithms. The results show that ourproposed algorithm is able to achieve perfect recognition on thebenchmark dataset.

I. INTRODUCTION

Fusion protein cleavage site (FPCS) has been postulated tobe a primary determinant of NDV virulence degree, so that wecan classify any isolated sequence according to amino acid atcleavage site.Fig. 1 shows a subset of sequences of fusionprotein collected from gene bank. In order to understand theviral virulence, we must give attention to the molecular basisby studying the nucleotide sequence of the virus in orderto assess whether the virus has the genetic makeup to behighly pathogenic for poultry or not [1]. NDV strains can becategorised into two types according to virulence, highly vir-ulent (velogenic) and low virulent (lentogenic), based on theirpathogenicity in chickens [2], [3]. The degree of pathogenicityare closely related to amino acids sequence motif present atcleavage site of the precursor fusion protein (F0) and theability of cellular proteases to cleave the F0 protein of differentpathotypes [1], [4], [5]. The precursor fusion glycoproteinsynthesises as a fusion-inactive precursor (F0) and must becleaved into F1 and F2 by host proteases to become fusion-active [6].

Sequence analysis of the F protein cleavage site can beused to predict potential pathogenicity of NDV instead ofconventional methods such as mean death time (MDT) andintracerebral pathogenic index tests (ICPI) [5]. The virulenceof NDV is known to be associated with differences in theamino acid sequence surrounding the post-translational cleav-age site of the F0 protein, with differences in the cleavage

Fig. 1. Amino acid sequences at cleavage site from site 112 to 117.

sites being directly related to the virulence of the strain. Mostviruses that are high virulent for chickens have the amino acidsequence 112 R/KR-Q-K/R-R 116 at the C-terminus of the F2protein and F (phenylalanine) at residue 117, the N-terminusof the F1 protein. In contrast, viruses of low virulence havesequences in the same region of 112 G/E-K/R-Q-G/E-R 116and L (leucine) at residue 117 [1], [4], [5].

In this paper, in order to classify the degree of virulenceof the Newcastle virus, we apply classification based on therotation forest algorithm on a collection of sequences fromthe gene bank collected at National Center for BiotechnologyInformation (NCBI). Rotation forest is a relatively recentensemble classifier. The ensemble concept is based on the ideathat each predictor makes different errors and the process ofgathering results can thus minimise differences and providemore accurate results. Empirical and theoretical studies [7], [8]have proved the improvement of performance for approachesfollowing the ensemble classifier concept. Multiple classifiersystems can outperform the single classifier both in terms ofaccuracy and robustness [9], while it has been shown that thisimprovement is also related to the diversity in an ensemblescheme [10].

3rd INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION 2014

978-1-4799-5180-2/14/$31.00 ©2014 IEEE

Page 2: [IEEE 2014 International Conference on Informatics, Electronics & Vision (ICIEV) - Dhaka, Bangladesh (2014.5.23-2014.5.24)] 2014 International Conference on Informatics, Electronics

II. RELATED WORK

Machine learning techniques have made important contribu-tions to a variety of fields. Medicine is one of these fields thatcan benefit from the application of data mining techniques,in particular pattern classification techniques, for diagnosis,prognosis, screening, etc.

Polat and Gunes [11] proposed a hybrid classification sys-tem based on C4.5 decision tree classifiers and a one-against-all approach to classify multi-class problems. Here, C4.5 wasinitially executed for all the classes of three datasets formthe UCI repository (dermatology, image segmentation, andlymphography) and 10CV classification accuracies of 84.48%,88.79%, and 80.11% were observed. In contrast, the proposedmethod gave classifications results of 96.71%, 95.18%, and87.95% respectively on the same datasets.

Abelln and Masegosa [12] presented a study using deci-sion trees built using imprecise probabilities and uncertaintymeasures called bagging credal decision trees (B-CDT). Theaccuracy for the UCI lymphography datasets classificationreached 76.96 % without pruning and 77.51% with pruning.

Hassanien el al. [13] presented a rough set approach tofeature reduction and generation of classification rules from aset of medical datasets. A set of data samples of patients withsuspected breast cancer were used and evaluated. Ilczuk andWakulicz-Deja [14] proposed a system to visualise decisionrules for medical diagnosis in the form of decision trees.

Recently, Elshazly et al. [15] presented a comparative studyof three classifiers, namely decision rules, k-nearest neighbour(kNN) and naive Bayes for classification of UCI diagnosisbreast cancer dataset. They tested the effect of discretisationtechniques and genetic algorithms (GAs) to extract reducts onthe generalisation performance. Results showed the efficiencyof the hybrid system which combines GA with RS decisionrules associated with boolean reasoning or equal width binningfor discretisation, achieving 95% diagnostic accuracy.

III. MATERIALS AND METHODS

In this paper, we propose a new algorithm to classify theNewcastle dataset, which consists of 213 Newcastle samples ofdifferent types of virulence obtained from the National Centerfor Biotechnology Information GenBank (NCBI). A represen-tative sample is shown in Fig. 1. The data samples are assignedto two classes, namely low virulent (52%) and high virulent(48%). The main components of our proposed algorithm areintroduced below, before we present the proposed algorithmat the end of the section.

A. Rotation forest

Rotation forest (ROT) is a multiple classifier system (MCS)built upon the random forest classifier concept [16]. Its ob-jective is to improve the accuracy and the diversity of theclassifier. The algorithm is initialised by extracting a specificnumber of bootstrap samples from the original training dataset.The extracted samples are then adopted to form a new trainingset. The new subsets are projected into a new feature spaceusing a linear transformation method. This process is repeated

for each subset. The original features are extracted by poolingall rotated features and the new training set is established.Each classifier is trained on its rotated feature space. Thetransformed features are used to construct new diverse treesto train the classifier. The decision tree learning algorithmconstructs the classification area using hyperplanes parallel tofeature axes, so any trivial rotation for the feature axes willgenerate a completely different tree [17].

The split, the bootstrapping process and the transformationprocess significantly increase the diversity for the ensembleclassifier while different splits will lead to significantly differ-ent tree. Various transformation methods were applied in theliterature such as principal component analysis (PCA), non-parametric discriminant analysis (NDA), random projections(RP), independent component analysis (ICA). [17] found thatPCA gave the highest results and preserves discriminatoryfeatures.

Let L = ((M1, N1) · · · (Mn, Nn)) be a dataset described byn features and X be an N × n matrix of training samples, Ybe the corresponding labels and F the feature set. With D thenumber of classifiers, and M is the number of feature subsets,the forest rotation algorithm is detailed in Algorithm 1.

Algorithm 1 Rotation forest algorithm1: for each base classifier D do2: Split F into M subsets.3: for i = 1 to M do4: Extract a bootstrap sample of 75% from M [i].5: Rotate the bootstrap sample using a transformation

method.6: Extract PCA coefficients COF .7: Arrange COF in a rotation matrix R(D).8: end for9: Construct each column in R(D) according to original

feature sequence in F to give NTS.10: Use NTS to build the base classifier.11: end for

Rotation forest is typically able to yield more accurateresults compared to other ensemble classifiers like bagging [8],boosting [18] and random forest [19] as it provides therequired trade-off between diversity and the accuracy in theconstruction of ensemble classifiers [9]. This is useful fora variety of applications including molecular datasets whichare characterised by high dimensionality in features and lowdimensionality in instance numbers which often lead to anincreased risk of overfitting [20].

B. The proposed algorithm

The framework of the proposed algorithm is illustratedin Fig. 2 while the algorithm is specified in Algorithm 2.The data is collected from the NCBI database site. Then,the BioEdit tool (v. 7.2.5) is applied in order to align thecollected nucleotide sequences of fusion protein of Newcastledisease virus. The amino acid information is obtained bytranslating the sequence in order to identify the cleavage site

3rd INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION 2014

978-1-4799-5180-2/14/$31.00 ©2014 IEEE

Page 3: [IEEE 2014 International Conference on Informatics, Electronics & Vision (ICIEV) - Dhaka, Bangladesh (2014.5.23-2014.5.24)] 2014 International Conference on Informatics, Electronics

Fig. 2. Framework of the proposed algorithm to classify the Newcastledisease

sequence. The number of basic amino acids at cleavage site isspecified and the data is normalised. Finally, the amino aciddata is classified to low or high virulent degree by applyingAlgorithm 1.

Algorithm 2 The proposed algorithm.1: Collect data from NCBI database site.2: Align the data use BioEdit.3: Translate data to amino acid to identify cleavage site

amino acid sequence.4: Selection of relevant data by a virologist.5: Normalise the data.6: Apply Algorithm 1 to classify the data to low or high

virulent degree.

IV. EXPERIMENTAL RESULTS

Experiments were performed based on cross validationwhere the data is randomly partitioned into k subsamples andeach subsample is tested based on training on the other k − 1subsamples.

A. Algorithm settings

There are a number of parameters that must be adjusted toyield good performance. Fig. 3 shows the experimental resultsperformed to reach the optimum value of the maximum andminimum size of the number of features in each subset of thetree. The experiments were conducted 20 times for specifyingthe values of the maximum and the minimum group; theresults for both parameters were identical for the dataset. The

results gave 99.5% for 2 experiments at 1 group and 2 groups,and reached 100% for 3 groups, while for higher numbers,performance dropped. The optimum value for the maximumand minimum size of the group of the tree was hence 3 groupswith an accuracy of 100%.

Fig. 3. Accuracy for varying the maximum and minimum size of the groupfor rotation forest classifier.

The decision tree classifier, J48 in WEKA, uses an error-based pruning algorithm. The user can choose a confidencevalue to be used when pruning the tree. Figure 4 presentsthe results of the experiments performed for specifying thepercentage of instances to be removed. 100 experiments wereperformed to reach the best ratio. As we can see, the highestaccuracy is reached when removing 78%.

Fig. 4. Accurcay for the varying the percentage of instances to be removedfor rotation forest classifier.

B. Performance analysis

The performance of each classifier was evaluated by itsaccuracy, sensitivity, and specificity, defined as

Accuracy =TP + TN

TP + FP + FN + TN, (1)

Sensitivity =TP

TP + FN, (2)

andSpecificity =

TN

FP + TN(3)

respectively, where TP is the number of true positives whichrepresents the number of correctly classified instances, FN

3rd INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION 2014

978-1-4799-5180-2/14/$31.00 ©2014 IEEE

Page 4: [IEEE 2014 International Conference on Informatics, Electronics & Vision (ICIEV) - Dhaka, Bangladesh (2014.5.23-2014.5.24)] 2014 International Conference on Informatics, Electronics

is the number of false negatives which represents the numberof instances incorrectly classified as not related to a class,TN is the number of true negatives which represents theinstances correctly classified as not relate to the class, and FPis the number of false positives which represents the instancesincorrectly classified as related to the class.

C. Comparison with other algorithms

Our proposed algorithm was compared with three bench-mark classifier algorithms as follows:

• Bagging classifier: built on the idea of bagging byBreiman [21] and Ho’s random selection features [7]. Thebehaviour of bagging algorithm is different from ROT. Ateach node, a bootstraping sample is chosen and is usedto train the tree. The remaining instances are used as anout-of-bag set to be evaluated. Predicions of all modelsare combined to produce the output.

• Bayes classifier: is a generative classifier which is basedon probabilistic classification. In the training phase, eachattribute is considered separately in each class. Theclassifier first calculates probabilities for each attributewith condition on the class value. A product rule is usedto obtain a joint conditional probability for the attributes,then Bayes rule is used to derive conditional probabilitiesfor the class variable. The test phase involves calculationof conditional probabilities with normal distributions.

• Decision table classifier: is composed of a hierarchicaltable. Each entry in a higher level table is broken downwith two attributes at each level of the hierarchy toform another table. The hierarchical structure is similarto dimensional stacking [22]. It typically provides fastclassification with low error rates and the ability to use alow number of attributes while producing understandableclassifiers.

Fig. 5 shows the evaluated performance measures forall four classification techniques. It can be seen that theproposed algorithm based on rotation forest classificationachieved perfect classification results and hence the high-est accuracy, sensitivity, and specificity among the testedalgorithms. In contrast, accuracy/sensitivity/specificity were

Fig. 5. Performance measure for all classifiers.

Fig. 6. ROC curves for all classifiers.

99.5%/99.5%/99.5% for bagging and decision table classifiers,and and 99%/99.1%/99.1% for the Bayes classifier.

Fig. 6 shows receiver operating characteristics (ROC)curves [23] for all algorithms. The ROC curve plots the truepositive rate as a function of the false positive rate, and henceshows the trade-off between sensitivity and specificity. As isapparent, providing perfect classification, our approach alsoyields the best ROC curve.

V. CONCLUSIONS

In this paper we have presented an algorithm based on rota-tion forest classification for classifying the degree of virulenceof the Newcastle disease virus. A collection of sequencesfrom the gene bank at National Center for BiotechnologyInformation is used for validating the proposed algorithm.Relevant features were determined with an virologist and acomparison with three other classification algorithms demon-strated that the proposed algorithm can be used to obtainan effective automatic classification system for NDV. Thenumerical experiments verify that the proposed algorithmpresents significantly better results than the other algorithms,providing perfect classification on the employed dataset.

REFERENCES

[1] M.S. Collins, J.B. Bashiruddin, D.J. Alexander, Deduced amino acidsequences at the fusion protein cleavage site of Newcastle disease virusesshowing variation in antigenicity and pathogenicity, Arch. Virol. 128 (3-4), 363-370, 1993.

[2] D.J. Alexander, Newcastle disease and other avian paramyxoviruses, Rev.Sci. Tech. O.I.E. 19, 443-462, 000).

[3] T. Asahara, Study on the virulence of Newcastle disease virus, KitasatoArch. Exp. Med. 51 (1-2), 15-29, 1978.

[4] O.S. de Leeuw, G. Koch, L. Hartog, N. Ravenshorst, B.P. Peeters,Virulence of Newcastle disease virus is determined by the cleavage siteof the fusion protein and by both the stem region and globular headof the haemagglutinin-neuraminidase protein, J. Gen. Virol. 86 (Pt 6),1759-1769, 2005.

[5] A. Panda, Z. Huang, S. Elankumaran, D. Rockemann, S.K. Samal, Roleof fusion protein cleavage site in the virulence of Newcastle disease virus,Microb. Pathogen. 36 (1), 1-10, 2004.

[6] A. Scheid and P.W. Choppin, Identification of biological activities ofparamyxovirus glycoproteins. Activation of cell fusion, hemolysis, andinfectivity of proteolytic cleavage of an inactive precursor protein ofSendai virus. Virology 57, 475-490, 1974.

3rd INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION 2014

978-1-4799-5180-2/14/$31.00 ©2014 IEEE

Page 5: [IEEE 2014 International Conference on Informatics, Electronics & Vision (ICIEV) - Dhaka, Bangladesh (2014.5.23-2014.5.24)] 2014 International Conference on Informatics, Electronics

[7] T. K. Ho, Multiple classifier combination: lessons and next steps, InBunke, H., Kandel, A.(Eds), Hybrid Methods in Pattern Recognition.World Scientific, pp. 171-198, 2002.

[8] J. Kittler, M. Hatef, R. P. W. Duin, J. Matas , On Combining Classifiers,IEEE Trans. Pattern Anal. Machine Intell. vol. 20(3), pp.226-239, 1998.

[9] L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms,Wiley, New York, 2004.

[10] K. -H. Liu and D. -S. Huang, Cancer classification using RotationForest, Computers in Biology and Medicine, vol.38, pp. 601- 610, 2008.

[11] K. Polat ,S. Gunes , A novel hybrid intelligent method based onC4.5 decision tree classifier and one-against-all approach for multi-classclassification problems, Expert Systems with Applications, vol. 36(2), pp.1587 - 1592, 2009.

[12] J. Abelln, A. R. Masegosa, Bagging schemes on the presence of classnoise in classification, Expert Systems with Applications, 39, 6827-6837,2012.

[13] A. E. Hassanien, J. M. Ali, and N. Hajime , Detection of spiculatedmasses in mammograms based on fuzzy image processing, In 7th Int.Conference on Artificial Intelligence and Soft Computing, pp. 1002-1007,2004.

[14] G. Ilczuk, A. Wakulicz-Deja. Visualization of Rough Set Decision Rulesfor Medical Diagnosis Systems, Rough Sets, Fuzzy Sets, Data Miningand Granular Computing Lecture Notes in Computer Science, vol. 4482,2007, pp 371-378.

[15] H.I. Elshazly, N.I. Ghali, A.M.E. Korany, A.E. Hassanien, Rough Setsand Genetic Algorithms A hybrid approach to breast cancer classification,Information and Communication Technologies, pp. 260-265, 2012.

[16] J. J. Rodriguez, L. I. Kuncheva, C. J. Alonso, Rotation forest: a newclassifier ensemble method, IEEE Trans., Pattern Ana., Machine Intell.,28 (10), 1619-1630, 2006.

[17] L. I. Kuncheva and J. J. Rodrguez , An Experimental Study on RotationForest Ensembles, Springer-Verlag Berlin Heidelberg, 459-468, 2007.

[18] Y. Freund and R. E. Schapire, A Decision-Theoretic Generalization ofon line Learning and an Application to Boosting, Journal of computerand system sciences, 55 , 119-139, 1997.

[19] L. Breiman, Random Forests, Machine Learning, vol. 45(1), 5 - 32,2001.

[20] F. Markowetz and R. Spang, Molecular diagnosis classification, modelselection and performance evaluation, Meth. Inf. Med. 44, 438-443, 2005.

[21] L. Breiman, Bagging predictors, Technical Report 421. University ofCalifornia, Berkeley, Department of Statistics, 1994.

[22] J. LeBlanc, M. Ward, and N. Wittels, Exploring N-DimensionalDatabases, Proceedings of First IEEE Conference on Visualization, pages230-237, 1990.

[23] J. Kerekes , Receiver Operating Characteristic Curve ConfidenceIntervals and Regions, IEEE Geoscience and Remote Sensing Letters,vol 5(2), pp. 251 - 255, 2008.

3rd INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION 2014

978-1-4799-5180-2/14/$31.00 ©2014 IEEE