8
FIS-PNN: A Hybrid Computational Method for Protein-Protein Interaction Prediction Sakhinah Abu Bakar 1 , Javid Taheri 2 , and Albert Y Zomaya 3 Centre for Distributed and High Performance Computing School of Information Technologies The University of Sydney Sydney, Australia 1 [email protected] 2 [email protected] 3 [email protected] AbstractThe study of protein-protein interactions (PPI) is an active area of research in biology as it mediates most of the biological functions in any organism. Although, there are no concrete properties in predicting PPI, extensive wet-lab experiments suggest (with a high probability) that interacting proteins in the fine level share similar functions, cellular roles and sub-cellular locations. In this study, we developed a technique to predict PPI based on their secondary structures, co-localization, and function annotation. We proposed our approach, namely FIS-PNN, to predict the interacting proteins in yeast using hybrid machine learning algorithms. FIS-PNN has been trained and tested using 1029 proteins with 2965 known positive interactions; it could successfully predict PPI with 96% of accuracy –a level that is significantly greater than all other existing sequence-based prediction methods. Keywords-component; protein-protein interaction; secondary structure; machine learning I. INTRODUCTION Protein-protein interactions (PPIs) are crucial for every organism as most of the biological functions are mediated by them. A protein may interact with another protein over a short or long period of time to form protein complexes, to carry it, or even to modify it. In fact, detecting which proteins interact, how they interact, and what function is performed by their complex interaction is at least as important as predicting the three-dimensional structure of individual proteins [1]. The information about such interactions improves our understanding of diseases and can provide the basis of many novel therapeutic approaches. During the 1990s, because most of the PPI prediction methods were based on amino acids sequence comparisons, they were only applicable to complete sequenced genomes, such as Helicobacter pylori, Bacillus subtilis, and Mycoplasma genitalium. For example, genes of two different complete-sequenced bacteria, H. Influenzae and E. Coli, were clustered based on their functional classes to investigate individual gene’s relationship order, as well as, the overall genome’s organization in both bacteria [2]. The conservation of gene order methods assumes that proteins encoded by conserved gene pairs must also interact physically. These methods can also be used to predict functions of prokaryotic gene products [3]. Another approach to predict PPIs is gene fusion method that identifies gene-fusion events in complete genomes based on sequence comparisons [4]. The similarity of phylogenetic trees approach named as Mirrortree achieved 66% accuracy by considering the effects of the reference organisms and the identification of homologous proteins in the target organism [5, 6]. Furthermore, a few more methods were proposed based on the similarity of phylogenetic trees, including partial correlation coefficient [7], intra-matrix correlations [8] and SVM-based methods [9, 10] with overall accuracies of 66-80%. In other approaches such as docking methods, threading- based methods and homology methods, protein three- dimensional structures information is used to predict PPI networks. A docking method has been developed by assuming that the putative interactors associate using the same interface patches as the seed interactors [11]. MULTIPROSPECTOR is a multimeric structure-based threading approach that aims to capture more distantly related or even analogous proteins [12]. In homology-based methods, it is believed that protein-protein interaction can be modeled by known structures of protein complexes whose components are homologous or similar to other proteins whose interactions to be modeled [13]. Subsequent to the introduction of many machine learning approaches, Bock and Gough were among the pioneers who managed to develop a method using Support Vector Machines (SVM) in PPI prediction. They proposed SVM- light to recognize and predict PPIs based on protein sequences and physico-chemical properties (charge and surface tension of protein) [14]. A kernel based on signature products method has also been introduced to improve the accuracy in the range 70-80% by using 10-fold cross validation [15]. Besides SVM, Hidden Markov models (HMMs) have also been introduced to PPIs. HMMs were built with artificial multiple sequence alignment patches to search sequences with remote homology [16]. HMM-based methods do not achieve a good prediction compared to SVM mainly because of the lack of information on protein sequences used in HMMs. 978-1-4577-0476-5/11/$26.00 © 2011 IEEE AICCSA 2011 196

[IEEE 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA) - Sharm El-Sheikh, Egypt (2011.12.27-2011.12.30)] 2011 9th IEEE/ACS International Conference

Embed Size (px)

Citation preview

Page 1: [IEEE 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA) - Sharm El-Sheikh, Egypt (2011.12.27-2011.12.30)] 2011 9th IEEE/ACS International Conference

FIS-PNN: A Hybrid Computational Method for Protein-Protein Interaction Prediction

Sakhinah Abu Bakar1, Javid Taheri2, and Albert Y Zomaya3

Centre for Distributed and High Performance Computing School of Information Technologies

The University of Sydney Sydney, Australia

[email protected] [email protected] [email protected]

Abstract— The study of protein-protein interactions (PPI) is an active area of research in biology as it mediates most of the biological functions in any organism. Although, there are no concrete properties in predicting PPI, extensive wet-lab experiments suggest (with a high probability) that interacting proteins in the fine level share similar functions, cellular roles and sub-cellular locations. In this study, we developed a technique to predict PPI based on their secondary structures, co-localization, and function annotation. We proposed our approach, namely FIS-PNN, to predict the interacting proteins in yeast using hybrid machine learning algorithms. FIS-PNN has been trained and tested using 1029 proteins with 2965 known positive interactions; it could successfully predict PPI with 96% of accuracy –a level that is significantly greater than all other existing sequence-based prediction methods.

Keywords-component; protein-protein interaction; secondary structure; machine learning

I. INTRODUCTION Protein-protein interactions (PPIs) are crucial for every

organism as most of the biological functions are mediated by them. A protein may interact with another protein over a short or long period of time to form protein complexes, to carry it, or even to modify it. In fact, detecting which proteins interact, how they interact, and what function is performed by their complex interaction is at least as important as predicting the three-dimensional structure of individual proteins [1]. The information about such interactions improves our understanding of diseases and can provide the basis of many novel therapeutic approaches.

During the 1990s, because most of the PPI prediction methods were based on amino acids sequence comparisons, they were only applicable to complete sequenced genomes, such as Helicobacter pylori, Bacillus subtilis, and Mycoplasma genitalium. For example, genes of two different complete-sequenced bacteria, H. Influenzae and E. Coli, were clustered based on their functional classes to investigate individual gene’s relationship order, as well as, the overall genome’s organization in both bacteria [2]. The conservation of gene order methods assumes that proteins encoded by conserved gene pairs must also interact physically. These

methods can also be used to predict functions of prokaryotic gene products [3]. Another approach to predict PPIs is gene fusion method that identifies gene-fusion events in complete genomes based on sequence comparisons [4]. The similarity of phylogenetic trees approach named as Mirrortree achieved 66% accuracy by considering the effects of the reference organisms and the identification of homologous proteins in the target organism [5, 6]. Furthermore, a few more methods were proposed based on the similarity of phylogenetic trees, including partial correlation coefficient [7], intra-matrix correlations [8] and SVM-based methods [9, 10] with overall accuracies of 66-80%.

In other approaches such as docking methods, threading-based methods and homology methods, protein three-dimensional structures information is used to predict PPI networks. A docking method has been developed by assuming that the putative interactors associate using the same interface patches as the seed interactors [11]. MULTIPROSPECTOR is a multimeric structure-based threading approach that aims to capture more distantly related or even analogous proteins [12]. In homology-based methods, it is believed that protein-protein interaction can be modeled by known structures of protein complexes whose components are homologous or similar to other proteins whose interactions to be modeled [13].

Subsequent to the introduction of many machine learning approaches, Bock and Gough were among the pioneers who managed to develop a method using Support Vector Machines (SVM) in PPI prediction. They proposed SVM-light to recognize and predict PPIs based on protein sequences and physico-chemical properties (charge and surface tension of protein) [14]. A kernel based on signature products method has also been introduced to improve the accuracy in the range 70-80% by using 10-fold cross validation [15]. Besides SVM, Hidden Markov models (HMMs) have also been introduced to PPIs. HMMs were built with artificial multiple sequence alignment patches to search sequences with remote homology [16]. HMM-based methods do not achieve a good prediction compared to SVM mainly because of the lack of information on protein sequences used in HMMs.

978-1-4577-0476-5/11/$26.00 © 2011 IEEE AICCSA 2011196

Page 2: [IEEE 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA) - Sharm El-Sheikh, Egypt (2011.12.27-2011.12.30)] 2011 9th IEEE/ACS International Conference

Figure 1. General performance of the current methods

This study focuses on a static physical PPI network in which interacting protein can substitute each other. Although there are no concrete properties in predicting PPI networks, it is experimentally verified that proteins with strong interactions more probably share similar functions, cellular roles, and/or sub-cellular locations [1]. Therefore, if two proteins have similar functions at the fine level, it is theoretically believed that they also share similar three-dimensional structures [1]. However, if two proteins have similar sequence, it is not strongly verified that they also share similar functions and interact with each other [1]. These hypotheses can be used to conclude that if two proteins have similar secondary structures, they most probably have similar three-dimensional structures and therefore share similar functions and interact with each other. As shown in fig. 1, sequence-based prediction methods achieve between 65-80% of accuracy. Although accuracies of the structure-based prediction methods are also in the same range, these methods managed to employ only the protein structure information to predict PPI –despite sequence-based methods that require more protein features in order to reach high prediction accuracy.

In this paper, a novel approach (FIS-PNN) is presented to investigate the aforementioned hypothesis. Section 2 presents the problem statement, section 3 provides overall conceptual machine learning framework we deployed to predict PPI networks. Section 4 demonstrates our novel algorithms, FIS-PNN. Section 5 shows results of FIS-PNN on yeast benchmark. Discussion and analysis is presented in section 6, followed by conclusion in section 7.

II. PROBLEM STATEMENT The prediction of PPI network in this work is formulated

as: given a set of amino acid sequences of any organism, { }NsssS ,,, 21 …= and their associated secondary

structures, { }NssssssSS ,,, 21 …= where N is the number of proteins. Every secondary structure can be presented in a sequence consists of secondary structure elements: helices (H), sheets (E) and coils (C). Every secondary structure

element are presented as { }niiii eeess ,2,1, ,,, …= where n is the structure length. Find the connected graph ( )EVG , where

{ }NpppV ,,, 21 …= represent a set of proteins and

{ }NjiwE ij ,,2,1,| …== is a set of similarity scores for connected proteins i and j. The similarity score formula for proteins pair ( )ji, can be written as

∑=

==n

kkjkiij eeifw

1

1 ,,, , (1)

where kjki ee ,, = if elements match HH → , EE→ , CC → or structure of coil match, CEH →),( is satisfied.

III. MACHINE LEARNING ALGORITHMS

A. Principal Component Analysis Principal Component Analysis (PCA) is one of the tools

in exploratory data analysis that involves mathematical procedures to transform large number of correlated variables into smaller uncorrelated variables, called principal components. Before calculate the principal components values, all data must be standardize by using mean and standard deviation of every variable. The PCA transformation can be formulated as below.

∑==

T

TT

WVWXY

Where ∑ TWV is the eigenvalue decomposition of

covariance matrix of data matrix TX .

The first principal component considers as much of the variability in the data and the remaining variables are accounted by other succeeded principal components as much as possible [17]. In other words, PCA is able to reduce the size of the input data and consequently reduce the complexity of the system.

B. Fuzzy Inference Systems Fuzzy inference systems (FIS) or also known as fuzzy

systems are linguistic based decision making systems that use possibility theory to handle uncertainty in their reasoning process. There are three main components in the fuzzy systems [18].

1) Fuzzy rules: A set of fuzzy IF-THEN rules map system’s inputs to its outputs.

2) Fuzzification: This process imports inputs from a real crisp system into the fuzzy system.

3) Defuzzification: By using mathematical models, this process converts fuzzy numbers (fuzzy system’s output) to crisp values.

197

Page 3: [IEEE 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA) - Sharm El-Sheikh, Egypt (2011.12.27-2011.12.30)] 2011 9th IEEE/ACS International Conference

Figure 2. Architecture of FIS-PNN

In FIS theory, the combination of different fuzzification and defuzzification functions with different rule base structures can lead to various solutions to a given task [19]. However, because a single FIS may not be suitable for high dimension datasets as it easily increases the complexity and reduces the speed of the system [20, 21], we used multiple FISs instead. As a result, the whole system runs faster, becomes more reliable, and also much simpler.

C. Probabilistic Neural Networks An artificial neural network (ANN) is a computational

model that simulates the structure of a neural network. The network consists of interconnected neurons and processes information. ANN is usually deployed to model complex or unknown relationships among inputs and outputs. The radial basis network called probabilistic neural networks (PNN) has been introduced by Specht [22-25]. The three layers PNN were developed to solve the classification problems where the radial basis function,

2xe− is chosen as the transfer function for neurons in the hidden layer.

When an input is presented, the first layer computes distances from the input vector to the training input vectors and produces a vector. The elements of this vector indicate how close the input is to a training input. The second layer sums these contributions for each class of inputs to produce a vector of probabilities as the output. Finally, the maximum of these probabilities will be selected in the second layer, and produces a 1 for that class and 0 for the other classes.

IV. FIS-PNN ALGORITHM In this paper, we introduced a new hybrid method that

employed multiple independent fuzzy inference systems and probabilistic neural networks to predict (either stable or transient) PPIs using the similarity score of proteins secondary structures. The first classifier has been proposed in our previous study with promising prediction performance [26]. However, after implementing the hybrid method by adding second classifier, the prediction performance successfully improved as will be further discussed in the following sections.

The proposed method, called FIS-PNN, consists of three stages as shown in Fig. 2.

A. Stage 1: Dataset preparation 1) Secondary structures similarity score computation:

The computation involves four steps as below. STEP 1: Clustering - All proteins in S were grouped

according to their families by using a clustering protein sequences tool called CLUSS [27].

STEP 2: Multiple Sequence Alignment (MSA) - All sequences in every group were aligned by using our new multiple-sequence alignment method named the Rubber Band Technique (RBT). RBT is inspired by the natural behavior of a rubber band on a plate with several poles resembling location in input sequences that are most probably biologically related [28].

STEP 3: Secondary Structure Prediction (SSP) - Databases of experimentally determined protein secondary structure are very limited. Because not all proteins have their secondary structure information in the databases, we used SSpro [29] (one of the most popular tools for secondary structure prediction) to determine secondary structure of proteins instead. From the aligned sequences (results from MSA), SSpro predicts secondary structure for every protein. SSpro represents every element of secondary structure by three conformational states: H, E and C.

STEP 4: Similarity Measurement (Sim) - Based on these elements and a number of corresponding to the length of the region, the method continues with the next step to measure the similarity of all pairs of proteins. For N proteins, we will have ( )1−NN possible interacting proteins. Similarity score for every pair is calculated from formula (1), where

jiij ww = for proteins pairs ),( ji and ),( ij . The scores are normalized to values in range [0,100] where higher scores resemble higher similarity between two proteins.

2) Frequency of co-localization: A co-localization matrix of known interacting proteins from Database of Interacting Protein (DIP) has been developed and compared with randomized interaction matrix [30]. The resulting matrix called the co-localization weight matrix, L divides sub-cellular localization into 21 categories (1-extracellular, 2-bud, 3-cell wall, 4-cell periphery, 5-plasma membrane, 6-intergral membrane, 7-cytoplasma, 8-cytoskeleton, 9-ER, 10-golgi, 11-transport vesicles, 12-nucleus, 13-mitochondria, 14-peroxisome, 15-endosome, 16-vacuole, 17-microsomes, 18-lipid particles, 19-punctate composite, 20-ambiguous, and 21-unknown) as shown in Fig. 3. It is known that most proteins move through several sub-cellular localizations in L. Therefore, we compute the frequency of co-localization for every protein pairs using the following formula [31].

( )⎭⎬⎫

⎩⎨⎧ ⎟⎠

⎞⎜⎝⎛

∈∀∈∀= jLiLonLocalizati

pjLpiLMAXppf ,,

,21211 , (2)

where ( )⋅onLocalizati refers to combination of localization weight in L.

198

Page 4: [IEEE 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA) - Sharm El-Sheikh, Egypt (2011.12.27-2011.12.30)] 2011 9th IEEE/ACS International Conference

Figure 3. The co-localization weight matrix

Figure 4. An example of principle component analysis

Figure 5. An architecture of multiple fuzzy inference systems model

3) Similarity scores of function annotation: The similarity of functional annotation are computed based on a hierarchical tree structure called FunCat [32]. FunCat has 28 main functional categories with up to six levels of increasing specificity. The levels for every category are separated by dots and a unique two-digit number (e.g., 01.01.03.02.01) is assigned to each category’s hierarchy. The similarity score of functional annotation for every proteins pair are computed as below [31]:

( ) ( ){ }ji

ji

ffLCA

pfpfMAXppf ,

,, 2

21212 ∈∀∈∀= , (3)

where LCA is the lowest common ancestor of the two proteins.

B. Stage 2: Fuzzy inference systems modeling(classifier 1) A single FIS may not suitable for high dimension dataset

as it possibly increases the complexity and consequently reduces the speed of the system [33]. Alternatively, the multiple fuzzy systems can be developed not only to speed up the whole systems but also to increase the reliability and simplicity of the system.

In this work, we construct a multiple independent FIS as first classifier with NxN input matrix whose membership functions for every input are obtained from fuzzy clustering method. Every row of the input matrix is an input vector for every fuzzy system. After applying PCA to all N input

vectors, we have the smaller number of input vector elements, M where M<<N, as shown in Fig. 4.

Every i-th fuzzy system classifies all possible links between protein i and all other proteins into either interacting or non-interacting pairs. To do this, Gaussian membership functions with product inference rule were used at the fuzzification level. The associated membership function parameters were adjusted using a combination of backpropagation algorithm and least squares estimation during learning process. FIS-PNN has only one output in the range [0 1] for every system where higher scores resemble higher probability of interacting proteins. The model architecture consists of an input matrix of similarity scores, subsystem outputs, iy for Ni ,,, …21= , and “com” operation to combine iy by rows and generate the final output Y1, as shown in Fig. 5.

C. Stage 3: Probabilistic neural network (classifier 2) The PNN works as the second classifier in FIS-PNN.

Output matrix of the first classifier, FIS is fed into the PNN as its input to generate the final symmetrical output matrix. Besides FIS’ output, two protein features, f1 and f2, (computed from equations 2 and 3) are also been considered as additional inputs to PNN –these features are shown as significant factors in predicting accurate PPI networks, and therefore, they are added to FIS-PNN to produce superior PPI networks.

Here, along receiving an input vector, the first layer computes distances from this input vector to all training input vectors; the second layer sums these contributions for each class of inputs to produce a vector of values between [0 1]. The output layer in this case classifies the vector by selecting the maximum value of the vector elements and produces a ‘1’ for the chosen class and ‘0’ for the other classes. In this study, the final output of the proposed method has two classes for interacting and non-interacting protein pairs.

V. RESULTS The FIS-PNN has been tested using 1029 Saccharomyces

cerevisie (baker’s yeast) proteins with 2965 already known positive interactions among them. Baker’s yeast protein

199

Page 5: [IEEE 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA) - Sharm El-Sheikh, Egypt (2011.12.27-2011.12.30)] 2011 9th IEEE/ACS International Conference

network is mostly used network as a benchmark in protein-protein interaction [14-16]. The positive interactions information was downloaded from the DIP [30]. DIP combines experimentally determined protein interactions information from various sources and is updated on a regular basis.

During the first stage of FIS-PNN, BLOSUM62 scoring matrix with gap opening penalty of 5 and gap extension penalty of 1 were empirically selected in RBT for MSA. We used random walk initialization mode for sequence length less than 200 and homogenous initialization mode, otherwise. RBT is executed ten times for every proteins group and its best result is considered as the final answer for MSA.

In the second stage, we executed our first classifier, multiple independent FIS with similarity scores matrix obtained from the first stage as inputs. PCA eliminate those principal components that contribute less than 1% to the total variation in the input vector. We used 10-fold cross validation test to evaluate the performance of our classifiers. After a subsystem is trained, the same transformation matrix is used to transform the test dataset that are applied to the subsystem. PCA process has successfully transformed a 1029x1029 matrix dataset into a 1029x6 matrix. This situation shows that among 1029 proteins, not all proteins have high connectivity with other proteins. Only 10% of these proteins have high connectivity with the maximum number of 77 interactions. After the validation test, our first classifier consists of N = 1029 subsystems and N different sets of inference rules (7 rules in average) with 0.0476419 of average of root mean square error (RMSE).

Then we used results from FIS as an input for the probabilistic neural networks (PNN). Since the results matrix is symmetrical, we decided to use only the lower triangular as input for the second classifier. In this study, we executed PNN with four different input sets as C1, C2, C3, and C4. C1 consisted of FIS’s output, C2 consisted of FIS’s output plus f1, C3 consisted of FIS’s output plus f2, and C4 consisted of both protein features, i.e., FIS’s output, f1, and f2. Each version of PNN has been trained by a gradient descent algorithm with adaptive learning rate back-propagation and tested using 10-fold cross validation.

VI. DISCUSSION AND ANALYSIS In this study we picked two of the best known algorithms,

SVM-light from [14] and k-nearest neighbour method, to gauge the performance of our novel approach, FIS-PNN.

A. Single classifier versus multiple classifiers Our proposed hybrid classification method, FIS-PNN,

that is consists of two different classifiers was able to successfully classify interacting and non-interacting proteins based on only their secondary structure similarity. The first classifier achieve 85% of true positive rate [26] while the second classifier improved the classification performance by correctly predict 2791 interacting proteins; that is, 91% of accuracy without addition any extra protein’s feature.

In our experiment, two machine learning methods were compared with FIS-PNN: SVM-light and k-nearest neighbor (KNN), as well as their combination. SVM-light has been

(a)

(b)

Figure 7. The ROC curves for all methods

Figure 6. Performance comparison of FIS-PNN, SVM-light, and KNN

200

Page 6: [IEEE 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA) - Sharm El-Sheikh, Egypt (2011.12.27-2011.12.30)] 2011 9th IEEE/ACS International Conference

implemented by Bock [14] and k-nearest neighbour has never been applied for PPI prediction before. The same kernel function as in [14] was used in SVM-light to recognize the interacting pairs and non interacting pairs during 10-fold cross validation. K-nearest neighbour employed Euclidean distance as a distance metric with k = 1.

As proven in [34], our study demonstrate that combination of multiple classifiers can always perform better than individual classifiers. Fig. 6 shows that PNN achieves the best performance among other single machine learning classifier methods with 74.57% of accuracy, while SVM-light achieves lower accuracy and KNN was failed to classify the given input data. Although SVM-light successfully predicts the high number of true positive interactions, it predicts high number of false positive as well. This situation shows that SVM-light is limited to the small-sized datasets with the fairly equivalent number of positive and negative links. However, both methods (SVM-light and KNN) significantly improve their performance when combined with FIS to achieve 83.9% and 84.05% of accuracy, respectively (Fig. 7). It is shown that the multiple classifiers method has a capability to improve the performance of single classifier method. In this study, our proposed FIS is able to make significant classification improvements as it can be observed through the wide area between multiple classifiers and single classifier ROCR curves in Fig. 7.

B. Protein features In this study, besides using the similarity score of

proteins secondary structures, we also considered two other protein features (similarity of functional annotation, f1, and frequency of co-localization, f2) for every protein pair. We developed four different datasets to be trained and tested using FIS-PNN. Fig. 8 shows results of these experiments with different input sets for the second classifier (PNN, SVM-light and KNN).

Although a protein’s secondary structure reveals invaluable information regarding the protein’s function, it might not be adequate for an accurate prediction. Therefore, other proteins’ features are considered in this work to significantly improve accuracy of PPI predictions. This yields to 96% of accuracy when all proteins’ features are considered. Fig. 8a shows reducing of false positive rate when more protein features are considered. This figure also shows that information from proteins’ secondary structures and their functional annotations are much more coherent in predicting PPIs when compared to information of protein co-localizations. Similar results are observed when FIS-KNN is applied to the same datasets as shown in Fig. 8b. FIS-SVM-light (Fig. 8c), on the other hand, shows opposite results as its performance decreases for larger datasets. This could be a result of the fast training process in this technique to map training data to kernel space.

C. Balanced versus imbalanced dataset In most experiments, the numbers of positive and

negative interactions are set to have a 1:1 ratio. The different sizes of datasets and the ratio of their positive and negative interactions are in fact the main factors that affect the

(a)

(b)

(c)

Figure 8. FIS-PNN’s performance for different datasets

201

Page 7: [IEEE 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA) - Sharm El-Sheikh, Egypt (2011.12.27-2011.12.30)] 2011 9th IEEE/ACS International Conference

TABLE I. THE GENERAL PERFORMANCE OF PPI PREDICTION METHODS

Balance Imbalance C1 C2 C3 C4 C1 C2 C3 C4

FIS-PNN Accuracy 0.911 0.93 0.969 0.97 0.8951 0.9102 0.9545 0.9645 Sensitivity 0.939 0.942 0.975 0.982 0.8482 0.8654 0.9312 0.941 Specificity 0.883 0.919 0.963 0.958 0.9185 0.9325 0.9661 0.9762 FIS-KNN Accuracy 0.8405 0.8428 0.8516 0.8491 0.8337 0.8381 0.8535 0.8527 Sensitivity 0.8546 0.8509 0.8543 0.8496 0.7774 0.7781 0.7841 0.7777 Specificity 0.8263 0.8347 0.8489 0.8486 0.8619 0.8681 0.8882 0.8902 FIS-SVM-light Accuracy 0.839 0.7775 0.8899 0.8723 0.5892 0.7432 0.8742 0.8693 Sensitivity 0.9602 0.9815 0.9297 0.9396 0.9907 0.9868 0.9218 0.9366 Specificity 0.7177 0.5735 0.8501 0.8049 0.3884 0.6214 0.8504 0.8356

performance of a classification method [35]. In this work, we have also tested the performance of FIS-PNN with both balanced and imbalanced datasets. Here, imbalanced dataset had the ratio of 1:2, i.e., the number of negative interactions was twice the number of positive ones.

Table 1 shows results of our tests and reveals that accuracy of our propose algorithm, FIS-PNN, is not heavily affected by the level of asymmetry in the datasets. FIS-PNN was also capable of successfully achieving high sensitivity and specificity for various sizes of datasets without great dependency on their asymmetry ratio. Larger imbalanced datasets such as with 1:10, 1:50 ratio of negative to positive links as well as the extreme ratio of 1:100 (the original ration of the yeast dataset were also used to check the efficiency of FIS-PNN; in all cases FIS-PNN managed to achieve high quality results. FIS-KNN also successfully differentiates interacting proteins and non-interacting proteins in imbalanced dataset with accuracy between 83-85%. Unlike FIS-KNN, FIS-SVM-light could not perform accurate classification for imbalanced dataset. This method achieves low specificity for imbalanced datasets C1 and C2, high specificity for imbalanced datasets C3 and C4.

VII. CONCLUSION In this study, we deployed similarity scores of proteins’

secondary structures as a new domain of protein information to predict accurate protein-protein interactions (PPIs). We predict PPIs based on formations helices (H), coils (C) and sheets (E) in proteins’ secondary structure using our proposed classification method, FIS-PNN. The FIS-PNN successfully predicts PPIs with 91% of accuracy to significantly outperform many existing sequence-based prediction methods. This proposed method is able to efficiently classify large datasets using either solely information of their secondary structures, or in addition to other protein features (frequency of co-localization and/or similarity of function annotation). Results articulate feasibility of our approach even in formidable cases of imbalanced datasets with a large number of negative interactions.

REFERENCES

[1] A. Tramontano, The Ten Most Wanted Solutions in Protein Bioinformatics: Chapman & Hall / CRC, 2005.

[2] J. Tamames, G. Casari, C. Ouzounis, and A. Valencia, "Conserved Clusters of Functionally Related Genes in Two Bacterial Genomes," Journal of Molecular Evolution, vol. 44, pp. 66-73, 1997.

[3] T. Dandekar, B. Snel, M. Huynen, and P. Bork, "Conservation of Gene Order: A Fingerprint of Proteins That Physically Interact," Trends Biochem Sci, vol. 23, pp. 324-328, 1998.

[4] A. J. Enright, I. Iliopoulos, N. C. Kyrpides, and C. A. Ouzounis, "Protein Interaction Maps for Complete Genomes based on Gene Fusion Events," Letters of Nature, vol. 402, pp. 86-90, 1999.

[5] F. Pazos and A. Valencia, "Similarity of Phylogenetic Trees as Indicator of Protein-protein Interaction," Protein Engineering, vol. 14, pp. 609-614, 2001.

[6] J. Sun, J. Xu, Z. Liu, Q. Liu, A. Zhao, and Y. Li, "Refined Phylogenetic Profiles Method for Predicting Protein-protein Interactions," Bioinformatics, vol. 21, pp. 3409-3415, 2005.

[7] T. Sato, Y. Yamanishi, and K. Horimoto, "Prediction of Protein-protein Interactions from Phylogenetic Trees using Partial Correlation Coefficient," Genome Imformatics, vol. 14, pp. 496-497, 2003.

[8] R. A. Craig and L. Liao, "Phylogenetic Tree Information Aids Supervised Learning for Predicting Protein-protein Interaction based on Distance Matrices," BMC Bioinformatics, vol. 8, pp. 1-12, 2007.

[9] X. W. Chen and M. Liu, "Prediction of Protein-protein Interactions using Random Decision Forest Framework," Bioinformatics, vol. 21, pp. 4394-4400, 2005.

[10] F. Marangoni, M. Barberis, and M. Botta, "Large Scale Prediction of Protein Interactions by a SVM-based Method," WIRN VIETRI 2003, LNCS, pp. 296-301, 2003.

[11] S. J. Cockell, B. Oliva, and R. M. Jackson, "Structure-based Evaluation of in silico Predictions of Protein-protein Interactions using Comparative Docking," Bioinformatics, vol. 23, pp. 573-581, 2007.

[12] L. Lu, A. K. Arakaki, H. Lu, and J. Skolnick, "Multimeric Threading-based Prediction of Protein-protein Interactions on a Genomic Scale : Application to the Saccharomyces Cerevisiae Proteome," Genome Research, vol. 13, pp. 1146-1154, 2003.

[13] A. Szilagyi, V. Grimm, A. K. Arakaki, and J. Skolnick, "Prediction of Physical Protein-protein Interactions," Physical Biology, vol. 2, pp. S1-S16, 19 April 2005 2005.

[14] J. R. Bock and D. A. Gough, "Predicting Protein-protein Interactions from Primary Structure," Bioinformatics, vol. 12, pp. 455-460, 2000.

202

Page 8: [IEEE 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA) - Sharm El-Sheikh, Egypt (2011.12.27-2011.12.30)] 2011 9th IEEE/ACS International Conference

[15] S. Martin, D. Roe, and J.-l. Faulon, "Predicting Protein-protein Interactions using Signature Products," Bioinformatics, vol. 21, pp. 218-226, 2005.

[16] J. Espadaler, O. Romero-Isart, R. M. Jackson, and B. Oliva, "Prediction of Protein-protein Interactions using Distant Conservation of Sequence Patterns and Structure Relationships," Bioinformatics, vol. 21, pp. 3360-3368, 2005.

[17] I. T. Jolliffe, Principal Component Analysis, 2 ed. New York: Springer, 2002.

[18] J. Taheri, "Fuzzy Logic," in Handbook of Nature-Inspired and Innovative Computing, A. Y. Zomaya, Ed. New York: Springer Science + Business Media Inc, 2006.

[19] J. Taheri and A. Y. Zomaya, "Fuzzy Logic," in Handbook of Nature-Inspired and Innovative Computing, A. Y. Zomaya, Ed.: New York: Springer Science + Business Media Inc, 2006.

[20] C. B. Cheng, C. J. Cheng, and E. S. Lee, "Neuro-Fuzzy and Genetic Algorithm in Multiple Response Optimization," Computers and Mathematics with Applications, vol. 44, pp. 1503-1514, 2002.

[21] J. Yen, L. Wang, and R. Langari, "Multiple Fuzzy Systems for Function Approximation," in Annual Conference of the North American Fuzzy Information Processing Society - NAFIPS, 1997, pp. 154-158.

[22] D. F. Specht, "Probability Neural Networks for Classification, Mapping or Associative Memory," in IEEE International Conference on Neural Networks 1988, pp. 525 - 532.

[23] D. F. Specht, "Probabilistic Neural networks and the Polynomial Adaline as Complementary Techniques for Classification," IEEE Transactions on Neural Networks, vol. 1, pp. 111-121, March 1990 1990.

[24] J. Taheri and A. Y. Zomaya, "An Overview of Neural Network Models," in Handbook of Bioinspired Algorithms and Applications S. Olariu and A. Y. Zomaya, Eds. Boca Raton, Florida: Chapman&Hall/CRC Press and Taylor & Francis Group, 2006.

[25] J. Taheri and A. Y. Zomaya, "Artificial Neural Networks," in Handbook of Nature-Inspired and Innovative Computing, A. Y. Zomaya, Ed. New York: Springer Science + Business Media Inc, 2006, 2006.

[26] S. A. Bakar, J. Taheri, and A. Y. Zomaya, "Fuzzy Systems Modeling for Protein-protein Interaction Prediction in Saccharomyces Cerevisie," in 18th World IMACS / MODSIM Congress, Cairns, Australia, 2009, pp. 782 - 788.

[27] A. Kelil, S. Wang, R. Brzezinsksi, and A. Fleury, "CLUSS: Clustering of Protein Sequences Based on a New Similarity Measure," BMC Bioinformatics, vol. 8, 2007.

[28] J. Taheri, A. Y. Zomaya, and B. B. Zhou, "RBT-L: A Location Based Approach for Solving the Multiple Sequence Alignment Problem," The University of Sydney, NSW, Australia 626, 2008.

[29] J. Cheng, A. Z. Randall, M. J. Sweredoski, and P. Baldi, "SCRATCH: A Protein Structure and Structural Feature Prediction Server," Nucleic Acids Research, vol. 33, pp. W72-W76, 2005.

[30] I. Xenarios, D. W. Rice, L. Salwinski, M. K. Baron, E. M. Marcotte, and D. Eisberg, "DIP: The Database of Interacting Proteins," Nucleic Acids Research, vol. 28, pp. 289-291, 2000.

[31] M. S. Lee, S. S. Park, and M. K. Kim, "A Protein Interaction Verification System Based on a Neural Network Algorithm," in IEEE Computational Systems Bioinformatics Conference Workshops, 2005, pp. 151-154.

[32] A. Ruepp, A. Zollner, D. Maier, K. Albermann, J. Hani, M. Mokrejs, I. Tetko, U. Guldener, G. Mannhaupt, M. Munsterkotter, and H. W. Mewes, "The FunCat: A Funtional Annotation Scheme for Systematic Classification of Proteins from Whole Genomes," Nucleic Acids Research, vol. 32, pp. 5539-5545, 2004.

[33] R. Sharan, S. Suthram, R. M. Kelly, T. Kuhn, S. McCuine, P. Ustz, T. Sittler, R. M. Karp, and T. Ideker, "Conserved Patterns of Protein Interaction in Multiple Species," PNAS, vol. 102, pp. 1974-1979, 8 February 2005.

[34] G. D. Thomas, "Ensemble Methods in Machine Learning," in Proceedings of the First International Workshop on Multiple Classifier Systems: Springer-Verlag, 2000.

[35] I. H. Witten and E. Frank, Data Mining: Practical machine Learning Tools and Techniques. United States of America: Morgan Kaufman, 2005.

203