7
SVM based approaches for classifying protein tertiary structures Georgina Mirceva, Danco Davcev Computer Science and Engineering Department Ss. Cyril and Methodius University in Skopje Skopje, Macedonia [email protected], [email protected] Abstract—The tertiary structure of a protein molecule is the main factor which can be used to determine its chemical properties as well as its function. The knowledge of the protein function is crucial in the development of new drugs, better crops and synthetic biochemicals. With the rapid development in technology, the number of determined protein structures increases every day, so retrieving structurally similar proteins using current algorithms takes too long. Therefore, improving the efficiency of the methods for protein structure retrieval and classification is an important research issue in bioinformatics community. In this paper, we present two SVM based protein classifiers. Our classifiers use the information about the conformation of protein structures in 3D space. Namely, our protein voxel and ray based protein descriptors are used for representing the protein structures. A part of the SCOP 1.73 database is used for evaluation of our classifiers. The results show that our approach achieves 98.7% classification accuracy by using the protein ray based descriptor, while it is much faster than other similar algorithms with comparable accuracy. We provide some experimental results. Keywords- PDB; protein classification; protein voxel based descriptor; protein ray based descriptor; Support Vector Machines (SVM) I. INTRODUCTION Proteins are one of the most important molecules in living organisms since they play a vital structural and functional role in the cells of living organisms. They are constructed of several polypeptide chains of amino acids, which fold into complex tertiary structures. The way these chains fold in the 3D space are very important in order to understand the function of the protein molecule. The knowledge of the protein function is crucial in the development of new drugs, better crops and synthetic biochemicals. Experimentally determined protein structures are stored in the world-wide repository Protein Data Bank (PDB) [1], which is the primary repository for proteins structures. Since the determination of the first 3D structure of the protein myoglobin up to now the complexity and the variety of the protein structures has increased as the number of the new determined macromolecules has. Therefore, a need for developing efficient methods for feature extraction and classification of proteins is obvious, which may result in a better understanding of these complicated structures, their functions and the deeper evolutionary procedures that led to their creation. In molecular biology, many classification schemes and databases such as SCOP [2], CATH [3] and FSSP [4] have been developed in order to describe the similarity between protein structures. The Structural Classification of Proteins - SCOP database [2] describes the evolutionary relationships between the protein structures. SCOP has been accepted as the most relevant and the most reliable classification dataset due to the fact that SCOP builds its classification decisions based on visual observations of the structural elements of the proteins made by human experts. Therefore, this manual approach during the classification process of new structures clarifies that SCOP is completely biased towards reliable and precise protein classification. The main levels of the SCOP hierarchy are Domain, Family, Superfamily, Fold and Class. Using the terminology of the SCOP database, two proteins that belong to the same fold share a common three-dimensional pattern with the same major secondary structure elements (SSEs) in the same arrangement with the same topological connections. In the SCOP hierarchy, the folds are grouped into different classes, where a class is defined by the topographical arrangement of the secondary structures of its member proteins. Although SCOP is highly reliable and precise, it has one drawback. Namely, due to its manual classification methods, the number of proteins released in the PDB database which have not yet been classified by the SCOP method drastically increases. The CATH (Class, Architecture, Topology, and Homo- logous superfamily) database [3] uses automatic methods for classification into domains as well as experts’ experience where automatic methods fail to give reliable results. In the CATH database proteins are also classified in hierarchical manner. The FSSP (Families of Structurally Similar Proteins) data- base [4] is based on the protein secondary structure. The eva- luation of a pair of proteins is a highly time consuming task, so the comparison between a macromolecule and all the macromolecules of the database requires days. Therefore, one representative protein for each class is defined, so that no two representatives have more than 25% amino-acid sequence similarity. The unclassified protein is compared only to the representative protein of each class that requires an entire day. 1 978-1-4577-0866-4/11/$26.00 ©2011 IEEE

[IEEE 2011 International Conference on Data and Knowledge Engineering (ICDKE) - Milano, Italy (2011.09.6-2011.09.6)] 2011 International Conference on Data and Knowledge Engineering

  • Upload
    danco

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2011 International Conference on Data and Knowledge Engineering (ICDKE) - Milano, Italy (2011.09.6-2011.09.6)] 2011 International Conference on Data and Knowledge Engineering

SVM based approaches for classifying protein tertiary structures

Georgina Mirceva, Danco Davcev Computer Science and Engineering Department Ss. Cyril and Methodius University in Skopje

Skopje, Macedonia [email protected], [email protected]

Abstract—The tertiary structure of a protein molecule is the main factor which can be used to determine its chemical properties as well as its function. The knowledge of the protein function is crucial in the development of new drugs, better crops and synthetic biochemicals. With the rapid development in technology, the number of determined protein structures increases every day, so retrieving structurally similar proteins using current algorithms takes too long. Therefore, improving the efficiency of the methods for protein structure retrieval and classification is an important research issue in bioinformatics community. In this paper, we present two SVM based protein classifiers. Our classifiers use the information about the conformation of protein structures in 3D space. Namely, our protein voxel and ray based protein descriptors are used for representing the protein structures. A part of the SCOP 1.73 database is used for evaluation of our classifiers. The results show that our approach achieves 98.7% classification accuracy by using the protein ray based descriptor, while it is much faster than other similar algorithms with comparable accuracy. We provide some experimental results.

Keywords- PDB; protein classification; protein voxel based descriptor; protein ray based descriptor; Support Vector Machines (SVM)

I. INTRODUCTION Proteins are one of the most important molecules in living

organisms since they play a vital structural and functional role in the cells of living organisms. They are constructed of several polypeptide chains of amino acids, which fold into complex tertiary structures. The way these chains fold in the 3D space are very important in order to understand the function of the protein molecule. The knowledge of the protein function is crucial in the development of new drugs, better crops and synthetic biochemicals.

Experimentally determined protein structures are stored in the world-wide repository Protein Data Bank (PDB) [1], which is the primary repository for proteins structures. Since the determination of the first 3D structure of the protein myoglobin up to now the complexity and the variety of the protein structures has increased as the number of the new determined macromolecules has. Therefore, a need for developing efficient methods for feature extraction and classification of proteins is obvious, which may result in a better understanding of these complicated structures, their functions and the deeper

evolutionary procedures that led to their creation. In molecular biology, many classification schemes and databases such as SCOP [2], CATH [3] and FSSP [4] have been developed in order to describe the similarity between protein structures.

The Structural Classification of Proteins - SCOP database [2] describes the evolutionary relationships between the protein structures. SCOP has been accepted as the most relevant and the most reliable classification dataset due to the fact that SCOP builds its classification decisions based on visual observations of the structural elements of the proteins made by human experts. Therefore, this manual approach during the classification process of new structures clarifies that SCOP is completely biased towards reliable and precise protein classification. The main levels of the SCOP hierarchy are Domain, Family, Superfamily, Fold and Class. Using the terminology of the SCOP database, two proteins that belong to the same fold share a common three-dimensional pattern with the same major secondary structure elements (SSEs) in the same arrangement with the same topological connections. In the SCOP hierarchy, the folds are grouped into different classes, where a class is defined by the topographical arrangement of the secondary structures of its member proteins.

Although SCOP is highly reliable and precise, it has one drawback. Namely, due to its manual classification methods, the number of proteins released in the PDB database which have not yet been classified by the SCOP method drastically increases.

The CATH (Class, Architecture, Topology, and Homo-logous superfamily) database [3] uses automatic methods for classification into domains as well as experts’ experience where automatic methods fail to give reliable results. In the CATH database proteins are also classified in hierarchical manner.

The FSSP (Families of Structurally Similar Proteins) data-base [4] is based on the protein secondary structure. The eva-luation of a pair of proteins is a highly time consuming task, so the comparison between a macromolecule and all the macromolecules of the database requires days. Therefore, one representative protein for each class is defined, so that no two representatives have more than 25% amino-acid sequence similarity. The unclassified protein is compared only to the representative protein of each class that requires an entire day.

1978-1-4577-0866-4/11/$26.00 ©2011 IEEE

Page 2: [IEEE 2011 International Conference on Data and Knowledge Engineering (ICDKE) - Milano, Italy (2011.09.6-2011.09.6)] 2011 International Conference on Data and Knowledge Engineering

Although the classification in the FSSP database is fully automated, it is not fast enough in order to significantly reduce the number of known structures that are not classified yet.

Therefore, the necessity of a system that will classify the proteins in a precise and reliable manner as SCOP does, but in an automated fashion is obvious. There are various methods for protein classification, which are trying to offer efficient and completely automated protein classification.

One way to determine protein similarity is to use sequence alignment algorithms like Needleman–Wunch [5], BLAST [6], PSI-BLAST [7] etc. These algorithms offer fast and efficient recognition of overlapping subsequences in two protein structures, which leads to detection of closely related protein structures. However, these methods cannot recognize proteins with remote homology.

Instead of sequence alignment methods, structure alignment methods like CE [8], MAMMOTH [9], DALI [10] etc. are used to detect and highlight distant homology relations between the protein structures.

The CE [8] algorithm performs alignment of the protein structures by using a combinatorial extension of an alignment path. It is composed of aligned fragment pairs (AFPs) of m amino acids.

MAMMOTH [9] first computes the URMS distance between all pairs of heptapeptides and determines rotation matrices of each pair, which are used to find the alignment that maximizes the local similarity. The maximum subset of similar local structures is determined by evaluation of the P-value.

The DALI [10] algorithm first calculates the distance matrix of the protein as a representative of the protein tertiary structure. The distance matrix is a symmetrical matrix whose elements are the 3D Euclidian distances between each pair of Cα atoms. In order to simplify the alignment in later stages, the distance matrix is decomposed into hexapeptide-hexapeptide fragments, named contact patterns. Then, a Monte Carlo algorithm is applied in order to find similar contact patterns and assembles pairs of contact patterns. Since Monte Carlo is a randomized algorithm, it cannot guarantee convergence with the globally optimal solution. The DALI algorithm is used for classifying protein structures in the FSSP database [4].

In general, these methods are very precise and efficient and they have high degree of successful mapping of the existing structures in new proteins. Structure alignment methods perform comparison of the query protein against all proteins in order to find the most similar existing protein to a novel protein structure. Having in mind that structure alignment methods are quite cost expensive, the speed of classification with these methods is always questioned. For example, CE takes 209 days [8] to classify 11.000 novel protein structures.

Also, there are numerous methods which combine sequence and structure alignment of the proteins. SCOPmap [11] uses a pipelined architecture for the classification. SCOPmap is based on four sequence alignment methods: BLAST [6], PSI-BLAST [7], RPS-BLAST and COMPASS [12] and two structure alignment methods: VAST [13] and DaliLite [14].

FastSCOP [15] is another, more efficient system than SCOPmap, which is based on 3D-BLAST [16] and MAMMOTH [9]. 3DBLAST is structure alignment method that is used as a preprocessing filter to produce the top 10 scores. 3D-BLAST [16] partitions the (k,α) map into 23 stru-ctural letters, thus three-dimensional protein structures are presented by one-dimensional sequences. Homologous proteins are determined by the BLAST algorithm [6]. The one-dimensi-onal representation of protein substructures loses the structural topology, so dissimilar tertiary structures can lead to identical sequences. Therefore, the accuracy is lower than detailed stru-cture alignment algorithms such as DALI and CE. The top 10 results obtained by 3D-BLAST are used by MAMMOTH in order to find the most similar protein structures to the query structure.

In contrary of classification of protein molecules by applying alignment techniques, the classification based on mapping of the protein structure in a high-dimensional feature space is found as very promising.

In [17], the distance matrix of the protein molecule is calculated, and then 24 local and 9 global features are extracted from the distance matrix histograms. Classification is based on the E-predict algorithm [17].

In [18], the protein descriptor is generated from the properties of protein sequence in order to avoid complex structure comparison. The protein descriptor contains information for the number of different amino acids, the hydrophobicity, the polarity, the Van der Waals volume, the polarizability and the secondary structures of the protein molecule. With this protein descriptor, proteins are classified using Naive Bayes and boosted C4.5 decision trees [18].

In this paper, we present two protein classifiers, which are based on the Support Vector Machines method (SVM). Our protein voxel and ray based descriptors [19] are used to represent the protein molecules in the feature space. After proper mapping of the protein structures in the feature space, the Support Vector Machines method is used in order to classify the protein structures. A part of the SCOP 1.73 database was used in evaluation of the proposed classifiers.

The rest of the paper is organized as follows: in Section 2 we present our classification approaches; Section 3 presents the experimental results, while Section 4 concludes the paper.

II. OUR APPROACH In this paper, we propose an accurate and fast system that

allows the users to classify protein structures. The information about protein structure is stored in PDB files. The PDB files are stored in the Protein Data Bank (PDB) [1] that is the primary depository of experimentally determined protein structures. We have used our protein voxel and ray based descriptors [19] in order to map the protein structures in the feature space. Then, a Support Vector Machines (SVM) classifier [20] is applied in order to classify each newly protein in a corresponding protein domain in the SCOP hierarchy. The phases of our classification system are illustrated on Fig. 1.

2

Page 3: [IEEE 2011 International Conference on Data and Knowledge Engineering (ICDKE) - Milano, Italy (2011.09.6-2011.09.6)] 2011 International Conference on Data and Knowledge Engineering

Figure 1. The training and test phases of our classification system.

In the training phase, the information about the protein structures contained in PDB files is processed and the protein voxel/ray based descriptors are extracted. After generation of the protein descriptors of all training proteins, SVM models for each SCOP domain are generated using the SVM method. In the test phase, the user uploads the PDB file of the query prote-in. The information from the PDB file is processed and the protein voxel/ray based descriptor is extracted. Then, the pro-tein is classified with the SVM method by using the generated SVM models. Finally, the classification results are presented to the user.

A. Protein Voxel Based Descriptor We have used the voxel-based descriptor presented in [21]

to extract the geometrical features of the protein structures. The voxel based descriptor is extracted in five phases. Since the exact 3D position of each atom and its radius are known, it may be represented by a sphere. First, we perform triangulation in order to build a mesh model of the protein structure. The sur-face of each sphere is triangulated, thus forming the mesh model of the protein. Then, the center of mass is calculated and the protein is translated, so the new center of mass is at the origin. The distance dmax between the new origin and the most distant vertex is computed and the protein is scaled, so dmax=1. In this way, we provide translation and scale invariance.

After triangulation, we perform voxelization. Voxelization transforms the continuous 3D-space into discrete 3D voxel space. The voxelization proceeds in two steps: discretization and sampling. Discretization divides the continuous 3D-space into voxels. With sampling, depending on the positions of the polygons of the 3D-mesh model, to each voxel vabc a value is attributed which is equal to the fraction of the total surface area S of the mesh which is inside the region µabc (1).

{ }abc

abcarea Iv

Sμ ∩= , 0 ≤ a, b, c ≤ N-1 (1)

Each triangle Tj of the model is subdivided into pj2 coin-

cident triangles each of which has the surface area equal to δ = Sj / pj

2, where Sj is the area of Tj . If all vertices of the triangle Tj lie in the same cuboid region µabc, then we set pj = 1, otherwise we use (2) to determine the value of pj.

minj

jS

p pS

⎡ ⎤⎢ ⎥=⎢ ⎥⎢ ⎥

(2)

For each newly obtained triangle the center of gravity G is computed and the voxel µabc is determined. Finally, the attribute vabc is incremented by δ. The quality of approximation is set by pmin. According to [21], we have set pmin =32000.

The information contained in the voxel grid can be proce-ssed further to obtain both correlated information and more compact representation of the voxel attributes as a feature. We applied the 3D Discrete Fourier Transform (3D-DFT) to obtain a spectral domain feature vector, which provides rotation inva-riance of the descriptor. A 3D-array of complex numbers F = [fabc] is transformed into another 3D-array by (3).

1 1 1' 2 ( / / / )

0 0 0

1 M N Pj ap M bq N cs P

abcpqsa b c

f f eMNP

π− − −

− + +

= = == ∑ ∑∑ (3)

Since we apply the 3D-DFT to a voxel grid with real-va-lued attributes, we shift the indices so that (a; b; c) is translated into (a–M/2; b–N/2; c–P/2). Let M=N=P and we introduce the abbreviation (4).

'

/2, /2, /2a M b N c P abcυ υ− − − ≡ (4)

We take the magnitudes of the low-frequency coefficients as components of the vector. Since the 3D-DFT input is a real-valued array, the symmetry is present among the obtained coefficients, so the feature vector is formed from all non-symmetrical coefficients that satisfy 1 ≤ |p|+|q|+|s| ≤ k ≤ N/2. We form the feature vector by the scaled values of f’pqs by dividing by |f’000|. This vector presents the geometrical features of the protein structure.

Additionally, several features of the primary and secondary structure of the protein molecule are considered, as in [22]. More specifically, concerning the primary structure the ratios of the amino acids’ occurrences and hydrophobic amino acids ratio are calculated. Concerning the secondary structure, the ra-tios of the helix types’ occurrences, the number of Helices, Sheets and Turns in the protein are calculated. These features are incorporated in the previously extracted geometry descriptor, thus forming better integrated descriptor. In this way, we extract our protein voxel based descriptor.

3

Page 4: [IEEE 2011 International Conference on Data and Knowledge Engineering (ICDKE) - Milano, Italy (2011.09.6-2011.09.6)] 2011 International Conference on Data and Knowledge Engineering

B. Protein Ray Based Descriptor The Cα atoms form the backbone of the protein molecule.

There are some residues that hang up on the Cα atoms, which are not important in the classification. Our previous analyses [19] showed that by considering only the Cα atoms of the protein and extracting a feature vector, we can attain higher accuracy.

Proteins have distinct number of Cα atoms, so we have to find a unique way to represent all proteins with descriptors with same length. In this research we interpolate the backbone of the protein with fixed number of points (Nint points), which are equidistant along the protein backbone. In this research we use Nint = 64 interpolation points.

The interpolation of the protein backbone is performed in two steps. First, the overall length of the protein backbone L is computed by (5), where nCα is the number of Cα atoms, and d stands for the Euclidean distance between two consecutive Cα atoms. Then, the protein backbone is interpolated to Nint – 1 segments with length l = L / (Nint – 1).

1

11

( , )C

i i

n

iL d C C

α

α α

+=

= ∑ (5)

Finally, we use a modification of the ray descriptor [21], [23] in order to extract the protein feature vector. In the ray based descriptor the surface of the 3D object is triangulated and the mesh model is generated, and then the elements of the descriptor are calculated as the distances between the points of the mesh model to the center of mass. In this research we approximate the backbone by interpolating it to fixed number of interpolation points, and then the elements of the descriptor are calculated as Euclidean distances from the interpolation points to the center of mass.

C. Support Vector Machines (SVM) The support vector machines (SVM) method is a binary

classification method proposed by Vapnik and his colleagues at the Bell laboratories [20], [24]. As a binary problem, it has to find the optimal hyperplane that separates the positive from negative examples, see Fig. 2. Examples are presented as data points: {xi,yi}, i = 1,...,N , yi ∈ {-1, 1}, xi ∈ Rd. In our approach, x corresponds to the protein voxel/ray based descriptor of the i-th training protein. The points x which lie on the hyperplane satisfy w·x + b = 0, where w is normal to the hyperplane, |b|/||w|| is the distance from the hyperplane to the origin, while ||w|| is the Euclidean norm of w. The “margin” of a separating hyperplane is defined as sum of the distances from the separating hyperplane to the closest positive and negative examples. Suppose that all the training examples satisfy the constraints (6), so they can be combined as an inequality (7).

* 1, for 1* 1, for 1

i i

i i

b yb y

+ ≥ + = ++ ≤ − = −

x wx w (6)

( * ) 1 0,i iy b i+ − ≥ ∀x w (7)

Figure 2. Linear separable hyperplane for two dimensional feature space.

The points which satisfy the equality (7) lie on the two hyperplanes H1 and H2, see Fig. 2. These hyperplanes are parallel and distinguish the positive from negative examples. So, the goal is to find a pair of hyperplanes which gives the maximum margin by minimizing ||w||2, according to (6). The model will contain only examples that lie on the separating hyperplanes, named support vector machines.

Nonnegative Lagrange multipliers αi are introduced for each example. In this way, the primal Lagrangian gets the following form:

∑∑

==

++−=N

iii

N

iiiP byL

11

2 )*(21 αα wxw

(8)

Then, we have to minimize LP with respect to w, b, and maximize with respect to all αi at the same time. This is a convex quadratic programming problem, since the function is itself convex, and those points which satisfy the constraints form a convex set. This means that we can equivalently solve the following “dual” problem: maximize LP, subject to the constraints that the gradient of LP with respect to w and b vanish, and subject also to the constraints that the αi ≥ 0. This gives the conditions (9). Then, (9) is substituted into (8), which leads to (10). LP and LD show the Lagrangian which arise from the same objective function, but under different constraints. In this way, the problem can be solved by minimizing LP or by maximizing LD.

0, == ∑∑

iiii

iii yy αα xw

(9)

ji

jijiji

iiD yyL xx *

21

,∑∑ −= ααα

(10)

4

Page 5: [IEEE 2011 International Conference on Data and Knowledge Engineering (ICDKE) - Milano, Italy (2011.09.6-2011.09.6)] 2011 International Conference on Data and Knowledge Engineering

However, this algorithm is very sensitive to noise examples. Therefore, positive slack variables ei, i = 1,…, N are introduced in (6), thus forming the constraints (11).

ieyebyeb

i

iii

iii

∀≥−=+−≤++=−+≥+

,01for,1*1for,1*

wxwx

(11)

An extra cost for error is assigned, so the objective function to be minimized is ||w||2/2 + C(Σi ei) instead ||w||2/2. The parameter C is defined by the user and corresponds to the penalty given to the errors.

In order to generalize the above method to be applicable to non-separable problems, the data should be mapped into other feature space H using a mapping Ф. Now if there were a kernel function K such that K(xi, xj) = Ф(xi) · Ф(xj), we would only need to use K in the training, and would never need to explicitly know what Ф is. One example for such function is the Gaussian function defined by (12), where σ is the standard deviation.

)2/exp(),( 22

σjijiK xxxx −−= (12)

In the test phase the sign of (13) is computed, where si corresponds to the support vectors. So, we can avoid computing Ф(x) explicitly and use K(si, x) = Ф(si ) · Ф(x).

bKy

byxf

N

iiii

N

iiii

+

=+=

=

=

),(

)(Φ*)(Φ)(

1

1

xs

xs

α

α (13)

Although SVM method is originally proposed as a binary classifier, there are many approaches that perform multi-class classification [25], but are computationally much more expensive than solving several binary problems. On the other hand, many approaches decompose the multi-class problem into several binary problems thus leading to faster classifier.

One possible approach is one-aginst-all (OvA), where N separate classifiers are constructed for each of the N classes. In the i-th SVM classifier, the samples from the i-th class are taken as positive, while the samples from all other classes are taken as negative examples. In the test phase, the query sample is presented to all N SVMs and is classified according to the outputs from the N classifiers. The main disadvantage of this approach is that all examples are used in building all N classifiers, thus leading to complex classifiers.

On the other hand, one-against-one (OvO) approach can be used by building separate classifier for each pair of classes, thus leading to N*(N-1)/2 classifiers. Each classifier is trained using the samples of the first class as positive and the samples of the second class as negative examples. The number of

examples used for training of each of the OvO classifiers is smaller, since only examples from two classes are considered. The main disadvantage of this method is that every test example has to be presented to N*(N-1)/2classifiers, thus leading to slower testing, especially as the number of classes arises.

In this research, our dataset contains proteins from 150 classes, so we used the OvA approach leading to 150 SVMs, instead of 11175 classifiers in the OvO case.

III. EXPERIMENTAL RESULTS We have implemented a system for protein classification

based on the SVM method. A part of SCOP 1.73 database was used. Our standard of truth data contains 6979 randomly selected protein chains from the 150 most populated protein SCOP domains. 90% of the data set serves as training data, while the other 10% serves as test data.

First, we examined the influence of the standard deviation σ on the classification accuracy of the SVM classifier based on the protein voxel based descriptor. We tested the influence of σ on the classification accuracy on the training and test set, see Table 1. The analysis showed that for small value of σ, the training phase lasts longer and leads to over-fitting. Therefore, for small value of σ we achieve high classification accuracy using the training data in the test phase. On the other hand, by decreasing the standard deviation, when the test set is used in the test phase, the classification accuracy is getting worse (70.2% classification accuracy for σ=1650). By increasing σ, the classification accuracy on the training data decreases due to the inability of the classifier to fit to the data so well. On the other hand, for higher values of σ the classification accuracy on the test set increases.

Further, we examined the influence of the penalty given to the errors - c. The analysis is performed for the best values of σ according to Table 1. The experimental results presented in Table 2 show that the error penalty c has a minor influence on the classification accuracy. As it can be seen from Table 2, our approach using the protein voxel based descriptor achieves 78.83% classification accuracy for σ = 8350 and c = 50. The training phase lasts several minutes, while the test phase takes several seconds.

Next, we examined the influence of standard deviation σ on classification accuracy of our classifier based on the protein ray based descriptor. In Table3, the classification accuracy is presented by using different values for σ. As it can be seen from Table 3, our classifier achieves the highest precision for σ = 0.7 and σ = 1.

According to the training and test time when the protein ray based descriptor is used, we performed analysis on a smaller data set. 1875 training and 208 test proteins were selected from 50 SCOP domains. For σ = 1, the training phase took 44 seconds, while classification time for all test proteins was 3 seconds. For σ = 10, the training phase took 25 seconds, while classification time for all test proteins was 1 seconds. The training and test phases for σ = 1 are longer, but, as it can be seen from Table 3, using σ = 1 our classifier achieves higher precision of 98.7%.

5

Page 6: [IEEE 2011 International Conference on Data and Knowledge Engineering (ICDKE) - Milano, Italy (2011.09.6-2011.09.6)] 2011 International Conference on Data and Knowledge Engineering

TABLE I. THE INFLUENCE OF THE STANDARD DEVIATION ON THE CLASSIFICATION ACCURACY (%) USING THE VOXEL BASED DESCRIPTOR

Standard deviation (σ)

Data set Training set Test set

1650 / 70.20

3000 99.98 74.76

4000 99.98 75.09

5000 99.98 76.06

6000 99.98 76.06

7000 99.44 77.04

7500 99.04 77.04

8000 98.48 78.01

8500 98.14 78.50

9000 97.69 77.69

10000 96.76 78.34

12000 94.67 78.18

15000 91.72 76.88

20000 76.68 75.73

TABLE II. THE INFLUENCE OF THE ERROR PENALTY ON THE CLASSIFICATION ACCURACY (%) USING THE VOXEL BASED DESCRIPTOR

Error penalty (c)

Standard Deviation σ σ = 8350 σ = 8400 σ = 8450 σ = 8500

10 76.71 76.71 76.71 76.55

25 77.69 77.36 77.36 77.36

35 77.69 77.85 77.85 77.85

40 78.18 78.34 78.18 78.18

45 78.50 78.66 78.50 78.34

50 78.83 78.66 78.66 78.66

55 78.34 78.50 78.66 78.66

60 78.01 78.01 78.18 78.34

75 78.18 78.18 78.01 78.34

TABLE III. THE INFLUENCE OF THE STANDARD DEVIATION ON THE CLASSIFICATION ACCURACY (%) USING THE RAY BASED DESCRIPTOR

Standard Deviation σ

0.5 0.7 0.8 0.9 1 2 3 5 10

98.37 98.70 96.91 96.74 98.70 98.37 98.37 97.56 96.74

Our previous analysis [19] showed that the average retrieval precision of the protein voxel based descriptor (77.8%) is lower than the average retrieval precision of the protein ray based descriptor (92.9%).

The analyses made in this research showed that the protein ray based descriptor is also much better for protein classification. Besides the classification accuracy, the SVM classifier using the protein ray based descriptor is faster due to the lower dimensionality of the protein ray based descriptor (64 instead of 450 features).

IV. CONCLUSION In this paper we presented two approaches for classifying

protein tertiary structures based on the Support Vector Machines (SVM) method. Our protein voxel and ray based descriptors were used to map the protein structures in the feature space. After proper transformation of the protein structures into the feature space, a SVM classifier is used in order to build separate SVM for each SCOP domain.

A part of the SCOP 1.73 database was used to evaluate our classifiers. We investigated the influence of the standard deviation on the classification accuracy. Further analysis can be made in order to find the optimal value of the standard deviation. Also, an automatic adjustment of the standard deviation can be made.

We have already investigated the protein voxel based descriptor whose average precision is 77.8%, and the protein ray based descriptor whose average precision is 92.9%. Our SVM classifier based on the protein ray based descriptor achieved higher classification accuracy (98.7%) than our SVM classifier based on the protein voxel based descriptor (78.83%). Also, due to the lower dimensionality of the ray descriptor (64 instead of 450 features), the ray based classifier is faster than the voxel based classifier.

Compared to the other classification algorithms with comparable accuracy, our approach is much faster. Therefore, the proposed approach could rapid the process of protein classification, so the number of known protein structures that are not classified could be reduced.

In the feature, we plan to investigate other feature extraction methods and other classification methods.

REFERENCES [1] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H.

Weissig, I. N. Shindyalov, and P. E. Bourne, “The Protein Data Bank,” Nucleic Acids Research, vol. 28, pp. 235-242, 2000.

[2] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, “Scop: a structural classification of proteins database for the investigation of sequences and structures,” Journal of Molecular Biology, vol. 247, pp. 536-540, 1995.

[3] C. A. Orengo, A. D. Michie, D. T. Jones, M. B. Swindells, and J. M. Thornton, “CATH - A hierarchic classif. of protein domain structures,” Structure, vol. 5, no. 8, pp. 1093-1108, 1997.

[4] L. Holm, and C. Sander, “The FSSP Database: Fold Classification Based on Structure-Structure Alignment of Proteins,” Nucleic Acids Research, vol. 24, pp. 206-210, 1996.

[5] S. B. Needleman, and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, no. 3, pp. 443-453, 1970.

[6] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, no. 3, pp. 403-410, 1990.

6

Page 7: [IEEE 2011 International Conference on Data and Knowledge Engineering (ICDKE) - Milano, Italy (2011.09.6-2011.09.6)] 2011 International Conference on Data and Knowledge Engineering

[7] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997.

[8] H. N. Shindyalov, and P. E. Bourne, “Protein structure alignment by incremental combinatorial extension (CE) of the optimal path,” Protein Engineering, vol. 9, pp. 739-747, 1998.

[9] A. R. Ortiz, C. E. Strauss, and O. Olmea, “Mammoth: An automated method for model comparison,” Protein Science, vol. 11, pp. 2606-2621, 2002.

[10] L. Holm, and C. Sander, “Protein structure comparison by alignment of distance matrices,” Journal of Molecular Biology, vol. 233, pp. 123-138, 1993.

[11] S. Cheek, Y. Qi, S. S. Krishna, L. N. Kinch, and N. V. Grishin, “SCOPmap: Automated assignment of protein structures to evolutionary superfamilies,” BMC Bioinformatics, vol. 5, pp. 197-221, 2004.

[12] R. Sadreyev, and N. Grishin, “COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance,” Journal of Molecular Biology, vol. 326, pp. 317-336, 2003.

[13] T. Madej, J. F. Gibrat, and S. H. Bryant, “Threading a database of protein cores,” Proteins, vol. 23, pp. 356-369, 1995.

[14] L. Holm, and C. Sander, “Dali: a network tool for protein structure comparison,” Trends in Biochemical Science, vol. 20, pp. 478-480, 1995.

[15] C. H. Tung, and J. M. Yang, “fastSCOP: a fast web server for recognizing protein structural domains and SCOP superfamilies,” Nucleic Acids Research, vol. 35, W438-W443, 2007.

[16] J. M. Yang, and C. H. Tung, “Protein structure database search and evolutionary classification,” Nucleic Acids Research, vol. 34, no 13, pp. 3646-3659, 2006.

[17] P. H. Chi, Efficient protein tertiary structure retrievals and classifications using content based comparison algorithms, PhD thesis, University of Missouri-Columbia, 2007.

[18] K. Marsolo, S. Parthasarathy, and C. Ding, “A Multi-Level Approach to SCOP Fold Recognition,” IEEE Symposium on Bioinformatics and Bioeng., pp. 57-64, 2005.

[19] G. Mirceva, S. Kalajdziski, K. Trivodaliev, and D. Davcev, “Comparative analysis of three efficient approaches for retrieving protein 3D structures,” 4-th Cairo International Biomedical Engineering Conference (CIBEC 2008), Cairo, Egypt, 2008.

[20] V. Vapnik, The Nature of Statistical Learning Theory, 2nd ed., Springer, New York, 1999.

[21] D. V. Vranic, 3D Model Retrieval, Ph.D. Thesis, University of Leipzig, 2004.

[22] P. Daras, D. Zarpalas, A. Axenopoulos, D. Tzovaras, and M. G. Strintzis, “Three-Dimensional Shape-Structure Comparison Method for Protein Classification,” IEEE/ACM Transactions on computational biology and bioinformatics, vol. 3, no. 3, pp. 193-207, 2006.

[23] D.V. Vranic, “An improvement of Ray-Based Shape Descriptor,” 8-th Leipziger Informatik-Tage (LIT'2M), W. Wittig and S. Eds., Leipzig, Germany, HTWK Leipzig, pp. 55-58, 2000.

[24] C. J. C. Burges, “A tutorial on support vector machine for pattern recognition,” Data Min. Knowl. Disc., vol. 2, no. 2, pp. 121-167, 1998.

[25] J. Weston, and C. Watkins, “Multi-class support vector machines,” Proceedings of ESANN99, M. Verleysen, Ed., Brussels, Belgium, 1999.

7