[IEEE 2007 Frontiers in the Convergence of Bioscience and Information Technologies - Jeju City, South Korea (2007.10.11-2007.10.13)] 2007 Frontiers in the Convergence of Bioscience

Protein Classification by Matching 3D Structures

Slobodan Kalajdziski1, Georgina Mirceva2, Kire Trivodaliev3, Danco Davcev4 University Ss. Cyril&Methodius, Faculty of Electrical Engineering and Information

Technologies, Skopje, Macedonia {1skalaj, 2Georgina.Mirceva, 3kiret, 4etfdav}@etf.ukim.edu.mk

Abstract In this paper, a 3D structure-based approach is

presented for the efficient classification of protein mo-lecules. The method relies on the geometric 3D stru-cture of the proteins. After proper positioning of the 3D structures, the Spherical Trace Transform is app-lied to them to produce geometry - based descriptors, which are completely rotation invariant. Additionally, some biological properties of the protein are taken, and added to the geometry-based descriptor, thus for-ming better integrated descriptor. We have used nearest neighbour classification on the previously extracted descriptors. A part of the FSSP/DALI database, was used to evaluate the classification accuracy of this system. The results show that this method achieves more than 92 percent classification accuracy while it is simpler and faster than the DALI method. We provide some experimental results of the implemented system. 1. Introduction

The structure of a protein molecule is the main fa-ctor which determines its chemical properties as well as its function. All information required for a protein to be folded in its natural structure is coded in its ami-no acid sequence. Therefore, the 3D representation of a residue sequence and the way this sequence folds in the 3D space are very important. The 3D protein stru-ctures are stored in the world-wide repository Protein Data Bank (PDB) [2], [3] which is the primary reposi-tory for experimentally determined proteins structures. With the technology innovation the number of 3D pro-tein structures increases every day. As the number and variety of proteins continue to grow there has been an increasing interest in applications to help navigate through these large databases. There are various me-thods for protein retrieval according to their structure.

In [4], the ray-based method as a 3D-model retrieval technique is introduced. Silhouette-based feature vector, depth buffer-based feature vector, volume-based feature vector and voxel-based feature vector are presented in [1].

The geometric hashing method to perform protein surface matching to identify similar binding sites is pre-sented in [5]. Two techniques, α-hull and 3D reference frames, are adopted to reduce the complex computation.

In [6], another protein structure retrieval system is proposed. They constructed an indexing structure to avoid exhaustively chain structure alignments. Signatu-res have been extracted from 2D distance matrices.

Since all the algorithms in [1] and [4] are applied for any kind of objects (airplanes, human body, animals etc.), they are not yet applied for protein 3D structure retrieval, we have used one of them for building protein structure retrieval and classification system.

In [7], a new scheme for automatic classification of 3D protein structures is presented. It is a dedicated and unified multiclass classification scheme. Neither de-tailed structural alignment nor multiple binary classifi-cations are required in this scheme. A nearest neigh-bour-based classification strategy have been adopted. A filter-and-refine scheme is used. The proposed method is compared against two other dedicated protein structu-re classification schemes, SGM and CPMine. It is also compared against a DALI structural alignment-based classification scheme, which is more accurate, but ex-tremely slow.

Protein classification plays a central role in under-standing the function of a protein molecule with respect to all known proteins in a database. With the rapid in-crease in the number of new proteins, the need for auto-mated and accurate methods for protein classification is increasingly important.

In [8], nine different protein classification methods are included for the performance analysis. The nine methods used are the profile-HMM, support vector machines (SVMs) with four different kernel functions (linear, polynomial, sigmoid, and radial basis functi-

Frontiers in the Convergence of Bioscience and Information Technologies 2007

0-7695-2999-2/07 $25.00 © 2007 IEEEDOI 10.1109/FBIT.2007.55

147


0-7695-2999-2/07 $25.00 © 2007 IEEEDOI 10.1109/FBIT.2007.55

147


0-7695-2999-2/07 $25.00 © 2007 IEEEDOI 10.1109/FBIT.2007.55

147


0-7695-2999-2/07 $25.00 © 2007 IEEEDOI 10.1109/FBIT.2007.55

147

ons), SVM-pair wise, SVM-Fisher, decision trees and boosted decision trees. Various statistics are used to analyze the classification performance: accuracy of each method, cross-validation test, minimum error po-int calculation, maximum and median rates of false positives and receiver operating characteristics graphs. By using these statistics, the performance of each classifier is examined in detail, not only on their accu-racy rates, but also their sensitivities, specificities, and the relationships among these statistics.

In this paper we present a system for classification of protein molecules from the existing protein databa-se. The PDB file has information about 3D structure of the protein. After proper positioning of the stru-ctures, the Spherical Trace Transform is applied to them to produce descriptor vectors, which are comple-tely rotation invariant. We have applied the method given in [1] to extract geometry descriptor. Addi-tionally, biological properties of the protein are taken as in [9], forming better integrated descriptor.

There are many algorithms used for protein classi-fication as Naive Bayesian classifier, nearest neigh-bour classifier, decision trees and so on. In our app-roach, we have used nearest neighbour classification [10] on the previously extracted descriptors. The eva-luation of our classification algorithm is made accor-ding to the DALI method [11].

The proposed research method is given in section 2, while in section 3 the experimental results and evaluation of the system are presented. Section 4 concludes the paper. 2. Research Method

Our goal is to provide a system that will allow the users to retrieve and classify protein structures. There are two phases in protein retrieval: offline and online phase, which are explained on Figure 1 and Figure 2.

The offline phase refers to process of filling the da-tabase with protein data together with their respective descriptors. The user uploads the PDB file of the new protein. The information from PDB files are processed (triangulation, voxelization), then 3D descriptor is ex-tracted. Also, thumbnail of the protein is generated.

The online phase refers to the processes of protein retrieval. The user uploads the PDB file of the compa-ring protein. Then, the descriptor is generated, by using the same way as in offline phase, which is compared with stored descriptors of the existing proteins in the da-tabase. Finally, list of results ordered by Euclidean dis-tance, which is used as metric for retrieval, is shown. The threshold of the shown proteins is defined by the user. Additionally, the protein is classified according to k-nearest neighbor method.

Figure 1. Offline phase of retrieval.

Figure 2. Online phase of retrieval.

2.1. Triangulation and voxelization

The information about protein structure is stored in PDB files. The structure of a PDB file is shown on Fi-gure 3.

Figure 3. A PDB file.

148148148148

They contain information about their 3D structure and their biological properties. For each atom of the protein, the coordinates of the origin are presented and also information about the type of the atom. Information about the amino acid sequence, helices, sheets, turns and some other futures are also contained in the PDB file. The PDB format is explained at [3].

Since the exact 3D position of each atom and its ra-dius are known (according to PDB file), it may be re-presented by a sphere. The surface of each sphere is triangulated. In this way, a sphere consists of a small set of vertices and a set of connections between the vertices. Finally, a protein is comprised of a set of spheres, along with the corresponding vertices and the connections among them. Then, the center of mass is calculated and the protein is translated so the new center of mass is at the origin. The distance dmax between the new origin and the most distant vertex is computed and protein is scaled, so dmax =1. In this way, we provide translation and scale invariance.

After triangulation, we perform voxelization. Vo-xelization transforms the continuous 3D-space, into the discrete 3D voxel space. The voxelization pro-ceeds in three steps: discretization, sampling, and sto-ring. Discretization divides the continuous 3D-space into voxels. With sampling, depending on positions of the polygons of a 3D-mesh model, to each voxel vabc, a value is attributed. Usually, vabc is a scalar value, and we deal either with a binary or real voxel grid. We used real voxel grid, where vabc is equal to the fraction of the total surface area S of the mesh which is inside the region µabc (1).

SIarea abc

abc}{ ∩= µυ , 0 ≤ a,b,c ≤ N - 1. (1)

Each triangle Tj of a model is subdivided into pj2 co-

incident triangles each of which has the surface area equal to δ = Sj / pj

2, where Sj is the area of Tj . If all vertices of the triangle Tj lie in the same cuboid region µabc , then we set pj = 1, otherwise we use (2) to de-termine the value of pj , as in [1].

=

SS

pp jj min

(2)

For each newly obtained triangle, the center of gra-vity G is computed, and the voxel µabc is determined. Fi-nally, the attribute vabc is incremented by δ. The quality of approximation is set by the parameter pmin. In our implementation we have set pmin=32000.

2.2. 3D descriptor extraction

We have used the voxel-based algorithm presented in [1] to extract the geometry descriptor and the Euclidean distance as metric for comparison. In [1], this method is proposed for any kind of objects, but it was not yet applied for protein 3D structure retrieval. We have used this method for building protein 3D structure retrieval system.

The information contained in a voxel grid can be processed further to obtain both correlated information and more compact representation of voxel attributes as a feature. We applied the 3D Discrete Fourier Transform (3D-DFT) to obtain a spectral domain feature vector which also provides rotation invariance of the descriptor.

A 3D-array of complex numbers F = [fabc] is trans-formed into another 3D-array by (3).

∑∑∑−

=

−

=

−

=

++−=1

0

1

0

1

0

)///(2' 1 M

a

N

b

P

c

PcsNbqMapjabcpqs ef

MNPf π (3)

Since we apply the 3D-DFT to a voxel grid with real-valued attributes, we shift the indices so that (a; b; c) is translated into (a–M/2; b–N/2; c–P/2). Let M=N=P and we introduce the abbreviation (4).

abcPcNbMa υυ ≡−−−'

2/,2/,2/ (4)

Thus, the origin (0; 0; 0) is shifted to (N/2; N/2; N/2). Therefore, we adjust (5).

∑ ∑ ∑−

−=

−

−=

−

−=

++−=12/

2/

12/

2/

12/

2/

/)(2'

3

' 1 N

Na

N

Nb

N

Nc

Ncsbqapjabcpqs e

Nf πυ (5)

We take magnitudes of low-frequency coefficients as components of the vector. Since the 3D-DFT input is a real-valued array, the symmetry is present among obtained coefficients, so the feature vector is formed from all non-symmetrical coefficients (6).

1 ≤ |p | + | q | + | s |≤ k ≤ N/2 (6)

We normalize f’pqs by dividing by |f’000|. Then, we

form the feature vector by the scaled values of f’pqs. This vector presents geometrical properties of the protein.

Additionally, characteristic attributes of the prima-ry and secondary structure of the protein molecules are extracted, forming attribute-based descriptor vec-tors as in [9]. More specifically, concerning the prima-ry structure, the ratio of the amino acids’ occurrences, the hydrophobic amino acids ratio and the ratio of the

149149149149

helix types’ occurrences in a protein are calculated. Concerning the secondary structure, the number of Helices, Sheets and Turns in a protein are also calcula-ted. These features and the weights assigned to them are listed in Table 1.

Table 1. Structural features and their weights.

Secondary structure features Weight

Number of HELICES 1 %

Number of SHEETS 1 %

Number of TURNS 1 %

Primary structure features Weight

Hydrophobic residue ratio 6 %

Helix type 1 %

Residue ratio 90 %

2.3. Retrieval according to Euclidean distance

In the process of retrieval, 3D descriptor for com-paring protein structures is generated. Then, it is com-pared with descriptors of proteins stored so far accor-ding to their Euclidean distance.

The geometrical descriptors are compared in pairs by using (7), as in [1].

pggRggpRGffffdD ||||min),(min '''''' αα

αα−==

∈∈ (7)

For p = 2, the parameter α is computed by using (8).

2''

''' *

g

gg

f

ff=α (8)

This parameter is calculated for each pair of vectors, and then it’s used in (7).

The structural similarity is evaluated by (9) where additionally different weights (see Table 1) to the attributes were assigned.

∑=

−=34

1

2''' )]()([i

ssiSififWD (9)

The overall similarity is determined by (10). As it can be seen from (10), our algorithm is mainly based on geometrical features (90%) rather than structural features (10%).

D= k1DG+ k2DS , k1=90%, k2=10%. (10)

By using the overall similarity measure, the distance between descriptor of comparing protein and descriptors of existing proteins in database are calculated, and then a list of results according to Euclidean distance is returned as a result. 2.4. Distance-Weighted k nearest - Neighbor Classification

There are many methods for classification as Naive Bayesian classifier, nearest neighbour classifier, decisi-on trees and so on. We have used nearest neighbour cla-ssifier [10] in the classification of proteins.

As the name indicates, k-nearest neighbours of the query object q, are used to determine the class of q. Di-fferent weights are assigned to the neighbors based on their distance from the query point (inverse square of the distances is used as weight). Thus, the effectiveness depends on the number k as well as on the weighting of the k neighbours. 3. Implementation, Experiments and Evalu-ation of the system

We have implemented a web-based system for pro-tein classification. The system is built on Microsoft Visual Studio.NET 2005, while the data is stored in a SQL Server 2005 database. The database stores 1065 proteins divided into 26 protein classes. Some results are shown on Figure 4 and Figure 5. The first protein (on Figure 4) that is shown (n. 1) is the most similar protein of the first class; the second one is the most si-milar protein from the second class and so on. Only classes to whom the first k proteins belong are shown.

Figure 4. Experimental result.

150150150150

Figure 5. Experiment result.

The FSSP/DALI database has been constructed ba-

sed on the premise that proteins with at least 25 per-cent similarity in their amino acid sequence should be-long to the same class even with dissimilar geometry. We have compared our results with the results of DALI Web Server [11] and we got more than 92% classification accuracy. Only 4 of the 50 randomly se-lected proteins were misclassified. Further analysis showed that our method, which is mainly based on geometrical features (90%) rather than structural features (10%), is faster (takes few sec) than the DALI method which lasts much longer (minutes, hours), and we still get satisfactory results.

Figure 6 and Figure 7 present the time taken for classification of the 50 randomly selected proteins by our system and the DALI system. It should be mentioned that times are given in different ranges.

0

2

4

6

8

10

12

14

classification time (sec)

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

experiment No.

Figure 6. Classification time with our system.

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

classification time (sec)

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

experiment No.

Figure 7. Classification time with the DALI system.

The ratio between the times taken from our system

and the DALI system are presented on Figure 8.

0%

20%

40%

60%

80%

100%

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

experiment No.

DALI system

our system

Figure 8. The ratio of times for classification with our system and the DALI system.

For example, our system takes 6,265 sec, while

DALI system takes 91 sec for classification of the first selected protein. The ratio 6,265 sec/ (6,265+91) sec = 6,44% corresponds for the first selected protein. For the other proteins, the corresponding ratio was calculated in the same way as it is shown on Figure 8.

As it can be seen, DALI algorithm takes much time than our system. The time for classification with the DALI algorithm vary depending on the length of the amino acid sequence, while times needed for classifi-cation of protein molecules with our system are closer. 4. Conclusion

We have presented a system for protein molecules classification by using information both about their 3D structure and biological properties. We have applied the voxel-based method for generating geometry de-

151151151151

scriptor. Additionally, characteristic attributes of the primary and secondary structure of the protein mole-cules were extracted, forming attribute-based descri-ptor vectors. In this way, we produced better integra-ted protein descriptors.

A part of the FSSP/DALI database, which provides a structural classification of the proteins, was used to evaluate the classification. The results show that this method achieves more than 92 percent classification accuracy while it is much simpler and faster (few sec) than the DALI method (minutes, hours).

Our future work will be concentrated on increasing the efficiency of the algorithm by dynamically chang-ing the parameter k, so that minimal number of neigh-bors would be taken. This will faster the classification process. Also, we will investigate new 3D descriptors and incorporate additionally characteristics in the descriptors. 5. References [1] D. V. Vranic, “3D Model Retrieval”, Ph.D. Thesis, Uni-versity of Leipzig, 2004. [2] H.M. Berman, J. Westbrook, Z. Feng G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne, “The Protein Data Bank” Nucleic Acids Research, vol. 28, pp. 235-242, 2000. [3] http://www.rcsb.org [4] D.V. Vranic, “An improvement of Ray-Based Shape Descriptor" in Proceedings of the 8. Leipziger Informatik-Tage (LIT'2M), W. Wittig and S. Eds., Leipzig, Germany, September 2000, pp. 55-58, HTWK Leipzig. [5] Shann-Ching Chen and Tsuhan Chen, "Protein Retrieval by Matching 3D Surfaces", GENSIPS 2002 , Raleigh, North Carolina, USA., October 2002. [6] Pin-Hao Chi, Grant Scott, Chi-Ren Shyu, "A Fast Protein Structure Retrieval System Using Image-Based Distance Ma-trices and Multidimensional Index", p. 522, Fourth IEEE Sym-posium on Bioinformatics and Bioengineering (BIBE'04) 2004. [7] Zeyar Aung, Kian-Lee Tan,” Automatic 3D Protein Stru-cture Classification without Structural Alignment“, Journal of Computational Biology, Volume 12, Number 9, 2005, Pp. 1221–1241. [8] Pooja Khati, “Comparative analysis of protein classifica-tion methods“, Masters Thesis, University of Nebraska, Lincoln, December 2004. [9] Petros Daras, Dimitrios Zarpalas, Apostolos Axenopoulos, Dimitrios Tzovaras, and Michael Gerassimos Strintzis, “Three-Dimensional Shape-Structure Comparison Method for

Protein Classification”, IEEE/ACM Transactions on computa-tional biology and bioinformatics, Vol. 3, No. 3, pp. 193-207, July-September 2006. [10] M. Ankerst, G. Kastenmuller, H.P. Kriegel, and T. Seidl, “Nearest Neighbor Classification in 3D Protein Databases,” Proc. Seventh Int’l Conf. Intelligent Systems for Molecular Biology (ISMB ’99), 1999. [11] The European Bioinformatics Institute, http://www.ebi.ac.uk/, 2006.

152152152152

Documents

[IEEE 2007 Frontiers in the Convergence of Bioscience and Information Technologies - Jeju City, South Korea (2007.10.11-2007.10.13)] 2007 Frontiers in the Convergence of Bioscience