9
proteins STRUCTURE FUNCTION BIOINFORMATICS Classification of conformational stability of protein mutants from 3D pseudo-folding graph representation of protein sequences using support vector machines Michael Ferna ´ndez, 1 * Julio Caballero, 1,2 Leyden Ferna ´ndez, 1 Jose Ignacio Abreu, 1,3 and Gianco Acosta 4 1 Molecular Modeling Group, Center for Biotechnological Studies, Faculty of Agronomy, University of Matanzas, 44740 Matanzas, Cuba 2 Centro de Bioinforma ´tica y Simulacio ´n Molecular, Universidad de Talca, 2 Norte 685, Casilla 721, Talca, Chile 3 Artificial Intelligence Lab, Faculty of Informatics, University of Matanzas, 44740 Matanzas, Cuba 4 National Bioinformatics Center, 10200, Havana, Cuba INTRODUCTION Evidence is accumulated that many disease-causing mutations exert their effects by altering protein folding. Predicting proteins structures and stabilities are fundamental goals in molecular biology. Therefore predicting changes in structure and stability, induced by point mutations, has immediate application in computational protein design. 1–4 Despite free energy simulations have accu- rately predicted relative stabilities of point mutants, 5 the computational costs that most of the methods actually demand, are extremely high to test the large number of mutations studied in protein design applications. Fast algorithms for protein energy calculations are being developed today for intending the translation of structural data into energetic parameters. However, the development of fast and reliable protein force-fields is a complex task due to the del- icate balance between the different energy terms that contribute to protein stability. Force-fields for predicting protein stability can be divided into three main groups: physical effective energy function (PEEF), statistical potential-based effective energy function (SEEF), 6 and empirical data-based effective energy function (EEEF). A simplified energy function with only Van der Waals and side chain torsion potentials, 7 a PEEF approach, has been used to predict the stabilities of the k repressor protein for mutations involving only hydrophobic residues. In addition, an improved optimization method including continuously flexible side chain angles also demonstrated better prediction accuracy as compared with discrete side chain angles from a rotamer library. 8 In turn, SEEF method includes statisti- cal potentials derived from geometric and environmental propensities and corre- lations of residues in X-ray crystal structures. 6,9,10 On the other hand, EEEF approach combines a physical description of the interactions with some data obtained from experiments previously ran on proteins. Examples of such algo- rithms are the helix/coil transition algorithm AGADIR 11 or FOLDEF, 12 a fast and accurate EEEF approach based on AGADIR algorithm that uses a full atomic The Supplementary Material referred to in this article can be found online at http://www.interscience.wiley.com/jpages/ 0887-3585/suppmat/ *Correspondence to: M. Fernandez, Molecular Modeling Group, Center for Biotechnological Studies, Faculty of Agronomy, University of Matanzas, 44740 Matanzas, Cuba. E-mail: [email protected] Received 6 December 2006; Revised 9 March 2007; Accepted 20 March 2007 Published online 24 July 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.21524 ABSTRACT This work reports a novel 3D pseudo- folding graph representation of pro- tein sequences for modeling purposes. Amino acids euclidean distances mat- rices (EDMs) encode primary struc- tural information. Amino Acid Pseudo-Folding 3D Distances Count (AAp3DC) descriptors, calculated from the EDMs of a large data set of 1363 single protein mutants of 64 proteins, were tested for building a classifier for the signs of the change of thermal unfolding Gibbs free energy change (DDG) upon single mutations. An optimum support vec- tor machine (SVM) with a radial ba- sis function (RBF) kernel well recog- nized stable and unstable mutants with accuracies over 70% in crossva- lidation test. To the best of our knowledge, this result for stable mu- tant recognition is the highest ever reported for a sequence-based predic- tor with more than 1000 mutants. Furthermore, the model adequately classified mutations associated to dis- eases of human prion protein and human transthyretin. Proteins 2008; 70:167–175. V V C 2007 Wiley-Liss, Inc. Key words: protein stability predic- tion; point mutations; kernel-based methods; graph similarity. V V C 2007 WILEY-LISS, INC. PROTEINS 167

Classification of conformational stability of protein mutants from 3D pseudo-folding graph representation of protein sequences using support vector machines

Embed Size (px)

Citation preview

proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS

Classification of conformational stability ofprotein mutants from 3D pseudo-foldinggraph representation of protein sequencesusing support vector machinesMichael Fernandez,1* Julio Caballero,1,2 Leyden Fernandez,1 Jose Ignacio Abreu,1,3

and Gianco Acosta4

1Molecular Modeling Group, Center for Biotechnological Studies, Faculty of Agronomy, University of Matanzas,

44740 Matanzas, Cuba

2 Centro de Bioinformatica y Simulacion Molecular, Universidad de Talca, 2 Norte 685, Casilla 721, Talca, Chile

3 Artificial Intelligence Lab, Faculty of Informatics, University of Matanzas, 44740 Matanzas, Cuba

4National Bioinformatics Center, 10200, Havana, Cuba

INTRODUCTION

Evidence is accumulated that many disease-causing mutations exert their

effects by altering protein folding. Predicting proteins structures and stabilities

are fundamental goals in molecular biology. Therefore predicting changes in

structure and stability, induced by point mutations, has immediate application

in computational protein design.1–4 Despite free energy simulations have accu-

rately predicted relative stabilities of point mutants,5 the computational costs

that most of the methods actually demand, are extremely high to test the large

number of mutations studied in protein design applications.

Fast algorithms for protein energy calculations are being developed today for

intending the translation of structural data into energetic parameters. However, the

development of fast and reliable protein force-fields is a complex task due to the del-

icate balance between the different energy terms that contribute to protein stability.

Force-fields for predicting protein stability can be divided into three main groups:

physical effective energy function (PEEF), statistical potential-based effective energy

function (SEEF),6 and empirical data-based effective energy function (EEEF).

A simplified energy function with only Van der Waals and side chain torsion

potentials,7 a PEEF approach, has been used to predict the stabilities of the krepressor protein for mutations involving only hydrophobic residues. In addition,

an improved optimization method including continuously flexible side chain

angles also demonstrated better prediction accuracy as compared with discrete

side chain angles from a rotamer library.8 In turn, SEEF method includes statisti-

cal potentials derived from geometric and environmental propensities and corre-

lations of residues in X-ray crystal structures.6,9,10 On the other hand, EEEF

approach combines a physical description of the interactions with some data

obtained from experiments previously ran on proteins. Examples of such algo-

rithms are the helix/coil transition algorithm AGADIR11 or FOLDEF,12 a fast

and accurate EEEF approach based on AGADIR algorithm that uses a full atomic

The Supplementary Material referred to in this article can be found online at http://www.interscience.wiley.com/jpages/

0887-3585/suppmat/

*Correspondence to: M. Fernandez, Molecular Modeling Group, Center for Biotechnological Studies, Faculty of

Agronomy, University of Matanzas, 44740 Matanzas, Cuba. E-mail: [email protected]

Received 6 December 2006; Revised 9 March 2007; Accepted 20 March 2007

Published online 24 July 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.21524

ABSTRACT

This work reports a novel 3D pseudo-

folding graph representation of pro-

tein sequences for modeling purposes.

Amino acids euclidean distances mat-

rices (EDMs) encode primary struc-

tural information. Amino Acid

Pseudo-Folding 3D Distances Count

(AAp3DC) descriptors, calculated

from the EDMs of a large data set of

1363 single protein mutants of 64

proteins, were tested for building a

classifier for the signs of the change

of thermal unfolding Gibbs free

energy change (DDG) upon single

mutations. An optimum support vec-

tor machine (SVM) with a radial ba-

sis function (RBF) kernel well recog-

nized stable and unstable mutants

with accuracies over 70% in crossva-

lidation test. To the best of our

knowledge, this result for stable mu-

tant recognition is the highest ever

reported for a sequence-based predic-

tor with more than 1000 mutants.

Furthermore, the model adequately

classified mutations associated to dis-

eases of human prion protein and

human transthyretin.

Proteins 2008; 70:167–175.VVC 2007 Wiley-Liss, Inc.

Key words: protein stability predic-

tion; point mutations; kernel-based

methods; graph similarity.

VVC 2007 WILEY-LISS, INC. PROTEINS 167

description of the structure of the proteins was reported

by Guerois et al.13 for predicting conformational stability

of more than 1000 mutants.

Gromiha et al.14–16 reported stability prediction stud-

ies not based on protein force-field calculations but

focused on correlations of free energy change with 3D

structure, sequence information, and amino acid proper-

ties such as hydrophobicity, accessible surface area, and

so forth. On the other hand, empirical equations involv-

ing physical properties calculated from mutant structures

have been reported. Zhou and Zhou17 developed a broad

study regarding 35 proteins and 1023 mutants from

which they derived a new stability scale.

Machine learning algorithms have been also applied to

the protein stability prediction problem. In this connec-

tion, outstanding reports of Capriotti et al.18–20 describe

the implementation of neural network and support vec-

tor machine models for predicting the change of thermal

unfolding Gibbs free energy change (DDG) by using

sequence and 3D structure information. This approach

used a dataset of more than 2000 mutations. Network

and vector machine were trained with a combination of

experimental condition data (pH and temperature), spe-

cific mutated residue information, and environmental

residues information.

Furthermore, recent reports refer the novel extensions

of different structure–property relationships approaches

to the prediction of protein stability using both

sequence-21–24 and structure-based information.25 In

such reports, descriptors are calculated over the protein

sequences or 3D structures in such a way that several

variables are computed considering the protein structure

as a simplified molecular pseudo-graph of Ca atoms.

In this work, an optimum classification model for the

conformational stability of protein mutants was success-

fully built from protein sequences. A novel 3D pseudo-

folding graph representation of protein sequence was used

in order to enrich the information from the protein pri-

mary structure. A total of 35 amino acid pseudo-folding

3D distances count (AAp3DC) descriptors were calculated

from the amino acids euclidean distance matrices from the

protein 3D graphs. We tested a large set of 1363 mutants of

64 proteins for deriving a general protein predictor for sta-

ble and unstable mutant recognition. Prediction of the

signs of change of thermal unfolding Gibbs DDG upon sin-

gle mutations was accomplished by support vector

machine (SVM) classification. In addition to AAp3DC

descriptors, we also trained the SVMs with temperature

and pH values of the DDG experimental measurements.

MATERIALS AND METHODS

Single point mutant dataset

The mutant dataset was obtained by filtering the data

previously collected by Capriotti et al. in Ref. 19 from

the Protherm database26 according to the following con-

strains:

1. DDG values have been experimentally determined and

reported in the data base.

2. The data are related to single point mutations (non

multiple mutations were taken into account).

After the filtering they gathered 2048 single point

mutants obtained from 64 proteins. However, since

some entries in the data collected by Capriotti et al.19

refer different measurements on same mutants, we

added an extra constrain.

3. Only one reported DDG value per annotated mutant

in the database was selected.

In this way, a dataset of 1383 mutants (see Supplemen-

tary Material) was gathered and used for testing the abil-

ity of the 3D pseudo-folding graph representation of pro-

tein sequence for discriminating between stable and

unstable mutants.

3D pseudo-folding representations ofprotein sequences and amino acid pseudo-folding 3D distances count descriptors

In addition to the intensive computation required for

predicting protein stability by the free energy function

methods, another limitation arises when considering nec-

essary X-ray crystal structures of mutants.1–12 Despite

protein crystallographic database continuously grow,

crystal structures are not always available for proteins of

interest. In this regard, some X-ray structural-independ-

ent protein stability prediction methods have gained

attention. The main advantages of such methods are the

use of amino acids sequence information for predicting

protein stability and their extremely less computational

cost in comparison with free energy function-based

methods.

Exploitation of protein sequences for prediction and

similarity studies have been extended by developing

2D27,28 and 3D29,30 graph representations of sequences

as well as weighted linear graph representations 21–24,31.

The first of these approaches intended to enrich the pri-

mary structure information by mapping sequence resi-

dues in a higher dimensional space. Among these reports,

Bai and Wang30 referred a new approach to represent

protein sequences with a 3D Cartesian coordinate system.

Protein representations are constructed in the interior of

a regular dodecahedron centered at origin of the Carte-

sian coordinate system and one vertex at the point

(0,0,1) that is circumscribed inside a unit ‘‘magic sphere’’

(Fig. 1). At the dodecahedron vertexes are positioned 20

amino acids. For calculating 3D position (xi, yi, zi) of

amino acid i in a given sequence Bai and Wang multi-

plied the 3 3 20 matrix of Cartesian coordinates of

dodecahedron vertexes by a 1 3 20 vector representing

M. Fernandez et al.

168 PROTEINS DOI 10.1002/prot

the cumulative occurrence numbers of amino acids from

the 1st amino acid to the ith amino acid of the

sequence.30

Differently to Bai and Wang, we calculated 3D posi-

tions of amino acid residues by using a method similar

to the ‘‘moving across the sequence’’ scheme reported by

Jeffrey32 for graphical representation of DNA that was

recently applied to the 2D graph representation of pro-

tein sequences by Randic et al.27 This representation is

more suitable for discriminating among highly similar

sequences such as it is the case of single point mutants.

In this method 3D graphical representation of proteins is

obtained by starting in the ‘‘magic sphere’’ center follow-

ing amino acid sequence by moving half way (half-step)

towards the corresponding amino acid. By moving

towards one amino acid at dodecahedron vertexes to

another, following the sequence order, protein is pseudo-

folded inside the ‘‘magic sphere’’ of unit radius. Three-

dimensional graphical form of depicted protein depends

on ordering on amino acids in the sequence, however,

the relative variations between sequences remain to a

large extent independent of the ordering of amino acids

along the dodecahedron vertexes. Amino acids in vertexes

from 1 to 20 were assigned in alphabetical order accord-

ing to three letter code. As an example, Figure 2(A,B)

depicts graphical representations of two shorter segments

of protein of yeast Saccharomyces cerevisiae already used

by Randic et al.27 for 2D graph representation of protein

sequences:

Protein I WTFESRNDPAKDPVILWLNGGPGCSSLTGL

Protein II WFFESRNDPANDPIILWLNGGPGCSSFTGL

From the protein 3D representations in Figure 2, the

amino acids Euclidean Distances Matrices (EDMs) of the

graphs can be computed. The EDM represents the rela-

tive distance among nodes (amino acid residues) in the

3D pseudo-folded representation of the protein sequence.

Afterwards, similarity between such graphs can be

assessed by performing a Euclidean Distances Count in

such a way that pairs of nodes (amino acid residues) at

certain discrete distances are counted. AAp3DC descrip-

tors are then computed using Eq. (1).

AAp3DCl ¼ 1

L

Xi

dij ð1Þ

where L is the length of the protein sequence used for

normalizing according to the size of the sequence and

d(l, dij) is a Dirac-delta function defined as:

dðl; s; dijÞ ¼ 1 if l � s2< dij � l þ s

2;

0 otherwise

� �ð2Þ

where the dij is the Euclidean distance between amino

acid residues i and j in the 3D protein graph, l and s are

the Euclidean distance and the step used for the distance

count, respectively.

Figure 1Graphical representation of the regular dodecahedron proposed by Bai and Wang30 for the 3D representation of protein sequences.

Classification of Conformational Stability

DOI 10.1002/prot PROTEINS 169

Node pair summations were carried out from an initial

distance l of 0.05–1.8 U at distance steps s of 0.05 U,

resulting in a total of 35 AAp3DC descriptors computed

for discriminating among mutant sequences. Computa-

tional codes for protein sequence 3D graphs generation

and AAp3DC descriptors calculation were written in Mat-

lab environment33 and M-file is available from the

authors upon request. Before classification study, calcu-

lated variables for the 1383 mutants were scaled to zero

mean and unit variance.

Support vector machines

The SVM, a new machine learning method, has been

used for many kinds of pattern recognition problems.

Since excellent introductions to SVM appear in Refs. 34–

36 only the main idea of SVM applied to pattern classifi-

cation problems is stated here. First, the input vectors

are mapped into one feature space (possible with a

higher dimension). Second, a hyperplane which can sepa-

rate two classes is constructed within this feature space.

Only relatively low-dimensional vectors in the input

space and dot products in the feature space will involve

by a mapping function. SVM was designed to minimize

structural risk whereas previous techniques were usually

based on minimization of empirical risk. So SVM is usu-

ally less vulnerable to the overfitting problem, so it can

deal with a large number of features.

The mapping into the feature space is performed by a

kernel function. There are several parameters in the

SVM, including the kernel function and regularization

parameter. The kernel function and its specific para-

meters, together with regularization parameter, can not

be set from the optimization problem but have to be

tuned by the user. These can be optimized by the use of

Vapnik-Chervonenkis bounds, crossvalidation, an inde-

pendent optimization set, or Bayesian learning. In this

article, we first tried a linear kernel that yields very poor

results. Afterwards, a radial basic function (RBF) kernel

was used yielding an improved nonlinear predictor. A

crossvalidation was implemented for setting the opti-

mized values of the two parameters: the regularization

parameter and the width of the RBF kernel. SVMs were

implemented in Matlab environment using the LIBSVM

toolbox by Chih-Chung and Chih-Jen37 that can be

downloaded from: http://www.csie.ntu.edu.tw/cjlin/libsvm/.

Model’s validation

The efficient of the SVM predictor for protein mutant

classification was evaluated by the same set of statistics

used by Capriotti et al. in Ref. 19 and listed below.

The overall accuracy is

Q2 ¼ P

Nð3Þ

where p is the total number of correct predicted muta-

tions and N is the total number of mutations.

The correlation coefficient C is defined as follow:

CðsÞ ¼ ½pðsÞnðsÞ � uðsÞoðsÞ�D

ð4Þ

where D is the normalization factor

D ¼ ðpðsÞ þ uðsÞÞðpðsÞ þ oðsÞÞðnðsÞ þ uðsÞÞðnðsÞ½þoðsÞÞ�1=2 ð5Þ

for each class s (þ and � for positive and negative DDGvalues); p(s) and n(s) are the number of correct predic-

Figure 2Three-dimensional pseudo-folding graph representations of example proteins

according to ‘‘half-step’’ method.

Protein I WTFESRNDPAKDPVILWLNGGPGCSSLTGL (A)

Protein II WFFESRNDPANDPIILWLNGGPGCSSFTGL (B).

M. Fernandez et al.

170 PROTEINS DOI 10.1002/prot

tions and correctly rejected assignments, respectively and

u(s) and o(s) are the number or under- and over-predic-

tions.

The coverage for each discriminant structure s is eval-

uated as

QS ¼ pðsÞpðsÞ þ uðsÞ ð6Þ

The accuracy for s is computed as

PS ¼ pðsÞpðsÞ þ oðsÞ ð7Þ

where p(s) and u(s) are the same as in Eq. (7)

RESULTS AND DISCUSSION

Three-dimensional graph representation of proteins

attempts to discriminate among protein sequences

according to a differentiated 3D spatial distribution of

residues in a 3D map. Thus, the unidimensional space of

protein primary structure is then translated to a 3D one.

Consequently, sequence information is converted into a

more easily readable format in order to compute some

descriptors for statistical pattern recognition studies.

After 3D pseudo-folding graph representation of protein

sequences, we applied graph similarity measurements for

obtaining a feature data matrix for SVM training. SVM

predictor was trained with 35 AAp3DC descriptors calcu-

lated over the 3D graph representations of the 1383

mutants studied (see Support Vector Machines, 3D

pseudo-folding representations of protein sequences and

AAp3DC descriptors). Temperature and pH values of the

experimental DDG measurements were also used for

training the predictor. Since the dataset is about threefold

unbalanced towards unstable mutants, a threefold higher

penalty for stable mutant misclassification was established

inside the SVM framework for obtaining a classifier not

biased to the most represented class. In a first attempt,

we implemented the simplest linear kernel but the higher

crossvalidation accuracy was only about 0.6. Then, a

nonlinear SVM classifier was optimized by adjusting the

value of a RBF kernel width and regularization parameter

throughout a 20-fold out crossvalidation test. Optimum

values of RBF kernel width and regularization parameter

were 0.033 and 10, respectively, yielding training and

crossvalidation results that appear in Table I. As can be

observed, an overall 20-fold out crossvalidation accuracy

of 73% for the classification of mutant’s DDG signs was

achieved with a correlation coefficient C ¼ 0.41. It is

noteworthy, that crossvalidation accuracies for recogniz-

ing stable mutants Q(þ) ¼ 0.71 and unstable mutants

Q(�) ¼ 0.74 are both equivalent and in the same range

of the overall accuracy achieved.

Since it has been reported that the effects caused by a

specific mutation depend on the type of substituted and

new added residues, it is interesting to analyze the per-

formance of the predictor regarding the nature of the

mutations.38,39 Classification results according to the

physico-chemical properties of the mutations appear in

Table II. By analyzing the SVM predictor accuracy as a

function of the mutation type, we found that mutations

involving substitutions of charged residues, mainly placed

on the protein surface, by other charged or apolar resi-

dues exhibit the lower classification accuracy. Similarly,

substitutions of apolar by charges residues show low ac-

curacy value. This facts suggest that sequence derived

information does not properly describe the effect of salt-

bridge interactions, the main stabilizing force among

charged residues placed on the protein surface, that

should be better attempted using 3D structure details. At

the protein core, interactions among apolar residues of-

ten occur at shorter range in the sequence whilst at the

protein surface interaction can occur at larger ranges.

Some other classification models for protein mutant’s

stability have been reported. Classifiers using just

sequence information were described by Gonzalez-Diaz

and coworkers.21 dealing with the linear discriminant

analysis (LDA) classification of Arc repressor mutants,

but according to its melting point value. Similarly, this

set of Arc repressor mutants was employed by Marrero-

Ponce et al.22 for deriving a LDA classification describing

both reports more than a 80% of validation data variance

using protein linear indices of the ‘‘macromolecular pseu-

dograph Ca-atom adjacency matrix,’’ a sequence-exploit-

ing approach. However, above mentioned models have

poor utility because they are protein-specific and use a

thermodynamic parameter (melting point) not directly

related to the protein conformational stability. In addi-

Table ITraining and Crossvalidation Statistics of AAp3DC-SVM Model for the

Classification of DDG Signs Upon Mutations for 1383 Mutants of 64 Proteins

Q2 P (þ) P (�) Q (þ) Q (�) C

Training 0.82 0.63 0.95 0.88 0.80 0.63Crossvalidation 0.73 0.50 0.87 0.71 0.74 0.41

þ and � : the indexes were evaluated for positive and negative DDG signs.

Table IICrossvalidation Accuracy (Q2) of the AAp3DC-SVM Model for the

Classification of DDG Signs Upon Mutations According to the Mutation Types

Native Charged Polar Apolar

NewCharged 0.65 (6%) 0.78 (9%) 0.68 (10%)Polar 0.70 (5%) 0.78 (9%) 0.72 (16%)Apolar 0.71 (4%) 0.75 (13%) 0.74 (28%)

Classification of Conformational Stability

DOI 10.1002/prot PROTEINS 171

tion, other single-protein models, previously developed

by us, built with amino acids sequence autocorrelation

vectors and Bayesian Regularized Neural Networks, suc-

cessfully mapped lysozymes23 and gene V protein24

mutants in self-organizing maps according to mutant’s

DDG levels.

On the other hand, taking into account classification

models of protein stability change upon mutations using

large and diverse mutant data, our classification model

shows to overcome the optimum reported by Capriotti

et al.19 using only sequence information. Despite they

reported an overall accuracy about 77%, the correct pre-

dictions were drastically shifted towards unstable mutants

with a value about 91% whilst for recognizing stable

mutants it was reported a very low value about 46%.

Such statistics reflect that their model almost recognized

all mutants as unstable yielding and overall adequate ac-

curacy but with very low discriminating ability. In turn,

our predictor recognized both stable and unstable mutants

with identical good accuracy about 70% due to the higher

penalty imposed to stable mutant misclassification during

SVM training. To the best of our knowledge, this is the

highest accuracy reported for positive DDG signs recogni-

tion by a model with more than 1000 mutations and only

exploiting sequence information. Interestingly, when Cap-

riotti et al.20 used 3D structure information the highest

overall classification accuracy achieved was about 80%

but stable mutants were poorly recognized with and accu-

racy of 56% supporting the fact that our AAp3DC-SVM

predictor, although its primary sequence nature, is more

adequate for the mutant stability recognition task. Simi-

larly, Gonzalez-Diaz et al.25 reported a LDA model for

recognizing stable mutants using 3D stochastic average

electrostatic potentials derived from protein 3D structure

with validation accuracy nearly 90% for all the dataset

and for each class separately. However, instead of discri-

minated between stable or unstable mutants according to

wild-type protein they classified the mutants in higher

stable and near-wild-type stable.

Since we used a whole sequence representation for

studying mutation effect on stability rather than a muta-

tion location-dependent representation such as other

reports,15–20 it could be obtained as undesirable pro-

tein-biased results. Therefore, it is interesting to analyze

the classification results appearing in Table III for each

set of mutants from a particular protein. Bad results, Q2

Table IIIPercent Crossvalidation Accuracy (Q2) of the AAp3DC-SVM Model for the Classification of DDG Signs Upon Mutations According to Mutants for Protein

Protein Q2 (%)Number ofmutants Protein Q2 (%)

Number ofmutants

DsbA 100 3 Ribonuclease T1 55 44Bovine ubiquitin 60 10 Rop 67 21Murine cellular prion 100 9 Ribonuclease A 73 11Chicken spectrin 100 1 Dihydrofolate reductase 58 12Cytochrome c 100 1 Ribonuclease SA 100 5Adenylate kinase 100 4 Alpha spectrin (SH3 domain) 83 6Arc repressor 100 1 Staphylococcal nuclease 94 34Adrenodoxin 73 11 Subtilisin BPN0 100 6Barnase 91 151 Plasminogen activator kringle-2 domain 83 6Trypsin inhibitor 94 47 Tumor suppressor P53 100 5Barstar 0 1 Gene V 86 92Myoglobin 59 56 Iso-1 cytochrome c 75 24Cytochrome c2 60 5 Acyl-coenzyme a binding protein 85 27Cold shock protein Bacillus caldolyticus 64 14 Acidic fibroblast growth factor 0 3Cold shock protein Bacillus subtilis 100 3 Adenylate kinase 100 4Chitosanase 100 3 Chymotrypsin inhibitor 2 74 77Carbonic anhydrase II 86 7 HPr protein 67 3Cytochrome b5 75 12 Lysozyme Bacteriophage T4 65 233Canine Lysozyme 100 4 Ribonuclease HI 63 65Streptococcal protein G helix variant 100 6 Thioredoxin 100 2Catabolite activator protein 0 2 Beta lactamase 100 1Alpha lactalbumin 100 4 Maltose binding protein 83 6HU DNA-binding protein 40 5 Subtilisin inhibitor 63 19Calbindin D9k 100 3 Cytochrome c551 100 6Interleukin 1 beta 57 7 Chicken Lysozyme 48 50Human lysozyme 58 105 Lambda cro repressor 90 10C-Myb 67 3 Phage lambda endolysin 100 2Glycogen phosphorylase 100 1 Plasminogen activator inhibitor 1 precursor 0 2Onconase 100 1 Fibroblast growth factor 12 100 1Ubiquitin 87 30 Phage lambda repressor protein cl 83 12Streptococcus protein G 83 23 Phage P22 tail protein 100 7Histidine-containing phosphocarrier protein 83 12 Tryptophan synthase 54 41

M. Fernandez et al.

172 PROTEINS DOI 10.1002/prot

values significantly lower than 0.5, are only obtained in

five cases but for five proteins with no more than five

mutants, thus having low statistical weight on the whole

classification model. This analysis reflects the robustness

of our predictor.

An interesting application of our classifier is to recog-

nize unstable mutations when mutations have been

reported to correlate to diseases. Indeed, many disease-

causing mutations exert their effects by altering protein

folding. In Table IV appears the classification for a series of

22 mutations for human prion protein40–42 and human

transthyretin.19,43 Our classifier correctly recognized as

unstable 13 of the 15 DDG-known mutants reported in Ta-

ble IV, this result represents an accuracy about 0.87 that is

higher than the validation result for all mutant dataset. It

should be noticed that the only two positive signs corre-

spond to very low values of DDG (< 1.0 kJ/mol). The rest

seven DDG-unknown mutants were also predicted as

unstable when some have been reported as disease provok-

ing mutations. However, all the mutants with DDG � �2.1 (kJ/mol) (in bold face letter in Table IV) that corre-

spond to diseases were well recognized as unstable

mutants. This analysis and the fact that defective protein

folding is main cause of mutation-related disease suggest

that our model could be useful for correlating nucleotide

polymorphisms and stability-related diseases.

AAp3DC descriptors were able of resembling an amino

acid interaction pattern that was learned by the SVM

without having to be explicitly fed with residue proxim-

ities or other structural information. In this regard, con-

formational stability, a 3D dependent feature, was suc-

cessfully modeled employing scarce information from

protein primary sequence. The reported model improves

previous one exploiting only sequence information. Fur-

thermore, our SVM predictor based on the similarity

measurements among 3D graph representation of whole

protein sequence seems to be superior to mutation

point-centered methods.15–20 In this regard, unlike such

reports, our method can handle any mutant as well as

any protein polymorphism since it is not restricted to

single point mutations. Despite we used only single point

mutants for training the classifier in order to compare

with previous reports, it should be noticed in Table IV

two mutants of human prion protein (D178N/M129 and

D178N/V129) involving double point changes correctly

recognized as unstable mutants.

CONCLUSIONS

Protein primary structure-based methods are less com-

putational intense and do not require X-ray crystal struc-

ture of proteins for implementation. Because of the avail-

Table IVStability Classification of Human Prion and Human Transthyretin Mutants According to the AAp3DC-SVM Model

Protein Mutation Disease phenotype DDG (kJ/mol) Classification

Human prion P102La GSD 0.8 � 2.5 unstableM129Vb Polymorphism �1.4 � 2.0 unstableD178N/M129b FFI �7.2 � 1.7 unstableD178N/V129b CJD �8.0 � 1.8 unstableV180Ib GSD �2.1 � 1.7 unstableT183Ab CJD �19.3 � 3.1 unstableT190Vb Polymorphism 0.7 � 2.4 unstableF198Sb GSD �10.3 � 1.7 unstableE200Kb CJD �0.6 � 2.4 unstableR208Hb CJD �6.0 � 2.5 unstableV210Ib CJD �1.1 � 2.6 unstableQ217Rb GSD �8.9 � 1.7 unstableM166Vc Polymorphism SC(1E1J) unstableS170Nc Polymorphism SC(1E1P) unstableR220Kc Polymorphism SC(1FKC) unstable

Human transthyretin V50Md Amyloidosis �9.2 � 10.0 unstableL75Pd Amyloidosis �6.3 � 9.6 unstableT139Md Unclassified �0.42 � 11.7 unstableT80Ae Amyloidosis SC(1TSH) unstableS97Ye Amyloidosis SC(2TRY) unstableY134Ce Amyloidosis SC(1IIK) unstableV142Ie Unclassified SC(1TTR) unstable

GSD, Gerstmann-Straussler disease; FFI, fatal familial insomnia; CJD, Creutzfeldt-Jakob disease; SC, structural conformational change determined by comparing native

(1QLX, human prion protein; 1BM7 human transthyretin) with mutated 3D structures (PDB codes are reported within parenthesis).

Values of DDG � � 2.1 (kJ/mol) appear in bold face letter.afrom Ref. 40.bfrom Ref. 41.cfrom Ref. 42.dfrom Ref. 43.efrom Ref 19.

Classification of Conformational Stability

DOI 10.1002/prot PROTEINS 173

ability of an enormous amount of thermodynamic data

on protein stability it is possible to use structure–pro-

perty relationship approach for protein stability modeling.

We reported a novel 3D pseudo-folding graph representa-

tion of protein sequence and calculated protein descrip-

tors for training a SVM classifier of mutant stability. The

model well recognized over 70% of stable and unstable

mutants in crossvalidation test. To the best of our knowl-

edge, the accuracy for stable mutant recognition over

70% is the highest ever reported for a model with a large

mutant dataset (>1000) exploiting only sequence infor-

mation. Furthermore, the classifier adequately recognized

some disease-related unstable mutants of human prion

protein and human transthyretin.

Three-dimensional pseudo-folding graph representa-

tion of protein sequence in combination with SVMs

could be very useful in protein prediction studies. De-

spite the disadvantage of requiring some previous ther-

modynamic experimental data for generating a training

set, our stability prediction technique is an alternative

approach for proteins whose sequences are known but X-

ray structures are unsolved.

REFERENCES

1. Saven J. Combinatorial protein design. Curr Opin Struct Biol 2002;

12:453–458.

2. Mendes J, Guerois R, Serrano L. Energy estimation in protein

design. Curr Opin Struct Biol 2002;12:441–446.

3. Bolon DN, Marcus JS, Ross SA, Mayo SL. Prudent modeling of

core polar residues in computational protein design. J Mol Biol

2003;329:611–622.

4. Looger LL, Dwyer MA, Smith JJ, Helling HW. Computational

design of receptor and sensor proteins with novel functions. Nature

2003;423:185–190.

5. Dang LX, Merz KM, Kollman PA. Free-energy calculations on pro-

tein stability: thr-1573val-157 mutation of t4 lysozyme. J Am Chem

Soc 1989;111:8505–8508.

6. Lazaridis T, Karplus M. Effective energy functions for protein struc-

ture prediction. Curr Opin Struct Biol 2000;10:139–145.

7. Lee C, Levitt M. Accurate prediction of the stability and activity

effects of site-directed mutagenesis on a protein core. Nature

1991;352:448–451.

8. Lee C. Testing homology modeling on mutant proteins: predicting

structural and thermodynamic effects in the Ala98-Val mutants of

T4 lysozyme. Fold Des 1995;1:1–12.

9. Topham CM, Srinivasan N, Blundell TL. Prediction of the stability

of protein mutants based on structural environment-dependent

amino acid substitution and propensity tables. Protein Eng

1997;10:7–21.

10. Gilis D, Rooman M. Prediction of stability changes upon single site

mutations using database-derived potentials. Theor Chem Acc

1999;101:46–50.

11. Lacroix E, Viguera AR, Serrano L. Elucidating the folding problem

of alpha-helices: local motifs, long-range electrostatics, ionic-

strength dependence and prediction of NMR parameters. J Mol

Biol 1998;284:173–191.

12. Munoz V, Serrano L. Development of the multiple sequence

approximation within the AGADIR model of alpha-helix formation:

comparison with Zimm-Bragg and Lifson-Roig formalisms. Biopoly-

mers 1997;41:495–509.

13. Guerois R, Nielsen JE, Serrano L. Predicting changes in the stability

of proteins and protein complexes: a study of more than 1000

mutations. J Mol Biol 2002;320:369–387.

14. Gromiha MM, Oobatake M, Kono H, Uedaira H, Sarai A. Relation-

ship between amino acid properties and protein stability: buried

mutations. J Protein Chem 1999;18:565–578.

15. Gromiha MM, Oobatake M, Kono H, Uedaira H, Sarai A. Role of

structural and sequence information in the prediction of protein

stability changes: comparison between buried and partially buried

mutations. Protein Eng 1999;12:549–555.

16. Gromiha MM, Oobatake M, Kono H, Uedaira H, Sarai A. Impor-

tance of surrounding residues for protein stability of partially bur-

ied mutations. J Biomol Struct Dyn 2000;18:1–16.

17. Zhou H, Zhou Y. Stability scale and atomic solvation parameters ex-

tracted from 1023 mutation experiment. Proteins 2002;49:483–492.

18. Capriotti E, Fariselli P, Casadio R. A neural-network-based method

for predicting protein stability changes upon single mutations. Bio-

informatics 2004;20:63–68.

19. Capriotti E, Fariselli P, Calabrese R, Casadio R. Prediction of pro-

tein stability changes from sequences using support vector

machines. Bioinformatics 2005;21:54–58.

20. Capriotti E, Fariselli P, Casadio R. I-Mutant2.0: predicting stability

changes upon mutation from the protein sequence or structure.

Nucleic Acids Res 2005;33:306–310.

21. Ramos de Armas R, Gonzalez-Dıaz H, Molina R, Uriarte E.

Markovian backbone negentropies: molecular descriptors for pro-

tein research. i. predicting protein stability in arc repressor mutants.

Proteins 2004;56:715–723.

22. Marrero-Ponce Y, Medina-Marrero R, Castillo-Garit JA, Romero-

Zaldivar V, Torrens F, Castro EA. Protein linear indices of the

‘macromolecular pseudograph a-carbon atom adjacency matrix’ in

bioinformatics. Part 1: prediction of protein stability effects of a

complete set of alanine substitutions in arc represor. Bioorg Med

Chem 2005;13:3003–3015.

23. Caballero J, Fernandez L, Abreu JI, Fernandez M. Amino acid

sequence autocorrelation vectors and ensembles of bayesian-regular-

ized genetic neural networks for prediction of conformational stability

of human lysozyme mutants. J Chem Inf Model 2006;46:1255–1268.

24. Fernandez L, Caballero J, Abreu JI, Fernandez, M. Amino acid

sequence autocorrelation vectors and bayesian-regularized genetic

neural networks for modeling protein conformational stability: gene

V protein mutants. Proteins 2007;67:834–852.

25. Gonzalez-Dıaz H, Molina R, Uriarte E. Recognition of stable pro-

tein mutants with 3D stochastic average electrostatic potentials.

FEBS Lett 2005;579:4297–4301.

26. Bava KA, Gromiha MM, Uedaira H, Kitajima K, Sarai A. ProTherm

version 4.0: thermodynamic database for proteins and mutants.

Nucleic Acids Res 2004;32:120–121. http://gibk26.bse.kyutech.ac.jp/

jouhou/protherm/protherm.html.

27. Randic M, Butina D, Zupan J. Novel 2D graphical representation of

proteins. Chem Phys Lett 2006;419:528–532.

28. Aguero-Chapin G, Gonzalez-Dıaz H, Molina R, Varona-Santos J,

Uriarte E, Gonzalez-Dıaz Y. Novel 2D maps and coupling numbers

for protein sequences. The first QSAR study of polygalacturonases;

isolation and prediction of a novel sequence from Psidium guajava

L. FEBS Lett 2006;580:723–730.

29. Randic M, Krilov G. Characterization of 3-D sequences of proteins.

Chem Phys Lett 1997;272:115–119.

30. Bai F, Wang T. On graphical and numerical representation of pro-

tein sequences. J Biomol Struct Dyn 2006;23:537–545.

31. Caballero J, Fernandez L, Garriga M, Abreu JI, Collina S, Fernandez

M. Proteometric study of ghrelin receptor function variations upon

mutations using amino acid sequence autocorrelation vectors and

genetic algorithm-based least square support vector machines.

J Mol Graph Model 2006 doi:10.1016/j.jmgm.2006.11.002.

32. Jeffrey HI. Chaos game representation of gene structure. Nucleic

Acid Res 1990;18:2163–2170.

M. Fernandez et al.

174 PROTEINS DOI 10.1002/prot

33. MATLAB 7.0. program, available from The Mathworks Inc., Natick,

MA. http://www.mathworks.com.

34. Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995;20:

273–297.

35. Burges CJC. A tutorial on support vector machines for pattern rec-

ognition. Data Min Knowledge Discov 1998;2:1–47.

36. Vapnik V. Statistical learning theory. New York: Wiley; 1998.

37. Chih-Chung C, Chih-Jen L. LIBSVM: a library for support vec-

tor machines. 2001; Software available at http://www.csie.ntu.edu.

tw/�cjlin/libsvm.

38. Sandberg WS, Terwilliger TC. Energetics of repacking a protein in-

terior. Proc Natl Acad Sci USA 1991;88:1706–1710.

39. Sandberg WS, Terwilliger TC. Engineering multiple properties of a

protein by combinatorial mutagenesis. Proc Natl Acad Sci USA

1993;90:8367–8371.

40. Apetri AC, Surewicz K, Surewicz WK. The effect of disease-associ-

ated mutations on the folding pathway of human prion protein.

J Biol Chem 2004;279:18008–18014.

41. Liemann S, Glockshuber R. Influence of amino acid substitutions

related to inherited human prion diseases on the thermodynamic

stability of the cellular prion protein. Biochemistry 1999;38:3258–

3267.

42. Calzolai L, Lysek DA, Guntert P, Schroetter C, Riek R, Zahn R,

Wuthrich K. NMR structures of three single-residue variants of

the human prion protein. Proc Natl Acad Sci USA 2000;97:8340–

8345.

43. Shnyrova VL, Villar E, Zhadana GG, Sanchez-Ruiz JM, Quintas A,

Saraiva MJM, Brito RMM. Comparative calorimetric study of non-

amyloidogenic and amyloidogenic variants of the homotetrameric

protein transthyretin. Biophys Chem 2000;88:61–67.

Classification of Conformational Stability

DOI 10.1002/prot PROTEINS 175