10
D2N: Distance to the native Avinash Mishra a , Prashant Singh Rana b , Aditya Mittal a , B. Jayaram a,b,c, a Kusuma School of Biological Sciences, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India b Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India c Department of Chemistry, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India abstract article info Article history: Received 8 April 2014 Received in revised form 3 July 2014 Accepted 15 July 2014 Available online 22 July 2014 Keywords: Protein folding Native Decoy Quality assessment Root-mean-square-deviation (RMSD), of computationally-derived protein structures from experimentally determined structures, is a critical index to assessing protein-structure-prediction-algorithms (PSPAs). The de- velopment of PSPAs to obtain 0 Å RMSD from native structures is considered central to computational biology. However, till date it has been quite challenging to measure how far a predicted protein structure is from its native in the absence of a known experimental/native structure. In this work, we report the development of a metric D2N(distance to the native) that predicts the RMSDof any structure without actually knowing the native structure. By combining physico-chemical properties and known universalities in spatial organization of soluble proteins to develop D2N, we demonstrate the ability to predict the distance of a proposed structure to within ±1.5 Ǻ error with a remarkable average accuracy of 93.6% for structures below 5 Ǻ from the native. We believe that this work opens up a completely new avenue towards assigning reliable structures to whole proteomes even in the absence of experimentally determined native structures. The D2N tool is freely available at http://www.scfbio-iitd.res.in/software/d2n.jsp. © 2014 Elsevier B.V. All rights reserved. 1. Introduction Protein sequences assume a three dimensional tertiary structure to perform several biological functions [1,2]. Physicochemical properties of amino acid residues and their stoichiometry [3] give rise to different interactive forces which further guide the folding of a protein sequence into its unique tertiary structure. This knowledge notwithstanding, prediction of tertiary structures of proteins, continues to be a grand challenge problemin modern biology [46]. Protein structure predic- tion algorithms (PSPAs) can be broadly classied into three categories namely, ab initio/de novo [710], homology [1117] and hybrid [18, 19]. Ab initio approaches use molecular dynamics, Monte Carlo or other sampling techniques using physics based force elds, to generate a structure or an ensemble of structures for the query sequence while homology based methods use experimentally known protein structures as templates based on sequence similarity and construct a full structure for the query sequence. However, limited understanding of the net balance of forces often results in low quality structures [20]. Many solu- tions have been proposed to screen these low quality structures which may appear appealing visually as any high resolution structure (Fig. 1). Several quality assessment servers address this issue by giving relative ranking to each structure in comparison to other predicted structures. These servers adopt either statistical potential based or phys- ics based approaches [2125]. We also designed a scoring function in our earlier work where we used all atom energies to detect the native or native-like structures from a large pool of decoy structures [26,27]. Further, we used other physico-chemical properties along with all atom energies to detect native/native-like structures from the decoy ensemble [28,29]. This combination approach performed better in cap- turing native than energy score alone, but all such methods assess a modeled protein structure in comparison to other structures in a group or with a standard pattern. The problem encountered is similar to simple variational treatments in physics and chemistry where it is easy to pick a good one from the available candidates but whether this is the right solution is beyond the purview of the current theoretical methods. Assigning distance from the native to a proposed structure in the absence of experimental information is essentially an unrealized dream. Any progress in this direction requires overcoming two gap areas: (i) knowledge of the invariant properties of native protein struc- tures amidst their diversity and (ii) a method to map these properties into an appropriate metric. Here we examine certain invariant physico-chemical properties of native proteins, and various machine learning models for estimating the distance of any query structure from its native without using its na- tive form. Proposed methodology is an effort to make protein structure prediction more transparent and reliable. Physicochemical properties considered to formulate the distance metric include, (i) total solvent Biochimica et Biophysica Acta 1844 (2014) 17981807 Abbreviations: D2N, Distance to native; PCC, Pearson correlation coefcient; RMSD, Root mean square deviation; GDT, Global distance test; TM, Topology modeling Corresponding author at: Department of Chemistry, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India. Tel.: +91 11 26591505; fax: +91 11 26582037. E-mail addresses: [email protected], [email protected] (B. Jayaram). URL: http://www.scfbio-iitd.res.in (B. Jayaram). http://dx.doi.org/10.1016/j.bbapap.2014.07.010 1570-9639/© 2014 Elsevier B.V. All rights reserved. Contents lists available at ScienceDirect Biochimica et Biophysica Acta journal homepage: www.elsevier.com/locate/bbapap

Biochimica et Biophysica Acta - IIT Delhiweb.iitd.ac.in/~amittal/2014_Avinash_BBA_Prot.pdf · 2014. 10. 9. · D2N: Distance to the native Avinash Mishraa, Prashant Singh Ranab, Aditya

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Biochimica et Biophysica Acta - IIT Delhiweb.iitd.ac.in/~amittal/2014_Avinash_BBA_Prot.pdf · 2014. 10. 9. · D2N: Distance to the native Avinash Mishraa, Prashant Singh Ranab, Aditya

Biochimica et Biophysica Acta 1844 (2014) 1798–1807

Contents lists available at ScienceDirect

Biochimica et Biophysica Acta

j ourna l homepage: www.e lsev ie r .com/ locate /bbapap

D2N: Distance to the native

Avinash Mishra a, Prashant Singh Rana b, Aditya Mittal a, B. Jayaram a,b,c,⁎a Kusuma School of Biological Sciences, Indian Institute of Technology, Hauz Khas, New Delhi 110016, Indiab Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi 110016, Indiac Department of Chemistry, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India

Abbreviations: D2N, Distance to native; PCC, PearsonRoot mean square deviation; GDT, Global distance test; TM⁎ Corresponding author at: Department of Chemistry,

Hauz Khas, New Delhi 110016, India. Tel.: +91 11 26591E-mail addresses: [email protected], bjay

(B. Jayaram).URL: http://www.scfbio-iitd.res.in (B. Jayaram).

http://dx.doi.org/10.1016/j.bbapap.2014.07.0101570-9639/© 2014 Elsevier B.V. All rights reserved.

a b s t r a c t

a r t i c l e i n f o

Article history:Received 8 April 2014Received in revised form 3 July 2014Accepted 15 July 2014Available online 22 July 2014

Keywords:Protein foldingNativeDecoyQuality assessment

Root-mean-square-deviation (RMSD), of computationally-derived protein structures from experimentallydetermined structures, is a critical index to assessing protein-structure-prediction-algorithms (PSPAs). The de-velopment of PSPAs to obtain 0 Å RMSD from native structures is considered central to computational biology.However, till date it has been quite challenging to measure how far a predicted protein structure is from itsnative — in the absence of a known experimental/native structure. In this work, we report the development ofa metric “D2N” (distance to the native) — that predicts the “RMSD” of any structure without actually knowingthe native structure. By combining physico-chemical properties and known universalities in spatial organizationof soluble proteins to develop D2N, we demonstrate the ability to predict the distance of a proposed structure towithin ±1.5 Ǻ error with a remarkable average accuracy of 93.6% for structures below 5 Ǻ from the native. Webelieve that this work opens up a completely new avenue towards assigning reliable structures to wholeproteomes even in the absence of experimentally determined native structures. The D2N tool is freely availableat http://www.scfbio-iitd.res.in/software/d2n.jsp.

© 2014 Elsevier B.V. All rights reserved.

1. Introduction

Protein sequences assume a three dimensional tertiary structure toperform several biological functions [1,2]. Physicochemical propertiesof amino acid residues and their stoichiometry [3] give rise to differentinteractive forces which further guide the folding of a protein sequenceinto its unique tertiary structure. This knowledge notwithstanding,prediction of tertiary structures of proteins, continues to be a “grandchallenge problem” in modern biology [4–6]. Protein structure predic-tion algorithms (PSPAs) can be broadly classified into three categoriesnamely, ab initio/de novo [7–10], homology [11–17] and hybrid [18,19]. Ab initio approaches use molecular dynamics, Monte Carlo orother sampling techniques using physics based force fields, to generatea structure or an ensemble of structures for the query sequence whilehomology basedmethods use experimentally known protein structuresas templates based on sequence similarity and construct a full structurefor the query sequence. However, limited understanding of the netbalance of forces often results in low quality structures [20]. Many solu-tions have been proposed to screen these low quality structures whichmay appear appealing visually as any high resolution structure (Fig. 1).

correlation coefficient; RMSD,, Topology modeling

Indian Institute of Technology,505; fax: +91 11 [email protected]

Several quality assessment servers address this issue by givingrelative ranking to each structure in comparison to other predictedstructures. These servers adopt either statistical potential based or phys-ics based approaches [21–25]. We also designed a scoring function inour earlier work where we used all atom energies to detect the nativeor native-like structures from a large pool of decoy structures [26,27].Further, we used other physico-chemical properties along with allatom energies to detect native/native-like structures from the decoyensemble [28,29]. This combination approach performed better in cap-turing native than energy score alone, but all such methods assess amodeled protein structure in comparison to other structures in agroup or with a standard pattern. The problem encountered is similarto simple variational treatments in physics and chemistry where it iseasy to pick a good one from the available candidates but whether thisis the right solution is beyond the purview of the current theoreticalmethods. Assigning distance from the native to a proposed structurein the absence of experimental information is essentially an unrealizeddream. Any progress in this direction requires overcoming two gapareas: (i) knowledge of the invariant properties of native protein struc-tures amidst their diversity and (ii) a method to map these propertiesinto an appropriate metric.

Here we examine certain invariant physico-chemical properties ofnative proteins, and various machine learning models for estimatingthe distance of any query structure from its native without using its na-tive form. Proposed methodology is an effort to make protein structureprediction more transparent and reliable. Physicochemical propertiesconsidered to formulate the distance metric include, (i) total solvent

Page 2: Biochimica et Biophysica Acta - IIT Delhiweb.iitd.ac.in/~amittal/2014_Avinash_BBA_Prot.pdf · 2014. 10. 9. · D2N: Distance to the native Avinash Mishraa, Prashant Singh Ranab, Aditya

Fig. 1. Ambiguity of modeled structures: A predicted structure (left) folded into clean tertiary form but its distance (RMSD) from the true native state (right PDB ID: 2HH6) is 16.5 Ǻ.

1799A. Mishra et al. / Biochimica et Biophysica Acta 1844 (2014) 1798–1807

accessible surface area, (ii) all atom-interaction energy, (iii) Cβ spatialdistribution and (iv) secondary structural feature (v) solvent accessiblesurface area of polar residues and (vi) solvent accessible surface area ofcharged residues. Residue lengths, Cβ–Cβ pairs, and number of polarand charged residues are also included to build a size independent pre-diction model. We tested several machine learning methods whereinrandom forest method outperformed others in taking correct decisionduring prediction. We use the above physico-chemical/structural pa-rameters which are treated as features in building a rigorous predictionmodel to estimate the distance to the native (D2N) of any query struc-ture. Before selecting these parameters for D2N calculation, we finger-printed their behavior on the true native structures.

1.1. Native structure selection

SCOP [30] based protein classes were considered for the selection ofnatives from RCSB [31]. They are: alpha proteins (α), beta proteins (β),and alpha and beta proteins (α + β and α/β). Further, some basicselection criteria were employed to eliminate bias, and these are listedbelow:

• Chain type: Protein chain but not any DNA or RNA• Sequence length N 50• 0 b resolution ≤ 2.5 Ǻ• Experimental method is X-ray• Homologue removal — 30% identity cutoff• No missing residue.

We applied abovementioned selection criteria on RCSB database forselecting native protein structures. Further, SCOP classification is alsoadded to the above mentioned filters. This resulted in 1081 structuresfor α class, 1173 structures for β class, 1521 structures for α + β classand 1687 for α/β class protein. There are 742 common structuresamong these classes and thereforewe have a total of 4720 unique struc-tures. Further, we removed those structures that have any missingresidue/s (i.e REMARK 465 in PDB file) as the calculation of physico-chemical features can be altered due to missing residue/s. We foundonly 925 structures that do not have missing residue/s. Later, we alsoobserved that there are 1568 structures out of 4720which havemissingresidue/s only at their termini. Thus they could be considered as contin-uous protein structures. We added these structures to previously fil-tered 925 native structures and calculated physico-chemical featuresfor 2493 structures and tested their universalities (Supplementary

Fig. 1). Below we describe the significance of each parameter and thecalculation procedure.

1.2. Total solvent accessible surface area (F1)

Protein folding is governed by various driving forces which amongothers, seek tominimize its total surface area. The degree of these forcesdepends on the surface of a protein exposed to the solvent, which con-veys a strong dependency of free energy of folding on solvent accessiblesurface area (SASA) [32,33]. SASA has been widely used as one of theimportant properties to assess the quality of protein structures. Hydro-phobic collapse is considered to be a dominant factor in protein foldingand this can be estimated as a loss of SASA of non-polar residues. Eachamino acid shows a different propensity to be found on the surface ofthe protein, based on the functional groups present in its side chain[34–38]. Some questions arise with regard to the usage of SASA:(i) should it be the total area or is it the area of the nonpolar residues?(ii) what is the standard fixed value of SASA for a native structure?and (iii) is the rule of minimum area applicable to non-globular pro-teins? Here, we have calculated the total SASA for the selected 2493native structures and plotted natural log–log plot versus number of res-idues (Supplementary Fig. 1(a)). SASA is calculated following the proce-dure of Lee and Richards [39]. Quite interestingly, the total solventaccessible surface area rather than the total non-polar area shows avery high correlation with a correlation coefficient (r) of 0.95 with thenumber of residues irrespective of the size, shape or secondary structurecomposition.

1.3. Total energy (empirical non-bonded all atom potential energy) (F2)

An empirical potential energy function developed by us earlier hasbeen used to calculate the total non-bonded energy of the 925 nativeproteins. This energy is comprised of three non-bonded interactingterms, namely, (i) electrostatic, (ii) van derWaals and (iii) hydrophobic.These are computed as given below.

Eijelec ¼332� qi� qjD rð Þ � rij

; EijVdW ¼ cij12r12ij

−cij6r6ij

; Eijhyd ¼ Mij12

r12ij−Mij

6

r6ij

rij is distance between pair of atoms i and j.

Cij12 ¼ ε � σ12

;Cij6 ¼ 2� ε � σ6

Page 3: Biochimica et Biophysica Acta - IIT Delhiweb.iitd.ac.in/~amittal/2014_Avinash_BBA_Prot.pdf · 2014. 10. 9. · D2N: Distance to the native Avinash Mishraa, Prashant Singh Ranab, Aditya

1800 A. Mishra et al. / Biochimica et Biophysica Acta 1844 (2014) 1798–1807

σ is sum of van der Waals radiiε is well depth

Mij12 ¼ d� R12

;Mij6 ¼ 2� d� R6

;where d is set to 1:

R is sum of van der Waals radii

F2 ¼Xn−1

i¼1

Xn

j¼iþ1

Eijelec þ EijVdW þ Eijhyd:

A scoring function which can mimic Gibbs free energy of a system isthe appropriate parameter to detect the quality of a structure sincenative conformation under the appropriate solution conditions is char-acterized by minimum-most free energy. It is non-trivial to capture thefree energy of a structure with an empirical potential function. A well-calibrated energy function can at best provide an approximation tothe true free energy. Again two issues arise: (i) what is the energy of anative structure to fix the distance of a query structure from the native?(ii) two ormore conformations of the sameprotein differing in their dis-tances to the native can possess same energy which gives rise to theproblem of degeneracy [40–51]. We have calculated the total energyof 2493 native structures and plotted them against number of residues.Supplementary Fig. 1(b) illustrates the dependency of the total energyon the number of residues in a given protein structure. A very highcorrelation independent of the size and class of theprotein suggests uni-versality which should help in combination with SASA in formulatingthe distance metric. A detailed description of the energy calculation isprovided in Supplementary Text S1.

1.4. Geometrical Cβ constraint (F3)

Spatial positioning of Cα or Cβ atoms is a decisive factor in providingthe 3D conformation of a protein. Recently, neighborhood profiles of Cαatoms for each pair of residues have been characterized and observed tobe invariant in 3618native proteins [52–54] suggesting certain universalgeometrical constraints in their positioning. Here we considered fouraliphatic nonpolar residues: alanine (ALA), valine (VAL), leucine (LEU)and isoleucine (ILE). Collectively they form 6 unique pairs. These resi-dues are selected due to presence of non-cyclic hydrophobic side chains.C-beta feature is used to exploit the compactness behavior of native/native like structures. This property monitors hydrophobic effect in theprocess of attaining native conformation. Such a hypothesis is driven byhydrophobic collapse in protein foldingwhich confirms the close neigh-borhood behavior of these residues in compact native structures. Cumu-lative inter-atomic Euclidean distance of Cβ atoms of aliphatic nonpolarresidues was calculated and plotted against the total number of suchpairs occurring in a given structure. This was done for all the 2493 na-tive proteins. Supplementary Fig. 1(c) illustrates the behavior of Cβ–Cβ cumulative Euclidean distance with the number of such aliphaticnonpolar pairs, which suggests a high dependency of each other witha correlation coefficient (r) of 0.97.

F3 ¼Xn−1

i¼1

Xn

j¼iþ1

rij

r is the Euclidean distance between i and ji, j are individual aliphatic nonpolar residuesn is the total number of aliphatic nonpolar residues.

1.5. Secondary structure penalty (F4)

Secondary structure prediction has crossed 80% accuracy [55] overthe last few years and is continually improving [56]. Thus deviations

from ideal predicted secondary structures can be used to measure thequality of a structure [28]. A detailed description of secondary structurepenalty calculation is provided in Supplementary Text S1.

1.6. Solvent accessible area of polar residues (F5)

Polar amino acids are assumed to be on the surface of the proteinsexhibiting maximum solvent accessible surface area. They interactwith solvent and give overall rigidity to protein structure; propertyof their solvent accessibility is utilized in this model as a separatefeature to assess the quality of a modeled protein structure. Furtherto make the prediction model size and length independent, numbersof such polar amino acids were added along with the solvent acces-sible area. Supplementary Fig. 1(d) illustrates the behavior of solventaccessible surface area of polar residues with respect to the numberof polar residues in 2493 native protein structures, which suggests ahigh dependency of each other with a correlation coefficient (r) of0.96.

1.7. Solvent accessible area of charged residues (F6)

Similar to polar amino acids, charged residues also show high solventaccessibility and constitute the exposed surface of protein structures. Thevalue of solvent accessible area of charged residues along with the num-ber of charged residues in a given protein is used as another physico-chemical feature in model building. Supplementary Fig. 1(e) illustratesthe behavior of solvent accessible area of charged residues with respectto the number of charged residues in 2493 native protein structures,which suggests a high dependency of each other with a correlation coef-ficient (r) of 0.97.

In summary, all the six parameters described above have showna reliable consistency over the 2493 native structures belonging todifferent SCOP classes and varying widely in the number of residues.This observation prompted us to use them as components of a metricto predict the distance of any structure from its native.

2. Materials and methods

2.1. Feature selection

Six different physicochemical features are selected for predicting thedistance of a protein structure from its corresponding native withoutusing its native form. These features are: total accessible surface area,total all atom non-bonded energy, secondary structure penalty, Cβ geo-metrical constraint, solvent accessible surface area of polar residues andsolvent accessible surface area of charged residues. Significance of theseparameters has already been presented by demonstrating their univer-salities on the native structures in Supplementary Fig. 1. Training andvalidation datasets consisted of 409 protein systems which includedifferent natives and their corresponding decoys. The total numberof amino acids varies widely across these systems; therefore to makethe predictive model size independent, residue length, total numbersof aliphatic hydrophobic residue pairs, and the number of polar andcharged residues are added to the list with existing four parameters.These six features are trained against RMSD using random forestmachine learning method.

2.2. Data classification

Primarily, two types of datasets are chosen: (i) Public decoysfrom Decoys ‘R’ Us (http://dd.compbio.washington.edu/) [57] and,(ii) Server predictions of CASP experiments CASP5 to CASP9 (http://predictioncenter.org/download_area/). This constitutes 409 systemswith their decoys covering an RMSD range of 0–30 Å. There are 278systems belonging to CASP decoys while 131 systems are from public

Page 4: Biochimica et Biophysica Acta - IIT Delhiweb.iitd.ac.in/~amittal/2014_Avinash_BBA_Prot.pdf · 2014. 10. 9. · D2N: Distance to the native Avinash Mishraa, Prashant Singh Ranab, Aditya

1801A. Mishra et al. / Biochimica et Biophysica Acta 1844 (2014) 1798–1807

decoy dataset. Following is a description of the datasets used in thisstudy:

(i) Public decoys(a) 4state_reduced (b) Fisa (c) Fisa_casp3 (c) Lattice_ssfit(c) Lmds (d) ROSETTA (e) rosetta_decoys_62 (f) CASP5

(ii) CASP decoys(a) CASP5 (b) CASP6 (c) CASP7 (d) CASP8 (e) CASP9.

In CASP decoys, only thosemodel structureswere considered wheresequence is matching with the target sequence (sequence given byCASP organizer team), any model showing missing residues wasdiscarded. Apart from this, models which showed severe atomic clashesand their total energy after minimization was un-defined (NAN: nota number) were also avoided in the final dataset. SupplementaryTable 1 describes the name and number of decoys of protein systems se-lected. After a careful inspection of dataset, we observed that there are~4500 decoy structures in the RMSD span of 5–8 Å which is the highestpopulated bin. However in decoy dataset there are ~88,000 decoys andonly 409 native structures so we added additional 4477 native crystalstructures (equal to the highest populated RMSD bin) with no missingresidue (i.e. no REMARK 465) with resolution less than 2.5 Å and se-quence identity less than 30% from RCSB to homogenize the numberof 0 Å structures in the pool of decoys for a better and uniform training.This adds up to 92,598 structures including natives. Next, this dataset isdivided into training (70%) and test (30%) sets. All physico-chemicalfeatures along with residue numbers, aliphatic hydrophobic residuepairs, and the number of polar and charged residues were calculatedfor these decoys; Supplementary Fig. 2 shows the behavior of these fea-tures (total surface area, total energy and geometrical Cβ constraint,polar surface area and charged surface area) on selected decoy sets.Although the distribution of data points in thisfigure is not tight as com-pared to native structures (Supplementary Fig. 1), all these featuresshow significant correlation with their corresponding partner. Thisalso shows the degeneracy problem in the dataset and suggests therequirement of multiple features in building predictive models.

2.3. Machine learning method employed

Dataset is trained using random forest regression method. Thismethod is a decision based model where multiple trees are grown forgenerating high accuracy. Tree starts from a master node where ateach node decision is made based on the value of the physico-chemicalparameter/feature. Each leaf node in a given tree acts as an individualprediction. For any query structure, the set of physicochemical parame-ters are calculated first and given to themodel for prediction.Model has550 trees. It starts entering these parameters to each treewhich later onfollow definite path based on their value and make a prediction. Thesame process is repeated for all the 550 trees where each tree is associ-ated with a single prediction. Eventually the mean of all predictions iscomputed and given as the final prediction. Supplementary Fig. 3 illus-trates the predictionmodel in a simplified formwhere only one tree outof 550 trees is shown. Here paths which lead to final prediction (RMSDin this case) are highlightedwith parameter's cutoff value. Paths that arenot followed in the particular prediction are indicated as NA1, 2….etc.and terminated at this point. Decision is being taken at each nodebased on the actual parameter's value and its corresponding cut-off. InSupplementary Fig. 3, four different structures are shownwith their cor-responding predicted RMSDs.

Random forest allows us to calculate the importance of each featureused in model building. This method assigns an average weight to eachfeature in terms of its contribution to the prediction model; a detaileddescription is provided in Supplementary Text S2. Secondary structurepenalty showed the highest importance in predicting distance of pro-tein structure while charged residue solvent accessible surface areahas the lowest importance if we consider only physico-chemical prop-erties. Primarily, we applied different machine learning methods to

predict the distance of any protein structure from its native using theirphysico-chemical features. Although we used only ten features in theprediction model which can be easily handled by simple optimizationtechnique, due to high degeneracy of these features among the datapoints, simple optimization yields relatively poor results. Here degener-acy implies different protein structures having a similar value for one ormore features which makes the prediction process difficult. Randomforest is a decision based method where each feature/parameter istreated individually in prediction therefore duplicity/similarity in onefeature can be overcome by another distinct feature. Linear Modelmethod which is one of the simplest optimization techniques wasused but it showed PCC = 0.25, RMSE = 8.94 and accuracy = 14.33%on the test dataset. These results are inferiorwith respect to random for-est results. Similarly neural networks also showed low accuracy and PCCin comparison to random forestmethod. A detailed comparison of thesemethods is provided in Supplementary Text S2.

2.4. Model building (RMSD)

We designed three training models based on RMSDs. Model-I istrained on the complete training set consisting of 64,827 structureswhich covered a whole range of RMSDs (0–30 Å). Model-II is trainedon 29,124 structures whose RMSDs lie in the range of 0–10 Ǻ. Model-III is trained on 13,793 structures covering an RMSD range of 0–5 Ǻ.These three models are combined together to predict the RMSD of agiven structure. Once it is trained, it is validated on the test set. Thesemodels are combined as three different layers to predict the distanceof any given structure. Fig. 2 shows the pipeline of implementation.Any structure first enters Model I for RMSD prediction. If the predictedRMSD of a given structure is more than 10 Ǻ then it is considered as afinal predicted RMSD and it exits from the pathway. If the predictedRMSD is less than 10 Ǻ then it goes to Model II. If the prediction ismore than 5Ǻ, then the predicted RMSD is consideredfinal and it termi-nates. Otherwise, it goes toModel III which ismeant for 0–5Ǻ. Here, anyprediction by this model is considered as final. Models I, II and III arethus used serially to predict the RMSD.

All atomenergy uses 6/12 potential to describe vanderWaals energywhich is expected to work for high quality structures but not for low orpoor quality models. Thus we also designed a prediction model whereall atom energy is eliminated from the feature list. Here also the predic-tion is ruled by Models I, II, and III as described above. Final RMSD pre-diction is an average of models with and without energy.

2.5. Model building (TM and GDT)

We also developed learning models to predict TM and GDT scores.Here one model is sufficiently capable of predicting the score. Thesame physico-chemical parameters were utilized as different featuresin the learning process for training (70%) and validation (30%) datasets.

To evaluate the performance of the predictivemodel, a 10-fold crossvalidation was performed (Supplementary Text S2) for RMSE, Correla-tion, R2 and accuracy. Results showed high consistency of the predictivemodel on different sets of training and testing data.

3. Results

3.1. Root mean square deviation (RMSD) prediction

Final accuracy of the current prediction was calculated on high,medium, low and bad quality structures. Here validation dataset wasdivided into different bins based on the RMSDs of given structures.Classification and accuracy of validation data are shown in Table 1.This result shows that native can be predicted as less than 1.5 Å RMSDstructure in~89.3% of the cases. High quality structures (0–3 Å) can bepredicted to within ±1.5 Å error in ~98.5% of the cases. It means a 2 Åstructure may be reported as either 1 Å or 3 Å structure or anywhere

Page 5: Biochimica et Biophysica Acta - IIT Delhiweb.iitd.ac.in/~amittal/2014_Avinash_BBA_Prot.pdf · 2014. 10. 9. · D2N: Distance to the native Avinash Mishraa, Prashant Singh Ranab, Aditya

Fig. 2.RMSDprediction flowchart: Threemodelswere utilized for an accurate prediction of distance to native (RMSD) of a query structure. Thesemodels were categorized based on RMSDrange. Decision accuracy:Model I (Path I)= 99.2% andModel I (Path II)= 98.8%with error 2 Å;Model II (Path I)= 95.9% andModel II (Path II)= 97.8%with error 1 Å;Model III= 93.2%.

1802 A. Mishra et al. / Biochimica et Biophysica Acta 1844 (2014) 1798–1807

in between with ~98% accuracy. Similarly, results for medium, lowand bad quality structures are also shown in Table 1 with the range ofacceptable errors.

All the different systems and their respective decoys considered inthe test set are pooled together to calculate correlation coefficientbetween predicted and original RMSDs. A plot of the predicted and orig-inal RMSDs spanning a range of 0–30 Å is shown in Fig. 3. Pearson cor-relation coefficient (PCC) of ~0.90 reflects a high accuracy of predictionfor all structures. To examine the prediction accuracy more microscop-ically we also evaluated the trend of prediction for smaller RMSD binssuch as 0–10 Å (PCC = 0.87) and 10–30 Å (PCC = 0.83) separately.PCCs of different RMSD bins suggest high accuracy of proposed D2Nmetric for structures below 10 Å. In the test set, decoys belong to differ-ent systems. Nonetheless D2N is able to discriminate poor models fromgood models for all systems.

Table 1RMSD prediction performance with range of acceptable errors on test set structures ofvalidation dataset. Bold represents the prediction accuracy in acceptable error range.

Acceptable error →↓ RMSD Prediction accuracy (%)

±1 Å ±1.5 Å ±2 Å ±3 Å ±4 Å ±5 Å

Native 82.49 89.30 91.08 92.78 93.19 93.190–3 (high) 97.42 98.45 99.01 99.26 99.48 99.743–5 (medium) 87.98 93.33 95.63 98.06 98.65 98.945–8 (low) 53.40 71.44 83.56 93.49 96.35 97.878–12 (bad) 45.03 59.28 71.69 86.66 93.78 96.3312–30 (bad) 36.40 50.74 61.68 76.55 84.41 89.40

3.2. TM (Template Modeling) score prediction

TM score is considered as a more accurate quantitative assessmentof protein structureswith respect to RMSDandGDT. TMscore calculatesthe structural similarity between two structures and gives an output inthe range (0–1). Here score 1 indicates a perfect matchwhile 0 refers toa completely random structure. This score is sensitive towards topologyof folds unlike RMSD which is highly sensitive to local variations. Wecalculated TM score of our candidate structures using Zhang lab utility(http://zhanglab.ccmb.med.umich.edu/TM-score/) [58,59] and trainedthem using random forest method with the same set of physicochemicalproperties that we used earlier for RMSD prediction. Here we developedonly one model to predict TM score. Prediction result is shown in Table 2.

3.3. GDT (global distance test) score prediction

GDT [60,61] is a global distance test which is proposed as a betterparameter than RMSD for structure evaluation in relation to nativebecause of its independent nature from local variations between thestructures. Position of alpha carbon atom of each residue is comparedwith experimental structure and the GDT score is calculated based ondifferent cutoff distances (4 Å, 5 Å, 6 Å). The score generally increaseswith cutoff distance but always lies between 0 and 1. Here, we calculat-ed GDT score using Zhang lab utility (http://zhanglab.ccmb.med.umich.edu/TM-score/) and trained them using random forest machine learn-ing methods with the same physicochemical parameters. Predictionresult on validation dataset is shown in Table 2.

Page 6: Biochimica et Biophysica Acta - IIT Delhiweb.iitd.ac.in/~amittal/2014_Avinash_BBA_Prot.pdf · 2014. 10. 9. · D2N: Distance to the native Avinash Mishraa, Prashant Singh Ranab, Aditya

Fig. 3. Actual vs predicted RMSDs: Actual value of RMSD is plotted against predicted valueon validation dataset for 0–30 Å RMSD structures; correlation coefficient of predictionis 0.90.

1803A. Mishra et al. / Biochimica et Biophysica Acta 1844 (2014) 1798–1807

Table 2 shows that the machine learning models employed areable to predict TM and GDT scores in ±0.2 acceptable error rangewith~94.5% accuracy. Tables 1 and 2 jointly represent the accuracy ofour models for predicting distance of any candidate structure in termsof RMSD, TM and GDT without the knowledge of the native.

3.4. Case study (PDB: 2SIM)

Data for one of the targets from CASP10 (T0644) is shown in Fig. 4where a variety of decoys were considered, spanning the range 0–10 ÅRMSD with their native structure. Predicted RMSD from the currentD2Nmethod is also shownwith their actual RMSD to demonstrate the re-liability of this method in unknown cases where native is not available.

3.5. CASP10 validation

3.5.1. Actual and predicted RMSDTraining and testing of D2N are done on public decoys along with

CASP (5–9) models; CASP10 dataset was not included in this process.Current method is thus tested on CASP10 models to evaluate the qualityof the models. CASP10 has stage 2 dataset for each target which containsbest 150 models selected according to reference consensus quality pre-dictor by the CASP team. These 150models for each target were releasedto different quality servers for their quality prediction. Here 71 targetsfrom CASP10 whose experimental structure is released in the public do-main were collected with their corresponding best 150 models. Thesemodelswere evaluated by D2N and their predicted RMSDs are comparedwith their actual RMSDs. All models for 71 targets were pooled togetherand PCC was computed which reached 0.85.

3.5.2. Ability to capture the best modelAnother distinctive characteristic of any scoring function is its ability

to capture the best model in the pool of decoys. D2N is also tested on itsability to capture the best structure for the released 71 targets. CASP ex-periment allows any structure prediction server to submit 5 models.Therefore in this study, we collected top 5 models for each targetaccording to lowest predicted RMSD by D2N. Concurrently best modelfor every target is also selected based on their actual RMSD from the

Table 2Prediction result of TM and GDT score with various error bars on 39,677 structures ofvalidation dataset. Bold represents the prediction accuracy in acceptable error range.

Prediction Accuracy (%)

Score Dataset Correlation ±0.1 ±0.2 ±0.3 ±0.4 ±0.5

TM 39,677 0.92 82.4 94.4 98 99.1 99.6GDT 39,677 0.91 82.5 94.5 98 99 99.6

native and compared with the RMSD of the best model identified byD2N in the previous step. Table 3 shows the comparison of the lowestRMSD present in the pool of decoys with respect to the lowest RMSDcaptured by D2N in top 5 models. Correlation coefficient betweenthem reached ~0.95 as shown in Fig. 5 which is extremely encouraging.Table 3 shows the ΔRMSD which is the difference between RMSD ofthe model identified by D2N and the lowest RMSD model availablein the pool of decoys. AverageΔRMSD is 1.1whichmeans that D2N cap-tures the structure which is on average within 1.1 Å from best modelavailable.

3.5.3. Comparative analysisEvaluation of quality assessment servers of CASP10 has been com-

pleted [62], where different servers were evaluated on standard bench-marks. We collected results of those participating servers which give asingle quality assessment score for completemodels for 71 released tar-gets. There are 10 such servers with group numbers: QA006, QA100,QA108, QA146, QA176, QA210, QA270, QA271, QA315, and QA378. Wecalculated the correlation between their individual quality score withthe original RMSDs for the mentioned targets. Fig. 6(a) shows the com-parison bars for QA servers according to correlation coefficient of theirscore with original RMSD along with D2N. The results clearly demon-strate that D2N has the highest correlation (r = 0.85) in comparisonto other servers. Data of QA servers has been collected from CASP10website (http://www.predictioncenter.org/download_area/CASP10/predictions/). Correlation coefficient is calculated for all target's decoyspooled together.

3.5.4. Discriminating good and bad modelsThe ability of scoring functions to discriminate between good and bad

quality models is one of the evaluation criteria for quality assessmentservers in CASP [62]. Here we compared 10 quality assessment servers(QA006, QA100, QA108, QA146, QA176, QA210, QA270, QA271, QA315,and QA378) with D2N based on their ROC (receiver operating character-istic) and AUC (area under the curve) analyses. Fig. 6(b) shows the ROCcurve of participating servers on best 150 models selected accordingto reference consensus quality predictor by CASP. Threshold RMSDwas assumed as 10 Å for separation between good and bad models.Fig. 6(c) shows the area under the curve for these servers for the respec-tive ROC curve. This analysis suggests a high discriminating power of D2N(AUC = 0.91) in comparison to two other servers QA176 (AUC = 0.92)and QA108 (AUC = 0.91). As we change threshold RMSD value from10 Å to 8 Å, AUC of all quality assessment servers changes but ranking re-mains the same. At 8 Å RMSD, D2N has AUC=0.90 and stands very closeto QA108 (0.92) and QA176 (0.92) among the top discriminating servers.

4. Discussion

4.1. Biological relevance

Proteins are most important biological machinery responsible forvarious biological activities. Most of their activities are highly depen-dent on their physical shape. X-ray and NMR are popularly used exper-imental techniques to determine the structures of proteins.Unfortunately these techniques are highly cost/labor intensive andsometimes not feasible due to particular nature of the proteins. Insuch a scenario in silico protein structure prediction emerges as a prom-ising alternative to the scientific community. Despite concerted, focusedand rigorous efforts, protein structure prediction to high accuracyremains unrealized which gives rise to numerous arguments aboutthe reliability of the predicted models. There is thus an urgent needfor a robustmethodwhich can correctly report the accuracies of in silicomodeled protein structures, when experimental/native structures arenot available. Filling this gap, we present here a robust D2N metricwhich can assess the quality of a protein structure in terms of its dis-tance to its unknown native structure.

Page 7: Biochimica et Biophysica Acta - IIT Delhiweb.iitd.ac.in/~amittal/2014_Avinash_BBA_Prot.pdf · 2014. 10. 9. · D2N: Distance to the native Avinash Mishraa, Prashant Singh Ranab, Aditya

Fig. 4.Case study:Differentmodel structures of CASP10 target (T0644)which correspond to different RMSDs are evaluated byD2N and its predictedRMSDs are comparedwith the originalRMSDs (shown in parenthesis).

1804 A. Mishra et al. / Biochimica et Biophysica Acta 1844 (2014) 1798–1807

4.2. D2N performance

It is a routine practice to compute how far a modeled protein struc-ture is from its native structure (crystal or NMR) if available, but if thenative structure is not known, it is impossible to assess whether a pro-posed structure is the native itself and if it is not, how far it is from thenative. We propose here that certain physicochemical properties of sol-uble proteins such as total accessible area, energy and spatial distribu-tion of Cβ atoms which show universalities, encode this informationand it is possible to formulate a distance to the native (D2N) metricwhich characterizes how far a structure is from its native state even inthe absence of an experimental structure. We considered 409 uniqueprotein sequences from CASP (5–9) and public decoys which consistof 92,598 candidate structures including 4477 native structures, cover-ing a large range of conformational space. To design a size independentprediction model, the number amino acids and aliphatic residue Cβ–Cβpairs was also used duringmetric preparation as two additional param-eters. Utilizing the physicochemical properties and machine learningmethods, we formulated the D2N metric for predicting RMSD, TM andGDT scores which were tested on a large test set of decoys.

Fig. 1 shows a typical challenge in protein structure predictionwhere modeled structure may look like native due to proper arrange-ment of its secondary structure elements but in reality is 16.5 Å away.We examined our current distance to native methodology on both thestructures shown in Fig. 1. RMSD of the modeled structure is predictedas 15.1 Å while for its native (2HH6.pdb) the predicted RMSD is 0.6 Å.

D2N is able to predict the distance of a proposed structure to withinan average acceptable error of ±1.5 Ǻ for natives as well as high andmedium quality structures (RMSD = 0–5 Å) with 96% accuracy asshown in Table 1. Here native structures show an accuracy of 82.5%within the range of ±1 Å error, as all predicted RMSDs have positivevalues, which means that natives are predicted in 0–1 Å range in82.5% of the cases. There are 17% of cases where native structures arepredicted withN1 Å error. A careful examination of these cases suggeststhat secondary structure penalty which is one of the features incorpo-rated in D2N is significantly high for them. Secondary structure penalty

is driven by secondary structure prediction algorithmswhich have theirown limitations of ~80–82% accuracy. This also affects high andmediumquality structures (0–5 Å) where we are unable to predict close to orig-inal RMSD. Another factor which affects the prediction accuracy isglobularity of the protein. In the current prediction model combinationof physico-chemical parameter values governs the prediction path.In non-globular protein cases, the value of exposed surface area goesabruptly high which completely alters the prediction path. In suchcases, although other parameters follow high/medium quality modelstructures, due to large surface area, their predicted RMSDs are notclose to original RMSDs. Table 1 shows the accuracies on low and badquality models. Here again accuracies differ depending on the qualityof the models. For low quality models (5–8 Å) we are able to generate93.5% accuracywith±3Å error. Our focus area in this study is to predictthe high and medium quality structures close to their original RMSD.Concurrently we also developed models for predicting TM and GDTscores again in the absence of the native, where the accuracies realizedare 94.4% and 94.5% with an acceptable error of ±0.2 respectively.

Any scoring function is evaluated on two parameters: 1) correlationcoefficient of predicted and original scores, and 2) ability to capture thebest/near native models. Fig. 3 shows the PCC of original RMSD withpredicted RMSD for 0–30 Å structures. We further subdivided thegiven RMSD bin into smaller bins of 0–10 Å and 10–30 Å and calculatedPCC values for both cases as shown in Supplementary Fig. 4. Values ofcorrelation are 0.87 for 0–10 Å structures and 0.83 for 10–30 Å. LowerPCC on low and bad quality model structures is because of the highΔRMSD value between the predicted and original RMSDs. In this case,although predicted RMSD lies in 10–30 Å range, its value is away fromthe original RMSD which increases ΔRMSD. However D2N predictsthese structures as low and bad quality structures. Relative error isthus under control. Fig. 3 and Supplementary Fig. 4 demonstrate highaccuracy of D2N on high and medium quality models (0–10 Å) andalso show its ability to distinguish between good and bad models.

Fig. 5 and Table 3 collectively show the ability of D2N to capture thebest model from the pool of decoys; the average ΔRMSD between thebest model available and the best model in top 5 picked by D2N is

Page 8: Biochimica et Biophysica Acta - IIT Delhiweb.iitd.ac.in/~amittal/2014_Avinash_BBA_Prot.pdf · 2014. 10. 9. · D2N: Distance to the native Avinash Mishraa, Prashant Singh Ranab, Aditya

Table 3Ability of D2N to capture best model on 71 released CASP-10 targets. Average error(ΔRMSD) is 1.11 Å.

Target Lowest RMSDmodel available

RMSD of best modelidentified by D2N

ΔRMSD RMSD of first modelidentified by D2N

T0645 2.46 2.49 0.03 3.10T0650 1.38 1.54 0.16 4.76T0651 10.52 10.52 0 11.52T0652 6.4 11.2 4.8 11.2T0653 8.2 16.66 8.46 19.83T0654 2.56 3.09 0.53 3.74T0655 4.81 8.65 3.84 9.01T0657 3.12 3.33 0.21 4.15T0658 8.9 8.9 0 9.02T0659 0.86 1 0.14 1.02T0661 2.58 2.66 0.09 2.66T0662 2.25 2.57 0.32 2.74T0663 9.72 11.57 1.85 12.29T0664 2.55 2.71 0.16 2.85T0666 7.2 7.69 0.49 7.68T0667 3.04 3.77 0.73 3.94T0669 3.76 3.91 0.15 7.28T0672 5.25 6.54 1.3 7.73T0673 5.38 8.74 3.36 8.74T0674 10.46 16.72 6.27 16.82T0675 4.4 7.37 2.97 7.37T0676 7.09 7.09 0 15.39T0678 6.28 7.84 1.56 7.84T0679 3.66 3.91 0.25 4.17T0680 3 3.97 0.97 3.97T0681 2.1 2.22 0.12 2.25T0682 1.59 1.59 0 1.59T0683 2.52 3.02 0.49 4.79T0684 16.07 18.58 2.52 18.58T0685 4.73 4.73 0 6.64T0686 9.5 10.97 1.47 17.90T0687 3.71 3.93 0.22 4.20T0688 2.46 2.52 0.07 2.75T0689 1.61 1.8 0.19 1.84T0690 13.36 14.48 1.12 14.48T0691 7.21 7.73 0.52 9.87T0692 1.85 2.1 0.25 2.12T0699 2 2.17 0.17 2.32T0700 6.57 7.15 0.58 7.15T0703 2.47 2.52 0.05 2.72T0704 2.54 2.54 0 2.75T0705 15.14 15.45 0.31 26.64T0707 6.44 7.65 1.22 9.88T0708 2.41 2.8 0.4 4.00T0712 1.91 2.07 0.16 2.07T0713 17.24 17.5 0.26 18.13T0714 1.12 1.18 0.06 1.19T0715 3 3.34 0.34 3.34T0716 3.96 4.3 0.35 6.84T0717 13.13 13.13 0 13.13T0719 26.21 26.21 0 30.88T0721 2.04 2.19 0.16 2.97T0724 15.11 17.68 2.56 18.16T0726 12.58 16.15 3.57 16.22T0731 10.34 10.34 0 11.83T0733 2.99 3.42 0.43 3.56T0735 17.42 17.42 0 17.67T0736 3.61 5.98 2.37 5.98T0737 12.85 17.62 4.78 17.62T0738 1.83 1.97 0.14 2.02T0742 5.33 5.33 0 6.13T0743 3.53 6.57 3.03 6.57T0744 5.91 6.56 0.65 6.56T0746 7.15 7.23 0.08 8.57T0747 4.52 12.06 7.54 12.14T0749 1.48 1.88 0.4 2.01T0752 1.49 1.49 0 1.65T0753 2.54 2.54 0 2.54T0755 3.11 3.18 0.07 3.18T0756 7.99 10.06 2.07 11.99T0757 1.89 2.65 0.76 2.65

Fig. 5. Ability of D2N to capture the best model from a pool of decoys of CASP10 dataset.Correlation coefficient between RMSD of bestmodel identified byD2N in top 5 and lowestRMSD models available in pool is 0.95.

1805A. Mishra et al. / Biochimica et Biophysica Acta 1844 (2014) 1798–1807

1.1 Å. Table 3 shows 29 high quality models (RMSD= 0–3 Å) out of 71where D2N captured structure with ΔRMSD less than 1 Å in 100% of thecases. Target T0747 is one such case where medium quality model hasthe highest deviation of 7.5 Å between the best model available andthe bestmodel in top 5 picked by D2N. The native of T0747 has a typicalfold where two domains are well separated from each other makinglarge interface area exposed to solvent. In such cases, total surfaceaccessible area of native/native-like model goes high and tweaks theprediction path in wrong direction. Fig. 5 also shows the same abilityof capturing the best model by D2N graphically. Here, it is clearly dem-onstrated that for models below 5 Å, data is tightly packed around thetrend line which shows high linearity.

Per target evaluation of D2N shows that there are a few cases wherepredicted RMSD is not highly correlated with the actual RMSD. In thetest set, CASP8_T0460 is only one such case where the value of PCC is0.24 (shown in Supplementary Fig. 5). When we observe the data andplot closely, we find this to be because of native's poor prediction inthis case with overall correlation dropping to 0.24. If we remove nativefrom the dataset, PCC reaches 0.41. Another important fact to note is theability of D2N to distinguish bad and goodmodels. Although correlationis not very high in this case, both original and predicted RMSDs lie in15–20 Å RMSD bin which confirms the prediction of these models asbad quality models by D2N.

D2Nmetric has been validated on CASP10 dataset that has not beenincluded in training/testing process of metric building. ThereforeCASP10 data is completely unknown to the model and appropriate forvalidation purpose. D2N showed a very high PCC value of 0.95 on testingdata for models lying in the RMSD range of 0–30 Å while it dropped to0.85 when computed on CASP10 targets. CASP10 data is composed of71 targets and their respective 150modelswhich cleared stage 2 criteriaaccording to CASP organizers. These models were also evaluated byCASP10 quality assessment (QA) servers where D2N outperformedwith highest correlation coefficient of 0.85 while the next best is 0.77by QA006 (shown in Fig. 6(a)). Several CASP10 targets were consideredas tough targets based on their available homologs in structural data-base. These targets are: T0651, T0653, T0671, T0684, T0690, T0705,T0713, T0717, T0724, T0726, T0735, and T0737. For these targets,CASP10 participants were unable to generate any high/medium qualitymodels during CASP competition. In such cases the correlation coeffi-cient of predicted and original RMSDs is relatively lower because allstructures lie in the small RMSD span of 10–15 Å and did not providea wide range of RMSDs. We observe in Supplementary Fig. 4 that PCCof D2N is better for high/medium quality models in comparison to badquality models. Although PCC computed for these targets did notreach a high value, D2N correctly identified themas badmodels and dis-tinguished them from their respective native structures.We used RMSDas a major metric for formulating and evaluating D2N which is one ofthe oldest intermolecular methods for structure comparison. RMSDcomputes global score for two structures which is also considered as

Page 9: Biochimica et Biophysica Acta - IIT Delhiweb.iitd.ac.in/~amittal/2014_Avinash_BBA_Prot.pdf · 2014. 10. 9. · D2N: Distance to the native Avinash Mishraa, Prashant Singh Ranab, Aditya

Fig. 6. (a) Shows the comparison bars for QA servers based on their correlation coefficients for their predicted scoreswith original RMSDs. (b) ROC curve of 10 QA servers to evaluate theirdiscriminating potential where thresholdRMSDwasfixed as 10Å. (c) Bar chart of AUC (area under the curve) for differentQA servers; QA176 has the highest area (0.92)while D2N standsat second position with QA108 showing an AUC of 0.91.

1806 A. Mishra et al. / Biochimica et Biophysica Acta 1844 (2014) 1798–1807

its limitation because a small variation in one part of the structure canresult in very high RMSD. This resulted in the development of local re-gion comparison methods such as TM asmore meaningful with respectto RMSD. But in the absence of functional information of proteins, wecannot give weight to its different local regions such as active site, allo-steric site, and functional domain. Therefore in protein structure predic-tion we need to generate such models which give high quality global aswell as local scores. Thus we focused on RMSD as the main criterion toevaluate distance to the native in this study but we also predict TMand GDT as different metrics for structural similarity.

Protein structure prediction demands a distance metric techniquewhich can not only detect the best structure out of a large pool of decoysbut also report a measure that denotes the distance of any modeledstructure to its native for assessing the quality of themodeled structureand for suggesting the degree of refinement required. Distance to thenative (D2N) metric is clearly a big step in this direction. It is conceiv-able that D2N in conjunction with homology/de novo strategies forstructure generation should help in assigning structures to solublefraction of an entire proteome of an organism or the Swissprot/Uniprotsequences in the not too distant future.

5. Usage

1. Preparation of protein structure:a) Removemultiple chain entry from protein and consider only one

chain.b) Remove ligand, water or any HETATM record.c) Check missing and modified amino acids.

2. Upload protein structure in PDB format to D2N tool (http://www.scfbio-iitd.res.in/software/d2n.jsp).

3. Enter email ID for sending result and submit.4. Note down job-ID to know the status of job.

6. Funding

Programme support to the Supercomputing Facility for Bioinformatics& Computational Biology (SCFBio), IIT Delhi from the Department of Bio-technology, Govt. of India (BT/01/COE/07/05) is gratefully acknowledged.

Supplementary data to this article can be found online at http://dx.doi.org/10.1016/j.bbapap.2014.07.010.

References

[1] G.M. Cooper, A molecular Approach, The Cell, Boston University, 2000.[2] A. Bruce, J. Alexander, L. Julian, R. Martin, R. Keith, W. Peter, The shape and structure

of proteins, Molecular Biology of the Cell, 4th edition, Garland Science, New York,2002.

[3] C.B. Anfinsen, Principles that govern the folding of protein chains, Science 181(1973) 223–230.

[4] S. Salzberg, D. Searls, S. Kasif, Grand challenges in computational biology, Computa-tional Methods in Molecular Biology, Elsevier, 1998.

[5] R. Unger, J. Moult, Finding the lowest free energy conformation of a protein is anNP-hard problem, proof and implication, Bull. Math. Biol. 55 (1993) 1183–1198.

[6] A.S. Fraenkel, Complexity of protein folding, Bull. Math. Biol. 55 (1993) 1199–1210.[7] J. Cheng, A.Z. Randell, M.J. Sweredoski, P.S. Baldi, A protein structure and structural

feature, Nucleic Acids Res. 33 (2005) 72–76.[8] B. Jayaram, K. Bhushan, S.R. Shenoy, P. Narang, S. Bose, P. Agrawal, D. Sahu, V.

Pandey, Bhageerath: an energy based web enabled computer software suite forlimiting the search space of tertiary structures of small globular proteins, NucleicAcids Res. 34 (2006) 6195–6204.

[9] L.H. Hung, S.C. Ngan, T. Liu, R. Samudrala, PROTINFO: new algorithms for enhancedprotein structure prediction, Nucleic Acids Res. 33 (2005) 77–80.

[10] C.A. Rohl, C.E. Strauss, K.M. Misura, D. Baker, Protein structure prediction usingRosetta, Methods Enzymol. 383 (2004) 66–93.

[11] D.E. Kim, D. Chivian, D. Baker, Protein structure prediction and analysis using theRobetta server, Nucleic Acids Res. 32 (2004) 526–531.

[12] O. Lund, M. Nielsen, C. Lundegard, P. Worning, CPH models 20: X3M a computerprogram to extract 3d models, Abstract at the CASP5 Conference, 1022002.

[13] N. Guex, M.C. Peitsch, SWISS-MODEL and the Swiss-PdbViewer, an environment forcomparative protein modeling, Electrophoresis 18 (1997) 2714–2723.

[14] C. Lambert, N. Leonard, X. DeBolle, E. Depiereux, ESyPred3D: prediction of proteins3D structures, Bioinformatics 18 (2002) 1250–1256.

[15] A. Sali, T. Blundell, Comparative proteinmodeling by satisfaction of spatial restraints,J. Mol. Biol. 234 (1993) 779–815.

Page 10: Biochimica et Biophysica Acta - IIT Delhiweb.iitd.ac.in/~amittal/2014_Avinash_BBA_Prot.pdf · 2014. 10. 9. · D2N: Distance to the native Avinash Mishraa, Prashant Singh Ranab, Aditya

1807A. Mishra et al. / Biochimica et Biophysica Acta 1844 (2014) 1798–1807

[16] C. Combat, M. Jambon, G. Deleage, C. Geourjon, Geno3D: automatic comparativemolecular modelling of protein, Bioinformatics 18 (2000) 213–214.

[17] P.A. Bates, L.A. Kelley, R.M. MacCallum, M. Sternberg, Enhancement of proteinmodeling by human intervention in applying the automatic programs 3D-JIGSAWand 3D-PSSM, Proteins 45 (2001) 39–46.

[18] B. Priyanka Dhingra, A. Jayaram, Homology/ab initio hybrid algorithm for samplingnear-native protein conformations, J. Comput. Chem. 34 (2013) 1925–1936.

[19] B. Jayaram, P. Dhingra, B. Lakhani, S. Shekhar, Bhageerath: targeting the near impos-sible, pushing the frontiers of atomic models for protein tertiary structure predic-tion, J. Chem. Sci. 124 (2012) 83–91.

[20] A.D. Ken, L.M. Justin, The protein-folding problem, 50 years on, Science 23 (2012)1042–1046.

[21] B.J. McConkey, V. Sobolev, M. Edelman, Discrimination of native protein structuresusing atom–atom contact scoring, Proc. Natl. Acad. Sci. 100 (2003) 215.

[22] P. Benkert, M. Kunzli, T. Schwede, QMEAN server for protein model quality estima-tion, Nucleic Acids Res. 37 (2009) 510–514.

[23] P. Benkert, S.C. Tosatto, D. Schomburg, QMEAN: a comprehensive scoring functionfor model quality assessment, Proteins 71 (2008) 261–277.

[24] B.J. McConkey, V. Sobolev, M. Edelman, Discrimination of native protein structuresusing atom–atom contact scoring, Proc. Natl. Acad. Sci. 100 (2003) 3215.

[25] J. Zhang, R. Chen, J. Liang, Empirical potential function for simplified protein models,combining contact and local sequence-structure descriptors, Proteins 63 (2006)949–960.

[26] P. Narang, K. Bhushan, S. Bose, B. Jayaram, Protein structure evaluation using all-atomenergy based empirical scoring function, J. Biomol. Struct. Dyn 23 (2006) 385–406.

[27] N. Arora, B. Jayaram, Strength of hydrogen bonds in alpha helices, J. Comput. Chem.18 (1997) 1245–1252.

[28] A. Mishra, S. Rao, A. Mittal, B. Jayaram, Capturing native/native like structures with aphysico-chemical metric pcSM in protein folding, Biochim. Biophys. Acta 1834(2013) 1520–1531.

[29] B. Jayaram, A. Mittal, A. Mishra, C. Acharya, G. Khandelwal, Universalities in proteintertiary structures, some new concepts, Biomolecular Forms and Functions, 8,WorldScientific, 2013. 210–219.

[30] A.G. Murzin, S.E. Brenner, T. Hubbard, C. Chothia, SCOP, a structural classification ofproteins database for the investigation of sequences and structures, J. Mol. Biol. 247(1995) 536–540.

[31] P.W. Rose, The RCSB Protein Data Bank, redesigned web site and web services,Nucleic Acids Res. 39 (2011) 392–401.

[32] C. Cyrus, The nature of the accessible and buried surfaces, proteins, J. Mol. Biol. 105(1976) 105–114.

[33] J. Janin, Surface and inside volume in globular protein, Nature 277 (1979) 491–492.[34] C.C. Pallise, D. Parry, Quantitative comparison of the ability of hydropathy scales to

recognize surface β-strands in proteins, Proteins 42 (2001) 243–255.[35] R.P. Bahdur, P. Chakrabarti, Discriminating the native structure from decoys using

scoring functions based on the residue packing in globular proteins, BMC Bioinforma.9 (2009) 76.

[36] J. Chen, W.E. Stites, Packing is a key selection factor in the evolution of proteinhydrophobic cores, Biochemistry 40 (2001) 5280–5289.

[37] R. Wolfenden, L. Anderson, P.M. Cullis, C.C. Southgate, Affinities of amino acid sidechains for solvent water, Biochemistry 20 (1981) 849–855.

[38] J. Kyte, R.F. Doolittle, A simple method for displaying the hydropathic character of aprotein, J. Mol. Biol. 57 (1982) 105–132.

[39] B. Lee, F.M. Richards, The interpretation of protein structures, estimation of staticaccessibility, J. Mol. Biol. 55 (1971) 379–400.

[40] H. Frauenfelder, S. Sligar, P.G.Wolynes, The energy landscapes andmotions of proteins,Science 254 (1991) 1598–1603.

[41] J.D. Bryngelson, J.N. Onuchic, N.D. Socci, P.G. Wolynes, Funnels, pathways, and theenergy landscape of protein folding, a synthesis, Proteins 21 (1995) 167–195.

[42] Y.B. Ruiz-Blanco, Y. Marrero-Ponce, W. Paz, Y. García, J. Salgado, Global stability ofprotein from an emperical free energy function, J. Theor. Biol. 321 (2013) 44–53.

[43] K.A. Dill, H.S. Chan, From Levinthal to pathways to funnels, Nat. Struct. Biol 4 (1997)10–19.

[44] K.A. Dill, S.B. Ozkan, T.R. Weikl, J.D. Chodera, V.A. Voelz, The protein folding problem,when will it be solved? Curr. Opin. Struct. Biol. 17 (2007) 342–346.

[45] P. Kollman, Free energy calculations, applications to chemical and biochemicalphenomena, Chem. Rev. 93 (1993) 2395–2417.

[46] W. Jorgensen, Free energy calculations, a breakthrough for modeling organic chem-istry in solution, Acc. Chem. Res. 22 (1989) 184–189.

[47] W.F.V. Gunsteren, H.J.C. Berendsen, Computer simulation of molecular dynamics,methodology, applications and perspectives in chemistry, Angew. Chem. Int. Ed.Engl. 29 (1990) 992–1023.

[48] D.L. Beveridge, F.M. DiCapua, Free energy via molecular simulation, applications tochemical and biomolecular systems, Annu. Rev. Biophys. Biomol. Struct. 18 (1989)431–492.

[49] Sandhya Shenoy, B. Jayaram, Proteins, sequence to structure and function, CurrentStatus, 11, CPPS, 2010. 498–514.

[50] P. Narang, K. Bhushan, S. Bose, B. Jayaram, A computational pathway for bracketingnative-like structures for small alpha helical globular proteins, Phys. Chem. Chem.Phys. 7 (2005) 2364–2375.

[51] J. Pillardy, C. Czaplewski, A. Liwo, W.J. Wedemeyer, J. Lee, D.R. Ripoll, P. Arlukowicz,Development of physics-based energy functions that predict medium-resolutionstructures for protein of the á, â and á/â structural classes, J. Phys. Chem. B 105(2001) 7299–7311.

[52] A. Mittal, B. Jayaram, S.R. Shenoy, T.S. Bawa, A stoichiometry driven universal spatialorganization of backbones of folded proteins, are there Chargaff's rules for proteinfolding ? J. Biomol. Struct. Dyn 28 (2010) 133–142.

[53] A.Mittal, B. Jayaram, Backbones of folded proteins reveal novel invariant amino-acidneighborhoods, J. Biomol. Struct. Dyn 28 (2011) 443–454.

[54] A. Mittal, B. Jayaram, The newest view on protein folding, stoichiometric andspatial unity in structural and functional diversity, J. Biomol. Struct. Dyn 28 (2011)669–674.

[55] D.T. Jones, Protein secondary structure prediction based on position-specific scoringmatrices, J. Mol. Biol. 292 (1999) 195–202.

[56] C. Haita, Z.S. Taner, R.L. Jernigan, A. Kloczkowski, Consensus Data Mining CDMprotein secondary structure prediction server, combining GOR V and FragmentDatabase Mining FDM, Bioinformatics 28 (2007) 2628–2630.

[57] R. Samudrala, M. Levitt, Decoys ‘R’ Us: a database of incorrect protein conformationsto improve protein structure prediction, Protein Sci. 9 (2000) 1399–1401.

[58] Y. Zhang, J. Skolnick, Scoring function for automated assessment of protein structuretemplate quality, Proteins 57 (2004) 702–710.

[59] J. Xu, Y. Zhang, How significant is a protein structure similarity with TM-score =05? Bioinformatics 26 (2010) 889–895.

[60] A. Zemla, C. Venclovas, J. Moult, K. Fidelis, Processing and analysis of CASP3 proteinstructure predictions, Proteins 45 (1999) 22–29.

[61] A. Zemla, C. Venclovas, J. Moult, K. Fidelis, Processing and evaluation of predictionsin CASP4, Proteins 45 (2001) 13–21.

[62] A. Barbato, K. Fidelis, B.Monastyrskyy, T. Schwede, A. Tramontan, Proteins 82 (2014)112–126.