10
BIOINFORMATICS Vol. 17 no. 6 2001 Pages 541–550 SCORE: predicting the core of protein models Charlotte M. Deane 1, , Quentin Kaas 2 and Tom L. Blundell 1 1 Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge CB2 1GA, UK and 2 ENSCM, 8 Rue de L’´ ecole Normale, 34000 Montpellier, France Received on November 1, 2000; revised on February 19, 2001; accepted on February 23, 2001 ABSTRACT Motivation: The prediction of the regions of homology models that can be ‘restrained by’ or ‘copied from’ the basis structures is a vital step in correct model generation, because these regions are the models most accurate part. However, there is no ideal method for the identification of their limits. In most algorithms their length depends on the number of family members and definitions of secondary structure. Results: The algorithm SCORE steps away from the conventional definitions of the core to identify from large numbers of basis structures those regions that can be considered structurally related to a target sequence. The use of φ,ψ constraints to accurately pinpoint the regions that are conserved across a family and environmentally constrained substitution tables to extend these regions allows SCORE to rapidly (generally in under 1 s, an order of magnitude faster than methods such as MODELLER) identify and build the core of homology models from the alignments of the target sequence to the basis structures. The SCORE algorithm was used to build 114 model cores. In only two cases was the core size less than 50% of the structure and all the cores built had an RMSD of 3.7 ˚ A or less to the target structure. Availability: The algorithm is available upon request. Contact: [email protected] INTRODUCTION Most functional restraints on evolutionary divergence operate at the level of tertiary structure, and therefore three-dimensional (3D) structures are more conserved in evolution than sequence (Bajaj and Blundell, 1984). Furthermore, most disruptive changes occur at discrete positions, and at loop regions on the surface of proteins, rather than within the solvent inaccessible hydropho- bic core. These core regions are therefore structurally conserved between homologous proteins. The property was first quantified by Chothia and Lesk (1986) who compared the structural similarity of the cores for several To whom correspondence should be addressed. pairs of homologous proteins and showed a dependence upon sequence identity. Therefore, to model proteins the target sequence is aligned to structures of similar sequence (the basis structures) the region that can be considered the core of the model is extracted from this alignment. From this alignment of the target sequence to the basis structures, the backbone fragments can be divided into two types: Structurally Conserved Regions (SCRs) and Structurally Variable Regions (SVRs) (Greer, 1980). SCRs are those regions that are structurally conserved in all members of a protein family; they are not necessarily limited to secondary structure, just as structural variability is not limited to loops. The definition of SCR necessarily varies depending on the number of basis structures and sequence identity— families with many members tend to have short SCRs. This means that several of the commonly used compara- tive modelling algorithms which use SCRs to delineate the core, (e.g. Bates and Sternberg, 1999; Peitsch, 1996; Sut- cliffe et al., 1987a,b; Yang and Honig, 1999) are likely to underestimate the number of residues that can be carried through to the target structure. Hilbert et al. (1993) examined in detail the behaviour of the SCRs in pairs of homologous proteins. The SCRs were defined as those regions which after structural superposition, had C α carbons within 3.8 ˚ A. This distance was selected, as it is the mean distance of adjacent C α atoms in a trans-polypeptide chain, i.e. the shift of a C α carbon by 3.8 ˚ A completely reassigns its spatial position. They counted the number of residues contained within these SCRs and found that the fraction of residues in this common core dropped with decreasing sequence identity. Pairs whose identity within the core residues was greater than 50% had 90% or more of their residues in SCRs. But even if the sequence identity of the core was below 20% the common cores still included 65% or more of the amino acids of the protein structures. Collar extension, extending the region copied from the basis structure beyond the designated SCR, has been shown to greatly improve protein structure prediction c Oxford University Press 2001 541 by guest on March 3, 2016 http://bioinformatics.oxfordjournals.org/ Downloaded from

SCORE: predicting the core of protein models

  • Upload
    oxford

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

BIOINFORMATICS Vol. 17 no. 6 2001Pages 541–550

SCORE: predicting the core of protein models

Charlotte M. Deane 1,∗, Quentin Kaas 2 and Tom L. Blundell 1

1Department of Biochemistry, University of Cambridge, Tennis Court Road,Cambridge CB2 1GA, UK and 2ENSCM, 8 Rue de L’ecole Normale, 34000Montpellier, France

Received on November 1, 2000; revised on February 19, 2001; accepted on February 23, 2001

ABSTRACTMotivation: The prediction of the regions of homologymodels that can be ‘restrained by’ or ‘copied from’ thebasis structures is a vital step in correct model generation,because these regions are the models most accurate part.However, there is no ideal method for the identification oftheir limits. In most algorithms their length depends on thenumber of family members and definitions of secondarystructure.Results: The algorithm SCORE steps away from theconventional definitions of the core to identify from largenumbers of basis structures those regions that can beconsidered structurally related to a target sequence. Theuse of φ, ψ constraints to accurately pinpoint the regionsthat are conserved across a family and environmentallyconstrained substitution tables to extend these regionsallows SCORE to rapidly (generally in under 1 s, an orderof magnitude faster than methods such as MODELLER)identify and build the core of homology models from thealignments of the target sequence to the basis structures.The SCORE algorithm was used to build 114 model cores.In only two cases was the core size less than 50% of thestructure and all the cores built had an RMSD of 3.7 A orless to the target structure.Availability: The algorithm is available upon request.Contact: [email protected]

INTRODUCTIONMost functional restraints on evolutionary divergenceoperate at the level of tertiary structure, and thereforethree-dimensional (3D) structures are more conservedin evolution than sequence (Bajaj and Blundell, 1984).Furthermore, most disruptive changes occur at discretepositions, and at loop regions on the surface of proteins,rather than within the solvent inaccessible hydropho-bic core. These core regions are therefore structurallyconserved between homologous proteins. The propertywas first quantified by Chothia and Lesk (1986) whocompared the structural similarity of the cores for several

∗To whom correspondence should be addressed.

pairs of homologous proteins and showed a dependenceupon sequence identity.

Therefore, to model proteins the target sequence isaligned to structures of similar sequence (the basisstructures) the region that can be considered the core ofthe model is extracted from this alignment. From thisalignment of the target sequence to the basis structures,the backbone fragments can be divided into two types:Structurally Conserved Regions (SCRs) and StructurallyVariable Regions (SVRs) (Greer, 1980). SCRs are thoseregions that are structurally conserved in all membersof a protein family; they are not necessarily limited tosecondary structure, just as structural variability is notlimited to loops.

The definition of SCR necessarily varies depending onthe number of basis structures and sequence identity—families with many members tend to have short SCRs.This means that several of the commonly used compara-tive modelling algorithms which use SCRs to delineate thecore, (e.g. Bates and Sternberg, 1999; Peitsch, 1996; Sut-cliffe et al., 1987a,b; Yang and Honig, 1999) are likely tounderestimate the number of residues that can be carriedthrough to the target structure.

Hilbert et al. (1993) examined in detail the behaviourof the SCRs in pairs of homologous proteins. The SCRswere defined as those regions which after structuralsuperposition, had Cα carbons within 3.8 A. This distancewas selected, as it is the mean distance of adjacent Cα

atoms in a trans-polypeptide chain, i.e. the shift of a Cα

carbon by 3.8 A completely reassigns its spatial position.They counted the number of residues contained withinthese SCRs and found that the fraction of residues in thiscommon core dropped with decreasing sequence identity.Pairs whose identity within the core residues was greaterthan 50% had 90% or more of their residues in SCRs. Buteven if the sequence identity of the core was below 20%the common cores still included 65% or more of the aminoacids of the protein structures.

Collar extension, extending the region copied fromthe basis structure beyond the designated SCR, has beenshown to greatly improve protein structure prediction

c© Oxford University Press 2001 541

by guest on March 3, 2016

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

C.M.Deane et al.

(Srinivasan and Blundell, 1993), because the most accu-rate parts of protein structure models are those that arecopied from or restrained by the basis structures (Jonesand Kleywegt, 1999; Martin et al., 1997). However, theprocedure has never been automated, since this requiresquantifying the relationship between the basis structureand target for the putative collar extension, by eitherlocal structure or sequence similarity. The latter remainsproblematical: even though it is commonly assumed inmodelling that local sequence similarity implies localconformational similarity (e.g. Bystroff and Baker, 1998;Topham et al., 1993), there is still no satisfactory way toquantify this; even peptides as long as nine residues canadopt more than one distinct conformation (Cohen et al.,1993; Kabsch and Sander, 1984; Mezei, 1998; Sander andSchneider, 1991).

Here we describe a new method for generating the core,which establishes reliably and rapidly how far beyond theconventional SCRs it is possible to extend the fragmentsthat are copied from the basis structures (i.e. the BasisFragments (BFs)). This allows as much of the knownstructural information as possible to be included in themodel, even if it lies outside a conventional SCR. Themethod is thoroughly benchmarked and compared toother modelling programs to give rules for its generalapplicability, its advantages and disadvantages, limits andpitfalls, dependence on the degree of sequence homologyor identity, number of family members and class ofprotein.

METHODOverview of methodThe method is designed to retain the maximum amount ofinformation from the structural alignment. To achieve thisit utilises the structural similarity of the basis structuresto each other (in both Cartesian and φ, ψ space) and thesimilarity of each basis structure to the target sequencethrough environmentally constrained substitution tables(Deane and Blundell, 2001). These criteria are discussedbelow. A BF rather than an SCR is the basic unit ofthe cores built by SCORE. A BF is any part of anysingle basis structure that can confidently be said tobe similar to the target structure, regardless of whetheror not it lies within an SCR. The latter, by contrast,spans all aligned sequences, and is limited to the residuesstructurally similar in all sequences. Thus BFs can extendthe information extracted from the alignment compared tothe SCRs.

Definition of structural conservationFor structures that have been superposed and aligned alongstructural features (using COMPARER; Sali and Blundell,1990, for instance), aligned residues are considered to

occupy ‘conserved’ positions if their Cα–Cα separations(after superposition) and Dφ,ψ (difference in backbonetorsion angles, equation (1)) both fall below defined cut-off values. An SCR is defined as a continuous stretch ofthree or more residues conserved in all aligned structures.(Note that the conventional SCR definition does notinclude the Dφ,ψ criterion.) For modelling purposes, allresidues in the SCR must additionally have sequencesimilarity to the aligned position in the target sequence.

Dφ,ψ = [(1/2)(dφ2 + dψ2)]1/2 (1)

where

dφ = |φik − φ jk | if |φik − φ jk | < 180 and 360

−|φik − φ jk | if |φik − φ jk | > 180

and

dψ = |ψik − ψ jk | if |ψik − ψ jk | < 180 and 360

−|ψik − ψ jk | if |ψik − ψ jk | > 180.

In SCORE the BFs are selected using both the structuralsimilarity across the basis structures and environmentallyconstrained substitution tables.

Definition of sequence similarity: environmentallyconstrained substitution tablesThe raw environmentally constrained substitutiontables were constructed by accumulating substi-tutions observed in all the homologous pair-wisealignments from a high-resolution database. Thisdatabase was extracted from HOMSTRAD (http://www-cryst.bioc.cam.ac.uk/∼homstrad) (date 18/02/00)(Mizuguchi et al., 1998) and contained 320 homologousfamilies with 859 proteins all solved to a resolution betterthan 2.5 A. In this case the environment of the replacedand substituted residues was taken into account and sixφ, ψ areas defined six environmentally constrained tables.Environmentally constrained amino acid substitution ta-bles are derived (Deane and Blundell, 2001). These tablesare used to score BFs. The value of the environmentallyconstrained substitution table score for a BF is givenby Sc(BF) (equation (2)). It is the sum of the value foreach environmentally conserved substitution betweenthe residues at position i in the alignment of the basisstructure (B) and the target sequence (T), where i runsfrom residue m (start of the BF) to residue n (end of theBF).

Sc(BF) =n∑

i=m

Sc(B(i), T(i)). (2)

Basis fragment generationSeveral SCR regions are designated as described aboveand each is taken in turn. The SCR from one of the basis

542

by guest on March 3, 2016

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

SCORE: predicting the core of protein models

structures is chosen as a representative of the region. Thisis the SCR from the structure with the highest Sc(SCR)(equation (2)). This initial SCR is designated as a BF.

The SCRs from all the basis structures (not just thatchosen as the representative BF) are then extended in allpossible combinations at both the N and C termini until aninsertion or deletion with respect to the target sequenceoccurs. All BFs (extensions of the SCRs) that have aSc(BF) greater than or equal to the Sc(SCR) of the originalSCR in that basis structure are collected. No duplicationsare allowed if two (or more) BFs from two (or more) basisstructures are ‘identical’ (structurally conserved) under thecriteria described above. The BF with the higher Sc(BF) isthen selected.

If there are only one or two basis structures an additionalstep is added to BF generation. In this case SCOREalso reduces the SCR size by removal of residues atthe N and C termini of the SCR fragment down to anyfragment containing only three residues. The same rulesfor acceptance still apply. These shorter fragments musthave an Sc(BF) higher than or equal to that of the SCRfrom their basis structure.

Core generationThus a long list of overlapping BFs is produced. To createthe core all possible combinations of non-overlappingBFs are generated and the core with the highest Sc(core),which is the sum of Sc(BF) for all the BFs in the core,is selected. If two cores have an identical score the corecontaining a greater number of residues will be selected.The SCORE program uses two other parameters MinScoreand MinSize, which relate to the minimum Sc(core) andminimum length of the core to be built respectively.

Relative spatial positioning of the selected basisfragmentsThere are two choices for BF positioning in the algorithm.The first is to fit the BFs to a single basis structure, selectedas that with the highest overall Sc(basis structure) denotedSc(t). The fitting is performed on all residues of that basisstructure (Sc(t)) that are aligned to the core. The second isto fit the BFs to a weighted averaged Cα network of theSCRs (equation (3)). In the fit the Cα positions of the BFsare superposed on these averaged Cα positions or on theCα positions of Sc(t) (Kearsley, 1989a,b). N Sc(a) is thenormalised score for each basis structure a. t indicates thebasis fragment with the highest Sc of the q basis structures.

N Sc(a) = eSc(a)/Sc(t)/ q∑

j=0

eSc( j)/Sc(t). (3)

The averaged Cα framework is developed by taking eachresidue Cα position i within the SCRs and multiplying the

Cα co-ordinates (l representing x , y and z) from each ba-sis structure j by N Sc( j) to calculate AνCi

l (equation (4))the Cα co-ordinate in the averaged framework for positioni in the SCR and co-ordinate l (x , y or z).

AνCil =

q∑j=0

N Sc( j) × Cαil ( j). (4)

The SCORE results discussed in this paper relate unlessotherwise specified to fitting of the BFs to this averagedCα framework.

Selection of alignments for testingA subset of alignments was selected from the HOM-STRAD database (July 2000). Structures with a resolutionbetter than 2.5 A solved by x-ray crystallography wereselected from the database. Low-resolution structuresand/or structures solved by NMR were omitted followingthe results of Guex and Peitsch (1997), Harrison et al.(1997) and Hilbert et al. (1993). This led to a list of185 family alignments. Ten of these were selected forparameterisation and 14 (Table 1) for initial testing.These smaller sets were selected to represent diversity innumber of family members, major secondary structuralcomponents (α, β, αβ, etc.), sequence similarity to thetarget and number of residues in the alignment. Thesealignments were used to generate models using SCORE,COMPOSER (Sutcliffe et al., 1987a) and MODELLER(Sali and Blundell, 1993).

RESULTSSetting program parametersSeveral different Cα separation and Dφ,ψ cut-offs weretested using the initial 10 alignments. These propertiesare designed to identify segments of structure that canbe considered ‘identical’ between two or more of thebasis structures. Setting the cut-offs too low will result invery small SCRs with the possibility that entire regionsof SCR will be overlooked. Very small initial SCRs canalso lead to the generation of a very large number ofpossible BFs which would increase the task involved ingenerating all possible cores, slowing down the algorithm.Conversely if the Cα separation and Dφ,ψ cut-offs aretoo high, positions in the basis structures are considered‘identical’, where they are actually showing significantvariation. Thus, a balance was struck so that the programcould give good predictions in a reasonable time frame.SCORE operates with a Cα–Cα separation cut-off of3.5 A and a Dφ,ψ cut-off of 150. The Dφ,ψ cut-off is highgiven that the distribution of differences between ‘random’torsion angles would be expected to peak around 90◦.Thus, it was tested to see how often this cut-off causeda difference in the SCR definition. In the 14 test cases

543

by guest on March 3, 2016

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

C.M.Deane et al.

Table 1. The 14 families used for evaluating SCORE

Percentage identity ScFamily† Members Class Target (number Maximum Minimum Average Maximum Minimum Average

of residues)

Adh1cod A, Multi

1d1t A (374) 70.8 52.5 61.7 1987 1423 17052ohx A domain

Hip1hpi 2hip A

Small 1isu A (62) 25.8 17.7 22.0 108 87 951cku A

Cytprime 1a7v A2ccy A All α 1jaf A (128) 32.8 26.6 30.3 241 187 2131bbh A

Ets 1pue E2stw A All α 1bc8 C (93) 52.7 37.5 47.3 314 175 2651fli A

Ltb1lt5 D

All β 1tii D (98) 14.3 12.2 13.3 33 31 323chb D

Ngf 1bet

1bnd ASmall

1b98 M (116) 56.0 52.3 54.6 537 423 488

1bnd Bdisulphide

Cpa1nsa 2ctc

αβ 1orb (323) 30.3 29.7 30.1 448 421 4371pca

Lamb2mpr A Membrane

1a0t P (413) 22.8 22.3 22.6 219 203 2111af6 A bound α

Rub4rxn 1rdg

Small 6rxn (52) 64.4 44.4 60.0 226 204 2161rb9

Ferritin2fha 1aew

All α 1bg7 (173) 68.0 58.2 63.9 746 616 7001rci

Fer2 1fxa A 4fxc1fxi A 1a701pfd 1frr A1awd 1doy Small 1b9r A (105) 17.7 12.4 15.46 248 −4 652cjo 1frd1ayf A1pdx A

Gpr 1f3g 1gpr All β 2gpr (154) 42.2 38.0 40.1 298 297 298

Fkbp 1fkb 1pbk α + β 1yat (113) 57.0 41.6 49.3 393 291 342

Ricin 1mrj 1apa1abr A α + β 1mrg (246) 63.8 28.0 38.41qci A

The family name and class are those designated in the HOMSTRAD database. The family members and target are listed by their PDB codes the chain isidentified after an underscore if necessary. The minimum, maximum and average percentage identity and Sc of the family members to the target are given inthe last six columns.†Full HOMSTRAD family names:Adh = alcohol dehydrogenase, Hip = high potential iron–sulfur protein, Cytprime = cytochrome c′, Ets = ETS domain, Ltb = heat-labileenterotoxin/cholera toxin, B subunit, Ngf = nerve growth factor, Cpa = zinc carboxypeptidases, Lamb = maltoporin (LamB protein), Rub = rubredoxin,Ferritin–ferritin, Fer2 = ferredoxin (2Fe-2S), Gpr = glucose permease, Fkbp = FKBP-type peptidyl-prolyl cis-trans isomerase, Ricin = Ribosomeinactivating protein.

544

by guest on March 3, 2016

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

SCORE: predicting the core of protein models

-500

0

500

1000

1500

2000

2500

-1000 0 1000 2000 3000

Maximum (basis)

Min

Sco

re

0

50

100

150

200

250

300

350

400

450

0 200 400 600

Alignment length

Min

Size

SSc

(a) (b)

Fig. 1. The correlation of MinSize with alignment length (a) and MinScore with Maximum Sc(basis) (b) for the 14 initial test families(Table 1) (circles) and for the other 100 HOMSTRAD families (squares).

five are effected by this cut-off in each case one or twoof the SCRs are shortened by one or two residues. Theeffect of the cut-off is not large but it is preventing residueswith significantly different φ, ψ values being labelled asstructurally conserved.

Next values of MinSize and MinScore were identified(on the set given in Table 1) that allowed rapid operation ofthe program without artificially lengthening the core. Theoptimal values were calculated using iterative testing. Inthis testing, the threshold of MinScore was first calculated(the highest value where a core was predicted) and thenthe threshold value of MinSize was selected such that itdid not overextend the predicted core but allowed rapidprediction. The MinScore and MinSize parameters werethen tested to see if a correlation could be found withany of the available information about the alignments(Sc(basis), alignment length). Figure 1 shows the onlysignificant correlations: MinScore with the maximumSc(basis) value of a sequence in the alignment to thetarget and MinSize to alignment length. Unfortunatelyneither correlation is strong enough to be sure that thealgorithm can automatically select an optimal value forthe parameters. SCORE therefore, automatically sets theMinSize and MinScore parameters within the bands foundin Figure 1 and, if no answer is found, or if more than 5000BF are created by the algorithm (rendering the algorithmvery slow), it then suggests the band within which theMinSize and MinScore parameters should be set and inwhich direction the user should move either one of theparameters in order to achieve a result within a reasonabletime frame. SCORE is a rapid algorithm; usually it takesless than 1 s to give a prediction.

Comparison to copying from a single basisstructureIn all 14 example cases there is more than one basisstructure (Table 1). The performance of SCORE wastherefore compared to copying the identical core from

any one of the basis structures (Table 2). The coreregion predicted by SCORE is compared to the structuralalignment and if any single structure is fully aligned alongthe whole length of the SCORE core then the RMSD of thecore copied from that basis structure to the real structurewas calculated. In general, SCORE performs better thanselection from a single basis structure (9 of 14 cases) bythe selection of locally more similar conformations andoften a core comparable to that generated by SCORE isnot available from some or all of the basis structures inthe alignment (Table 2), for example in the case of thecytprime† family.

Comparison to COMPOSER—conventional SCRsThe accuracy of SCORE was then compared to the SCRsconstructed by the version of COMPOSER (Sutcliffeet al., 1987a,b) found in SYBYL 6.6 (Tripos UK Ltd).COMPOSER generates SCRs in the ‘conventional’manner which is still commonly used by many modellinggroups using only Cα–Cα separation and structurallyaligned positions (e.g. Bates et al., 1997; Bates andSternberg, 1999; Burke et al., 1999; Dunbrack, 1999;Guex et al., 1999; Guex and Peitsch, 1997; Harrison etal., 1995, 1997; Peitsch, 1996). COMPOSER was run onthe 14 test examples using their respective structure basedalignments extracted from HOMSTRAD. The comparisonbetween SCORE and COMPOSER centres on two issues,the size of core that is predicted by the two programsand the RMSD of these cores to the real structure. Thisis because a program may be predicting a significantlylarger core, but in doing so may extend the region thatis copied beyond that which is truly ‘identical’ betweenthe basis structure and the target or give a significantlylower RMSD but predict a far smaller percentage of thestructure. Table 3 shows that in general the cores are of asimilar size, but they are not selecting identical residues.

† The family name given here is that found in the HOMSTRAD database andin Table 1.

545

by guest on March 3, 2016

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

C.M.Deane et al.

Table 2. Comparing SCORE to copying from the basis structures

Family Members PDBCODE chain (RMSD of core A) SCORE RMSD (A) Target (% in core)

∗Adh 1cod A (NR) 2ohx A (0.62) 0.62 1d1t A (99)∗Hip 1hpi (1.95) 2hip A (2.09) 1cku A (1.71) 1.65 1isu A (76)∗Cytprime 1a7v A (NR) 2ccy A (NR) 1bbh A (NR) 1.44 1jaf A (82)∗Ets 1pue E (NR) 2stw A (2.20) 1fli A (2.15) 2.16 1bc8 C (74)Ltb 1lt5 D (1.68) 3chb D (1.75) 1.71 1tii D (57)∗Ngf 1bet (NR) 1bnd A (1.45) 1bnd B (1.35) 1.29 1b98 M (82)Cpa 1nsa (1.02) 2ctc (1.17) 1pca (1.11) 1.22 1orb (79)∗Lamb 2mpr A (2.00) 1af6 A (1.99) 1.99 1a0t P (66)Rub 4rxn (0.66) 1rdg (0.73) 1rb9 (0.58) 0.70 6rxn (83)∗Ferritin 2fha (0.40) 1aew (NR) 1rci (NR) 0.40 1bg7 (87)∗Fer2 1fxa A (NR) 4fxc (NR) 1fxi A (NR) 1a70 (NR) 2.78 1b9r A (62)

1pfd (NR) 1frr A (NR) 1awd (NR) 1doy (NR) 2cjo(NR) 1frd (NR) 1ayf A (NR) 1pdx A (3.03)

Gpr 1f3g (NR) 1gpr (1.57) 1.77 2gpr (94)∗Fkbp 1fkb (0.81) 1pbk (0.79) 0.80 1yat (91)Ricin 1mrj (0.68) 1apa (NR) 1abr A (NR) 1qci A (NR) 0.74 1mrg (86)

The asterisk indicates the cases where the SCORE core is better than or identical to the best core built from only one of the basis structures. NR [NoResult]—indicates that it was not possible to build a core equivalent to the SCORE core from that basis structure. The RMSD is calculated on Cα atoms only.

Table 3. Comparison of SCORE to COMPOSER and MODELLER

SCORE COMPOSER MODELLERFamily % residues RMSD (A) RMSD(f) (A) % residues RMSD (A) RMSD (A) over

predicted predicted SCORE core

Adh # 99 $ 0.62 0.62 96 0.62 0.62Hip 76 $ 1.65 1.73 87 3.82 1.84Cytprime # 82 $ 1.44 1.24 81 2.68 1.56Ets # 74 2.16 2.23 44 1.15 1.84Ltb 57 1.71 1.75 96 3.82 1.52Ngf 82 $ 1.29 1.41 86 1.42 1.36Cpa 79 $ 1.22 1.22 91 4.85 1.23Lamb 66 $ 1.99 2.00 92 14.48 2.49Rub # 83 $ 0.70 0.69 – – 0.85Ferritin 87 $ 0.40 0.40 95 0.36 0.48Fer2 # 62 $ 2.78 3.03 21 7.26 3.03Gpr # 94 1.77 1.63 90 1.64 1.49Fkbp 91 0.80 0.81 94 0.84 0.70Ricin # 86 $ 0.74 0.68 79 1.19 0.83

The two RMSD columns for SCORE relate to the two different fitting procedures, RMSD is when the core is fitted to the average Cα network and RMSD(f)when the core is fitted to a single basis structure. The hash (#) indicates the cores that are longer when built by SCORE than COMPOSER. The dollar ($)indicates where the SCORE RMSD values are lower than those achieved by MODELLER. The RMSD is calculated on Cα atoms only.

This explains the lower RMSDs in general for the SCOREcores (Table 3).

Only two of the cores built by SCORE have a slightlyinferior RMSD than those constructed by COMPOSER,and both the SCORE cores are significantly longer, 69to 41 residues in the ets family and 145 to 135 residuesin the gpr family. The worst core built by SCORE hasan RMSD of 2.7 A. Five of the COMPOSER cores havean RMSD greater than 3.8 A. In the case of the lambfamily the COMPOSER core has an RMSD of 14.5 A

which is far greater than would be expected when copyingelements from such similar basis structures. The problemis caused by COMPOSER copying long loop regions fromthe basis structures which have diverged significantly fromthe target, whereas SCORE is able to identify the limits ofthe ‘identical’ regions far more precisely. In the case ofthe ricin family the SCORE core has both a lower RMSDand a larger number of residues. Here the opposite effectis observed: COMPOSER creates shorter SCRs finishingwhere all the structures are ‘identical’ to one another but

546

by guest on March 3, 2016

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

SCORE: predicting the core of protein models

a

b

0

1

2

3

4

5

6

-500 0 500 1000 1500 2000 2500

ximum (basis)

rmsd

0

10

20

30

40

50

60

70

80

90

100

-500 0 500 1000 1500 2000 2500

Maximum (basis)

resi

dues

in c

ore

(%)

Ma Sc

Sc

Fig. 2. The relationship between maximum Sc(basis) with RMSD of core (a) and size of core (b).

SCORE performs ‘collar extension’ along a basis structurethat is similar to the target.

Comparison to MODELLERMODELLER (Sali and Blundell, 1993) is one of the mostcommonly used modelling packages available. A singlemodel structure for the target, using the basis structuresand the alignment file from HOMSTRAD was constructedusing MODELLER. To demonstrate that the methodologybeing developed here can compete with current programs,the core regions from the MODELLER predictions werecompared to those from SCORE in terms of their Cα fitto the real structure. Once again the SCORE core was cutfrom a structure, this time from the MODELLER model.This comparison is not entirely fair, as MODELLER isdesigned to build the entire structure and the core regionfor comparison is that selected by SCORE. However,if SCORE is a reliable modelling program it should in

general perform at least as well as MODELLER forbuilding the core region of the protein. Of the 14 examplesSCORE builds a lower RMSD core in 10 cases (Table 3).However, all the values here are close (within 0.3 A)indicating a similarity of predictive ability.

The effect of basis structure choiceSc is not directly correlated with percentage identity, sothe algorithm was further tested to see whether percent-age identity or Sc(basis) is a better guide to overall struc-tural similarity in the context of the SCORE algorithm.Percentage identity is known not to be a powerful indica-tor of local structural similarity. However, of the six caseswhere the basis structure with the highest percentage iden-tity to the target was different from that with the highestSc(basis) to the target (Table 3), three built lower RMSDcores with the highest Sc(basis) and three with the highestpercentage identity basis structure (the core sizes being in

547

by guest on March 3, 2016

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

C.M.Deane et al.

0

5

10

15

20

25

30

35

3 8 13 18 23 28 33 38 43 48 53 58 64 72 83 95 117 141 169 196 287Length of BF

Num

ber

of B

F

020406080

100120140160180200

1 2 3 4 5 6 7 8 9 10 >10

Gap length

Num

ber

0

10

20

30

40

50

60

70

80

0 1 2 3 4 5 6 7 8 9 10 >10

length of N or C terminal gap

Num

ber

a

b

c

Fig. 3. The length variation of BFs (a) gaps (b) and N or C termini missing fragments (c) of SCORE cores.

general similar). There is, therefore, no clear evidence asto which measure is a better guide to global similarity inthese highly identical homologous families.

Evaluation on a large datasetCOMPOSER is not a fully automated system and MOD-ELLER has a run time three orders of magnitude longerthan SCORE so evaluation on a larger dataset was carriedout for SCORE alone. It was run on 100 other families ex-tracted from the HOMSTRAD database, leading to a totalof 114 families. In only two cases was the core size less

than 50% of the structure and all the cores built had anRMSD of 3.7 A or less to the target structure. The Min-Size and MinScore parameters set using the original 14examples were compared to the values calculated for thislarger dataset and all but a tiny proportion fall within thepre-ordained bands from the smaller test set (Figure 1).

The relationship between the average, maximum andminimum Sc(basis) or percentage identity of basis struc-tures to target sequence with the size of core and RMSDof core was examined. The clear correlations that wereobserved by Hilbert et al. (1993) on pairs of homologous

548

by guest on March 3, 2016

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

SCORE: predicting the core of protein models

structures were not as obvious. There is a general increasein core size with greater maximum Sc or percentage iden-tity up to a point where the graphs level out (greater than50% identity or 550 Sc respectively). RMSD of the coreappears to correlate better to Sc(basis) values (maximum,minimum or average) than percentage identity. The shapeof the graph is as expected in that increasing Sc(basis)decreases the RMSD of the core (Figure 2).

The larger dataset was also used to assess whether fittingto a single basis structure or to an averaged frameworkshould be used. Overall the performance was found tobe slightly better if the core was fitted to an averagedframework, but there are many cases where fitting toa single basis structure improved results. Unfortunatelythere is no clear indication as to what percentage identitiesor Sc(basis) values correlate to a more reliable result whenfitting to a single basis structure rather than an averagedframework.

Counting the basis fragments and the gaps in thecoresThe target structures were divided into three types offragments:

(1) those BFs predicted by SCORE;

(2) the N and C termini fragments not predicted bySCORE;

(3) all other fragments that are not predicted bySCORE.

The number of fragments of each length of the threetypes is shown in Figure 3. This was done so that theaverage length of a BF or gap could be observed, i.e. wasSCORE predicting only short fragments, all separated byone residue gaps, or was it predicting large continuouspieces of structure separated by small gaps. As can be seenfrom Figure 3 the median length of BF is nine residues andthe average length is between 28 and 29 residues.

The gaps between the BFs are of interest, for if SCOREis to form the basis of a comparative modelling programthese gaps will have to be predicted using SVR modellingsoftware such as (Deane and Blundell, 2000). Thesemethods work best for short gaps of eight residues orless. Thus it is desirable that the majority of gaps fall intothis category. As Figure 3 shows the majority of gaps arein fact only one residue in length. Over 85% are eightresidues or less. In the case of the N and C termini gapsnearly 70% are less than three residues.

CONCLUSIONA new approach to prediction of the core of proteinmodels is suggested. That steps away from the concept ofSCRs to the more fluid definition of BFs. The algorithmis fully automatic, rapid in operation and compares

well with other comparative modelling programs such asCOMPOSER and MODELLER. In the case of proteinfamilies the operation of the program appears to fulfil thecriteria of both prediction of a reasonable percentage ofthe structure coupled with low RMSD values.

REFERENCESBajaj,M. and Blundell,T. (1984) Evolution and the tertiary structure

of proteins. Ann. Rev. Biophys. Bioeng., 13, 453–492.Bates,P.A. and Sternberg,M.J. (1999) Model building by compari-

son at CASP3: using expert knowledge and computer automa-tion. Proteins, 37, 47–54.

Bates,P.A., Jackson,R.M. and Sternberg,M.J. (1997) Model build-ing by comparison: a combination of expert knowledge and com-puter automation. Proteins, (Suppl. 1), 59–67.

Burke,D.F., Deane,C.M., Nagarajaram,H.A., Campillo,N., Martin-Martinez,M., Mendes,J., Molina,F., Perry,J., Reddy,B.V.,Soares,C.M., Steward,R.E., Williams,M., Carrondo,M.A., Blun-dell,T.L. and Mizuguchi,K. (1999) An iterative structure-assistedapproach to sequence alignment and comparative modeling.Proteins, (Suppl. 3), 55–60.

Bystroff,C. and Baker,D. (1998) Prediction of local structure inproteins using a library of sequence-structure motifs. J. Mol.Biol., 281, 565–577.

Chothia,C. and Lesk,A.M. (1986) The relation between the diver-gence of sequence and structure in proteins. EMBO, 5, 823–826.

Claessens,M., Van Cutsem,E., Lasters,I. and Wodak,S. (1989)Modelling the polypeptide backbone with ‘spare parts’ fromknown protein structures. Protein Eng., 2, 335–345.

Cohen,B.I., Presnell,S.R. and Cohen,F.E. (1993) Origins of struc-tural diversity within sequentially identical hexapeptides. ProteinSci., 2, 2134–2145.

Deane,C.M. and Blundell,T.L. (2000) A novel exhaustive search al-gorithm for predicting the conformation of polypeptide segmentsin proteins. Proteins, 40, 135–144.

Deane,C.M. and Blundell,T.L. (2001) CODA: A combined algo-rithm for predicting the structurally variable regions of proteinmodels. Protein Sci., 10, 599–612.

Dunbrack,Jr,R.L. (1999) Comparative modeling of CASP3 targetsusing PSI-BLAST and SCWRL. Proteins, (Suppl. 3), 81–87.

Greer,J. (1980) Model for haptoglobin heavy chain based uponstructural homology. Proc. Natl Acad. Sci. USA, 77, 3393–3397.

Guex,N. and Peitsch,M.C. (1997) SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling.Electrophoresis, 18, 2714–2723.

Guex,N., Diemand,A. and Peitsch,M.C. (1999) Protein modellingfor all. Trends Biochem. Sci., 24, 364–367.

Harrison,R.W., Chatterjee,D. and Weber,I.T. (1995) Analysis ofsix protein structures predicted by comparative modeling tech-niques. Proteins, 23, 463–471.

Harrison,R.W., Reed,C.C. and Weber,I.T. (1997) Analysis of com-parative modeling predictions for CASP2 targets 1, 3, 9, and 17.Proteins, (Suppl. 1), 68–73.

Hilbert,M., Bohm,G. and Jaenicke,R. (1993) Structural relation-ships of homologous proteins as a fundamental principle in ho-mology modeling. Proteins, 17, 138–151.

Jones,T.A. and Thirup,S. (1986) Using known substructures inprotein model building and crystallography. EMBO, 5, 819–822.

549

by guest on March 3, 2016

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

C.M.Deane et al.

Jones,T.A. and Kleywegt,G.J. (1999) CASP3 comparative modelingevaluation. Protein Struct. Funct. Genet., (S3), 30–46.

Kabsch,W. and Sander,C. (1984) On the use of sequence homolo-gies to predict protein structure: identical pentapeptides can havecompletely different conformations. Proc. Natl Acad. Sci. USA,81, 1075–1078.

Kearsley,S.K. (1989a) On the orthogonal transformation used forstructural comparisons. Acta Cryst., A 45, 208–210.

Kearsley,S.K. (1989b) Structural comparisons using restrainedinhomogeneous transformations. Acta Cryst., A 45, 628–635.

Levitt,M. (1992) Accurate modeling of protein conformation byautomatic segment matching. J. Mol. Biol., 226, 507–533.

Martin,A.C.R., MacArthur,M.W. and Thornton,J.M. (1997) Assess-ment of comparative modeling in CASP2. Protein Struct. Funct.Genet., (Suppl. 1), 14–28.

Mezei,M. (1998) Chameleon sequences in the PDB. Protein Eng.,11, 411–414.

Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P.(1998) HOMSTRAD: a database of protein structure alignmentsfor homologous families. Protein Sci., 7, 2469–2471.

Peitsch,M.C. (1996) ProMod and Swiss-Model: internet-based toolsfor automated comparative protein modelling. Biochem. Soc.Trans., 24, 274–279.

Sali,A. and Blundell,T.L. (1990) Definition of general topologicalequivalence in protein structures. A procedure involving compar-ison of properties and relationships through simulated annealingand dynamic programming. J. Mol. Biol., 212, 403–428.

Sali,A. and Blundell,T.L. (1993) Comparative protein modelling bysatisfaction of spatial restraints. J. Mol. Biol., 234, 779–815.

Sander,C. and Schneider,R. (1991) Database of homology-derivedprotein structures and the structural meaning of sequence align-ment. Proteins, 9, 56–68.

Srinivasan,N. and Blundell,T.L. (1993) An evaluation of the perfor-mance of an automated procedure for comparative modelling ofprotein tertiary structure. Protein Eng., 6, 501–512.

Sutcliffe,M.J., Haneef,I., Carney,D. and Blundell,T.L. (1987a)Knowledge based modelling of homologous proteins, Part I:Three-dimensional frameworks derived from the simultaneoussuperposition of multiple structures. Protein Eng., 1, 377–384.

Sutcliffe,M.J., Hayes,F.R.F. and Blundell,T.L. (1987b) Knowledgebased modelling of homologous proteins, Part II: rules for theconformations of substituted sidechains. Protein Eng., 1, 385–392.

Topham,C.M., McLeod,A., Eisenmenger,F., Overington,J.P., John-son,M.S. and Blundell,T.L. (1993) Fragment ranking in mod-elling of protein structure. Conformationally constrained envi-ronmental amino acid substitution tables. J. Mol. Biol., 229, 194–220.

Unger,R., Harel,D., Wherland,S. and Sussman,J.L. (1989) A 3Dbuilding blocks approach to analyzing and predicting structureof proteins. Proteins, 5, 355–373.

Yang,A.S. and Honig,B. (1999) Sequence to structure alignment incomparative modeling using PrISM. Proteins, 37, 66–72.

550

by guest on March 3, 2016

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from