16
Algorithmica (1999) 25: 279–294 Algorithmica © 1999 Springer-Verlag New York Inc. On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids 1 J. Atkins 2 and W. E. Hart 3 Abstract. We describe a proof of NP-hardness for a lattice protein folding model whose instances contain protein sequences defined with a fixed, finite alphabet that contains 12 amino acid types. This lattice model represents a protein’s conformation as a self-avoiding path that is embedded on the three-dimensional cubic lattice. A contact potential is used to determine the energy of a sequence in a given conformation; a pair of amino acids contributes to the conformational energy only if they are adjacent on the lattice. This result overcomes a significant weakness of previous intractability results, which do not examine protein folding models that have a finite alphabet of amino acids together with physically interesting conformations. Key Words. Protein folding, Intractability, Protein structure prediction. 1. Introduction. The problem of predicting a protein’s three-dimensional structure is a notoriously difficult problem in biophysics. Although a variety of methods have been proposed to perform structure prediction, this problem has been difficult to solve in a robust manner. For example, Ngo et al. [15] note that “It is still not known whether there exists an efficient algorithm for predicting the structure of a protein from its amino acid sequence alone. Decades of research have failed to produce such an algorithm—yet Nature seems to solve the problem....This general observation has prompted recent attempts to characterize the computa- tional complexity of protein structure prediction in simple models that represent a pro- tein’s conformation as a path in a lattice. These models treat protein structure prediction as an energy minimization problem whose goal is to find a conformation of a given sequence (i.e., a self-avoiding path in the lattice) whose energy is optimal. Computational analyses of simple lattice models for protein folding include both intractability results [2], [4], [7], [12]–[14], [16], [17] as well as performance-guaranteed approximation algorithms [1], [9]–[11]. A noteworthy difference between the approximability and intractability results is that approximation algorithms have been developed for models that are closely related to well-studied lattice models, while the intractability results have been developed for lattice models that have much weaker justifications. Specifically, the models used for the intractability results employ generalizations of the physical aspects of protein folding that make it possible for many instances of the problem to be physically uninteresting. 1 This work was supported by the Office of Mathematical Informational and Computational Sciences, U.S. Department of Energy, Office of Energy Research. This work was performed, in part, while the first author visited Sandia National Laboratories. Sandia is a multiprogram laboratory operated by Sandia corporation, a Lockheed Martin Company, for the United States Department of Energy under Contract DE-AC04-94AL85000. 2 39th Floor, Tower 45, 120 West Forty-Fifth Street, New York, NY 10036, USA. [email protected]. 3 Sandia National Laboratories, P.O. Box 5800, Albuquerque, NM 87185-1110, USA. [email protected]; http://www.cs.sandia.gov/wehart/. Received June 1, 1997; revised March 13, 1998. Communicated by D. Gusfield and M.-Y. Kao.

On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

Embed Size (px)

Citation preview

Page 1: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

Algorithmica (1999) 25: 279–294 Algorithmica© 1999 Springer-Verlag New York Inc.

On the Intractability of Protein Folding with a FiniteAlphabet of Amino Acids1

J. Atkins2 and W. E. Hart3

Abstract. We describe a proof of NP-hardness for a lattice protein folding model whose instances containprotein sequences defined with a fixed, finite alphabet that contains 12 amino acid types. This lattice modelrepresents a protein’s conformation as a self-avoiding path that is embedded on the three-dimensional cubiclattice. A contact potential is used to determine the energy of a sequence in a given conformation; a pairof amino acids contributes to the conformational energy only if they are adjacent on the lattice. This resultovercomes a significant weakness of previous intractability results, which do not examine protein foldingmodels that have a finite alphabet of amino acids together with physically interesting conformations.

Key Words. Protein folding, Intractability, Protein structure prediction.

1. Introduction. The problem of predicting a protein’s three-dimensional structure isa notoriously difficult problem in biophysics. Although a variety of methods have beenproposed to perform structure prediction, this problem has been difficult to solve in arobust manner. For example, Ngo et al. [15] note that “It is still not known whetherthere exists an efficient algorithm for predicting the structure of a protein from its aminoacid sequence alone. Decades of research have failed to produce such an algorithm—yetNature seems to solve the problem. . . .”

This general observation has prompted recent attempts to characterize the computa-tional complexity of protein structure prediction in simple models that represent a pro-tein’s conformation as a path in a lattice. These models treat protein structure prediction asan energy minimization problem whose goal is to find a conformation of a given sequence(i.e., a self-avoiding path in the lattice) whose energy is optimal. Computational analysesof simple lattice models for protein folding include both intractability results [2], [4], [7],[12]–[14], [16], [17] as well as performance-guaranteed approximation algorithms [1],[9]–[11]. A noteworthy difference between the approximability and intractability resultsis that approximation algorithms have been developed for models that are closely relatedto well-studied lattice models, while the intractability results have been developed forlattice models that have much weaker justifications. Specifically, the models used for theintractability results employ generalizations of the physical aspects of protein foldingthat make it possible for many instances of the problem to be physically uninteresting.

1 This work was supported by the Office of Mathematical Informational and Computational Sciences, U.S.Department of Energy, Office of Energy Research. This work was performed, in part, while the first authorvisited Sandia National Laboratories. Sandia is a multiprogram laboratory operated by Sandia corporation, aLockheed Martin Company, for the United States Department of Energy under Contract DE-AC04-94AL85000.2 39th Floor, Tower 45, 120 West Forty-Fifth Street, New York, NY 10036, USA. [email protected] Sandia National Laboratories, P.O. Box 5800, Albuquerque, NM 87185-1110, USA. [email protected];http://www.cs.sandia.gov/∼wehart/.

Received June 1, 1997; revised March 13, 1998. Communicated by D. Gusfield and M.-Y. Kao.

Page 2: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

280 J. Atkins and W. E. Hart

A significant weakness of almost all of the models for which intractability resultshave been developed is that the alphabet of amino acid types used to construct proteinsequences is unbounded in size.4 If we define an amino acid type by the pattern ofinteractions it exhibits with all other amino acids, then these models have an infinitenumber of amino acid types. The consequence of this observation is that the proteinsequences included in these models do not have the property ofcorrelation betweenamino acids, which is an important property of accurate physical models [3].

In this paper we provide the first demonstration that protein structure prediction may becomputationally intractable even when a finite set of amino acid types is used to constructthe problem instances. Our model represents proteins as linear chains that are embeddedon the cubic lattice. Protein sequences are defined from a set of 12 amino acid types. Theconformational energy of such a linear chain is computed using a contact potential, whichallows a pair of amino acids to contribute to the conformational energy only if they areadjacent on the lattice. In subsequent work, several authors have provided NP-hardnessresults for similar models. Nayak et al. [13] consider a string folding problem with avery large alphabet of amino acids, using a technique that can be used to “convert” ahardness proof for a model with an unbounded number of amino acids to a model with abounded number of amino acids. Crescenzi et al. [4] and Berger and Leight [2] prove thatprotein folding in the HP-model is NP-hard for the two- and three-dimensional cases,respectively. Since the HP model has an alphabet of two amino acid types, it is is perhapsthe simplest lattice model.

Since simple lattice models have been well studied [5], these types of intractabilityresult may provide new insight into the computational aspects of protein structure pre-diction. An interesting property of the contact energy matrix used in our reduction is thatit does not contain any fixed constants; there are an infinite number of energy matricesthat can be used to provide our result. Further, there are no fixed gaps enforced betweenelements of the energy matrix; there exist energy matrices for which the differencesbetween any pair of contact energies is arbitrarily small.

2. Background

2.1. Energy Minimization. A protein is a chain of amino acid residues that folds into aspecificnativethree-dimensional structure under certain physiological conditions. Pro-teins unfold when folding conditions provided by the environment are disrupted, andmany proteins spontaneously refold to their native structures when physiological condi-tions are restored. This observation is the basis for the belief that prediction of the nativestructure of a protein can be donecomputationallyfrom the information contained in theamino acid sequence alone.

Exhaustive search of a protein’s conformational space is clearly not a feasible algo-rithmic strategy for protein structure prediction. The number of possible conformationsis exponential in the length of the protein sequence, and even powerful computational

4 Fraenkel’s model [7] uses a finite number of amino acid types, but it allows the protein to chain to beembedded in a square lattice without forcing subsequent amino acids to lie in close proximity on the lattice,thereby leading to biologically implausible optimal conformations for certain amino acid sequences.

Page 3: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids 281

hardware is not capable of enumerating this space for even moderately large proteins.This observation led Levinthal [15] to raise a question about the paradoxical discrepancybetween the enormous number of possible conformations and the fact that a large fractionof proteins of all sizes fold within minutes. While these observations appear contradic-tory they simply point to the lack of knowledge of a possible algorithmic structure thatcould guide an efficient search algorithm (see [15] for further discussion of this issue).Consequently, computational analyses of the protein folding process can provide insightinto the inherent algorithmic difficulty of folding proteins.

Following the thermodynamic hypothesis [6], computational models of protein fold-ing are typically formulated to find the global minimum of a potential energy function.Many simple protein folding models use lattices to describe the space of conformationsthat proteins can assume. Lattices are infinite periodic graphs that are generated by trans-lations of a “unit graph” that fill a two- or three-dimensional space.5 Lattices providea natural discretization of the space of protein conformations since the dihedral anglesalong the protein’s backbone are indeed constrained to specific domains. The confor-mation of a protein is often viewed as a self-avoiding path in the lattice in which thevertices are labeled by the amino acids [5]. An energy value is associated with everyconformation taking into account neighborhood relationships of the amino acids on thelattice. For example, with a contact potential an amino acid contributes to the energyin a conformation if it is adjacent to amino acids (on the lattice) for which it has anonzero contact energy (except that adjacencies formed by the path of amino acids arenot included).

2.2. Intractability Results. In prior work, protein structure prediction has been provenNP-hard for a variety of protein lattice models [7], [12], [14], [16], [17], which means thatthey are at least has hard to solve as NP-complete problems. Since it is widely believedthat no polynomial time algorithm exists for NP-hard problems [8], these results suggestthat protein structure prediction is intractable.

The protein lattice models for which protein structure prediction has been provenNP-hard vary along a variety of dimensions, including the lattice used to represent con-formations, the alphabet of amino acids, and the energy formula. Fraenkel [7] examinesa physical model in which each amino acid is represented as a bead in a graph that isembedded in the cubic lattice. The graph represents the contacts in the protein that mustbe held at a fixed distance, presumably including the edges along the backbone of theprotein. Fraenkel’s model uses an alphabet of three amino acid types that represent thecharges associated with the amino acids:−1, 0, 1. The model uses a distance-dependentenergy formula that computes the product of the charges divided by distance. The energyis the sum over all edges in the contact graph that is provided in the problem specification.

Ngo and Marks [14] present a hardness result for a molecular structure predictionproblem that encompasses protein folding. Their model considers a chain molecule ofatoms whose energy is based upon a typical form of the empirical potential-energyfunction for organic molecules. Conformations of this chain molecule are embedded ina diamond lattice.

5 See [12] for a graph-theoretic definition of lattices.

Page 4: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

282 J. Atkins and W. E. Hart

Patterson and Przytycka [16] examine a physical model in which each amino acid isrepresented as a bead along a chain. A contact energy formula is used, which has contactenergies of one for contacts between identical residues and zero otherwise. The latticeused by this model is the cubic lattice.

Unger and Moult [17] examine a protein folding model that applies to a lattice thatadds edges between planar diagonals on the cubic lattice [12]. Their model treats aminoacids as beads along a chain. The energy formula is a simple energy function that hasthe same form as empirically derived force fields [17]. Hart and Istrail [12] generalizeUnger and Moult’s analysis to a model that uses an arbitrary lattice. Hart and Istrailalso describe an NP-hardness result for the cubic lattice that is robust to a wide range ofLennard–Jones-like energy potentials.

A property common to all of these models except Fraenkel’s is that the set of instancesdefined by the computational protein folding problem allows an infinite number of aminoacid types. On the other hand, instances of Fraenkel’s model allow conformations forwhich the sequence of amino acids is not embedded as a path on the lattice; a pair ofconsecutive amino acids in the protein sequence could lie far away on the lattice. Themodel that we consider overcomes both of these weaknesses.

3. Preliminaries

3.1. Definitions. Let A = {a1, . . . ,ar } be a set of amino acid types. A protein sequenceis s ∈ An. A conformationof s is an embedding ofs into the three-dimensional cubiclattice such that subsequent amino acids are adjacent on the lattice. LetFs = { f1, . . . , fn}represent a conformation ofs, where fi ∈ Z3. A conformationFs is an embedding ofsin the cubic lattice if for alli there exists an edge between the verticesfi and fi+1 (i.e.,they are neighbors on the lattice).

Let C be a conformation of a sequences, and letM be the symmetric contact energymatrix fors. EM(s,C) is the energy ofs in the conformationC with respect toM .

Two amino acids form acontactif they are adjacent on the lattice and are not subse-quent amino acids in the protein. The contacts of an amino acid are those contacts that itparticipates in. The number of potential contacts of an amino acidsi is S(s, i ). The num-ber of potential contacts of all amino acids of typeak in s is S(s,ak) =

∑si=ak

S(s, i ). Inour lattice model, this is four times the number of amino acids of typeak plus one for everyend of the sequence that is of typeak. For a subsetP ⊆ A of the amino acids, we define

S(s, P) =∑ak∈P

S(s,ak).

DEFINITION 1. SupposeP, Q ⊆ A. We say thats is (P, Q)-coveredin C when eachamino acidsi ∈ P formsS(s, i ) contacts with amino acids inQ.

DEFINITION 2. Let P, Q ⊆ A. Define cross(P, Q) = [δi j], where

δi j =2 if ai ,aj ∈ P ∩ Q,

1 if ai ∈ P,aj ∈ Q− P or aj ∈ P,ai ∈ Q− P,0 otherwise.

Page 5: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids 283

The cross(P, Q) matrix accounts for the additional energy that can be achieved fora sequence that is(P, Q)-covered inC when the contact energies between amino acidsin P andQ are increased by one.

DEFINITION 3. Let CM(s) be the set of conformations that have optimal energy forswith respect toM . That is,

CM(s) = {C | EM(s,C) ≤ EM(s,C′),∀C′}.

3.2. A Basic Construction. The following theorem allows us to construct successivestrings that will fold into increasingly specific minimum energy structures. The basicidea behind this result is that if a class of amino acids is completely covered by anotherclass of amino acids in some optimal conformation, and if we simply increase the contactenergy between them, then the same conformation is an optimal conformation for thesequence with respect to the augmented energy matrix.

THEOREM1. Suppose P⊂ Q ⊆ A and s∈ An. ConsiderM̄ such thatM̄ = M +η cross(P, Q), where M≤ 0 andη < 0. If ∃C ∈ CM(s) such that s is(P, Q)-coveredin C, then

• CM̄(s) ⊆ CM(s),• ∀C′ ∈ CM̄(s), s is(P, Q)-covered in C′,• ∀C′ ∈ CM̄(s), EM̄(s,C

′) = EM(s,C)+ ηS(s, P).

Otherwise, ∀C ∈ CM(s) and∀C′, EM̄(s,C′) > EM(s,C)+ ηS(s, P).

PROOF. From the definition of cross(P, Q) we know thatM̄si ,sj < 0 for anysi ∈ Pandsj ∈ Q. If si , sj ∈ P, then the energy added by cross(P, Q) is 2η, and ifsi ∈ P andsj ∈ Q− P, then the energy added by cross(P, Q) is η. This difference accounts for thefact thatS(s, P) double counts contacts between amino acids inP.

Suppose∃C ∈ CM(s) such thats is (P, Q)-covered inC and consider a conformationC′ ∈ CM̄(s). The energy ofs in C′ can only be lower than the energy ofs in C onlyby adding energies from interactions between theP amino acids andQ amino acids(includingP–P andQ–Q interactions). Each amino acidsi ∈ P can contribute up to anadditionalηS(s, i ) to the total energy ofs in C′. Thus for any conformationC′ ∈ CM̄(s)we haveEM̄(s,C

′) ≥ EM(s,C′) + ηS(s, P), with equality holding if and only ifs is(P, Q)-covered inC′. SinceEM(s,C′) ≥ EM(s,C),∀C′ with equality holding if andonly if C′ ∈ CM(s), we haveEM̄(s,C

′) ≥ EM(s,C)+ηS(s, P), with equality holding ifand only ifC′ ∈ CM(s) ands is (P, Q)-covered inC′. SinceC satisfies these conditions,we know that it is possible to achieve this equality. It follows thatCM̄(s) ⊆ CM(s) and,∀C′ ∈ CM̄(s), s is (P, Q)-covered inC′.

From the above argument, the case where6 ∃C ∈ CM(s) such thats is (P, Q)-coveredin C also follows.

Page 6: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

284 J. Atkins and W. E. Hart

4. A Hardness Result for a Finite Alphabet. The problem that we consider is

(A,M)-PFInstance: A sequenceS= (s1, . . . , sn), si ∈ A; B ∈ Q.Question: Is there a conformationFs embedded as a linear chain in the cubic lattice such

thatn∑

i=2

i−1∑j=1

Msi ,sj g(| fi − f j |) ≤ B?

The functiong(x) is one if x is one and zero otherwise, so(A,M)-PF uses a contactpotential.

We prove that(A,M)-PF is NP-complete for a specific alphabetA and energy matrixM . This result demonstrates that protein structure prediction can be difficult even witha protein folding model that has a finite alphabet and fixed energy matrix.

THEOREM2. Let A= {a,b, cx, c0, c1,d,e, fx, fd, gx, g0, g1} and let

M =

a b cx c0 c1 d e fx fd gx g0 g1

abcx

c0

c1

defx

fdgx

g0

g1

0 0 0 0 0 0 0 0 0 0 0 00 α1 + 2η1 η1 η1 + η4 η1 + η4 0 0 0 0 0 0 00 η1 0 η4 η5 0 0 0 0 0 0 00 η1 + η4 η4 2η4 η4 + η5 0 0 0 0 η4 η4 00 η1 + η5 η5 η4 + η5 2η5 0 0 0 0 η5 0 η5

0 0 0 0 0 0 0 0 η3 0 0 00 0 0 0 0 0 α2 + 2η2 η2 η2 + η3 0 0 00 0 0 0 0 0 η2 0 η3 0 0 00 0 0 0 0 η3 η2 + η3 η3 2η3 0 0 00 0 0 η4 η5 0 0 0 0 0 0 00 0 0 η4 0 0 0 0 0 0 0 00 0 0 0 η5 0 0 0 0 0 0 0

for anyαi , ηj ∈ Q−, i = 1,2 and j = 1, . . . ,6. Then(A,M)-PF is NP-complete.

We prove Theorem 2 by showing that HAMILTONIAN PATH reduces to(A,M)-PF.From our reduction, it is relatively easy to demonstrate that if there exists a Hamiltonianpath, then the protein sequence generated in our reduction can be configured into a lowenergy conformation. We describe the reduction in a series of subsequences of the finalprotein sequence. With each subsequence, we give an energy matrix,Mi ∈ Q12×Q12, anddiscuss the various ways that this section of the protein sequence can fold with a minimumenergy, with respect to the current matrix. In doing so, we show that if the minimal energyof the protein sequence is sufficiently low, then there exists a Hamiltonian path.

Before we begin, we define a few variables. Letn̂ = 2dlog2 ne, N = n(n̂+ 2) + 3,andm = N − 1. We also definebit j

i to be thei th bit of a binary representation of theinteger j , for j = 1, . . . ,n andi = 1, . . . , n̂.

We begin with the string

• S1 = bN3

and with the matrixM1, with M1b,b = α1 ∈ Q− and all other values zero. The following

proposition describes the minimal energy structure of this sequence.

Page 7: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids 285

PROPOSITION1. The lowest energy structure for the sequence S1 is an N× N × Ncube.

To simplify the presentation of our main result, the proof of Proposition 1 is providedin the Appendix. While there are many ways in which this sequence can fold into acube, the cubic conformational structure is an invariant property to all lowest energyconformations.

We now defineS2 andS3 and consider the stringS1S2S3. Let

• S2 = (cN2

x a)4cN2

x ,• S3 = s2s1ams1ams2am · · ·amsnams1s1s2, where

s1 = acxaacN−2x aacxa,

s2 = (acxa)N,

sd = acxaacxcxdd(c3xdd)n−1cN−3n−1

x aacxa,

d ji = acxaacxcxdd(cxcbit j

icxdd)n−1cN−3n−1

x aacxa,

su = acxaacN−3n−1x (ddc3

x)n−1ddcxcxaacxa,

u ji = acxaacN−3n−1

x (ddcxcbit jicx)

n−1ddcxcxaacxa,

sj = sdu j1d j

2 . . .ujn̂−1d j

n̂u jn̂d j

n̂−1 . . .uj2d j

1su,

and the matrixM2 = M1+η1 cross({b}, {b, cx, c0, c1})with η1 ∈ Q−. Among the manypossible conformations ofS1S2S3 that include anN × N × N cube of theb’s, some are({b}, {b, cx, c0, c1})-covered. From Theorem 1, we know thatCM2(S1S2S3) will consistof these conformations.

For these conformations, theb’s in S1 will form a cube, and thecx ’s in S2 will coverfive of the cube’s faces. Figure 1 illustrates the conformation of the remaining face.The two side columns of the remaining face will be covered withcx ’s from the twos2

subsequences ofS3. Adjacent to one of these columns ofcx ’s will be thecx ’s from thefirst su. Adjacent to the other will be thecx ’s from the lastsd andsu. The othern(n̂+ 2)columns will be covered in groups of(n̂+2) columns that correspond to thec’s from ansj . These groups can be permuted in any order, but thec’s from ansj lie within n̂+ 2consecutive columns. Thec0’s andc1’s are placed alongn − 1 identical rows. Theserows will be every third row, starting with the fifth row from one side. The pairs of rowsbefore, after, and between these rows will have rows ofd’s above each row. The followingproposition demonstrates that these are the only conformations inCM2(S1S2S3).

PROPOSITION2. The conformations described above are the only({b}, {b, cx, c0, c1})-covered conformations of S1S2S3 that contain the N× N × N cube of b’s.

PROOF. First, note that there exactly enoughc’s to cover the cube. Each subsequencecN2

x must cover a single face of the cube ofb’s since otherwise there would exist acx that is not in contact with the cube ofb’s, leaving too fewc’s to cover the cube.Now consider the subsequenceS3. Clearly, all of thec’s in this subsequence lie on theremaining face of the cube. Consequently, thec’s in eachs1, s2, sd,d

ji , su, andu j

i will

Page 8: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

286 J. Atkins and W. E. Hart

Fig. 1. Illustration of the configuration of the last subsequence ofc’s in S2 on the face of the cube ofb’s.Short loops illustrate connections formed by a pair ofa’s. Long loops illustrate connections formed by longsequences ofN − 1 a’s. The two blocks ofc’s andd’s in the middle of the face illustrate the configurationof the subsequences that correspond to vertices. The shaded regions indicate where thed’s form loops on theface of thec’s.

necessarily form paths on this face. Thecx ’s that are adjacent to twoa’s in the sequencemust cover the edge of the face they are covering; otherwise, one of thea’s would coverab preventing the desired coverage of the cube. It follows that the two subsequencess2

and the outermostcx ’s of thed ji ’s, u j

i ’s, sd’s, andsu’s lie along the edges.Now thecx ’s in a subsequences2 must form a path along the edge of the face. Consider

the interior point above the face that is adjacent to the middlecx (see Figure 2); if themiddlecx is in a corner, consider the interior point diagonally adjacent. If the path formedby s2 bends at a corner, then this point is at least(N − 1)/2 away from the nearest openedge point and(N + 1)/2 away from the next. Consequently, a path ofcx ’s that coversthis point would need to have a length of at least(N − 1)/2+ (N + 1)/2+ 1= N + 1.However, the subsequences ofS3 have at mostN c’s that can be laid down along a path

Page 9: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids 287

Fig. 2.Illustration of the point next to the subsequences2 that cannot be covered by ac. The dark path representsa path of thecx ’s in s2, and the gray square represents the closest uncovered point that is adjacent to the middleof this path. The lines connecting this point show a shortest path through this point that begins and ends at theedge of the face of the cube.

that starts and ends at an edge of the face. Consequently, this point on the face cannot becovered unless the subsequences2 does not bend at a corner.

If neither of the subsequencess2 bend at a corner, then they must lie along oppositeedges of the face. Now consider a point on the face that is along the line betweenthe middlecx ’s of the s2 subsequences. Because of the constraints imposed by thes2

subsequences, these points can only be covered by paths that run from one edge of theface to the other edge of the face, parallel to the paths formed by thes2 subsequences.

Consequently, the subsequencessj form a block of consecutive lines on the face ofthe cube ofb’s. The sequence ofa’s connectingsj andsj+1 is sufficiently long for thesj ’s to be placed in any permutation on the face of the cube and is short enough to ensurethat all of thesj ’s start from the same side. Thea’s from j to j + 1 loop up to the( j +1)th plane above the face of the cube, form a loop on that plane, and then connect tothe location ofsj+1 on the face of the cube (see Figure 3). These loops use independentplanes and columns, so they are completely disjoint.

The next subsequence is

• S4 = aN(a f (N−5)2

x

)5e(N−5)3 f N−5

x (tutd)(N−5)/2−1 f N−5x , where

td = fx fx( fd fd fx)n f N−3n−7

x ,

tu = f N−3n−7x ( fx fd fd)

n fx fx.

The next matrix isM3 = M2 + η2 cross({e}, {e, fx, fd}) + M ′, whereη2 ∈ Q− andM ′ is zero except forM ′e,e = α2 ∈ Q−. Since there are no energies between the aminoacids{e, fx, fd}, and the remaining amino acids, these amino acids form a separatesubstructure. From the arguments in Propositions 1 and 2 we can conclude that thissubstructure is an(N − 5) × (N − 5) × (N − 5) cube ofe’s that is covered withf ’s.

Page 10: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

288 J. Atkins and W. E. Hart

Fig. 3. Illustration of the configuration of the loops ofa’s that connect together thesj ’s on the face of the cubeof b’s.

It follows that CM3(S1S2S3S4) only contains conformations that consist of a cube ofb’s covered withc’s loosely connected to a cube ofe’s covered withf ’s. The followingproposition demonstrates that these two cubes can be tied together with the energy matrixM4 = M3+ η3 cross({ fd}, { fx, fd,d}), η3 ∈ Q−.

PROPOSITION3. CM4(S1S2S3S4) only contains conformations that consist of a cube ofb’s covered with c’s connected to a cube of e’s covered with f ’s, such that the fd’s onthe e-cube are in contact with the d’s that are connected to the b-cube(see Figure4).

PROOF. It is possible to cover thefd’s to form the conformations required by thisproposition. Because the size of thee-cube is even, it does not matter whether thesequence ofa’s connecting thef ’s andc’s has odd or even length; in either case thee-cube can be constructed to form thefd–d contacts.

From Theorem 1 we know that all of the conformations inCM4(S1S2S3S4)must coverthe fd’s and thatCM4(S1S2S3S4) ⊆ CM3(S1S2S3S4). This implies that the covered cubesof b’s ande’s must exist in all conformations inCM4(S1S2S3S4). Since thefd’s must lieon the face of thee-cube, the only way they can be completely covered is if thed’s form asingle fd–d contact with eachfd. However, this implies that the conformation connectsthee-cube with theb-cube. Further, since the position of thed’s is fixed on the face of

Page 11: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids 289

Fig. 4.Illustration of the optimal configuration ofS4 with energy matrixM4 shown from a side. The large blacksquare represents theb-cube and the smaller black square represents thee-cube. The gray bands surroundingthe black squares illustrate the position of thec’s and f ’s on the faces of theb- ande-cubes, respectively. Thesmall gray rectangles between the gray bands illustrate the position of thed that form contacts with thefd ’s.

the b-cube, the optimal conformations ofS1S2S3S4 connect the cubes as illustrated inFigure 4.

The final subsequence is

• S5 = ti1, j1tj1,i1ti2, j2tj2,i2 · · · ti K , jK tjK ,i K , where

tg = (agn̂xa)n−2,

ti, j = a8mtgagbit i1gbit i

2. . . gbit i

n̂gbit i

n̂. . . gbit i

1aagbit j

1gbit j

2. . . gbit j

n̂gbit j

n̂. . . gbit j

1atg,

and the pairs(i1, j1), . . . , (i K , jK ) are the edges in the graph used in the reduction.

The final matrix, M = M4 + η4 cross({c0}, {c0, c1, cx, gx, g0,b}) + η5 cross({c1},{c0, c1, cx, gx, g1,b}), η4, η5 ∈ Q−, allows optimal conformations that coverc0’s andc1’s by threading theg’s between thed’s. We say that edge(i, j ) is confirmedif ineitherti, j or tj,i each of theg0’s andg1’s form a contact withc0’s andc1’s, respectively.Note that due to the parity of the cubic lattice only one ofti, j and tj,i can satisfy thisdefinition for a particular conformation ofS1S2S3S4. The following proposition showsthat if there does not exist a Hamiltonian path, then there are at mostn − 2 confirmededges.

PROPOSITION4. If there does not exists a Hamiltonian path in the instance of HAMIL-TONIAN PATH, then the conformations inCM(S) can have at most n− 2 confirmededges.

Page 12: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

290 J. Atkins and W. E. Hart

PROOF. We know from Theorem 1 and Proposition 3 thatCM(S) consists of conforma-tions that contain a coveredb-cube that is connected to a coverede-cube.

Consider a subsequence ofS that containsn consecutive subsequences ofg’s, andnote that two of the subsequences ofg’s consist ofg0’s andg1’s. Consequently, a coveringof thec0’s andc1’s along a given row is possible only if there exists an encoded edge(eitherti, j or tj,i ) that can confirm one of then− 1 pairs of adjacentsk’s along the row.If the permutation of thesk subsequences does not placesi next tosj on the face ofthe b-cube, then a subsequence corresponding to an edge(i, j ) cannot be confirmed;in particular, the symmetric encodings ofi and j ensure an exact match between theencoded numbers in theg’s andc’s.

Thus, for any permutation of thesk’s, confirmed edges correspond to edges in thegraph along a path defined by the permutation. If an edge(i, j ) is confirmed byti, j ,then the subsequencetj,i cannot confirm(i, j ) by simply threading it antiparallel toti, jbecause of the parity of the cubic lattice (note thatti, j andtj,i are connected by an evenlength string). Consequently, if there does not exist a Hamiltonian path in the graph, thenthere cannot exist more thann− 2 confirmed edges inS.

The proof of Theorem 2 follows directly from Proposition 4.

PROOF OFTHEOREM2. The sequenceS can clearly be generated in polynomial timefrom an instance of HAMILTONIAN PATH. The value ofB can similarly be computedby adding up the energy of the conformation ofS that forms the connectedb-cube ande-cube, and assuming thatn− 1 edges were confirmed.

If a Hamiltonian path exists, then theg’s can be threaded between theb-cube ande-cube to confirmn−1 edges, so all of thec0’s andc1’s are covered. LetC refer to sucha conformation and note thatEM(C, S) = B. If a Hamiltonian path does not exist, thenwe know from Proposition 4 that at mostn−2 edges are confirmed. Thus either thec0’sor c1’s are not covered in the optimal conformations ofS. From Theorem 1 we knowthat, for all conformationsC, EM(C, S) > B.

5. Discussion. Several authors have provided detailed discussions of the relevanceof intractability arguments to protein folding in natural systems [7], [14], [15], [17]. Amajor factor that affects the interpretation of NP-hardness results is the extent to whichour simplified protein folding models capture features of the protein folding processthat are fundamentally related to the time needed to fold a protein. If our models fail tocapture these features, then NP-hardness results may simply reflect this fact.

We have argued that previous NP-hardness results for energy minimization maysimply be due to the generality of the protein models that were considered [12]. Inparticular, these models either fail to model the set of possible proteins with a finitenumber of amino acid types or allow unreasonable protein structures. The use of modelsthat assume an infinite number of amino acid types is a fundamental weakness becausethese models include many biologically unreasonable protein sequences. In fact, thesequences used in the proofs of NP-hardness take advantage of the fact that there are aneffectively infinite number of amino acid types (e.g., see [12]).

The model analyzed in this paper defines proteins with an alphabet of 12 amino acid

Page 13: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids 291

types, thereby answering the open question posed by Hart and Istrail [12]. It consequentlyoffers much more insight into the difficulty of energy minimization. Although our resultapplies for a specific alphabet and class of energy matrices, these results should provideinsight into the possible intractability of other related models.

The energy matrixM used in Theorem 2 does not include repulsive interactions, but webelieve that including repulsive interactions would simplify our intractability argument.Unlike energy matrices used in prior NP-hardness analysis,M does not have radicallydifferent values. In fact, the difference

∣∣Mi j − Mrs

∣∣ can be arbitrarily small for alli , j ,r , ands. An interesting consequence is that the energy of the optimal conformation maybe arbitrarily close to the energy of nonoptimal conformations. Thus an energy matrixexists (for which the intractability argument holds) that matches the small gap betweendenatured and native conformations observed in nature.

Our result demonstrates that one of the major sources of generality in protein modelscan be eliminated. However, the model we have analyzed does suffer from some of thesame weaknesses of related protein lattice models. For example, the space of proteinsequences may still be unnecessarily broad, including a wide range of sequences thatare not similar to naturally occurring proteins.

One component of the protein folding problem that is commonly overlooked in com-putational analyses is that the lowest energy state of a protein does not include a collectionof many degenerate structures (i.e., different structures with the same energy). Gener-ally, at the level of detail represented by lattice proteins, it is assumed that a protein’snative structure is unique. This property has been used in studies of protein sequencesto identify “protein-like” sequences [5]. However, it has been uniformly omitted fromcomplexity analyses of protein folding. In fact, the sequences that have been used toprovide reductions from NP-hard problems typically have lowest-energy conformationsthat arenot unique (e.g., see [17]). Our analysis suffers from this same weakness, sincethe cubes employed in our design of the optimal conformations can be constructed ina very large number of different ways. This weakness will be particularly difficult toovercome because it is the conformational flexibility of the protein sequence that hasbeen widely used to prove NP-hardness results. In fact, it is entirely conceivable thatfinding native conformations for sequences with unique lowest-energy conformationsmay not be intractable.

Acknowledgments. We thank Sorin Istrail for many helpful discussions. We also thankSarina Bromberg for noting the importance of uniqueness in protein folding models. Wethank Rajmohan Rajaraman for his help with proving Proposition 1.

Appendix. Proof of Proposition 1. Let A be any set of lattice points inZ3, and letVA

be the order ofA. Let Pxy(A) be the projection ofA onto thex–y plane, letPxz(A) beits projection onto thex–z plane, and letPyz(A) be its projection onto they–z plane. LetVxy(A), Vxz(A), andVyz(A) be the orders ofPxy(A), Pxz(A) andPyz(A), respectively.Similarly, let Px(A) be the projection ofA onto thex axis, letPy(A) be its projectiononto they axis, and letPz(A) be its projection onto thez axis, and letVx(A), Vy(A),andVz(A) be their orders respectively.

Page 14: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

292 J. Atkins and W. E. Hart

We prove Proposition 1 using the following lemma, which allows us to place a lowerbound on the number of adjacencies betweenb’s and non-b’s. This enables us to get anupper bound on the number of possibleb–b adjacencies withinS.

LEMMA 1. Vxy(A)Vxz(A)Vyz(A) ≥ V2A, with equality holding only if A= Px(A) ×

Py(A)× Pz(A).

PROOF. We prove this by induction on the number of distinctz coordinates among thepoints in A. First, if there is only one distinct value, thenVxy(A) = VA, and we needonly show thatVxz(A)Vyz(A) ≥ VA. This follows from the fact that in two dimensionsA is a subset of the cross product of its projection onto thex axis and its projection ontothe y axis. Thus equality can only be achieved ifA equals this cross product. Our basecase is now done.

We now assume our theorem is true for sets of points with fewer thanr distinct valuesfor the z axis. Pick any valuek such that there exists at least one point inA whosez coordinate is greater thank and at least one point inA whosez coordinate is lessthan or equal tok. Let A′ be the set of points inA whosez coordinate is greater thank, and letA′′ be A − A′. By our induction hypothesisVxy(A′)Vxz(A′)Vyz(A′) ≥ V2

A′

and Vxy(A′′)Vxz(A′′)Vyz(A′′) ≥ V2A′′ , with both equalities holding if and only if both

A′ = Px(A′)× Py(A′)× Pz(A′) andA′′ = Px(A′′)× Py(A′′)× Pz(A′′). Now,

Vxy(A)Vxz(A)Vyz(A)

= Vxy(A)(Vxz(A′)+ Vxz(A

′′))(Vyz(A′)+ Vyz(A

′′))= Vxy(A)(Vxz(A

′)Vyz(A′)+ Vxz(A

′)Vyz(A′′)+ Vxz(A

′′)Vyz(A′)

+ Vxz(A′′)Vyz(A

′′))≥ Vxy(A)

(Vxz(A

′)Vyz(A′)+ Vxz(A

′′)Vyz(A′′)

+ 2√

Vxz(A′)Vyz(A′′)Vxz(A′′)Vyz(A′))

≥ Vxy(A)Vxz(A′)Vyz(A

′)+ Vxy(A)Vxz(A′′)Vyz(A

′′)

+ 2√

Vxy(A′)Vxy(A′′)Vxz(A′)Vxz(A′′)Vyz(A′)Vyz(A′′)≥ V2

A′ + V2A′′ + 2VA′VA′′ = (VA′ + VA′′)

2 = V2A.

That the first inequality holds follows from the arithmetic-mean–geometric-mean in-equality. The second follows fromVyz(A) ≥ max{Vyz(A′),Vyz(A′′)}, and the third fromour induction hypothesis.

Now suppose thatA ⊂ Px(A) × Py(A) × Pz(A). Then either (a)A′ ⊂ Px(A′) ×Py(A′)× Pz(A′), (b) A′′ ⊂ Px(A′′)× Py(A′′)× Pz(A′′), or (c)Vyz(A) > min{Vyz(A′),Vyz(A′′)}. Each of these cases implies thatVxy(A)Vxz(A)Vyz(A) > V2

A, so equality canonly be achieved ifA = Px(A) × Py(A) × Pz(A). It is easily verified that equality isindeed achieved in this case.

We can now prove Proposition 1.

PROOF. It suffices to show that theN×N×N cube is the unique set (up to translation)

Page 15: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids 293

of N3 lattice points with the maximum number of possible adjacencies between pairs ofpoints within the set. This gives the minimal energy forSsince the chain connectivitiesare a constant factor.

Let A be any given set ofN3 points. The number of adjacencies between verticeswithin A is a strictly decreasing function of the number of adjacencies between verticesin A and those outside ofA. Therefore, minimizing the number of adjacencies betweenthe vertices inA and those outside ofA is equivalent to maximizing the number ofadjacencies between pairs of vertices inA. We note that the number of adjacenciesbetween vertices inA and those outside is at least 2(Vxy(A)+ Vxz(A)+ Vyz(A)) sincefor each pair(i, j ) for which there exists ak value with(i, j, k) ∈ A there are at leasttwo such adjacencies—one involving the maximal suchk and one the minimal suchk.Note that ifA has “gaps” in it, there will be more than this number of such adjacencies.

By the generalized arithmetic-mean–geometric-mean inequality, we have 2(Vxy(A)+Vxz(A)+Vyz(A)) ≥ 6(Vxy(A)Vxz(A)Vyz(A))1/3, so from Lemma 1 we have 2(Vxy(A)+Vxz(A) + Vyz(A)) ≥ 6V2/3

A = 6N2. To have 2(Vxy(A) + Vxz(A) + Vyz(A)) = 6N2,we must satisfy two conditions: (1)Vxy(A) = Vxz(A) = Vyz(A) and (2) APx(A) ×Py(A)× Pz(A). Now to attain 6N2, the A cannot have any gaps, soA must also meet athird condition: (3) for every pair of vertices withinA which agree on two coordinates,all lattice point of their convex hull are inA (this condition minimizes the number ofadjacencies between vertices inA and those outside). Conditions (2) and (3) forceA tobe a rectangular solid while condition (1) forces it to be a cube. It is easily verified thatthe N × N × N cube has only this many exterior adjacencies, so the lower bound isindeed sharp and the cube is the unique structure attaining it.

References

[1] R. Agarwala, S. Batzogloa, V. Danˇcı́k, S. E. Decatur, S. Hannenhalli, M. Farach, S. Muthukrishnan,and S. Skiena. Local rules for protein folding on a triangular lattice and generalized hydrophobicity. InProc. RECOMB 97, pages 1–2, 1997.

[2] B. Berger and T. Leighton. Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete.J. Comput. Biol., 5(1):27–40, 1998.

[3] H. S. Chan and K. A. Dill. Comparing folding codes for proteins and polymers.PROTEINS: Structure,Function, and Genetics, 24:335–344, 1996.

[4] P. Crescenzi, D. Goldman, C. Papadimitriou, A. Piccolboni, and M. Yannakakis. On the complexity ofprotein folding.J. Comput. Biol., 5(3):423–466, 1998.

[5] K. A. Dill, S. Bromberg, K. Yue, K. M. Fiebig, D. P. Yee, P. D. Thomas, and H. S. Chan. Principles ofprotein folding: a perspective from simple exact models.Prot. Sci., 4:561–602, 1995.

[6] C. J. Epstein, R. F. Goldberger, and C. B. Anfinsen. The genetic control of tertiary protein structure:studies with model systems. InProc. Cold Spring Harbor Symposium on Quantitative Biology, pages439–449, 1963.

[7] A. S. Fraenkel. Complexity of protein folding.Bull. Math. Biol., 55(6):1199–1210, 1993.[8] M. R. Garey and D. S. Johnson.Computers and Intractability—A Guide to the Theory of NP-

completeness. Freeman, San Francisco, CA, 1979.[9] W. E. Hart and S. Istrail. Fast protein folding in the hydrophobic-hydrophilic model within three-eights

of optimal.J. Comput. Biol., 3(1):53–96, 1996. Extended abstract inProc. 27th Annual ACM Symposiumon Theory of Computation, May 1995.

[10] W. E. Hart and S. Istrail. Invariant patterns in crystal lattices: implications for protein folding algorithms.In D. Hirshberg and G. Myers, editors,Combinatorial Pattern Matching, pages 288–303. Springer-Verlag, New York, 1996.

Page 16: On the Intractability of Protein Folding with a Finite Alphabet of Amino Acids

294 J. Atkins and W. E. Hart

[11] W. E. Hart and S. Istrail. Lattice and off-lattice side chain models of protein folding: linear time structureprediction better than 86% of optimal.J. Comput. Biol., 4(3):241–260, 1997. Extended abstract inProcRECOMB 97, January 1997.

[12] W. E. Hart and S. Istrail. Robust proofs of NP-hardness for protein folding: general lattices and energypotentials.J. Comput. Biol., 4(1):1–20, 1997.

[13] A. Nayak, A. Sinclair, and U. Zwick. Spatial codes and the hardness of string folding problems. InProcNinth Annual ACM–SIAM Symposium on Discrete Algorithms, pages 639–648, 1998.

[14] J. T. Ngo and J. Marks. Computational complexity of a problem in molecular structure prediction.Prot.Engrg., 5(4):313–321, 1992.

[15] J. T. Ngo, J. Marks, and M. Karplus. Computational complexity, protein structure prediction, and theLevinthal paradox. In K. Merz, Jr., and S. Le Grand, editors,The Protein Folding Problem and TertiaryStructure Prediction, Chapter 14, pages 435–508. Birkhauser, Boston, MA, 1994.

[16] M. Patterson and T. Przytycka. On the complexity of string folding.Discrete Appl. Math., 71:217–230,1996.

[17] R. Unger and J. Moult. Finding the lowest free energy conformation of a protein is a NP-hard problem:proof and implications.Bull. Math. Biol., 55(6):1183–1198, 1993.