5
Proc. Nati. Acad. Sci. USA Vol. 83, pp. 1608-1612, March 1986 Biochemistry On the specificity of DNA-protein interactions (regulation of transcription/DNA binding site specification/recognition mechanisms/regulatory proteins/lac repressor) PETER H. VON HIPPEL AND OTTO G. BERG* Institute of Molecular Biology and Department of Chemistry, University of Oregon, Eugene, OR 97403 Contributed by Peter H. von Hippel, November 4, 1985 ABSTRACT In this paper we summarize the various factors that must be considered in establishing the operational specificity of the binding of a protein regulator of gene expression to a DNA target site. We consider informational (combinatorial) aspects of binding-site specification, actual recognition mechanisms, and the thermodynamics of target-site selection against a background of competing pseudospecific and non-(sequence)-specific DNA bind- ing sites. The results provide insight into the design, specification, and possibly the evolution of regulatory proteins and their chromosomal binding targets, as well as into practical aspects of the design of regulatory-protein isolation schemes and physico- chemical regulatory considerations in vivo. There are as many definitions of the specificity of protein-DNA interactions as there are molecular biologists and related sci- entists working on the problem. Specificity can be considered at the levels of (i) the combinatorial specification of the number of base pairs required to define a unique binding site in a genome; (ii) the structure of the DNA (and protein) binding site, including structural complementarity and steric access; (iii) the energetics of the interaction, including the electrostatic poten- tials of the surfaces of the interacting molecules; (iv) the thermodynamics of the interaction, as determined by the net free energy of specific complex formation and the effects of competing sites; and (v) equilibrium binding selection, which determines the actual level of saturation of the specific (regu- latory) site under various environmental conditions. Since these levels are all interdependent, a coherent picture of the speci- ficity of such interactions can only be obtained by considering them all in context FORMULATION OF THE PROBLEM It is probably most useful to approach the general problem of protein-DNA interaction specificity in a functional context; i.e., in terms of the degree of saturation of a regulatory target site on a particular chromosome. The lac operon of Escherichia coli can be used to illustrate the problem. This operon occurs once per bacterial genome. Depending on the physiological state of E. coli, and thus on its level of replication in proportion to the rate of cell division, the average bacterium may contain one or several (up to 3 or 4) copies of the lac operon. Each operon contains one operator site, which serves as the specific binding target for lac repressor. In wild-type cells there is an average of 10-30 copies of the lac repressor protein. In functional terms the central protein-nucleic acid interaction that defines this system is the competitive (with RNA polymerase) binding of lac repressor to the lac operator. The repressor and poly- merase binding sites (operator and promoter, respectively) overlap (1), and thus, when repressor is bound, the promoter is occluded and transcription is inhibited. The lac operon in wild-type E. coli is expressed at 10-3 of the induced (consti- tutive or Lac-) level. In binding terms this means that the ratio of free to repressor-complexed operator sites in vivo is 10-3. A detailed thermodynamic analysis of this system has been presented (2, 3). The degree of specificity of the interaction can best be appreciated when one realizes that the in vivo system contains 407 DNA binding sites that can potentially compete. for lac repressor (each base pair of the chromosome comprises the beginning of a potential competing binding site). Thus the total concentration of potential DNA binding sites (DT) greatly exceeds that of repressor molecules (RT), which in turn exceeds that of operator sites (Or); i.e., DT >> RT > OT. The effective specificity of the system is measured in terms of the fractional saturation of the operator site with repressor. Obviously this will relate to the free concentrations of the various species and will thus depend, in large measure, on ratios of specific to nonspecific binding constants. LEVELS OF SPECIFICITY 1. Binding Site Specification We first consider the system in terms of absolute specificity; i.e., we assume a protein that can absolutely (and only) discriminate between the four information elements of DNA. There are four canonical nucleotide residues (A, T, G, and C) in single-stranded DNA and four types of base pairs (APT, T-A, G-C, and C(G) in double-stranded DNA. The latter will be our primary focus here. A specific binding site is recognized in terms of specific sequences of base pairs. A conditional probability approach (3) can be used to determine the minimal length (n) of a sequence of recognition elements (base pairs) required to specify a site, so that the expected frequency (fi; see Eq. 5 below) with which that site reappears at random within the genome is less than unity. For E. coli DNA this minimal length is -12 base pairs, assuming a double-stranded se- quence within a genome of overall composition A = T = G = C. This approach assumes that the overall sequence of the genome can be treated as chemically (though obviously not genetically) random and that every base pair is fully specified (in terms of base-pair type). Unspecified loci can, of course, interrupt the overall sequence in defined positions, but these loci will not count toward n. Similarly, specification only at the level of R-Y (purine-pyrimidine) (vs. Y-R) base pairs can occur; such loci are weighted less in establishing n. (For further details of this approach, see ref. 3.) 2. Recognition a. Primary Sequence Recognition Mechanisms. The only molecular mechanism that can unambiguously recognize and discriminate individual base pairs in double-stranded DNA is *Permanent address: Department of Molecular Biology, Biomedical Center, Box 590, S-751 24 Uppsala, Sweden. 1608 The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. Downloaded by guest on November 23, 2020

the specificity of DNA-proteininteractions · Proc. Natl. Acad. Sci. USA83 (1986) 1609 complementary hydrogen-bonding through the majorormi-norgrooves ofthe double helix (28, 29)

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: the specificity of DNA-proteininteractions · Proc. Natl. Acad. Sci. USA83 (1986) 1609 complementary hydrogen-bonding through the majorormi-norgrooves ofthe double helix (28, 29)

Proc. Nati. Acad. Sci. USAVol. 83, pp. 1608-1612, March 1986Biochemistry

On the specificity of DNA-protein interactions(regulation of transcription/DNA binding site specification/recognition mechanisms/regulatory proteins/lac repressor)

PETER H. VON HIPPEL AND OTTO G. BERG*Institute of Molecular Biology and Department of Chemistry, University of Oregon, Eugene, OR 97403

Contributed by Peter H. von Hippel, November 4, 1985

ABSTRACT In this paper we summarize the various factorsthat must be considered in establishing the operational specificityof the binding of a protein regulator of gene expression to a DNAtarget site. We consider informational (combinatorial) aspects ofbinding-site specification, actual recognition mechanisms, and thethermodynamics of target-site selection against a background ofcompeting pseudospecific and non-(sequence)-specific DNA bind-ing sites. The results provide insight into the design, specification,and possibly the evolution of regulatory proteins and theirchromosomal binding targets, as well as into practical aspects ofthe design of regulatory-protein isolation schemes and physico-chemical regulatory considerations in vivo.

There are as many definitions ofthe specificity ofprotein-DNAinteractions as there are molecular biologists and related sci-entists working on the problem. Specificity can be consideredat the levels of (i) the combinatorial specification ofthe numberof base pairs required to define a unique binding site in agenome; (ii) the structure ofthe DNA (and protein) binding site,including structural complementarity and steric access; (iii) theenergetics of the interaction, including the electrostatic poten-tials of the surfaces of the interacting molecules; (iv) thethermodynamics of the interaction, as determined by the netfree energy of specific complex formation and the effects ofcompeting sites; and (v) equilibrium binding selection, whichdetermines the actual level of saturation of the specific (regu-latory) site under various environmental conditions. Since theselevels are all interdependent, a coherent picture of the speci-ficity of such interactions can only be obtained by consideringthem all in context

FORMULATION OF THE PROBLEM

It is probably most useful to approach the general problem ofprotein-DNA interaction specificity in a functional context;i.e., in terms of the degree of saturation of a regulatory targetsite on a particular chromosome.The lac operon of Escherichia coli can be used to illustrate

the problem. This operon occurs once per bacterial genome.Depending on the physiological state of E. coli, and thus onits level ofreplication in proportion to the rate of cell division,the average bacterium may contain one or several (up to 3 or4) copies of the lac operon. Each operon contains oneoperator site, which serves as the specific binding target forlac repressor. In wild-type cells there is an average of 10-30copies of the lac repressor protein. In functional terms thecentral protein-nucleic acid interaction that defines thissystem is the competitive (with RNA polymerase) binding oflac repressor to the lac operator. The repressor and poly-merase binding sites (operator and promoter, respectively)overlap (1), and thus, when repressor is bound, the promoteris occluded and transcription is inhibited. The lac operon inwild-type E. coli is expressed at 10-3 of the induced (consti-

tutive or Lac-) level. In binding terms this means that the ratioof free to repressor-complexed operator sites in vivo is 10-3.A detailed thermodynamic analysis of this system has beenpresented (2, 3).The degree of specificity of the interaction can best be

appreciated when one realizes that the in vivo system contains407 DNA binding sites that can potentially compete. for lacrepressor (each base pair of the chromosome comprises thebeginning of a potential competing binding site). Thus the totalconcentration of potential DNA binding sites (DT) greatlyexceeds that ofrepressor molecules (RT), which in turn exceedsthat of operator sites (Or); i.e., DT >> RT > OT.The effective specificity ofthe system is measured in terms

ofthe fractional saturation ofthe operator site with repressor.Obviously this will relate to the free concentrations of thevarious species and will thus depend, in large measure, onratios of specific to nonspecific binding constants.

LEVELS OF SPECIFICITY

1. Binding Site Specification

We first consider the system in terms of absolute specificity;i.e., we assume a protein that can absolutely (and only)discriminate between the four information elements of DNA.There are four canonical nucleotide residues (A, T, G, and C)in single-stranded DNA and four types of base pairs (APT, T-A,G-C, and C(G) in double-stranded DNA. The latter will be ourprimary focus here.A specific binding site is recognized in terms of specific

sequences of base pairs. A conditional probability approach(3) can be used to determine the minimal length (n) of asequence of recognition elements (base pairs) required tospecify a site, so that the expected frequency (fi; see Eq. 5below) with which that site reappears at random within thegenome is less than unity. For E. coli DNA this minimallength is -12 base pairs, assuming a double-stranded se-quence within a genome of overall composition A = T = G= C.This approach assumes that the overall sequence of the

genome can be treated as chemically (though obviously notgenetically) random and that every base pair is fully specified(in terms of base-pair type). Unspecified loci can, of course,interrupt the overall sequence in defined positions, but theseloci will not count toward n. Similarly, specification only atthe level of R-Y (purine-pyrimidine) (vs. Y-R) base pairs canoccur; such loci are weighted less in establishing n. (Forfurther details of this approach, see ref. 3.)

2. Recognition

a. Primary Sequence Recognition Mechanisms. The onlymolecular mechanism that can unambiguously recognize anddiscriminate individual base pairs in double-stranded DNA is

*Permanent address: Department of Molecular Biology, BiomedicalCenter, Box 590, S-751 24 Uppsala, Sweden.

1608

The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement"in accordance with 18 U.S.C. §1734 solely to indicate this fact.

Dow

nloa

ded

by g

uest

on

Nov

embe

r 23

, 202

0

Page 2: the specificity of DNA-proteininteractions · Proc. Natl. Acad. Sci. USA83 (1986) 1609 complementary hydrogen-bonding through the majorormi-norgrooves ofthe double helix (28, 29)

Proc. Natl. Acad. Sci. USA 83 (1986) 1609

complementary hydrogen-bonding through the major or mi-nor grooves of the double helix (28, 29). These recognitionpatterns of hydrogen-bond donors and acceptors on DNAhave been explicitly listed by Seeman et al. (4) and have beenpresented in a particularly simple and useful "stick-figure"representation by Woodbury and von Hippel (ref. 5; see alsoref. 6). Such recognition depends on the interaction of therelevant hydrogen-bond donor and acceptor groups of theDNA base pair with a complementary matrix of hydrogen-bond acceptor and donor groups provided by appropriatelypositioned amino acid and peptide functional groups in thebinding site of the regulatory protein. This hydrogen-bondedrecognition-matrix approach has more recently been used bycrystallographers in deducing interactions in specific protein-nucleic acid interaction complexes (e.g., see refs. 7 and 8).

b. Secondary Sequence Recognition Mechanisms. When theissue is not the absolute identification of a particular se-quence of base pairs by a regulatory protein, other recogni-tion mechanisms, at a "lower level" of specification, can alsocome into play. Thus DNA "regions" can be discriminatedat the level of strandedness (e.g., single- vs. double-strand-ed), groove geometry and secondary structure (e.g., B- vs.Z-form DNA), and so forth, based on differences in proteinbinding affinity. The origins of these affinity differences canbe steric, conformational, or electrostatic and may reflectstructural consequences of regional differences in base-paircomposition. Examples include the preferred binding oftetraalkylammonium ions to dA*dT sequences in the (major)groove of B-DNA (9), the preferred binding of the antibioticsnetropsin and distamycin in the minor grooves ofpoly(dAdT)and dA-dT-rich sequences of B-form DNA (10), the nonspe-cific binding of E. coli lac repressor to double-stranded DNA(11, 12), the preferential binding ofphage T4 gene 32-encodedprotein to single-stranded DNA and RNA (13), and the differ-ential sensitivities of gene-specific eukaryotic DNA regions toendonucleases (14). Such interactions obviously provide ameasure of regional binding specificity. However, they do notcarry enough structural information to provide a mechanism forthe primary recognition of specific base-pair sequences byproteins. Such recognition must originate in hydrogen-bondinginteractions.

3. Affinity

Accepting the notion that specific target-site recognition doesindeed involve the "reading," by a regulatory protein, of aspecific array of hydrogen-bonding donors and acceptors inthe major and minor grooves of the DNA double helix, wenext consider the question of quantitative specificity ordiscrimination, which can also be termed "the problem oftheother sites."

This problem exists because discrimination between, forexample, "right" and "wrong" base pairs cannot be abso-lute. Rather there is some finite level of affinity of the proteinfor the "correct" site and some lower (but nonzero) andprogressively decreasing affinity for other sites with decreas-ing degrees of homology with the correct one. To the extentthat the great preponderance of wrong sites can compete withthe regulatory target for protein and thus reduce the freeprotein concentration, the effective affinity of protein for thecorrect sites will also be reduced.

a. Sequence-Specific Binding Free Energy. We may attemptto estimate the favorable binding free energy expected percorrectly positioned hydrogen-bond donor-acceptor pair be-tween protein and nucleic acid from first principles. Since thefunctional groups of both the protein and the nucleic acidbinding sites will be involved in hydrogen-bonding with waterwhen the complex is dissociated, for illustrative purposes weassume an average differential contribution of approximately-0.5 kcal/mol per correctly formed protein-to-nucleic acid

hydrogen bond. Assuming an average of one to two hydro-gen-bonded recognition events per base pair, this gives us arange offavorable specific binding free energies (for a proteinwith a recognition-site size n of 12 base pairs) of -6 to -12kcal/mol of protein bound.These are not large numbers, and it is important to

recognize that much more favorable free energy is likely to belost per mispaired position than is gained per proper recog-nition event. This follows because a mispositioned base paircan result in the total loss of at least one hydrogen-bondinginteraction; i.e., a protein hydrogen bond donor will end up"facing" a nucleic acid donor, or an acceptor will be"buried" facing an acceptor. In either case at least onehydrogen bond that was broken in removing the protein andnucleic acid donor (or acceptor) groups from contact with thesolvent is not replaced, and an unfavorable contribution of asmuch as +5 kcal/mol may be added to the binding free energyunless the protein-DNA complex can adjust its overallconformation somewhat to minimize this problem. Thisphenomenon illustrates the principle that generally applies torecognition interactions that are based on hydrogen-bonddonor-acceptor complementarity in water; i.e., correctdonor-acceptor interactions may not add much to the sta-bility of the complex, but incorrect hydrogen-bondcomplementarities are markedly destabilizing. Thus, differ-ential specificity of this type is largely attributable to theunfavorable effects of incorrect contacts.

b. Non-Sequence-Specific Binding Free Energy. If, as dis-cussed above, the main determinants of specificity are theunfavorable contributions of "wrong" base pairs, specificbinding will also require a large nonspecific contribution tothe binding free energy to achieve sufficient binding affinity.Such nonspecific interactions usually involve a large elec-trostatic component, due mostly to the displacement ofcondensed counterions from DNA phosphate groups bypositively charged protein side chains (see ref. 15). Forexample, for lac repressor binding to the operator site, weestimate a total standard free energy of binding of approxi-mately -17 kcal/mol under physiological salt concentra-tions. This interaction involves the formation of 7-8 nonspe-cific charge-charge interactions between protein and DNA(15, 16) and numerous base-pair-specific recognition inter-actions (17). The binding of lac repressor to nonspecific DNAinvolves =11 charge-charge interactions and no base-pair-specific interactions (11, 12), resulting in a standard freeenergy of binding of approximately -7 kcal/mol under thesame conditions.

c. Conformational Change of the Regulatory Protein to aTotally Nonspecific Binding Mode. The above discussionsuggests that at the rate at which the specific binding freeenergy decreases with misplaced (noncomplementary) hy-drogen-bonding contacts, more than 3-5 "incorrect" basepairs in the lac operator sequence may result in completedissociation of a repressor-pseudooperator complex. Insteadwe find that under these conditions, lac repressor "isomer-izes" to a binding mode where the interaction free energy istotally electrostatic and involves no sequence-dependentcomponents. The same behavior may also characterize E.coli RNA polymerase (18) and phage T4-coded DNA poly-merase (19). One of the advantages of this electrostaticbinding mode for lac repressor (and perhaps for othergenome-regulatory proteins as well) may be its ability to"slide" over the surface of the DNA molecule in a one-dimensional diffusion process, thus facilitating translocationto the regulatory target site (20, 21).

d. Distortions of the Protein and/or the DNA Target Site. Inconcluding these remarks on affinity, we stress that neitherthe protein nor the DNA sites involved in binding are totallyrigid. Thus, both partners in complex-formation can (andwill) distort to optimize sterically sensitive binding interac-

Biochemistry: von Hippel and Berg

Dow

nloa

ded

by g

uest

on

Nov

embe

r 23

, 202

0

Page 3: the specificity of DNA-proteininteractions · Proc. Natl. Acad. Sci. USA83 (1986) 1609 complementary hydrogen-bonding through the majorormi-norgrooves ofthe double helix (28, 29)

1610 Biochemistry: von Hippel and Berg

tions, within the limits of energetically available conforma-tions. We have already seen an example of this in theisomerization of lac repressor to a totally electrostaticallybound, non-sequence-dependent binding mode in the pres-ence of an excess of unfavorable hydrogen-bonding interac-tions. DNA sites can also change their local conformations.Thus, for example, hydrogen-bonding of DNA base pairs toa potentially complementary protein matrix could be eitherimproved or degraded by a local conformational change ofthe DNA that modifies the relative positions, directions, andexposures ofthe hydrogen-bonding functional groups locatedin the grooves of the DNA double helix.

It is important to emphasize that such distortions fromoptimal solution conformations are associated with a ther-modynamic cost. The free energy required to maintain theoptimal (distorted) binding conformation of either partnermust be subtracted from the favorable free energy of thebinding interaction. Beyond a certain point this thermody-namic cost exceeds the free energy gained as a consequenceof the interaction, and complex-formation no longer occurs.

4. Equilibrium Selection

How are binding sequences designed to provide sufficientspecificity in vivo? In the example of repressor-operatorbinding, the effective (or functional) selection is determinedby the fractional saturation of operator site(s). This in turn iscontrolled not only by the affinities discussed above, but alsoby the concentrations and availabilities of repressors, oper-ators, and various competitive DNA binding sites.

a. Coupled Equilibria. These competitive binding reac-tions can best be described by a coupled equilibrium model.If the total concentration of repressor molecules is RT, theconcentration of free repressors is RF, and the total concen-tration ofDNA sites with binding constant Ki is Di, one finds

RT = RF + KEl+R* [1]1 +KjRF

The summation term over all (available) DNA sites gives theconcentration of bound protein; thus, the concentration offree protein can be related to the fractional saturation of anysite. In particular, we are interested in the fractional satura-tion (6s) of the specific site (with binding constant Ks). Inthese terms the free concentration of repressor is

R = 1 Oes .x [2]

We shall define x [3 Oe/(l - Os)] as the effective selectionfactor for the specific site. Thus,

x xD +E xDiR 1 ~~~~~~~~~~~~[3]KT 1+x ns x + Ks/Ki

where Ds is the total concentration of specific sites (opera-tors) in the system, and the summation term is over allnonspecific sites (ns). From Eq. 3, one can calculate theeffective selection factor (x) as a function of the totalrepressor concentration and the relevant binding constants.In general, the fractional saturation of the specific site (6w)should be close to'l, so thatx >> 1. (For lac repressor-opera-tor binding, as discussed above, x 103.)The competition from each class i of nonspecific sites is

determined mainly by the discrimination ratio for that siterelative to the specific site (KS/K,). To see roughly how thevarious nonspecific sites influence the specific binding, wedivide them into two groups. Strong "pseudosites," involv-ing only a few "wrong" base pairs, bind protein in its specificbinding mode and exhibit values ofKS/Ki < x. These sites are

titrated along with the specific (functional) site(s) to achievethe final level of specific site selection, x. Weak nonspecificsites exhibit values of K5/Kj > x. These sites thus participateas an unsaturated nonspecific binding "background." If thisgrouping is applied, Eq. 3 can be approximated as

XKS[RT -Ds - D]

1 + KnsDns[4]

in the limit of strong selection (x >> 1). Here DPS is theconcentration of strong pseudosites and Kn5Dns (= YK1Dg)represents an average over the weak nonspecific sites withconcentration Dm5. Thus KS/(1 + KnsDnS) can be consideredan effective specific binding constant. Eq. 4 shows thatnonspecific binding can have a profound effect on thetitration of specific regulatory sites (e.g., see refs. 2, 3, 22,and 23).

b. Limits to Protein Concentrations and linding Constants.A variety of physical, physiological, and evolutionary con-siderations conspire to place upper and lower limits on the invivo values of the binding constants and total concentrationsof proteins that regulate gene expression by binding tospecific DNA sites.

(i) Titration. An effective degree ofregulatory site titration(e.g., for operon repression) requires x >> 1; thus, thespecific binding constant must be large enough to make xKS(RT - Ds)/(1 + KnsDns) >> 1. This establishes a lower limitfor K5.

(ii) Fluctuations. The above criterion also suggests that ifthe specific binding affinity is very large, efficient selectionmight be achieved with a minimal investment of protein; i.e.,with a very small value of (RT - D5). In the limit, this meansthat the total repressor concentration could be effectivelyequal to the total operator concentration. However, thissituation would not be stable with respect to small fluctua-tions in repressor number, since derepression could resultfrom the loss of a single repressor molecule. If the averagenumber of repressor molecules per cell is rT, the magnitudeof natural fluctuations is expected to be at least ± (FT)and possibly much larger (24). If the average (total) numberof specific sites in the cell is ni, we expect that rT - fs >2(PT) /2, rT 2 fs + 2[1 + (1 + fis) /2], to keep protein numbersas low as possible without substantial loss of binding in somecells due to random fluctuations in protein number.Because these fluctuation considerations place a lower

limit on the number of regulatory protein molecules per cell,they also set an upper limit for K, in a given titration situation.Thus the effective (buffered) specific binding constant [KS/(1+ Kn5DnJ)] in vivo cannot usefully be much larger thanX-NA'Vcel/2[1 + (1 + is) /2], where NA is Avogadro's numberand Vceu is the effective cell volume. For the lac system, ifx= 10, V,,11 = 10-15 liter, and Dn.Kn. 10 (see ref. 23), thenKs is expected to be of the order of 1012 M-1, in reasonableagreement with previous estimates (2, 15, 23).

(iii) Biological considerations. The specific binding con-stant cannot usefully be set so high that the half-time fordissociation exceeds the relevant cell cycle times needed toachieve DNA replication, organelle duplication, and celldivision, though the cell may devise special allosteric mech-anisms (such as inducer binding) to lower K, into a biologi-cally acceptable range. Also, while effective target bindingrequires x >> 1, we expect evolutionary selection to reducex (and thus also K, and RT) to as small a value as regulatoryrequirements permit (to be discussed in greater detail in afuture publication).

c. Parameters for Pseudo- and Nonspecific Binding.(i) Specific binding constants as a function of the number

of wrong base pairs per regulatory site. If the regulatorytarget site consists of n specified base pairs placed in the

Proc. Natl. Acad. Sci. USA. 83 (1986)

Dow

nloa

ded

by g

uest

on

Nov

embe

r 23

, 202

0

Page 4: the specificity of DNA-proteininteractions · Proc. Natl. Acad. Sci. USA83 (1986) 1609 complementary hydrogen-bonding through the majorormi-norgrooves ofthe double helix (28, 29)

Proc. Natl. Acad. Sci. USA 83 (1986) 1611

correct (contiguous or noncontiguous) sequence, then wemay ask (i) how many nonspecific sites are there that differfrom the canonical sequence in containing j (1, 2, 3 . . .)incorrect base pairs in the various positions defining the siteand (ii) how much is the free energy of binding to each classof such progressively more incorrect sites decreased by eachmispairing?The number [f,,(j)] of competing nonspecific sites oflength

n that differ from the canonical sequence in j positions iseasily obtained by a simple combinatorial calculation (3).Thus,

fn(j ) 2N [ (i') ( k)ni(3)i] [5]

where fn(j) is the expected frequency of random occurrenceof a particular sequence for a genome of composition A = T= G = C, N is the size of the genome (in base pairs), and (7)is the combinatorial factor n!/[j!(n - j)!]. (The factor 2 comesin because sequences in double-stranded DNA can be read bythe protein along either strand.) Clearly, the number ofsequences for which j = 1 (for n = 12) is expected to be 36times the number of canonical sequences for which j = 0,whereas the number of sequences withj = 2 is 594 times thecanonical number.The effects of these increasingly incorrect sites on the level

of functional titration (activation or repression) of the regu-latory target site(s) depend, of course, not only on theirnumbers, but also on their relative (to Ks) binding constantsfor the regulatory protein, and these binding constants areharder to estimate. A useful first approximation is to assumethat each mispairing results in a constant decrement inbinding free energy of the complex from that characteristic ofthe canonical interaction; i.e., that the specific part of thebinding free energy is equipartitioned between all the base-pair positions involved in recognition.

Ifeach misplaced specific base-pair interaction reduces thebinding constant by a factor ld (where d may range between5 and 100), a site withj substitutions relative to the canonicalsequence will have a binding constant

Kj = Ksd-. [6]

For a protein with two binding modes (specific and non-specific, as discussed above) the effective binding constantcan be expressed (for Ks >> K) as

Kj= Kns + Ksd-i. [7]

When j > ln(Ks/Kns)/lnd, the protein will be bound in thenonspecific mode (with a binding constant of Kns). From Eq.7, the discrimination ratio for a site with i substitutions isKslKj < di. Thus, equilibrium discrimination is always lesseffective when the protein displays two binding modes.

(ii) Perturbation of specific site selection by pseudospeci-fic and nonspecific sites. With the pseudosite distribution ofEqs. 5 and 7, the specific binding relation from Eq. 3 can berewritten as

RT Ds +X

(1 + Dns Kns)1+ K1xKs

+ D ~ns [JO 1 +d /x [81

This approximation will be valid when x << KS/KDS, whichshould hold since otherwise all nonspecific sites would act asstrong pseudosites. This equation expresses the necessarybalance between site size n (specification) and the discrimi-

nation factors (d and K/lKn,) to achieve the required selec-tion (x) at a total protein concentration RT. It may also benecessary to multiply Dn,by a fractional factor to account forthe fact that some of the nonspecific sites may be covered byother proteins or be otherwise structurally inaccessible (D.Noble, D. Forbes, M. Schmid, and P.H.vH., unpublishedresults).

(iii) Overspecification. von Hippel (3) has pointed out thatthe value of n determined by strictly combinatorial calcula-tions represents a minimum binding site specification and thata larger number of recognition elements may be required inreal cases to decrease the probability that occupancy ofstrong pseudosites will "swamp" the titration of the specificregulatory site, using up all the protein and resulting ineffective (physiological) derepression. Thus, a value of ngreater than that needed to reducefd (the expected frequencyof reoccurrence per genome of the canonical sequence) justto unity may be required, since increasing this number willdecrease the frequency of strong pseudosites as well. Thelevel of overspecification needed to achieve this objectivewill depend on the magnitude of the decrease in total bindingfree energy per wrong base pair.For a given selection factor x, the average number of

protein molecules per cell "lost" to nonproductive binding inthe specific binding mode at pseudosites is

= 2N (1)n [ E (;) 1 [9]

MShas been plotted in Fig. 1 as a function of site size n forvarious values of x and d. Obviously, to satisfy Eq. 8, ihmust be smaller than rT; otherwise the required selection (x)cannot be established regardless of the strength of thespecific binding constant. Fig. 1 shows that unless theaverage discrimination factor d for each misplaced base pairis very large, the site size required to specify a bindingsequence will be substantially larger than the minimal esti-mate of n = 12; i.e., the site must be appreciably overspeci-fled even if only a small fraction of the competitive sites areaccessible for binding.

(iv) Assumptions. The assumptions made in the abovetreatment should be clearly borne in mind. We have assumedthat all noncanonical sites will occur at random; i.e., thatstrong pseudosites are not selected against. In addition, indeveloping Eqs. 6-9, we have assumed that the free energyloss for each substituted base pair is additive. This may notbe strictly true if recognition reactions include a strongdependence on local (sequence-dependent) DNA conforma-tion differences (see section 2b). Finally, it is clear that in thegeneral case neither base-pair substitutions at different po-sitions in the binding site nor different substitutions at thesame position will necessarily reduce the specific bindingaffinity by a constant factor, d (see refs. 7, 17, 25, and 26).Nevertheless, Eqs. 8 and 9 do provide a reasonably good firstapproximation to the general case ifd is a geometric average;i.e., if lnd corresponds to the average free-energy loss foreach possible substitution (to be described elsewhere).

DISCUSSIONIn this paper, we have summarized a series of approaches tothe specificity ofprotein-nucleic acid interactions involved inthe regulation (at the DNA level) of biological function. Theultimate manifestation of regulatory specificity must be thedegree to which a specific biological process (e.g., thetranscription of a particular gene or the translation of aparticular protein) is "turned up" or "turned down" as aconsequence of a specific DNA-protein interaction. Eventhough a given process of gene expression may represent theend-product of a lengthy series of preequilibrium, steady-

Biochemistry: von Hippel and Berg

Dow

nloa

ded

by g

uest

on

Nov

embe

r 23

, 202

0

Page 5: the specificity of DNA-proteininteractions · Proc. Natl. Acad. Sci. USA83 (1986) 1609 complementary hydrogen-bonding through the majorormi-norgrooves ofthe double helix (28, 29)

1612 Biochemistry: von Hippel and Berg

z

(.)

'I,

a-

0

a.

oc

:30

mV,c.40._

a-

0IIE

n (Specified Base Pairs/Binding Site)

FIG. 1. Estimate of the average number of protein molecules percell (M7) bound to pseudosites as a function of the canonical site sizen (number of specified base pairs per regulatory site) for E. coli with2N 107(N = number of base pairs per genome), and total base

composition A = T = G = C, for various selection (x) anddiscrimination (d) factors. Line a, d = o; line b, x = 103 and d = 103;line c, x = 103 and d = 100; line d, x = 103 and d = 30; line e, x =

104 and d = 30; line f, x = 103 and d = 10; line g, x = 10i and d =

10. For example, this graph shows that if one can "afford" to lose5 protein molecules per cell onto pseudosites, then for x = 103 andd = 30 (line d), the situation that we believe applies to the E. coli lacsystem in vivo, a minimum canonical site size (n) of 16 must bespecified. Alternatively, for a site size (n) of 13, more than 100 proteinmolecules will be lost onto pseudosites.

state, or kinetically controlled steps, over an appreciablerange of rates and concentrations most regulatory processeswill reflect directly the equilibrium extent of specific DNA-target saturation by the relevant binding protein(s). It is thislevel of specificity with which we are concerned here.To discuss this problem with precision, we have attempted

to differentiate the various levels of specificity with an

appropriate terminology. Thus, specification refers to thelength (in base pairs) of the sequence actually involved inspecifying the target binding site (ref. 3, see also ref. 27).Recognition is defined by the physicochemical mechanismsthat actually control the specificity of the interactions.Discrimination (or selectivity) refers to the thermodynamicsof the interactions involved and is determined by the differ-ences in affinity of the protein for the various DNA targetsamong which the protein is distributed. Thus the discrimi-nation ratio for pairs of binding sites is a ratio of bindingconstants. Finally, selection (in the equilibrium case), or thefinal level of biological expression (which reflects regulatory

site saturation), is determined by the effective binding rela-tion for the whole system of proteins and DNA binding sites.

This entire hierarchy of specificity criteria must be con-sidered in examining many issues related to functionalspecificity, including optimization of the design of regulatoryproteins and DNA target sites, the development of purifica-tion procedures and binding assays for regulatory proteins,and the evaluation of binding selectivities and affinities invivo.

This research has been supported by Public Health ServiceResearch Grants GM15792 and GM29158 (to P.H.vH.), and bypartial salary support from the Swedish Natural Science ResearchCouncil (to O.G.B.).

1. Dickson, R. C., Abelson, J. N., Barnes, W. M. & Reznikoff,W. S. (1975) Science 182, 27-32.

2. von Hippel, P. H., Revzin, A., Gross, C. A. & Wang, A. C.(1974) Proc. Nati. Acad. Sci. USA 71, 4808-4812.

3. von Hippel, P. H. (1979) in Biological Regulation and Devel-opment, ed. Goldberger, R. F. (Plenum, New York), Vol. 1,pp. 279-347.

4. Seeman, N. C., Rosenberg, J. M. & Rich, A. (1976) Proc.Natd. Acad. Sci. USA 73, 804-808.

5. Woodbury, C. P., Jr., & von Hippel, P. H. (1981) in TheRestriction Enzymes, ed. Chirikjian, J. (Elsevier, Amsterdam),Vol. 1, pp. 181-207.

6. von Hippel, P. H., Bear, D. G., Winter, R. B. & Berg, 0. G.(1982) in Promoters: Structure and Function, eds. Rodriquez,R. & Chamberlin, M. (Praeger, New York), pp. 3-33.

7. Ohlendorf, D. H., Anderson, W. F., Fisher, R. G., Takeda,Y. & Matthews, B. W. (1982) Nature (London) 298, 718-723.

8. Pabo, C. 0. & Sauer, R. T. (1984) Annu. Rev. Biochem. 53,293-321.

9. Melchior, W. B., Jr., & von Hippel, P. H. (1973) Proc. Natl.Acad. Sci. USA 70, 298-302.

10. Kopka, M. L., Yoon, C., Goodsell, D., Pjura, P. & Dickerson,R. E. (1985) Proc. Natd. Acad. Sci. USA 82, 1376-1380.

11. deHaseth, P. L., Lohman, T. M. & Record, M. T., Jr. (1977)Biochemistry 16, 4783-4790.

12. Revzin, A. & von Hippel, P. H. (1977) Biochemistry 16,4769-4776.

13. Kowalczykowski, S. C., Lonberg, N., Newport, J. W. & vonHippel, P. H. (1981) J. Mol. Biol. 145, 75-104.

14. Weintraub, H. & Groudine, M. (1976) Science 193, 848-856.15. Record, M. T., Jr., deHaseth, P. L. & Lohman, T. M. (1977)

Biochemistry 16, 4791-4796.16. Winter, R. B. & von Hippel, P. H. (1981) Biochemistry 20,

6948-6960.17. Goeddel, D. V., Yansura, D. G. & Caruthers, M. H. (1978)

Proc. Natl. Acad. Sci. USA 75, 3578-3582.18. deHaseth, P. L., Lohman, T. M., Record, M. T., Jr., & Bur-

gess, R. R. (1978) Biochemistry 17, 1612-1622.19. Fairfield, F. R., Newport, J. W., Dolejsi, M. K. & von Hip-

pel, P. H. (1983) J. Biomol. Struct. Dyn. 1, 715-727.20. Winter, R. B., Berg, 0. G. & von Hippel, P. H. (1981) Bio-

chemistry 20, 6961-6977.21. Berg, 0. G., Winter, R. B. & von Hippel, P. H. (1982) Trends

Biochem. Sci. 7, 52-55.22. Lin, S.-Y. & Riggs, A. D. (1975) Cell 4, 107-111.23. Kao-Huang, Y., Revzin, A., Butler, A. P., O'Connor, P.,

Noble, D. & von Hippel, P. H. (1977) Proc. Natl. Acad. Sci.USA 74, 4228-4232.

24. Berg, 0. G. (1978) J. Theor. Biol. 71, 587-603.25. Jobe, A., Sadler, J. R. & Bourgeois, S. (1974) J. Mol. Biol. 85,

231-248.26. Mossing, M. C. & Record, M. T., Jr. (1985) J. Mol. Biol. 186,

295-305.27. Schneider, T., Stormo, G., Gold, L. & Ehrenfencht, A. (1986)

J. Mol. Biol., in press.28. Yarun, M. (1969) Annu. Rev. Biochem. 38, 841-880.29. von Hippel, P. H. & McGhee, J. D. (1972) Annu. Rev.

Biochem. 41, 231-300.

Proc. Natl. Acad. Sci. USA 83 (1986)

Dow

nloa

ded

by g

uest

on

Nov

embe

r 23

, 202

0