15
Nucleosome Occupancy Information Improves de novo Motif Discovery Leelavati Narlikar , Raluca Gordˆ an , and Alexander J. Hartemink Department of Computer Science, Duke University, Durham, NC 27708-0129 {lee,raluca,amink}@cs.duke.edu Abstract. A complete understanding of transcriptional regulatory processes in the cell requires identification of transcription factor binding sites on a genome- wide scale. Unfortunately, these binding sites are typically short and degenerate, posing a significant statistical challenge: many more matches to known transcrip- tion factor binding sites occur in the genome than are actually functional. Chro- matin structure is known to play an important role in guiding transcription factors to those sites that are functional. In particular, it has been shown that active regu- latory regions are usually depleted of nucleosomes, thereby enabling transcription factors to bind DNA in those regions [1]. In this paper, we describe a novel al- gorithm which employs an informative prior over DNA sequence positions based on a discriminative view of nucleosome occupancy; the nucleosome occupancy information comes from a recently published computational model [2]. When a Gibbs sampling algorithm with our informative prior is applied to yeast sequence- sets identified by ChIP-chip [3], the correct motif is found in 50% more cases than with an uninformative uniform prior. Moreover, if nucleosome occupancy infor- mation is not available, our informative prior reduces to a new kind of prior that can exploit discriminative information in a purely generative setting. 1 Introduction Finding functional DNA binding sites of transcription factors (TFs) on a genome-wide scale is a crucial step in understanding transcriptional regulation. Despite an explo- sion of data about TF binding from high-throughput technologies like ChIP-chip [3, 4, and many more], DIP-chip [5], PBM [6], and gene-expression arrays [7, 8, and many more], de novo motif finding remains a difficult problem. The fundamental reason for this is that the binding sites of most TFs are short, degenerate sequences which occur frequently in the genome by chance. The ‘signal’ of functional sites (which are bound in vivo) is overwhelmed by the ‘noise’ due to the non-functional sites (which are not bound in vivo). Distinguishing functional sites from non-functional ones, and inferring the true motif recognized by the TF, is thus a challenge. Many probabilistic motif discovery methods have been developed to tackle the prob- lem of motif discovery [9, 10]. The standard approach is to look for a pattern common These authors contributed equally to this work. T. Speed and H. Huang (Eds.): RECOMB 2007, LNBI 4453, pp. 107–121, 2007. c Springer-Verlag Berlin Heidelberg 2007

Nucleosome Occupancy Information Improves de novo Motif Discovery

Embed Size (px)

Citation preview

Nucleosome Occupancy Information Improvesde novo Motif Discovery

Leelavati Narlikar�, Raluca Gordan�, and Alexander J. Hartemink

Department of Computer Science, Duke University, Durham, NC 27708-0129{lee,raluca,amink}@cs.duke.edu

Abstract. A complete understanding of transcriptional regulatory processes inthe cell requires identification of transcription factor binding sites on a genome-wide scale. Unfortunately, these binding sites are typically short and degenerate,posing a significant statistical challenge: many more matches to known transcrip-tion factor binding sites occur in the genome than are actually functional. Chro-matin structure is known to play an important role in guiding transcription factorsto those sites that are functional. In particular, it has been shown that active regu-latory regions are usually depleted of nucleosomes, thereby enabling transcriptionfactors to bind DNA in those regions [1]. In this paper, we describe a novel al-gorithm which employs an informative prior over DNA sequence positions basedon a discriminative view of nucleosome occupancy; the nucleosome occupancyinformation comes from a recently published computational model [2]. When aGibbs sampling algorithm with our informative prior is applied to yeast sequence-sets identified by ChIP-chip [3], the correct motif is found in 50% more cases thanwith an uninformative uniform prior. Moreover, if nucleosome occupancy infor-mation is not available, our informative prior reduces to a new kind of prior thatcan exploit discriminative information in a purely generative setting.

1 Introduction

Finding functional DNA binding sites of transcription factors (TFs) on a genome-widescale is a crucial step in understanding transcriptional regulation. Despite an explo-sion of data about TF binding from high-throughput technologies like ChIP-chip [3, 4,and many more], DIP-chip [5], PBM [6], and gene-expression arrays [7, 8, and manymore], de novo motif finding remains a difficult problem. The fundamental reason forthis is that the binding sites of most TFs are short, degenerate sequences which occurfrequently in the genome by chance. The ‘signal’ of functional sites (which are boundin vivo) is overwhelmed by the ‘noise’ due to the non-functional sites (which are notbound in vivo). Distinguishing functional sites from non-functional ones, and inferringthe true motif recognized by the TF, is thus a challenge.

Many probabilistic motif discovery methods have been developed to tackle the prob-lem of motif discovery [9, 10]. The standard approach is to look for a pattern common

� These authors contributed equally to this work.

T. Speed and H. Huang (Eds.): RECOMB 2007, LNBI 4453, pp. 107–121, 2007.c© Springer-Verlag Berlin Heidelberg 2007

108 L. Narlikar, R. Gordan, and A.J. Hartemink

to the bound sequences that is statistically enriched with respect to the background dis-tribution of all intergenic sequences. If, in addition to the set of bound sequences, a setof unbound sequences is available, a stronger criterion might insist that the pattern beable to discriminate between the two sets [11, 12, 13, 14, 15]. Unfortunately, due tothe low signal-to-noise ratio of binding sites mentioned earlier, these methods generallysuffer from low specificity and sensitivity [16].

Often, DNA sequences that match known TF motifs do not appear to be functional invivo TF binding sites. One explanation is that not all parts of the genome are equally ac-cessible to TFs in vivo. In particular, since DNA is wound over histone octamers callednucleosomes, the positioning of these nucleosomes provides a possible mechanism fordifferential access of TFs to potential binding sites [1, 2, 17, 18, 19, 20]. Our goal inthis paper is to leverage knowledge about nucleosome positioning to improve de novomotif finding.

If we knew exactly what parts of the genome were occupied by nucleosomes in theexact environmental conditions for which we have in vivo TF binding data, we couldbias our search for TF binding sites to the areas that are free of nucleosomes. Unfor-tunately, no high-resolution nucleosome occupancy data is available for any organismon a whole-genome scale. In the case of yeast, Yuan et al. [20] have reported high-resolution nucleosome occupancy data using tiling arrays, but only for chromosome III.On the other hand, Lee et al. [1] have published occupancy data for the whole genome,but it is of low resolution: they report only the average occupancy over each intergenicregion.

Recently, Segal et al. [2] developed a computational model based on high-quality ex-perimental nucleosome binding data to calculate the average nucleosome occupancy ateach nucleotide position in the yeast and chicken genomes. This occupancy is purportedto be intrinsic to the DNA sequence, and hence independent of in vivo conditions. Theauthors claim that their predictions explain about 50% of observed in vivo nucleosomepositions. Here, we use predictions from their model to build informative priors overDNA sequence positions that can be used to improve the accuracy of motif finding. Weformulate two different nucleosome occupancy priors: the first is based directly on thepredictions of Segal et al., while the second adopts a discriminative perspective, com-paring nucleosome occupancy in bound versus unbound sequences. When nucleosomeoccupancy information is not available, the first prior simplifies to an uninformativeuniform prior, but the second simplifies to a new kind of informative prior that can ex-ploit discriminative information in a purely generative setting. This represents a novelapproach to discriminative motif discovery that retains the computational benefits of agenerative formulation. As we shall see, each of our three informative priors improvesupon the uninformative uniform prior.

We choose Gibbs sampling as the search method in our algorithms, but in principle,the priors can be used with any search strategy. Our choice of a position specific scoringmatrix (PSSM) [21] as a model for the motif is also arbitrary since our priors can beapplied while learning any type of motif model. The purpose of this paper is not todemonstrate the benefits of one search strategy over another or one motif model overanother, but to demonstrate the utility of nucleosome occupancy data in constructinginformative priors for motif discovery.

Nucleosome Occupancy Information Improves de novo Motif Discovery 109

2 Motif Discovery

In this section, we describe the popular generative formulation of the problem of motifdiscovery, derive the objective function we seek to optimize, and explain the searchmethodology that we use to optimize this objective function.

2.1 Sequence Model and Objective Function

Assume we have n DNA sequences X1 to Xn believed to be commonly bound by someTF. Although in reality a sequence might have multiple binding sites, for simplicity wemodel only one binding site in each sequence. Because the experimental data mightbe erroneous, we also model the possibility of some sequences not having any bindingsite. This is analogous to the zero or one occurrence per sequence (ZOOPS) model inMEME [22]. Let Z be a vector of length n denoting the starting location of the bindingsite in each sequence: Zi = j if there is a binding site starting at location j in Xi and weadopt the convention that Zi = 0 if there is no binding site in Xi. We assume that theTF motif can be modeled as a PSSM of length W while the rest of the sequence followssome background model parameterized by φ0. The PSSM can be described by a matrixφ where φa,b is the probability of finding base b at location a within the binding site for1 ≤ b ≤ 4 and 1 ≤ a ≤ W .

Thus if the sequence Xi is of length mi, and Xi contains a binding site at locationZi, we can compute the probability of the sequence given the model parameters as:

P (Xi | φ, Zi > 0, φ0) = P (Xi,1, . . . Xi,Zi−1 | φ0) ×(

W∏k=1

φk,Xi,Zi+k−1

)

× P (Xi,Zi+W , . . . Xi,mi | φ0)

and if it instead does not contain a binding site as:

P (Xi | φ, Zi = 0, φ0) = P (Xi,1, Xi,2 . . . Xi,mi | φ0)

We wish to find φ and Z that maximize the joint posterior distribution of all theunknowns given the data. Assuming priors P (φ) and P (Z) over φ and Z respectively,our objective function is:

arg maxφ,Z

P (φ, Z | X , φ0) = arg maxφ,Z

(P (X | φ, Z, φ0)P (φ)P (Z)

)(1)

2.2 Optimization Strategy and Scoring Scheme

As others before us have done, we use Gibbs sampling to sample repeatedly from theposterior over φ and Z with the hope that we are likely to visit those values of φand Z with the highest posterior probability. Gibbs sampling is a Markov chain MonteCarlo (MCMC) method that approximates sampling from a joint posterior distributionby sampling iteratively from individual conditional distributions [23]. Applying the col-lapsed Gibbs sampling strategy developed by Liu [24] for faster convergence, we can

110 L. Narlikar, R. Gordan, and A.J. Hartemink

integrate out φ and sample only the Zi. This results in the following expression forsampling Zi from its conditional distribution assuming the prior on Z to be independentof the PSSM parameters φ:

P (Zi | Z[−i], X ,φ0) =P (Z | X ,φ0)

P (Z[−i] | X ,φ0)=

P (Z)∫φ

P (X | φ, Z, φ0)P (φ)dφ

P (Z[−i])∫φ

P (X | φ, Z[−i], φ0)P (φ)dφ

where Z[−i] is the vector Z without Zi. Proceeding analogously to the derivation ofLiu [24], we compute the integrals using a Dirichlet prior on φ. We further simplifythe sampling expression by dividing it by P (Zi = 0, Xi | φ0) which is a constant ata particular sampling step. This results in the following sampling distribution for aparticular location j within sequence Xi, similar to the predictive update formula asdescribed in [25]:

P (Zi = j | Z[−i], X ,φ0) =P (Zi = j) ×

(W∏

a=1φa,Xi,j+a−1

)P (Zi = 0) × P (Xi,j , . . . , Xi,j+W−1 | φ0)

(2)

for 1 ≤ j ≤ mi − W + 1, and

P (Zi = j | X ,φ0) = 1 (3)

for j = 0, where φ is calculated from the counts of the sites contributing to the cur-rent alignment Z[−i], plus the pseudocounts as determined by the Dirichlet prior. Moredetails are provided in [26].

The joint posterior distribution after each iteration can be calculated as:

P (φ,Z | X ,φ0) ∝ P (X | φ,Z, φ0) × P (φ) × P (Z) (4)

To simplify computation, we divide the above expression by the constant probabilityP (X | Z = 0, φ0) and use the logarithm of the resulting value as a score for the motif.

To maximize the objective function and hence the score, we run the Gibbs samplerfor a predetermined number of iterations after apparent convergence to the joint poste-rior and output the highest scoring PSSM at the end. We report only a single motif φto enable us to evaluate the algorithm and compare it with other popular methods. Inprinciple, however, since we are using an MCMC sampling method, we could insteadperform Bayesian model averaging over many samples from the posterior and report amean motif (or multiple motifs if there are multiple modes in the distribution).

3 Informative Positional Priors for Motif Discovery

The basic Gibbs sampling approach mentioned above has been used in several motiffinders, often with additional parameters and heuristics [27, 28, 29]. However, all thesemethods use an uninformative prior over the locations Z at which the TF is supposed tobind within the DNA sequences. In a recent paper, we showed how information aboutthe TF’s structural class could be leveraged to produce informative priors over Z thatsignificantly help motif discovery [26]. Here, we describe other informative choices for

Nucleosome Occupancy Information Improves de novo Motif Discovery 111

P (Z) which we will henceforth refer to as ‘positional priors’. We introduce a prior Nbased solely on nucleosome occupancy, a prior DN incorporating nucleosome occu-pancy information from both bound as well as unbound sequences, and a discrimina-tive prior D, which is a special case of DN when nucleosome occupancy informationis unavailable. To assess the utility of these priors, we compare their performance tothe performance of an uninformative uniform prior U , keeping all other aspects of thealgorithm identical.

3.1 Building a Positional Prior

The four positional priors mentioned above can be constructed in a similar fashion fromdifferent probabilistic scores. We use the term ‘probabilistic score’ in the remainderof the paper to denote the probability of a particular W -mer being a binding site oftranscription factor T : Si,j = P (XW

i,j is a binding site of T ), where XWi,j denotes the

W -mer Xi,jXi,j+1 · · · Xi,j+W−1.For each sequence Xi, we wish to define a prior probability distribution over all

possible starting locations j of a binding site in that sequence, i.e. P (Zi = j). Wenotice that the values Si,j themselves do not define a probability distribution over j,because they may not sum to 1. As mentioned in Section 2.1, we model each sequenceXi as containing at most one binding site of T . If Xi has no binding site, none of thepositions of Xi can be the starting location of a binding site of T so it must be that:

P (Zi = 0) ∝mi−W+1∏

u=1

(1 − Si,u) (5)

On the other hand, if Xi has one binding site at position j, not only must a binding sitestart at location j but also no binding site should start at any of the other locations inXi. Formally, we write:

P (Zi = j) ∝ Si,j

mi−W+1∏u=1u �=j

(1 − Si,u) for 1 ≤ j ≤ mi − W + 1 (6)

We then normalize P (Zi) assuming the same proportionality constant in (5) and (6), sothat under the assumptions of our model we have:

mi−W+1∑j=0

P (Zi = j) = 1 for 1 ≤ i ≤ n (7)

3.2 Uniform Prior (U)

This is the simplest form of positional prior. It is built using a uniform probabilisticscore Ui,j which assigns equal probabilities to a W -mer XW

i,j being a binding site of Tor not:

Ui,j = 1 − Ui,j = 0.5 for 1 ≤ j ≤ mi − W + 1 (8)

If we substitute Si,j with Ui,j in equations (5) and (6) and normalize P (Zi = j), weget a uniform prior U :

P (Zi = j) =1

mi − W + 2for 0 ≤ j ≤ mi − W + 1 (9)

112 L. Narlikar, R. Gordan, and A.J. Hartemink

3.3 Nucleosome Occupancy Prior (N )

A uniform prior is a common choice for a positional prior and most motif finding algo-rithms implicitly use such a prior. In reality though, as mentioned earlier, certain DNAregions are inaccessible to TFs due to the presence of nucleosomes at those locations.

We would like to bias the search in a probabilistic manner towards nucleosome-freeareas. For this purpose, we use the nucleosome occupancy predicted by the computa-tional model developed by Segal et al. [2]. This model outputs the probability Ni,j ofeach nucleotide Xi,j in the input sequences being occupied by nucleosomes (for nowthe model is only designed for sequences in yeast or chicken). Assuming nucleosomeoccupancy indicates inaccessibility, we calculate the average probability of the W -merXW

i,j being accessible to the TF as:

Ai,j = 1 − 1W

W−1∑k=0

Ni,j+k (10)

Alternatively one could use the maximum instead of the average occupancy over the Wnucleotides when computing Ai,j , but averaging reduces the effect of outliers. Havingdefined Ai,j , we can now build the positional prior N as described in Section 3.1, usingAi,j as the probabilistic score Si,j .

3.4 Discriminative Nucleosome Occupancy Prior (DN )

The formulation of the above probabilistic score Ai,j has a drawback: What if a par-ticular W -mer is prone to be highly accessible throughout the genome? For instance,certain promoter elements which are required for the assembly of general TFs and arenot related to the specific TF in question, might be depleted of nucleosomes. The priorN , in that case, could indicate a high prior belief in that W -mer being a TF binding siteregardless of the fact that the W -mer is equally accessible in the rest of the genome asin the bound set X .

Most large-scale high-throughput experimental methods like ChIP-chip, DIP-chip,and PBM give rise to two sets of DNA sequences: those bound by the profiled transcrip-tion factor T (positive sequences X) and those not bound (negative sequences which wedenote as Y ). The use of the negative set along with the positive set to enrich the motifsignal has been shown previously to be beneficial in improving specificity [12, 14, 15].In the referenced methods, if a W -mer is present in the negative set for transcriptionfactor T , it is generally treated as an instance of a non-binding site and hence, penal-ized. However, in an in vivo situation, a W -mer matching the true motif of T mightoccur in the negative set but be inaccessible for T due to the presence of a nucleosomeat that position. In that case, it should not be treated as a negative data point.

Here, we devise a new discriminative prior which takes into account both these is-sues. For each W -mer XW

i,j , we ask the following question: “Of all the accessible oc-currences of this word, how many occur in the positive set?” To answer this question,we subject each accessible W -mer to a Bernoulli trial. Unfortunately, we cannot tellfor sure whether a particular location is accessible or not, because we only know theprobability that each location is accessible. Thus, we count the number of accessiblesequences in expectation, by weighing each occurrence of the W -mer according to how

Nucleosome Occupancy Information Improves de novo Motif Discovery 113

accessible it is. For this purpose, we introduce two functions rk,l and r′k,l defined on theset of all possible W -mers σ:

rk,l(σ) =

{1 : XW

k,l = σ

0 : XWk,l �= σ

and r′k,l(σ) =

{1 : Y W

k,l = σ

0 : Y Wk,l �= σ

(11)

We now define a new probabilistic score Ci,j as:

Ci,j =

∑k,l

Ak,lrk,l(XWi,j )∑

k,l

Ak,lrk,l(XWi,j) +

∑k,l

A′k,lr

′k,l(X

Wi,j )

(12)

where A′i,j is the accessibility score calculated for the set Y analogous to the calculation

of Ai,j for X in (10). Using Ci,j as our probabilistic score Si,j , we can now build thepositional prior DN as described in Section 3.1. In practice, we notice that Ci,j can havesome false peaks due to W -mers that occur very rarely in the genome. In such cases,when the W -mer occurs in Xi at some position j, Ci,j becomes large due to a smalldenominator. This effect can be alleviated by adding pseudocounts to the expressionin (12).

3.5 Simple Discriminative Prior (D)

To assess the importance of incorporating nucleosome occupancy information in dis-criminative motif discovery, we now consider a special case of DN . We assume wehave no nucleosome occupancy information, i.e., each Ai,j = c and A′

i,j = c, where cis some arbitrary constant. Equation (12) then reduces to a new probabilistic score Di,j :

Di,j =

∑k,l

rk,l(XWi,j )∑

k,l

rk,l(XWi,j ) +

∑k,l

r′k,l(X

Wi,j )

(13)

In other words, we calculate the probability Di,j of XWi,j being a binding site of T as the

number of occurrences of XWi,j in X relative to the total number of occurrences of XW

i,j

in both sets X and Y without looking at accessibility. Again, we add pseudocountswhile computing Di,j and then calculate a positional prior P (Zi = j) as described inSection 3.1 by substituting Di,j for Si,j . We refer to this positional prior as D.

Note that in computing D we use only the datasets X and Y and not any nucleosomeoccupancy information. Other motif discovery algorithms that make use of both X andY formulate the problem in a discriminative manner, and attempt to learn a motif thatappears more often in the positive set than in the negative set. Since these models opti-mize a discriminative objective function over the sets X and Y , they have to deal witha large search space and typically are prone to many local optima. Such methods oftenrequire an ‘intelligent guess’ as a seed matrix to initialize the search so as to avoid poorlocal optima. In addition, at every step of the search algorithm, they have to evaluate theparameters of the model on each sequence in both sets. Hence, the time complexity ofthese algorithms is much worse compared to generative models which iterate only overthe positive set. Here, however, our generative model framework remains generativeand all the discriminative information is captured in our prior.

114 L. Narlikar, R. Gordan, and A.J. Hartemink

Fkh2 motif

Ai,jDi,jCi,j

0

0.1

0.2

0.3

0.4

Fig. 1. Plot of Ai,j , Di,j , and Ci,j used to compute the priors N , D, and DN , respectively. Thex-axis represents part of an intergenic DNA region from a sequence-set for Fkh2 profiled underthe YPD condition in a ChIP-chip experiment [3]. The intergenic region spans positions 770845to 770945 in Chromosome XVI. The Fkh2 binding site shown in the figure starts at position770887.

3.6 Informative Priors in Action

To visualize how informative priors might be helpful in identifying TF binding sites,we show in Figure 1 the values of Ai,j , Di,j , and Ci,j used to compute the priorsN , D, and DN over a portion of a DNA sequence obtained from an Fkh2 ChIP-chipexperiment. As can be seen from the figure, in this instance all three priors give a goodindication of where a Fkh2 binding site is likely to exist, even before information fromthe likelihood is taken into account. Of course, this may not happen all the time sowe use the remainder of the paper to assess more precisely the relative utility of thesepriors.

4 Results

We compiled ChIP-chip data published by Harbison et al. [3], who profiled the inter-genic binding locations of 203 yeast TFs under various environmental conditions: YPD,and one or more of Alpha, But14, But90, H202Hi, H202Lo, Pi-, RAPA, or SM over6140 intergenic regions. These intergenic regions range from 48 to 1553 nucleotidesand have an average length of 433 nucleotides. For each TF profiled under each condi-tion, we define its bound sequence-set to be those intergenic sequences reported to bebound with p-value < 0.001. We restrict our attention to sequence-sets of size at least10, which yields 242 sequence-sets, encompassing 148 TFs. Of these sequence-sets,156 correspond to the 80 TFs with a consensus binding motif in the literature (as sum-marized by Harbison et al. at the time their paper was published, or as earlier reportedby Dorrington and Cooper [30] or Jia et al. [31]), and these 156 are used throughout theremainder of the paper to compare the performance of various motif finding algorithms.

Nucleosome Occupancy Information Improves de novo Motif Discovery 115

We incorporate the U , N , D, and DN priors into our Gibbs sampling framework—implemented in PRIORITY [26]—and refer to the resulting algorithms as PRI-U , PRI-N , PRI-D, and PRI-DN , respectively. For evaluation purposes, we fix the motif-widthW to 8 in all our runs, although in practice one could certainly explore more values ofW . As a background model, we use a third order Markov model trained on all intergenicregions in yeast. We run each algorithm 10 times from different random starting pointsfor each sequence-set for 10,000 sampling iterations and report the top-scoring motifamong the 10 runs. We consider an algorithm to be successful for a sequence-set onlyif the top-scoring motif matches the literature consensus for the corresponding TF. Weuse a variation of the inter-motif distance measure described by Harbison et al. andconsider a motif learned by an algorithm to be correct if it is at a distance less than 0.25from the literature consensus.1 Different distance cut-offs give different results, but wenotice the general trend across all programs remains the same.

Because we are primarily interested in quantifying the extent to which these new in-formative priors improve de novo motif discovery, the results presented in the main por-tion of the manuscript are limited to a comparison of PRI-N , PRI-D, and PRI-DN ver-sus PRI-U . However, to ensure that PRI-U is not simply a ‘straw man’, but represents areasonable point of comparison, we have also compiled results from three other popularmotif discovery programs as reported by Harbison et al. : AlignACE [27], MEME [22],and MDscan [32] (see Supplementary Material). Using the same criterion for success(the top-scoring motif should match the literature consensus), AlignACE is successfulin 16 of the 156 sequence-sets, MEME in 35, MDscan in 54, and PRI-U in 46. Align-ACE has one disadvantage over the others in that it uses a first-order Markov modelof the background, but each of the three existing methods has advantages over PRI-U :AlignACE considers many motif widths; MEME considers many motif widths, usessophisticated heuristics to initialize its search, and uses a fifth-order Markov model ofthe background; and MDscan makes significant use of the p-values from the ChIP-chipexperiments. Despite these disadvantages, PRI-U performs admirably, even without aninformative prior, and therefore represents a reasonable point of comparison. Since ev-erything about the algorithm is the same apart from the choice of prior, PRI-U permitsthe most accurate quantification of the utility of our new informative priors, and so weuse it in the remainder of the paper as a baseline when comparing the performance ofPRI-N , PRI-D, and PRI-DN .

Figure 2 summarizes the results of the four algorithms on 156 sequence-sets. Overall,while PRI-U finds the correct motif in 46 sequence-sets, PRI-DN finds the correctmotif in 69 sequence-sets, resulting in an improvement of 50% over baseline. To breakdown these results more carefully, we divide the sequence-sets into four groups basedon the success/failure of PRI-U and PRI-DN (corresponding to the four quadrants inFigure 2). This grouping reveals that the DN prior never performs worse than the Uprior, a claim that is also true of the D prior, but not of the N prior. To better understandthe performance of these two priors in relation to the DN prior, we now consider eachgroup in detail:

1 The distance is normalized to lie between 0 and 1; see Supplementary Material for detailsabout the distance calculation.

116 L. Narlikar, R. Gordan, and A.J. Hartemink

U D DNN 0

U D DNN

U D DNN

41

U D DNN 5

0

I

U D DNN0

U D DNN

U D DNN

0

84

U D DNN3

IV

U D DNN 4

U D DNN

U D DNN

6

U D DNN 5

8

II

U D DNN0

U D DNN

U D DNN

0

0

U D DNN0

III

46 0

23 87

Fig. 2. Results of the four algorithms on 156 yeast sequence-sets produced by ChIP-chip exper-iments [3]. Each row of four balls corresponds to the four positional priors U , N , D, and DN .A filled ball indicates the situation where the respective prior succeeds in finding the true motif.There are 24 = 16 possible combinations of successes/failures of the four priors shown by 16rows of filled/empty balls. The number of cases resulting in each combination is indicated nextto the respective row. The 16 combinations are divided into four quadrants, conditioned on thesuccess/failure of U and DN . The central numbers indicate the cardinality of each quadrant. Ascan be seen, some combinations like those in quadrant III do not occur.

Group I: PRI-U succeeds and PRI-DN succeeds.

This group corresponds to the upper-left quadrant of Figure 2 and it contains 46sequence-sets corresponding to 31 TFs. For most sequence-sets in this group (41of 46) all four algorithms find motifs matching the literature consensus. For theother 5 sequence-sets (Cin5 H202Lo, Ste12 Alpha, Ste12 YPD, Hsf1 H202Lo, andSkn7 YPD) PRI-N is the only algorithm that fails.

Let us look at the case of the TF Ste12 in more detail. In theory, the way the pri-ors are formulated, PRI-N should work on TFs for which the nucleosome occupancyover the functional binding sites is lower, in general, than the nucleosome occupancyover the rest of the sequences in the set. For PRI-DN to succeed though, the nucle-osome occupancy over the functional sites must be lower than the occupancy overthe non-functional sites (that is, sites in the negative set). In both the Alpha and YPDconditions, the average nucleosome occupancy in the sequence-sets is lower than thenucleosome occupancy at the functional binding sites of Ste12. This explains whyPRI-N fails. But according to the analysis of Segal et al. [2, Supplemental Figure 36],the average nucleosome occupancy at the functional sites of Ste12 is lower than theaverage occupancy at the non-functional sites. This clarifies why in spite of using

Nucleosome Occupancy Information Improves de novo Motif Discovery 117

the same nucleosome occupancy data, PRI-DN succeeds in finding the true motif ofSte12 in both conditions, although PRI-N does not. This suggests the importance ofusing nucleosome occupancy information in a discriminative setting.

Group II: PRI-U fails and PRI-DN succeeds.

This group corresponds to the lower-left quadrant of Figure 2 and it contains 23sequence-sets corresponding to 19 TFs. In eight cases, PRI-DN is the only algo-rithm that succeeds in finding the true motif. This implies that neither D nor N aloneis strong enough to identify the true motif, but the combination DN succeeds. In 9other cases in this group, in addition to DN , exactly one of D and N is successful.This suggests that in those cases, the improvement in DN comes mainly from therespective prior.

Group III: PRI-U succeeds and PRI-DN fails.

This group, corresponding to the upper-right quadrant of Figure 2, is empty. Thisimplies that whenever the uniform prior succeeds, the DN prior also succeeds. Thususing this informative prior does not worsen the performance of the algorithm for anysequence-set.

Group IV: PRI-U fails and PRI-DN fails.

This group corresponds to the lower-right quadrant of Figure 2 and contains 87sequence-sets corresponding to 50 TFs. For 84 of these 87 sequence-sets, none ofthe four algorithms finds motifs matching the literature consensus. For the remain-ing three cases (Msn2 H202Hi, Skn7 H202Lo, and Tec1 YPD) although PRI-D suc-ceeds, PRI-DN seems to fail to find the true motif. However, the failure of PRI-DNseems to be the result of the program getting stuck in a local optimum in each case.When we score the three motifs found by PRI-D according to the posterior score ob-tained using the DN prior, we get a significantly higher score than the score reportedby PRI-DN for the respective top motifs it learns (which do not score well accordingto the distance metric). The same reasoning applies for the failure of PRI-N for thesequence-sets of Msn2 H202Hi and Skn7 H202Lo. In the Tec1 YPD sequence-set,however, Tec1 binding sites have an average nucleosome occupancy of ∼89% whichis higher than the average occupancy over all intergenic regions (∼85%) causing theN prior to fail.

5 Discussion

Although it has been known for a while that nucleosomes control the binding activityof TFs by providing differential access to DNA binding sites [1, 2, 17, 18, 19, 20], webelieve we are the first to use nucleosome occupancy information to more accuratelypredict de novo binding sites of TFs.

Our results show that direct use of the nucleosome occupancy predictions ofSegal et al. [2] as a positional prior does not help motif discovery much: PRI-N finds 51correct motifs compared to the 46 found by PRI-U . Motifs of some TFs are more proneto be occupied by nucleosomes than others. The example of Ste12 in Group I illustrates

118 L. Narlikar, R. Gordan, and A.J. Hartemink

how the prior N can fail because of the high nucleosome occupancy at Ste12 functionalsites. However, when we adopt a discriminative perspective on nucleosome occupancy,the prior DN succeeds in finding the true Ste12 motif. In fact, there is no sequence-seton which PRI-N succeeds, but PRI-DN fails. Overall, our results show that discrimi-native use of nucleosome occupancy information is extremely useful: PRI-DN finds 69true motifs, 50% more than PRI-U . Although in this paper we focus on the usefulnessof nucleosome occupancy information, the D prior also improves motif discovery no-ticeably without this information: PRI-D finds 60 true motifs, 30% more than PRI-U .In addition to the three programs AlignACE, MEME, and MDscan discussed earlier,Harbison et al. use three conservation-based algorithms to discover motifs: MEME c,CONVERGE [3], and a method by Kellis et al. [33] which find 49, 56, and 50 cor-rect motifs respectively (see Supplementary Material). Not only does PRI-DN performmuch better than these programs, even PRI-D finds more correct motifs than the best ofthese programs. This suggests that our prior D will be quite useful in motif discoveryproblems even when nucleosome occupancy information is unavailable.

Our discriminative priors (both D and DN ) are novel in the way they incorporatediscriminative information in a generative setting. Note that in a specific genome, fora particular W -mer σ starting at position j in Xi, the denominator of (13) remains thesame regardless of the sequences in X , since it is nothing but the total number of oc-currences of σ in the whole genome. Similarly, for a particular nucleosome occupancydataset (experimental or computational), the weighted sum of all accessible sites in thedenominator of (12) remains the same for all possible sequence sets X . Hence thesenumbers can be precomputed and stored in a table of size 4W . Then, for a particularsequence-set X , computing the prior involves one pass (linear-time) over just the se-quences in X . No information needs to be explicitly computed from the negative setY , which is good because it changes as the positive set changes. In addition, sincethe actual algorithm only needs to sample over the positive set, the overall time andspace complexities of the search are much less than the complexities of other discrim-inative approaches. In fact, it is practically impossible to compare the performance ofPRI-D with these approaches since the size of the intergenic regions in yeast is about 3megabases (and larger for metazoan genomes).

In this study, we have fixed W to be 8. In the case of longer motifs, we could post-process the short motif learned by the algorithm and expand it appropriately on eitherside. Alternatively, we could build priors for multiple values of W and, like most motiffinders, run the algorithm with different motif lengths. A larger value for W has certainconsequences, however. First, the space required to store priors over W -mers is expo-nential in W . Second, as W grows, the average probability of seeing a W -mer in thegenome decreases, implying that pseudocounts used to smooth the prior become increas-ingly important (of course, this effect will be mitigated somewhat in larger genomes).

Throughout the paper, we have used PSSMs to model motifs. Although the PSSMis currently a popular choice for a motif model, recent biological [34] and compu-tational [35, 36] findings indicate that more expressive (and hence, more complex)models might be more appropriate. Since our method assigns a prior on the locationswithin each sequence and not on any specific form of the motif model, it can be used tolearn any motif model.

Nucleosome Occupancy Information Improves de novo Motif Discovery 119

The nucleosome occupancy predictions from the model of Segal et al. attempt tocapture the static, intrinsic nucleosome binding properties of the DNA. In reality, how-ever, the positioning of nucleosomes changes dynamically as the environmental condi-tions change or even as the cell progresses through its cell-cycle. Nucleosomes coveringcertain functional sites might be displaced under specific conditions by other mecha-nisms to permit access to TFs. It is thus not surprising that Segal et al. note that accord-ing to their computational model, certain TFs have higher nucleosome occupancy attheir functional sites than non-functional sites. If nucleosome occupancy data collectedunder the same environmental conditions in which the TFs are profiled were available,we would expect to get better results. Unfortunately, at this time high-resolution nucleo-some occupancy data is limited. But as more data becomes available, we can incorporateit usefully into our approach.

In closing, we stress that incorporating informative priors over sequence positions isof great benefit to motif discovery algorithms. Low signal-to-noise ratio, especially inhigher organisms, makes it difficult to successfully use algorithms based only on sta-tistical overrepresentation. Narlikar et al. [26] have shown that using informative priorsbased on structural classes of TFs improves motif discovery and this paper shows thatother kinds of informative priors improve motif discovery as well. Algorithms usingconservation information across species [3, 33, 37, 38] are another example of success-ful incorporation of additional information for motif discovery. We note that althoughPRI-DN does better overall than the conservation based methods described earlier,there are certain motifs that one or more of these methods find but PRI-DN does not.This suggests that combining conservation and nucleosome occupancy might furtherimprove the performance of motif finders. We are currently working toward a unifiedframework of informative priors based on nucleosome occupancy, TF structural class,and conservation.

Acknowledgments. The authors thank Eran Segal for sending them the nucleosomeoccupancy model, Jason Lieb for sharing unpublished data, and Uwe Ohler for usefuldiscussions and suggestions. The research presented here was supported by a NationalScience Foundation CAREER award and an Alfred P. Sloan Fellowship to A.J.H.

Supplementary Material can be found at http://www.cs.duke.edu/∼amink/.

References

[1] Lee,C., Shibata,Y., Rao,B., Strahl,B., Lieb,J. (2004) Evidence for nucleosome depletion atactive regulatory regions genome-wide, Nature Genetics, 36(8): 900–905.

[2] Segal,E., Fondufe-Mittendorf,Y., Chen,L., Thastrom,A., Field,Y., Moore,I., Wang,J., andWidom,J. (2006) A genomic code for nucleosome positioning, Nature, 442(7104):772–778.

[3] Harbison,C., et al. (2004) Transcriptional regulatory code of a eukaryotic genome, Nature,431:99–104.

[4] Lee,T., et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae.Science, 298:799–804.

[5] Liu,X., Noll,D., Lieb,J., and Clarke,N. (2005) DIP-chip: Rapid and accurate determinationof DNA binding specificity, Genome Research, 15(3):421–427.

120 L. Narlikar, R. Gordan, and A.J. Hartemink

[6] Mukherjee S., Berger M., Jona G., Wang X., Muzzey D., Snyder M., Young R., and BulykM. (2004) Rapid analysis of the DNA binding specificities of transcription factors withDNA microarrays, Nature Genetics, 36(12):1331–1339.

[7] Spellman,P., Sherlock,G., Zhang,M., Iyer,V., Anders,K., Eisen,M., Brown,P., Botstein,D.,and Futcher,B. (1998) Comprehensive identification of cell cycle-regulated genes of theyeast Saccharomyces cerevisiae by microarray hybridization, Molecular Biology of the Cell,9:3273–3297.

[8] Kim,S., Lund,J., Kiraly,M., Duke,K., Jiang,M., Stuart,J., Eizinger,A., Wylie,B., and David-son,G. (2001) A gene expression map for Caenorhabditis elegans, Science, 293:2087–2092.

[9] Wasserman,W. and Sandelin,A. (2004) Applied bioinformatics for the identification of reg-ulatory elements, Nat Rev Genet, 5(4):276–287.

[10] Siggia,E. (2005) Computational methods for transcriptional regulation, Current Opinion inGenetics and Development, 15:214–221.

[11] Workman,C. and Stormo,G. (2000) ANN-Spec: A method for discovering transcription fac-tor binding sites with improved specificity, Pac. Symp. Biocomput., 467–478.

[12] Segal,E., Barash,Y., Simon,I., Friedman,N., and Koller,D. (2002) From sequence to expres-sion: A probabilistic framework, RECOMB ’02.

[13] Sinha,S. (2002) Discriminative motifs, RECOMB ’02.[14] Hong,P., Liu,X., Zhou,Q., Lu,X., Liu,J., and Wong,W. (2005) A boosting approach for motif

modeling using ChIP-chip data, Bioinformatics, 21(11):2636–2643.[15] Sinha,S. (2006) On counting position weight matrix matches in a sequence, with application

to discriminative motif finding, Bioinformatics, 22(14):e454–463.[16] Tompa,M. et al. (2005) Assessing computational tools for the discovery of transcription

factor binding sites, Nat. Biotechnol., 23(1):137–144.[17] Almer,A., Rudolph,H., Hinnen,A., and Horz,W. (1986) Removal of positioned nucleosomes

from the yeast PHO5 promoter upon PHO5 induction releases additional upstream activat-ing DNA elements, Embo. J., 5:2689–2696.

[18] Mai,X., Chou,S., and Struhl,K. (2000) Preferential accessibility of the yeast his3 promoteris determined by a general property of the DNA sequence, not by specific elements, CellBiol., 20:6668:6676.

[19] Sekinger,E., Moqtaderi,Z., and Struhl,K. (2005) Intrinsic histone-DNA interactions and lownucleosome density are important for preferential accessibility of promoter regions in yeast,Mol. Cell, 18:735–748.

[20] Yuan,G., Liu,Y., Dion,M., Slack,M., Wu,L., Altschuler,S., and Rando,O. (2005) Genome-scale identification of nucleosome positions in S. cerevisiae, Science, 309:626–630.

[21] Staden,R. (1984) Computer methods to locate signals in nucleic acid sequences, NucleicAcids Research, 12:505–519.

[22] Bailey,T. and Elkan,C. (1994) Fitting a mixture model by expectation maximization to dis-cover motifs in biopolymers, ISMB ’94, AAAI Press, Menlo Park, California, pp. 28–36.

[23] Gelfand,A. and Smith,A. (1990) Sampling based approaches to calculating marginal densi-ties, Journal of the American Statistical Association, 85:398–409.

[24] Liu,J. (1994) The collapsed Gibbs sampler with applications to a gene regulation problem,Journal of the American Statistical Association, 89:958–966.

[25] Liu,J., Neuwald,A., and Lawrence,C. (1995) Bayesian models for multiple local sequencealignment and Gibbs sampling strategies, Journal of the American Statistical Association,90:1156–1170.

[26] Narlikar,L., Gordan,R., Ohler,U., and Hartemink,A. (2006) Informative priors based ontranscription factor structural class improve de novo motif discovery, Bioinformatics,22(14):e384–e392.

Nucleosome Occupancy Information Improves de novo Motif Discovery 121

[27] Roth,F., Hughes,J., Estep,P., and Church,G. (1998) Finding DNA regulatory motifs withinunaligned non-coding sequences clustered by whole-genome mRNA quantitation, NatureBiotech., 16:939–945.

[28] Liu,X., Brutlag,D., and Liu,J. (2001) BioProspector: Discovering conserved DNA motifs inupstream regulatory regions of co-expressed genes, Pac Symp Biocomput., 127–138.

[29] Thijs,G., Marchal,K., Lescot,M., Rombauts,S., De Moor,B., Rouze,P., and Moreau,Y.(2002) A Gibbs sampling method to detect over-represented motifs in the upstream regionsof coexpressed genes, Journal of Computational Biology, 9:447–464.

[30] Dorrington,R.A. and Cooper,T.G. (1993) The DAL82 protein of Saccharomyces cere-visiae binds to the DAL upstream induction sequence (UIS), Nucleic Acids Research,21(16):3777-3784.

[31] Jia,Y., Rothermel,B., Thornton,J. and Butow,R.A. (1993) A basic helix-loop-helix-leucinezipper transcription complex in yeast functions in a signaling pathway from mitochondriato the nucleus, Molecular and Cellular Biology, 17: 1110–1117.

[32] Liu,X., Brutlag,D., and Liu,J. (2002) An algorithm for finding protein-DNA bindingsites with applications to chromatin immunoprecipitation microarray experiments, NatureBiotech., 20:835–839.

[33] Kellis,M., Patterson,N., Endrizzi,M., Birren,B., and Lander,E. (2003) Sequencing and com-parison of yeast species to identify genes and regulatory elements, Nature, 432:241–254.

[34] Bulyk,M., Johnson,P., and Church,G. (2002) Nucleotides of transcription factor bindingsites exert interdependent effects on the binding affinities of transcription factors, NucleicAcids Research, 30:1255–1261.

[35] Agarwal,P. and Bafna,V. (1998) Detecting non-adjacent correlations within signals in DNA,RECOMB ’98

[36] Barash,Y., Elidan,G., Friedman,N., and Kaplan,T. (2003) Modeling dependencies inprotein-DNA binding sites, RECOMB ’03.

[37] Miller,W., Makova,K., Nekrutenko,A., and Hardison,R. (2004) Comparative Genomics,Annu. Rev. Genom. Human. Genet., 5:15–56.

[38] Siddharthan,R., Siggia,E., and Nimwegen,E. (2005) PhyloGibbs: A Gibbs Sampling MotifFinder That Incorporates Phylogeny, PLoS Comput. Biol., 1(7):e67.