[Methods in Molecular Biology] Chromatin Immunoprecipitation Assays Volume 567 || Modeling and Analysis of ChIP-Chip Experiments

Chapter 9

Modeling and Analysis of ChIP-Chip Experiments

Raphael Gottardo

Abstract

Chromatin immunoprecipitation on microarrays, also known as ChIP-chip, is a popular technique forgenome-wide localization of DNA-binding proteins. However, the high density (several million genomicsequences for small eukaryote genomes) and the high noise-to-signal ratio of microarrays make the analysisof ChIP-chip data very challenging. In this chapter, we review some of the issues involved in the analysis ofChIP-chip data and present a few statistical methods that can be used to overcome these issues and improvethe detection of DNA–protein binding sites.

Key words: Bayesian analysis, binding sites, multiple testing, normalization, statistics.

1. Introduction

Chromatin immunoprecipitation on microarrays, ChIP-chip, isthe most widely used method for identifying in vivo DNA–proteinbound regions in a high-throughput manner (1). Recently, Affy-metrix (Santa Clara, CA), NimbleGen Systems (Madison, WI),and Agilent Technologies (Palo Alto, CA) have developed oligo-nucleotide arrays that tile all of the non-repetitive genomicsequences of human and other eukaryotes. These tiling arrays,coupled with ChIP, permit the unbiased mapping of DNA–pro-tein binding sites. Annotation of the transcription factor bindingsites in a given genome is essential for building genome-wideregulatory networks, which can then be used in health researchto better understand diseases and identify new targets for drugs,etc. However, the large amount of data (several million measure-ments) and the small number of replicates available are very chal-lenging for any statistical analysis.

Philippe Collas (ed.), Chromatin Immunoprecipitation Assays, Methods in Molecular Biology 567,DOI 10.1007/978-1-60327-414-2_9, ª Humana Press, a part of Springer Science+Business Media, LLC 2009

133

Similar to gene expression arrays (2, 3), tiling arrays query eachsequence of interest with a short oligonucleotide, referred to as anoligo or probe. The difference is that the probes used do notnecessarily represent genes, but short sequences of DNA in agiven genome. The ChIP protocol generates an IP-enrichedDNA fragment population and measures the enrichment of eachprobe in this population. In general, a control sample is alsogenerated to calibrate the IP sample, and there are various waysof obtaining control populations (1). In terms of tiling resolutionand coverage, this can vary greatly from one manufacturer to theother. For example, Affymetrix tiling arrays contain oligonucleo-tides of 25 base pairs (bps) in length, spanning the non-repetitiveregions of a genome at an average resolution of 35 bps in humansand higher in smaller genomes. Because the original genomicDNA is sheared into segments of an average length of 500–1,000 base pairs (bps) or less, one would expect a DNA–proteinbound region to be of an approximate length of 0.5–1 kbps con-taining a fixed number of probes (the actual number depends onthe tiling resolution) with intensities that form a peak-like struc-ture whose center corresponds to probes closest to the actualbinding site. In practice, empirical studies suggest that the lengthof bound regions can be extremely variable (4–6).

The fluorescent intensity values obtained from an oligonucleo-tide microarray hybridization are not directly comparable because ofsystematic probe biases due to non-specific binding. If notaccounted for, such biases can severely deteriorate any subsequentanalysis. It turns out that this problem is closely related to the basecomposition of the nucleic acid molecules. For example, sequenceswith a high G/C content tend to induce stronger hybridization,because each G-C pair forms three hydrogen bonds, whereas an A-Tpair forms two. The statistical method of normalization aims atmaking the probe measurements more comparable by reducingthese biases. Johnson et al. (7) introduced the first normalizationmodel for ChIP-chip based on probe sequence composition. Thismodel was motivated by sequence-specific probe behavior modelsfor gene expression microarrays (8–10). Other normalization tech-niques borrowed from gene expression include Lowess (11, 12) andquantile–quantile (13, 14) normalization. However, these techni-ques do not use the probe sequence information, and as shown byRoyce et al. (15), will typically be inferior.

Once the data have been properly normalized, one can proceedwith the detection of bound regions. Several approaches are avail-able for analyzing ChIP-chip data. A common approach is to test ahypothesis for each probe using a sliding window statistic and thento try to correct for multiple testing (5, 16). Keles et al. (5) used ascan statistic, which is an average of t-statistics across a certainnumber of probes while Cawley et al. (4) used Wilcoxon’s ranksum test within a certain genomic distance sliding window. In

134 Gottardo

each of these situations, two types of error can occur: a false positive(type I error) or a false negative (type II error). When many hypoth-eses are tested at the same time, the probability of making a type Ierror increases. One approach to overcoming this problem is to tryto control the total number of type I errors or false positives. Thiscan be done using multiple testing procedures to control somemeasure of the overall type I error. The most common measure inthe area of microarrays is the false discovery rate (FDR), which is theproportion of false positives among the total number of discoveriesreported (17). A difficulty with sliding window approaches is thatthe resulting p-values (or statistics) are not independent as each testuses information from neighboring probes, and it is challenging todevise powerful multiple adjustment procedures. In addition, thewindow size is fixed and has to be determined in advance. Alter-natives to sliding window approaches include hidden Markovmodels (18, 19) and Bayesian approaches (6, 14, 20). Bayesianapproaches can make the best use of available prior informationwhile borrowing strength from the data when estimating the quan-tities of interest. Using such Bayesian techniques, inference isusually based on the posterior distribution of the parameters. Inthis chapter, we review and illustrate two methods that can be usedto analyze ChIP-chip data, namely MAT (7) and BAC (6).

2. Materials

2.1. Data We use two publicly available datasets that have already beenanalyzed by several research groups.

2.1.1. ER Data Carroll et al. (21) mapped the association of the estrogen receptor(ER) on chromosomes 21–22. These data contain two conditions(genomic DNA control and IP enriched) with three replicateseach. Several binding sites have already been identified and experi-mentally validated, and we will use this information to compare thedifferent methods presented. In total, we have a set of 83 verifiedbound regions we can use for validation.

2.1.2. Spike-In Data The second dataset we use is a spike-in data that was generated aspart of the Encode consortium project (22) covering 1% of thehuman genome using the Affymetrix technology (1.0R arrays). Inthis experiment, 96 clones approximately 500 bps in length werespiked into sample at (2n + 1)-fold enrichment for n = 1,. . ., 8 andcompared to genomic DNA. Some of these clones mapped to over-lapping locations on the genome and a few of the clones mapped tolocations that were not on one or both of the arrays. Controlsamples consisted of sonicated DNA that were labeled and hybri-dized on the array. There were 67 unique spike-in regions and the

Modeling and Analysis of ChIP-Chip Experiments 135

number of probes in each region ranged from 3 to 94 probes, with amedian of 21. The size of the regions covered on the array rangedfrom 65 to 2044 bps, with a median of 470. The probes on the arrayare 25 bps long and the midpoints of consecutive probes are spacedat an average of 35 bps. The spike-in data set includes five replicatearrays for both the treatment and control samples.

2.2. Software All results presented in this chapter were obtained using opensource implementations of MAT and BAC.

2.2.1. MAT MAT is written as a python package and can be downloaded athttp://chip.dfci.harvard.edu/�wli/MAT/. The webpage con-tains all instructions for the installation and use of the package.

2.2.2. BAC BAC is written in the R statistical language with a few functionsimplemented in C for efficiency. The BAC package is distributed aspart of BioConductor (23), an open source and open developmentsoftware project for the analysis and comprehension of genomicdata. The package can be downloaded at http://www.bioconduc-tor.org/packages/bioc/html/BAC.html. The package contains avignette with detailed instructions on how to use it.

3. Methods

3.1. Normalization Normalization plays an important role in the analysis of tilingarrays and thus ChIP-chip. Its aim is to remove systematic biasesand ease the separation of the true signal due to DNA–proteinbound regions from the background noise. MAT (7) was the firstnormalization model for ChIP-chip based on probe sequenceinformation. In MAT, the normalization is done in two stages:(i) a prediction model for the probe intensities is derived from theirsequence compositions; and (ii) each probe is normalized by sub-tracting its predicted intensity (representing the bias) from theobserved intensity. The rationale behind MAT is that any correla-tion between the observed and predicted intensities would provideevidence of probe sequence specific biases. In MAT, the normal-ization is performed by fitting, to all probes on a given array, thefollowing linear model,

yp¼�þX25

j¼1

X

k2fA;C ;G;T g�jkIpjkþ

X

k2fA;C ;G;T g�kn2

pkþ�logðcpÞþ2p [1]

136 Gottardo

where yp is the log transformed intensity from probe p, npk is thenucleotide count of type k in the sequence of probe p, � is theoverall baseline intensity, Ipjk is an indicator function equal to oneif the nucleotide at position j is k in probe sequence p and 0otherwise, �jk is the effect of nucleotide k at each position j, �k

is the effect of the nucleotide count squared, cp is the number oftimes the sequence of probe p appears in the genome (copynumber), d is the effect of log probe copy number, and "p is anerror term. In other words, the first term on the right hand side ofequation [1] is the mean log intensity of all probes, the secondterm accounts for nucleotide positional effect, the third termaccounts for overall nucleotide composition, while the last termaccounts for the fact that if a probe maps to multiple locations inthe genome its intensity will typically be greater. The 81 resultingparameters can be easily estimated via least squares (see Note 1).When applied to both the ER and the spike-in data, the correla-tions between observed and predicted intensities ranged from0.62 to 0.86, suggesting that a significant part of the signalmeasured by the probes is due to non-specific hybridization.The effect of the MAT normalization applied to both datasetsis shown in Fig. 9.1. Before normalization the GC content has astrong effect on the log intensities; the greater the GC content,the greater the intensity. After normalization, the effect of the GCcontent on the log intensities is significantly decreased. Figure 9.2shows the effect of each single nucleotide (A, G, C, T) as a functionof its position on the probe. One can see that G/C’s have themaximum effect particularly if they are towards the middle of aprobe. In the next section, we will see that if the probe measure-ments are not properly normalized, it can severely affect thedetection of bound regions.

3.2. Detection of Bound

Regions

In addition to normalization, MAT can also detect boundregions with a sliding window approach based on a trimmedmean statistic combined to an FDR estimation procedure (7).The trimmed mean removes the top and bottom 10% of thenormalized intensities and averages the remaining 80%. It thusprovides robustness against outliers. Assuming that the nulldistribution of the trimmed mean based statistic is symmetricabout the median, for each cutoff value above the median(positive cutoff), a negative cutoff is defined as the value sym-metric to the positive cutoff about the median. After mergingnearby probes beyond both cutoffs, the region FDR can beestimated as the ratio of negative regions over the total numberof regions. MAT can automatically select the proper cutoff sothat the region FDR is less than or equal to the user-specifiedFDR value.


In comparison, BAC (6), which is built on previousapproaches used in gene expression analysis (24–26), uses a Baye-sian hierarchical model to identify regions of interest.

In BAC the log transformed measurements are modeled asfollows:

y1pr ¼ �pþ 21pr

y2pr ¼ �p þ �pþ 22pr ;

2cpr � N ð0; � �1cp Þ;

[2]

Fig. 9.1. Boxplots of log intensities as a function of GC counts before and after normalization for one control array of the ERdata (top) and one control array of the spike-in data (bottom). The thick line within each box shows the median log intensityfor all probes with the corresponding number of G’s or C’s. After normalization the medians are mostly centered around zero.

138 Gottardo

where ycpr is the log transformed intensity of probe p from replicater in condition c with c={1,2} denoting the treatment label equal toone for control and two for IP enriched. In equation [2], �p isprobe background intensity, and �p is a probe enrichment effect,which we expect to be large if probe p is part of a bound region. Wemodel the background as a random effect with Gaussian distribu-tion, namely �p � N ð0;c�1Þ where the variance c�1 is constantacross probes. Even though we would typically normalize our datato remove probe sequence effects (e.g., using MAT), it might stillbe necessary to include probe specific effects for two main reasons:(i) the MAT sequence normalization model is not perfect andsome unexplained residual effects are likely to remain, and (ii)some of the probe-to-probe variation might be due to other(non-sequence specific) factors. To model the fact that enrichmenteffects can be exactly zero, we use a mixture of a point mass at zeroand a Gaussian distribution truncated at zero. BAC takes intoaccount the spatial dependence between probes by allowing theweights of the mixture to be correlated for neighboring probes; seeGottardo et al. (6) for details. BAC also includes an exchangeableprior for the variances, allowing each probe to have a differentvariance while still achieving some shrinkage. This allows us toregularize empirical variance estimates, which can be very noisydue to the small number of replicates. Finally, non-informativepriors are used for all parameters and a simulation technique calledMarkov chain Monte Carlo is used to estimate the unknownparameters. Among other things, these parameter estimates canbe used to compute, for each probe, the probability that the probebelongs to a bound region. The closer the probability is to one, the

Fig. 9.2. Effect of nucleotide base (A, G, C, T) as a function of its position on the probe sequence for the ER data (left ) andspike-in data (right ). G’s and C’s, especially towards the middle of the probes, have the strongest effect.


more evidence there is that the probe belongs to a bound region.Bound regions can then be formed by thresholding these posteriorprobabilities. A common threshold is 0.5; an FDR-based thresh-old can also be derived as explained in (27).

We now turn to the ER and spike-in data to evaluate andcompare the performance of MAT and BAC. We have appliedeach method to both datasets, fixing the false discovery rate to10% (Table 9.1). Overall, both MAT and BAC perform rela-tively well on both datasets as they detect most of the positivecontrols. On the ER data, BAC performs slightly better as itdetects more positive controls. On the spike-in data, we actu-ally know the status of all the regions and we can thus com-pute the true false discovery rate in addition to the number ofpositive controls detected. BAC and MAT detect the samenumber of positive controls, but BAC has a nominal FDRcloser to the true FDR (see Note 2). Finally, for comparison,we have also included the results of MAT and BAC applied to

the unnormalized data. The performance of both methods isclearly inferior. For example, MAT applied to the ER dataleads to a huge number of detected regions, most of whichare likely false positives.

For the spike-in data, because we know the true status ofall the regions, it is also possible to plot a receiver operatingcharacteristic (ROC) curve, which shows the number of truepositives versus the number of false positives detected whenvarying the cut-off of each method. For such an ROC curve,the higher the curve is, the better the performance is.Figure 9.3 shows that both MAT and BAC are virtually

Table 9.1Performances of MAT and BAC on the ER and spike-in data. For comparisonpurposes we have also included the results without normalization

ER Spike-in

TP Total TP Total FDR (%)

BAC w/ normalization 73 99 65 72 10

MAT w/ normalization 62 72 65 71 8

BAC w/o normalization 25 83 51 66 23

MAT w/o normalization 83 14084 46 52 12

140 Gottardo

equivalent when the data are normalized and that both sufferfrom the lack of normalization. Figure 9.3 also showsthat BAC is slightly better when the data are not normalized(see Note 3).

4. Notes

1. The normalization implemented in MAT is done for eacharray separately and uses all the probes on the array to estimatethe sequence-specific biases. This is not optimal as probes aspart of bound regions do not only measure background butalso specific hybridization; this can result in over smoothingfor some of the true signals due to enriched regions. Toovercome this problem, one could simply replace the leastsquares estimation by a more robust procedure; see for exam-ple (15). In addition, MAT was derived for transcriptionfactor data, but preliminary results on histone modificationdata suggest that it works relatively well on such data.

Fig. 9.3. ROC curve for MAT and BAC on the spike-in data.


2. In the analysis of high-throughput biological discoveries,including ChIP-chip, it is common to use an FDR procedureto account for multiple testing. However, in practice, it can bedifficult to get an accurate estimate of the FDR. Based on ourexperience, the estimation of the FDR is particularly difficultwith histone modification data where one expects manyenriched regions. In this case, we recommend the use ofcontrol regions in order to estimate the FDR. If such controlregions are not available, one could simply select a thresholdthat leads to a reasonable number of enriched regions.

3. In the results shown above, BAC performed slightly better thanMAT. This is not surprising because BAC is a more compre-hensive modeling approach. This said, BAC is computationallymore demanding and users would need to decide whether theimproved results are worth the additional computing time. BACalso requires a control sample as well as replicates. This is not thecase for MAT, which can be applied to a single array.

Acknowledgments

The author would like to thank Shirley X. Liu, Wei Li, and Evan W.Johnson with whom some of the work presented here originated. Theauthor also thanks Evan W. Johnson for providing the spike-in data.

References

1. Buck, M. J. and Lieb, J. D. (2004) Chip-chip: considerations for the design, analysis,and application of genome-wide chromatinimmunoprecipitation experiments. Geno-mics 83, 349–360.

2. Schena, M., Shalon, D., Davis, R. W. andBrown, P. O. (1995) Quantitative monitor-ing of gene expression patterns with a com-plementary DNA microarray. Science 270,467–470.

3. Lockhart, D. J., Dong, H., Byrne, M. C.,Follettie, M. T., Gallo, M. V., Chee, M. S.,Mittmann, M., Wang, C., Kobayashi, M.,Horton, H. and Brown, E. L. (1996)Expression monitoring by hybridization tohigh-density oligonucleotide arrays. Nat.Biotechnol. 14, 1675–1680.

4. Cawley, S. E., Bekiranov, S., Ng, H. H.,Kapranov, P., Sekinger, E. A., Kampa, D.,

Piccolboni, A., Sementchenko, V., Cheng,J., Williams, A. J., Wheeler, R., Wong, B.,Drenkow, J., Yamanaka, M., Patel, S., Bru-baker, S., Tammana, H., Helt, G., Struhl, K.and Gingeras, T. R. (2004) Unbiased map-ping of transcription factor binding sitesalong human chromosomes 21 and 22points to widespread regulation of noncod-ing RNAs. Cell 116, 499–509.

5. Keles, S., van der Laan, M. J., Dudoit, S. andCawley, S. E. (2006) Multiple testing meth-ods for ChIP-chip high density oligonucleo-tide array data. J. Comput. Biol. 13,579–613.

6. Gottardo, R., Li, W., Johnson, W. E. andLiu, X. S. (2008) A flexible and powerfulBayesian hierarchical model for ChIP-chip experiments. Biometrics 64,468–478.

142 Gottardo

7. Johnson, W. E., Li, W., Meyer, C. A.,Gottardo, R., Carroll, J. S., Brown, M.and Liu, X. S. (2006) Model-based analy-sis of tiling-arrays for ChIP-chip. Proc.Natl. Acad. Sci. U.S.A. 103,12457–12462.

8. Naef, F. and Magnasco, M. O. (2003) Sol-ving the riddle of the bright mismatches:labeling and effective binding in oligonu-cleotide arrays. Phys. Rev. E: Stat. Phys. Plas-mas Fluids 68, 011906.

9. Wu, Z. and Irizarry, R. A. (2003) Preproces-sing of oligonucleotide array data. Nat. Bio-technol. 22, 656–8.

10. Wu, Z. and Irizarry, R. A. (2005) Stochasticmodels inspired by hybridization theory forshort oligonucleotide arrays. J. Comput.Biol. 12, 882–93.

11. Cleveland, W. S. (1979) Robust locallyweighted regression and smoothing scatter-plots. J. Am. Stat. Assoc. 74, 829–836.

12. Peng, S., Alekseyenko, A. A., Larschan, E.,Kuroda, M. I. and Park, P. J. (2007) Nor-malization and experimental design forChip-chip data. BMC Bioinformatics 8, 219.

13. Bolstad, B. M., Irizarry, R. A., Astrand, M.and Speed, T. P. (2003) A comparison ofnormalization methods for high density oli-gonucleotide array data based on varianceand bias. Bioinformatics 19185–93

14. Keles, S. (2007) Mixture modeling for gen-ome-wide localization of transcription fac-tors. Biometrics 63, 10–21.

15. Royce, T. E., Rozowsky, J. S. and Gerstein,M.B.(2007)Assessing theneed for sequence-based normalization in tiling microarrayexperiments. Bioinformatics 23, 988–97.

16. Buck,M.J.,Nobel,A.B.andLieb,J.D.(2005)Chipotle:Auser-friendly tool for theanalysisofChIP-chip data. Genome Biol. 6, R97.

17. Benjamini, Y. and Hochberg, Y. (1995) Con-trolling the false discovery rate: a practical andpowerful approach to multiple testing. J. R.Stat. Soc. Ser. B Stat. Methodol. 57, 289–300.

18. Li, W., Meyer, C. A. and Liu, X. S. (2005) Ahidden Markov model for analyzing ChIP-chip experiments on genome tiling arraysand its application to p53 bindingsequences. Bioinformatics 21, i274–i282.

19. Ji, H. and Wong, W. H. (2005) Tilemap:create chromosomal map of tiling array hybri-dizations. Bioinformatics 21, 3629–3636.

20. Qi, Y., Rolfe, A., MacIsaac, K., Gerber, G.and Pokholok, D. (2006) High-resolutioncomputational models of genome bindingevents. Nat. Biotechnol. 24, 963–970.

21. Carroll, J. S., Liu, X. S., Brodsky, A. S., Li,W., Meyer, C. A., Szary, A. J., Eeckhoute, J.,Shao, W., Hestermann, E. V., Geistlinger,T. R., Fox, E. A., Silver, P. A. and Brown,M. (2005) Chromosome-wide mapping ofestrogen receptor binding reveals long-range regulation requiring the Forkheadprotein FoxA1. Cell 122, 33–43.

22. Johnson, D. S., Li, W., Gordon, D. B.,Bhattacharjee, A., Curry, B., Ghosh, J.,Brizuela, L., Carroll, J. S., Brown, M.,Flicek, P., Koch, C., Dunham, I., Bieda,M., Xu, X., Farnham, P., Kapranov, P.,Nix, D., Gingeras, T. R., Zhang, X., Hol-ster, H. L., Jiang, N., Green, R., Song, J.,McCuine, S., Anton, E., Nguyen, L.,Trinklein, N., Ye, Z., Ching, K., Hawkins,D., Ren, B., Scacheri, P. C., Rozowsky, J.S., Karpikov, A., Euskirchen, G. M.,Weissman, S., Gerstein, M. B., Snyder,M., Yang, A., Moqtaderi, Z., Hirsch, H.,Shulha, H. P., Fu, Y., Weng, Z., Struhl,K., Myers, R. M., Lieb, J. and Liu, X. S.(2008) Systematic evaluation of variabilityin Chip-chip experiments using predefinedDNA targets. Genome Res. 18, 393.

23. Gentleman, R. C., Carey, V. J., Bates, D. M.,Bolstad, B. M., Dettling, M., Dudoit, S.,Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hor-nik, K., Hothorn, T., Huber, W., Iacus, S.,Irizarry, R. A., Leisch, F., Li, C., Maechler,M., Rossini, A. J., Sawitzki, G., Smith, C.,Smyth, G. K., Tierney, L., Yang, J. Y. H. andZhang, J. (2004) Bioconductor: open soft-ware development for computational biol-ogyandbioinformatics.GenomeBiol.5,R80.

24. Newton, M. A., Kendziorski, C. M.,Richmond, C. S., Blattner, F. R. andTsui, K. W. (2001) On differential varia-bility of expression ratios: improving sta-tistical inference about gene expressionchanges from microarray data. J. Comput.Biol. 8, 37–52.

25. Gottardo, R., Pannucci, J. A., Kuske, C. R.and Brettin, T. S. (2003) Statistical analysisof microarray data: a Bayesian approach.Biostatistics 4, 597–620.

26. Gottardo, R., Raftery, A. E., Yeung, K. Y.and Bumgarner, R. E. (2006) Bayesianrobust inference for differential gene expres-sion in microarrays with multiple samples.Biometrics 62, 10–18.

27. Newton, M. A., Noueiry, A., Sarkar, D. andAhlquist, P. (2004) Detecting differentialgene expression with a semiparametric hier-archical mixture method. Biostatistics 5,155–76.


Documents

[Methods in Molecular Biology] Chromatin Immunoprecipitation Assays Volume 567 || Modeling and Analysis of ChIP-Chip Experiments