ChIP-seq data: quality control, read mapping and peak calling · 2015. 8. 6. · ChIP-seq data: quality control, read mapping and peak calling Data included raw ChIP-seq reads of

ChIP-seq data: quality control, read mapping and peak calling

Data included raw ChIP-seq reads of two replicates (different individuals) each of human, rhesus

macaque, mouse (C57BL6/J strain), Brown Norway rat and dog for the transcription factors CEBPA,

FOXA1, HNF4A and ONECUT1 (previously known as HNF6). All individuals used were male, except for

one female macaque CEBPA replicate. All of this data has been previously deposited under a single EBI

ArrayExpress accession (E-MTAB-1509).

Raw reads in FASTQ format were downloaded from this accession and were trimmed to 36

base-pairs (bp) before we discarded any reads with more than 20% of bases with a quality score less

than 20 using FASTX Toolkit (v0.0.13.2; http://hannonlab.cshl.edu/fastx_toolkit/; Note that by default

this program assumes Sanger quality scores: Illumina quality scores are indicated by the option –Q 33).

We filtered mapped reads with a mapping quality (MAPQ) score less than 20 with SAMtools (v0.1.19-

44428cd; Li et al. 2009). The unmasked Ensembl reference genomes used for these alignments and

downstream analyses were: hg19 (human), MMUL_1 (macaque), mm10 (mouse), rno5 (rat) and cfa3

(dog).

In order to gauge whether using SWEMBL to call peaks would result in substantially different

results, peak calling was also done using MACS (macs2 v2.0.10; Zhang et al. 2008) with default

significance options. ChIP-seq experiments in mouse and rat, which are based on inbred laboratory

strains, show less variation in output across both of these software and across replicates. These samples

tend to have higher values of quality metrics for ChIP-seq experiments including the normalized strand

coefficient (NSC), relative strand correlation (RSC) and non-redundancy fraction (NRF) as described in

Landt et al. 2012 (Supplementary Table 1). The NSC and RSC for each sample was determined with

phantompeakqualtools (v2.0; http://code.google.com/p/phantompeakqualtools; Kundaje et al. 2015),

which uses a modified version of the peak caller SPP (v1.10.1; Kharchenko et al. 2008). These quality

statistics are consistent with adequate quality data based on ENCODE standards (Landt et al. 2012).

Peak overlaps were determined with BEDTools (v2.19.0; Quinlan and Hall 2010). New genomic

coordinates of these overlapping peaks were determined by a weighted average of the coordinates of

each peak, with weights proportional to their significance scores reported by SWEMBL. In other words,

the genomic coordinates of overlapping peaks used for downstream analyses was proportionately closer

to whichever peak had a higher score

LASTZ alignments

Multi-species alignments around the Gulo and Uox locus were produced with LASTZ (v1.02) and

the threaded block aligner (v12; Kent et al. 2003). These alignments were restricted to a single synteny

http://hannonlab.cshl.edu/fastx_toolkit/

http://code.google.com/p/phantompeakqualtools

in order to verify that any bound regions were truly orthologous across all columns of the alignment.

The output MAF blocks were then turned into a single global multi-species alignment. Peak coordinates

were projected onto this alignment, which allowed orthologous binding events across species to be

identified. To visualize binding events nearby the Uox locus, the reads per billion (RPB) for each species

and TF (both replicates combined) were plotted along this multi-species alignment alongside the input

RPB.

Mammalian transcriptome analysis

To identify liver expressed orthologs across all five mammals, previously published RNA-seq data

from the livers of all five mammals (E-MTAB-890; Kutter et al. 2011) was aligned with Bowtie2 and

TopHat2 (v2.0.10; Kim et al. 2013) and analyzed using Cufflinks (v2.2.1; Trapnell et al. 2010). Filtering

and prepping of RNA-seq reads was done as above for ChIP-seq data. Ensembl’s Biomart database

(Kinsella et al. 2011) was used to retrieve known one-to-one orthologs across all five mammals.

Binding level analysis

Peak intensity for peak k (gijks) was calculated using the following equation:

𝑔𝑖𝑗𝑘𝑠 = 𝑙𝑜𝑔 (

𝑐𝑖𝑗𝑘𝑠

𝑁𝐶𝑗𝑠𝑖𝑖𝑗𝑘𝑠

𝑁𝐼𝑗𝑠

⁄ )

Where: cijks is the number of TF ChIP enriched reads bound +/- 50bp around the summit of peak k, iijks is

the number of input reads bound +/- 250bp around the summit of peak k. The indices indicate that this

is a peak of TF j within 100kb of the TSS of gene i in species s. A pseudocount of one read was added to

both cijks and iijks which were then multiplied by 1x106 and 2x105 respectively to make ratios easier to

work with and to correct for the different number of bases scanned (~5-fold difference). NCjs and NIjs

correspond to the total number of reads bound genome-wide in the TF ChIP and input experiments

respectively. Due to lower coverage, a wider range was scanned for input reads (501 bp). A smaller

range (101 bp) was used to scan for the TF ChIP enriched reads to help prevent counting reads from

partially overlapping peaks. Bamtools (v2.3.0; https://github.com/pezmaster31/bamtools) was used to

count the number of mapped, filtered reads for each of the criteria above from bam files of each

replicate. This approach means that intensities of peaks found in both replicates tend to be higher,

which is advantageous since these bound regions are more likely to be functional cis-regulatory

modules. In contrast, a peak called in only a single replicate by just barely passing the significance

https://github.com/pezmaster31/bamtools

threshold will contribute much less to a gene’s binding level and be down-weighted due to the lack of

reads mapping in the other replicate. However, it is possible that many of the peaks bound in only a

single replicate have a considerable number of reads bound in that region of the other replicate as well,

but simply not enough to be called confidently as a peak. Rather than discarding this peak as is usually

done, this approach of determining binding intensity allows peaks to be down or up-weighted

appropriately by the ChIP read enrichment over both replicates. We obtained combined TF binding

levels per gene by summing the standard scores for each TF and then maximum likelihood ancestral

binding levels (for the combined measure for four TFs) were reconstructed under a Brownian motion

(BM) model in R using the “ace” function of the ape package (Paradis et al. 2004). These standard scores

were then converted into P-values. Human pseudogene P-values were combined using Fisher’s method.

Ensembl multi-species alignments

After parsing out alignments discussed in main text, the sequences for these species were re-mapped to

the corresponding reference genome using blat (v35x1; Kent 2002), which confirmed that the alignment

sequences and coordinates matched correctly. 2kb flanking blocks were split into smaller 500bp blocks

before determining whether their coordinates intersected with other peaks or exons. After acquiring

these multi-species alignments, genomic coordinates in each focal species were determined in each

alignment +/- 150bp from the summit (taken as the centre of a peak) based on coordinates of the

species in which the peak was called.

Inferring substitution rates, binding strengths and changes in binding strength

Primate sequences within the human-tarsier clade were parsed from the alignments described above.

Specifically, the subset of the phylogeny used for each alignment included eight primate species: human,

chimp, gorilla, orangutan, gibbon, macaque, marmoset and tarsier (outgroup) and any alignments with

no aligned sequence from any of these species was removed from the analysis. The newick format of

this tree is (neutral subs per site, taken from ENSEMBL):

(((((((Human:0.0059,Chimp:0.0064):0.0019,Gorilla:0.0086):0.008,Orangutan:0.017):0.0026,Gibbon:0.019

1):0.0095,Macaque:0.0352):0.0147,Marmoset:0.0687):0.0565,Tarsier:0.1392);

As mentioned in main text, all ancestral sequences across the primate phylogeny were

reconstructed using the “prequel” program of PHAST (v1.3; Hubisz et al. 2011), which outputs the

probabilities of each base at each column in the alignment in an ancestral sequence. Any base with

probability >= 0.7 at a given column position was taken to be correct. Otherwise, an IUPAC ambiguity

code (Dixon et al. 1985) was used at that column for all bases with probabilities >= 0.1. All ancestral and

extant species sequences were scanned by a custom Perl script to score all possible TFBS through the

standard binding strength score (S):

𝑆 = ∑log (𝑏𝑖)

log (𝑔𝑖)

𝑘

1

Where the motif is of length k and bi and gi are the PWM and background frequencies of the base at

position i in the motif respectively.

Position weight matrices (PWMs) for each TF were acquired from online databases of TF binding

specificities. Specifically, the PWMs used in this study are based on experimentally verified binding sites

for each of the four TFs and not PWMs based on ChIP-seq enrichment to minimize circularity. We

downloaded the PWMs of CEBPA, FOXA1 and HNF4A from JASPAR (Mathelier et al. 2014). The

mammalian ONECUT1 PWM was not available in this database, so we downloaded its PWM from

HOCOMOCO (Kulakovskiy et al. 2013), a PWM repository specifically for humans. For all four of these

PWMs we took the expected numbers of bases observed at each position given 50 TFBS (i.e. we

downsampled the PWMs to the same number of binding sites). We then added a pseudocount of 1 to

each PWM in order to “soften” the PWM (Moses and Sinha 2009). Softening the PWM’s proved to be

important, since otherwise inferred changes in strength (see below) could be extremely large.

Substitutions were inferred at all ungapped columns overlapping motif matches with S > 0 in any

sequence, based upon these ancestral sequences. Distributions of S across all primates were also

inferred, using this same data. Additionally, by combining the substitution and S values, the change in

binding strength (∆S) was inferred (Moses 2009):

∆𝑆𝑖𝑎𝑏 = log (𝑓𝑖𝑏)

log (𝑔𝑖𝑏)−

log (𝑓𝑖𝑎)

log (𝑔𝑖𝑎)

For a substitution from base a to b (both of the set { A C G T } ). Where i is the position in the motif, g is

the background frequency, f is the PWM frequency and k is the total motif length.

The distribution of ∆S values can then be calculated along any arbitrary primate lineage. Importantly,

the direction of change in binding strength is informative (i.e. increases or decreases binding strength),

so for example, a large gain in binding strength could be detected in TFBSs only along the human

lineage, which might indicate the gain (or at least increase in strength) of a new TFBS unique to humans

that is being bound. 51 bp windows around the peak summits were used to enrich for high scoring

motifs since these regions are enriched for sequences matching the motif of each TF (presumably since

they are real TFBSs). We called substitutions in TFBSs with at least intermediate S scores, based upon

the empirical distribution of S scores (CEBPA > 1, FOXA1 > 3, HNF4A > 3, ONECUT1 > 1).

References

Ballester B, Medina-Rivera A, Schmidt D, Gonzàlez-Porta M, Carlucci M, Chen X, Chessman K, Faure AJ,

Funnell AP, Goncalves A, et al. 2014. Multi-species, multi-transcription factor binding highlights

conserved control of tissue-specific biological pathways. Elife 3:1–29.

Cunningham F, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G,

Fitzgerald S, et al. 2014. Ensembl 2015 Nucleic Acids Res 43:D662–D669.

Dixon HBF, Bielka H, Cantor CR. 1985. Nomenclature for incompletely specified bases in nucleic acid

sequences. J. Biol. Chem. 261:13–17.

Funnell APW, Wilson MD, Ballester B, Mak KS, Burdach J, Magan N, Pearson RCM, Lemaigre FP, Stowell

KM, Odom DT, et al. 2013. A CpG mutational hotspot in a ONECUT binding site accounts for the

prevalent variant of hemophilia B Leyden. Am. J. Hum. Genet. 92:460–467.

Hubisz MJ, Pollard KS, Siepel A. 2011. Phast and Rphast: Phylogenetic analysis with space/time models.

Brief. Bioinform. 12:41–51.

Kent WJ. 2002. BLAT — The BLAST -Like Alignment Tool. Genome Res. 12:656–664.

Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. 2003. Evolution’s cauldron : Duplication, deletion,

and rearrangement in the mouse and human genomes. Proc. Natl. Acad. Sci. USA 100:11484-9.

Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. 2013. TopHat2: accurate alignment of

transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14:R36.

Kinsella RJ, Kähäri A, Haider S, Zamora J, Proctor G, Spudich G, Almeida-King J, Staines D, Derwent P,

Kerhornou A, et al. 2011. Ensembl BioMarts: A hub for data retrieval across taxonomic space.

Database bar030:1–9.

Kharchenko PK, Tolstorukov MY, Park PJ. 2008. Design and analysis of ChIP-seq experiments for DNA-

binding proteins. Nat Biotechnol 26:1351-9.

Kulakovskiy I V., Medvedeva Y a., Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB, Makeev VJ. 2013.

HOCOMOCO: A comprehensive collection of human transcription factor binding sites models.

Nucleic Acids Res. 41:195–202.

Kundaje A, Jung LY, Kharchenko P, Wold B, Sidow A, Batzoglou S, Park P. 2015. Assessment of ChIP-seq

data quality using cross-correlation analysis (submitted).

Kutter C, Brown GD, Gonçalves A, Wilson MD, Watt S, Brazma A, White RJ, Odom DT. 2011. Pol III

binding in six mammals shows conservation among amino acid isotypes despite divergence among

tRNA gene. Nature Genetics 43:948-55.

Landt SG, Marinov GK, Kundaje A, Frazer KA et al. 2012. ChIP-seq guidelines and practices of the

ENCODE and modENCODE consortia the genome. Genome Research 22:1813–1831.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. 2009. The

Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079.

Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, Buchman S, Chen CY, Chou A,

Ienasescu H, et al. 2014. JASPAR 2014: An extensively expanded and updated open-access

database of transcription factor binding profiles. Nucleic Acids Res. 42:1–6.

Moses AM. 2009. Statistical tests for natural selection on regulatory regions based on the strength of

transcription factor binding sites. BMC Evol. Biol. 9:286.

Moses A, Sinha S. 2009. Chapter: Regulatory Motif Analysis, in Bioinformatics, Tools and Applications.

Edwards D et al. (eds.) Springer-Verlag, New York.

Paradis E, Claude J, Strimmer K. 2004. APE: Analyses of phylogenetics and evolution in R language.

Bioinformatics 20:289–290.

Quinlan AR, Hall IM. 2010. BEDTools: A flexible suite of utilities for comparing genomic features.

Bioinformatics 26:841–842.

Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, Kutter C, Watt S, Martinez-

jimenez CP, Mackay S, et al. 2010. Five-Vertebrate ChIP-seq Reveals Transcription Factor Binding.

Science 328:1036–1040.

Schmidt D, Schwalie PC, Wilson MD, Ballester B, Gonalves Â, Kutter C, Brown GD, Marshall A, Flicek P,

Odom DT. 2012. Waves of retrotransposon expansion remodel genome organization and CTCF

binding in multiple mammalian lineages. Cell 148:335–348.

Stefflova K, Thybert D, Wilson MD, Streeter I, Aleksic J, Karagianni P, Brazma A, Adams DJ, Talianidis I,

Marioni JC, et al. 2013. Cooperativity and rapid evolution of cobound transcription factors in

closely related mammals. Cell 154:530–540.

Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L.

2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and

isoform switching during cell differentiation. Nat. Biotechnol. 28:511–515.

Wong ES, Thybert D, Schmitt BM, Stefflova K, Odom DT, Flicek P. 2015. Decoupling of evolutionary

changes in transcription factor binding and gene expression in mammals. Genome Research

25:167-178.

Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M,

Li W, et al. 2008. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9:R137.

Supplementary Table 1: QC statistics: including the fraction of reads in peaks (FRiP; red = low, green = high), the

normalized strand coefficient (NSC), the relative strand correlation (RSC) and the non-redundancy fraction (NRF).

Supplementary Table 2: Proportion overlap between ChIP-seq replicates (two replicates of each

experiment) at reciprocal overlap cut-offs of 33% and 75%. A cut-off of 75% captures the majority of

overlapping peaks called at a cut-off of 33%. The last three columns show the total number of binding

events (called by SWEMBL) in each replicate as well as the final number of peaks carried forward

(including peaks reproducible in each replicate and those unique to each replicate).

Supplementary Table 3: Likelihood ratio tests performed on deeply shared cis-regulatory modules (CRMs) nearby

the 23 liver-specific genes (ENSEMBL gene id shown) and UOX. Peak coordinates indicate the peak calls used in this

analysis, however +/-200 bp from peak summits was used for all of these tests. Scale factors are shown relative to

branch lengths based upon unbound, flanking DNA of all 23 genes (one neutral model based upon all genes).

Importantly, these branch lengths were scaled by the scaling factor called in macaque prior to this analysis.

Therefore, the scaling factors shown are relative to a scale factor of 1 along the macaque branch. “Null scale”

refers to the scaling factor fitted to the entire tree (i.e. same scaling factor for the human:gibbon lineage and

macaque). “Alt scale” indicates the scaling factor fitted to the human:gibbon lineage of the phylogeny (the scaling

factor along the macaque branch is always 1 under the alternative hypothesis). LRT indicates 2*(likelihood ratio) of

the null and alternative models.

Supplementary Figure 1: Multi-species alignment around the Gulo locus in relative alignment

coordinates. Specifically, this alignment corresponds to +/- 100kb around the human Gulo TSS

(chr8:27,317,791-27,517,791). Thick black lines correspond to DNA for each species, while the thinner

line indicates a gap at that position. Called peaks in each species are indicated by a different coloured

circle above the alignment as described in the legend above. The Gulo locus is coloured blue at the

bottom of the alignment and nearby genes are shown in red. * indicates the Gulo promoter, where

there is marked loss in binding in primates compared to the other species.

Supplementary Figure 2: Tests for degeneration along the macaque lineage as described in the main text

and as shown in figure 2 along the human lineage. (A) The difference in expression level between the

rodent ancestor and macaque. (B) The analogous difference in binding level between the rodent

ancestor and macaque. Species trees show the inferred values of the quantitative traits. The white

circles indicate the ancestral rodent and the red lines indicate the lineages being compared. Gray

histograms correspond to the standard score of the distribution of changes for 1116 liver expressed

genes with one-to-one orthologs across the five mammals. Gulo and Uox are indicated in terms of their

standard scores by the blue and red lines respectively.

Supplementary Figure 3: TF-bound sequences (“peaks”) evolve more slowly than unbound sequences

(“flanks”). (A) Substitutions per site along the human-gibbon lineage in human and macaque peaks and

flanks (where flanks correspond to unbound, non-coding DNA flanking the bound sequences). Peak

sequences are evolving more slowly than flanks (Wilcoxon tests; ** indicates significance at P < 10-6). (B)

Substitutions per site along the macaque branch in human and macaque peaks and flanks. Macaque

peak sequences are evolving more slowly, but not human peaks (Wilcoxon tests; * indicates significance

at BF corr. P < 0.05). Both panels A and B are based upon peaks and flanks within 100kb of the 23 liver-

specific genes.

Supplementary Figure 4: Changes in binding strength (∆S) are not distributed differently between flanks

and peak summits for both liver-specific genes and Uox. (A) Human peaks nearby liver-specific genes; (B)

Human peaks nearby Uox; (C) Macaque peaks nearby liver-specific genes; (D) Macaque peaks nearby

Uox. Black dots correspond the actual substitutions in each dataset. In peaks sequences, substitutions

were called in +/- 25bp of the peak summits.

Documents

ChIP-seq data: quality control, read mapping and peak calling · 2015. 8. 6. · ChIP-seq data: quality control, read mapping and peak calling Data included raw ChIP-seq reads of