Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
ChIP-seq data: quality control, read mapping and peak calling
Data included raw ChIP-seq reads of two replicates (different individuals) each of human, rhesus
macaque, mouse (C57BL6/J strain), Brown Norway rat and dog for the transcription factors CEBPA,
FOXA1, HNF4A and ONECUT1 (previously known as HNF6). All individuals used were male, except for
one female macaque CEBPA replicate. All of this data has been previously deposited under a single EBI
ArrayExpress accession (E-MTAB-1509).
Raw reads in FASTQ format were downloaded from this accession and were trimmed to 36
base-pairs (bp) before we discarded any reads with more than 20% of bases with a quality score less
than 20 using FASTX Toolkit (v0.0.13.2; http://hannonlab.cshl.edu/fastx_toolkit/; Note that by default
this program assumes Sanger quality scores: Illumina quality scores are indicated by the option –Q 33).
We filtered mapped reads with a mapping quality (MAPQ) score less than 20 with SAMtools (v0.1.19-
44428cd; Li et al. 2009). The unmasked Ensembl reference genomes used for these alignments and
downstream analyses were: hg19 (human), MMUL_1 (macaque), mm10 (mouse), rno5 (rat) and cfa3
(dog).
In order to gauge whether using SWEMBL to call peaks would result in substantially different
results, peak calling was also done using MACS (macs2 v2.0.10; Zhang et al. 2008) with default
significance options. ChIP-seq experiments in mouse and rat, which are based on inbred laboratory
strains, show less variation in output across both of these software and across replicates. These samples
tend to have higher values of quality metrics for ChIP-seq experiments including the normalized strand
coefficient (NSC), relative strand correlation (RSC) and non-redundancy fraction (NRF) as described in
Landt et al. 2012 (Supplementary Table 1). The NSC and RSC for each sample was determined with
phantompeakqualtools (v2.0; http://code.google.com/p/phantompeakqualtools; Kundaje et al. 2015),
which uses a modified version of the peak caller SPP (v1.10.1; Kharchenko et al. 2008). These quality
statistics are consistent with adequate quality data based on ENCODE standards (Landt et al. 2012).
Peak overlaps were determined with BEDTools (v2.19.0; Quinlan and Hall 2010). New genomic
coordinates of these overlapping peaks were determined by a weighted average of the coordinates of
each peak, with weights proportional to their significance scores reported by SWEMBL. In other words,
the genomic coordinates of overlapping peaks used for downstream analyses was proportionately closer
to whichever peak had a higher score
LASTZ alignments
Multi-species alignments around the Gulo and Uox locus were produced with LASTZ (v1.02) and
the threaded block aligner (v12; Kent et al. 2003). These alignments were restricted to a single synteny
in order to verify that any bound regions were truly orthologous across all columns of the alignment.
The output MAF blocks were then turned into a single global multi-species alignment. Peak coordinates
were projected onto this alignment, which allowed orthologous binding events across species to be
identified. To visualize binding events nearby the Uox locus, the reads per billion (RPB) for each species
and TF (both replicates combined) were plotted along this multi-species alignment alongside the input
RPB.
Mammalian transcriptome analysis
To identify liver expressed orthologs across all five mammals, previously published RNA-seq data
from the livers of all five mammals (E-MTAB-890; Kutter et al. 2011) was aligned with Bowtie2 and
TopHat2 (v2.0.10; Kim et al. 2013) and analyzed using Cufflinks (v2.2.1; Trapnell et al. 2010). Filtering
and prepping of RNA-seq reads was done as above for ChIP-seq data. Ensembl’s Biomart database
(Kinsella et al. 2011) was used to retrieve known one-to-one orthologs across all five mammals.
Binding level analysis
Peak intensity for peak k (gijks) was calculated using the following equation:
𝑔𝑖𝑗𝑘𝑠 = 𝑙𝑜𝑔 (
𝑐𝑖𝑗𝑘𝑠
𝑁𝐶𝑗𝑠𝑖𝑖𝑗𝑘𝑠
𝑁𝐼𝑗𝑠
⁄ )
Where: cijks is the number of TF ChIP enriched reads bound +/- 50bp around the summit of peak k, iijks is
the number of input reads bound +/- 250bp around the summit of peak k. The indices indicate that this
is a peak of TF j within 100kb of the TSS of gene i in species s. A pseudocount of one read was added to
both cijks and iijks which were then multiplied by 1x106 and 2x105 respectively to make ratios easier to
work with and to correct for the different number of bases scanned (~5-fold difference). NCjs and NIjs
correspond to the total number of reads bound genome-wide in the TF ChIP and input experiments
respectively. Due to lower coverage, a wider range was scanned for input reads (501 bp). A smaller
range (101 bp) was used to scan for the TF ChIP enriched reads to help prevent counting reads from
partially overlapping peaks. Bamtools (v2.3.0; https://github.com/pezmaster31/bamtools) was used to
count the number of mapped, filtered reads for each of the criteria above from bam files of each
replicate. This approach means that intensities of peaks found in both replicates tend to be higher,
which is advantageous since these bound regions are more likely to be functional cis-regulatory
modules. In contrast, a peak called in only a single replicate by just barely passing the significance
threshold will contribute much less to a gene’s binding level and be down-weighted due to the lack of
reads mapping in the other replicate. However, it is possible that many of the peaks bound in only a
single replicate have a considerable number of reads bound in that region of the other replicate as well,
but simply not enough to be called confidently as a peak. Rather than discarding this peak as is usually
done, this approach of determining binding intensity allows peaks to be down or up-weighted
appropriately by the ChIP read enrichment over both replicates. We obtained combined TF binding
levels per gene by summing the standard scores for each TF and then maximum likelihood ancestral
binding levels (for the combined measure for four TFs) were reconstructed under a Brownian motion
(BM) model in R using the “ace” function of the ape package (Paradis et al. 2004). These standard scores
were then converted into P-values. Human pseudogene P-values were combined using Fisher’s method.
Ensembl multi-species alignments
After parsing out alignments discussed in main text, the sequences for these species were re-mapped to
the corresponding reference genome using blat (v35x1; Kent 2002), which confirmed that the alignment
sequences and coordinates matched correctly. 2kb flanking blocks were split into smaller 500bp blocks
before determining whether their coordinates intersected with other peaks or exons. After acquiring
these multi-species alignments, genomic coordinates in each focal species were determined in each
alignment +/- 150bp from the summit (taken as the centre of a peak) based on coordinates of the
species in which the peak was called.
Inferring substitution rates, binding strengths and changes in binding strength
Primate sequences within the human-tarsier clade were parsed from the alignments described above.
Specifically, the subset of the phylogeny used for each alignment included eight primate species: human,
chimp, gorilla, orangutan, gibbon, macaque, marmoset and tarsier (outgroup) and any alignments with
no aligned sequence from any of these species was removed from the analysis. The newick format of
this tree is (neutral subs per site, taken from ENSEMBL):
(((((((Human:0.0059,Chimp:0.0064):0.0019,Gorilla:0.0086):0.008,Orangutan:0.017):0.0026,Gibbon:0.019
1):0.0095,Macaque:0.0352):0.0147,Marmoset:0.0687):0.0565,Tarsier:0.1392);
As mentioned in main text, all ancestral sequences across the primate phylogeny were
reconstructed using the “prequel” program of PHAST (v1.3; Hubisz et al. 2011), which outputs the
probabilities of each base at each column in the alignment in an ancestral sequence. Any base with
probability >= 0.7 at a given column position was taken to be correct. Otherwise, an IUPAC ambiguity
code (Dixon et al. 1985) was used at that column for all bases with probabilities >= 0.1. All ancestral and
extant species sequences were scanned by a custom Perl script to score all possible TFBS through the
standard binding strength score (S):
𝑆 = ∑log (𝑏𝑖)
log (𝑔𝑖)
𝑘
1
Where the motif is of length k and bi and gi are the PWM and background frequencies of the base at
position i in the motif respectively.
Position weight matrices (PWMs) for each TF were acquired from online databases of TF binding
specificities. Specifically, the PWMs used in this study are based on experimentally verified binding sites
for each of the four TFs and not PWMs based on ChIP-seq enrichment to minimize circularity. We
downloaded the PWMs of CEBPA, FOXA1 and HNF4A from JASPAR (Mathelier et al. 2014). The
mammalian ONECUT1 PWM was not available in this database, so we downloaded its PWM from
HOCOMOCO (Kulakovskiy et al. 2013), a PWM repository specifically for humans. For all four of these
PWMs we took the expected numbers of bases observed at each position given 50 TFBS (i.e. we
downsampled the PWMs to the same number of binding sites). We then added a pseudocount of 1 to
each PWM in order to “soften” the PWM (Moses and Sinha 2009). Softening the PWM’s proved to be
important, since otherwise inferred changes in strength (see below) could be extremely large.
Substitutions were inferred at all ungapped columns overlapping motif matches with S > 0 in any
sequence, based upon these ancestral sequences. Distributions of S across all primates were also
inferred, using this same data. Additionally, by combining the substitution and S values, the change in
binding strength (∆S) was inferred (Moses 2009):
∆𝑆𝑖𝑎𝑏 = log (𝑓𝑖𝑏)
log (𝑔𝑖𝑏)−
log (𝑓𝑖𝑎)
log (𝑔𝑖𝑎)
For a substitution from base a to b (both of the set { A C G T } ). Where i is the position in the motif, g is
the background frequency, f is the PWM frequency and k is the total motif length.
The distribution of ∆S values can then be calculated along any arbitrary primate lineage. Importantly,
the direction of change in binding strength is informative (i.e. increases or decreases binding strength),
so for example, a large gain in binding strength could be detected in TFBSs only along the human
lineage, which might indicate the gain (or at least increase in strength) of a new TFBS unique to humans
that is being bound. 51 bp windows around the peak summits were used to enrich for high scoring
motifs since these regions are enriched for sequences matching the motif of each TF (presumably since
they are real TFBSs). We called substitutions in TFBSs with at least intermediate S scores, based upon
the empirical distribution of S scores (CEBPA > 1, FOXA1 > 3, HNF4A > 3, ONECUT1 > 1).
References
Ballester B, Medina-Rivera A, Schmidt D, Gonzàlez-Porta M, Carlucci M, Chen X, Chessman K, Faure AJ,
Funnell AP, Goncalves A, et al. 2014. Multi-species, multi-transcription factor binding highlights
conserved control of tissue-specific biological pathways. Elife 3:1–29.
Cunningham F, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G,
Fitzgerald S, et al. 2014. Ensembl 2015 Nucleic Acids Res 43:D662–D669.
Dixon HBF, Bielka H, Cantor CR. 1985. Nomenclature for incompletely specified bases in nucleic acid
sequences. J. Biol. Chem. 261:13–17.
Funnell APW, Wilson MD, Ballester B, Mak KS, Burdach J, Magan N, Pearson RCM, Lemaigre FP, Stowell
KM, Odom DT, et al. 2013. A CpG mutational hotspot in a ONECUT binding site accounts for the
prevalent variant of hemophilia B Leyden. Am. J. Hum. Genet. 92:460–467.
Hubisz MJ, Pollard KS, Siepel A. 2011. Phast and Rphast: Phylogenetic analysis with space/time models.
Brief. Bioinform. 12:41–51.
Kent WJ. 2002. BLAT — The BLAST -Like Alignment Tool. Genome Res. 12:656–664.
Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. 2003. Evolution’s cauldron : Duplication, deletion,
and rearrangement in the mouse and human genomes. Proc. Natl. Acad. Sci. USA 100:11484-9.
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. 2013. TopHat2: accurate alignment of
transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14:R36.
Kinsella RJ, Kähäri A, Haider S, Zamora J, Proctor G, Spudich G, Almeida-King J, Staines D, Derwent P,
Kerhornou A, et al. 2011. Ensembl BioMarts: A hub for data retrieval across taxonomic space.
Database bar030:1–9.
Kharchenko PK, Tolstorukov MY, Park PJ. 2008. Design and analysis of ChIP-seq experiments for DNA-
binding proteins. Nat Biotechnol 26:1351-9.
Kulakovskiy I V., Medvedeva Y a., Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB, Makeev VJ. 2013.
HOCOMOCO: A comprehensive collection of human transcription factor binding sites models.
Nucleic Acids Res. 41:195–202.
Kundaje A, Jung LY, Kharchenko P, Wold B, Sidow A, Batzoglou S, Park P. 2015. Assessment of ChIP-seq
data quality using cross-correlation analysis (submitted).
Kutter C, Brown GD, Gonçalves A, Wilson MD, Watt S, Brazma A, White RJ, Odom DT. 2011. Pol III
binding in six mammals shows conservation among amino acid isotypes despite divergence among
tRNA gene. Nature Genetics 43:948-55.
Landt SG, Marinov GK, Kundaje A, Frazer KA et al. 2012. ChIP-seq guidelines and practices of the
ENCODE and modENCODE consortia the genome. Genome Research 22:1813–1831.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. 2009. The
Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079.
Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, Buchman S, Chen CY, Chou A,
Ienasescu H, et al. 2014. JASPAR 2014: An extensively expanded and updated open-access
database of transcription factor binding profiles. Nucleic Acids Res. 42:1–6.
Moses AM. 2009. Statistical tests for natural selection on regulatory regions based on the strength of
transcription factor binding sites. BMC Evol. Biol. 9:286.
Moses A, Sinha S. 2009. Chapter: Regulatory Motif Analysis, in Bioinformatics, Tools and Applications.
Edwards D et al. (eds.) Springer-Verlag, New York.
Paradis E, Claude J, Strimmer K. 2004. APE: Analyses of phylogenetics and evolution in R language.
Bioinformatics 20:289–290.
Quinlan AR, Hall IM. 2010. BEDTools: A flexible suite of utilities for comparing genomic features.
Bioinformatics 26:841–842.
Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, Kutter C, Watt S, Martinez-
jimenez CP, Mackay S, et al. 2010. Five-Vertebrate ChIP-seq Reveals Transcription Factor Binding.
Science 328:1036–1040.
Schmidt D, Schwalie PC, Wilson MD, Ballester B, Gonalves Â, Kutter C, Brown GD, Marshall A, Flicek P,
Odom DT. 2012. Waves of retrotransposon expansion remodel genome organization and CTCF
binding in multiple mammalian lineages. Cell 148:335–348.
Stefflova K, Thybert D, Wilson MD, Streeter I, Aleksic J, Karagianni P, Brazma A, Adams DJ, Talianidis I,
Marioni JC, et al. 2013. Cooperativity and rapid evolution of cobound transcription factors in
closely related mammals. Cell 154:530–540.
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L.
2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and
isoform switching during cell differentiation. Nat. Biotechnol. 28:511–515.
Wong ES, Thybert D, Schmitt BM, Stefflova K, Odom DT, Flicek P. 2015. Decoupling of evolutionary
changes in transcription factor binding and gene expression in mammals. Genome Research
25:167-178.
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M,
Li W, et al. 2008. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9:R137.
Supplementary Table 1: QC statistics: including the fraction of reads in peaks (FRiP; red = low, green = high), the
normalized strand coefficient (NSC), the relative strand correlation (RSC) and the non-redundancy fraction (NRF).
Supplementary Table 2: Proportion overlap between ChIP-seq replicates (two replicates of each
experiment) at reciprocal overlap cut-offs of 33% and 75%. A cut-off of 75% captures the majority of
overlapping peaks called at a cut-off of 33%. The last three columns show the total number of binding
events (called by SWEMBL) in each replicate as well as the final number of peaks carried forward
(including peaks reproducible in each replicate and those unique to each replicate).
Supplementary Table 3: Likelihood ratio tests performed on deeply shared cis-regulatory modules (CRMs) nearby
the 23 liver-specific genes (ENSEMBL gene id shown) and UOX. Peak coordinates indicate the peak calls used in this
analysis, however +/-200 bp from peak summits was used for all of these tests. Scale factors are shown relative to
branch lengths based upon unbound, flanking DNA of all 23 genes (one neutral model based upon all genes).
Importantly, these branch lengths were scaled by the scaling factor called in macaque prior to this analysis.
Therefore, the scaling factors shown are relative to a scale factor of 1 along the macaque branch. “Null scale”
refers to the scaling factor fitted to the entire tree (i.e. same scaling factor for the human:gibbon lineage and
macaque). “Alt scale” indicates the scaling factor fitted to the human:gibbon lineage of the phylogeny (the scaling
factor along the macaque branch is always 1 under the alternative hypothesis). LRT indicates 2*(likelihood ratio) of
the null and alternative models.
Supplementary Figure 1: Multi-species alignment around the Gulo locus in relative alignment
coordinates. Specifically, this alignment corresponds to +/- 100kb around the human Gulo TSS
(chr8:27,317,791-27,517,791). Thick black lines correspond to DNA for each species, while the thinner
line indicates a gap at that position. Called peaks in each species are indicated by a different coloured
circle above the alignment as described in the legend above. The Gulo locus is coloured blue at the
bottom of the alignment and nearby genes are shown in red. * indicates the Gulo promoter, where
there is marked loss in binding in primates compared to the other species.
Supplementary Figure 2: Tests for degeneration along the macaque lineage as described in the main text
and as shown in figure 2 along the human lineage. (A) The difference in expression level between the
rodent ancestor and macaque. (B) The analogous difference in binding level between the rodent
ancestor and macaque. Species trees show the inferred values of the quantitative traits. The white
circles indicate the ancestral rodent and the red lines indicate the lineages being compared. Gray
histograms correspond to the standard score of the distribution of changes for 1116 liver expressed
genes with one-to-one orthologs across the five mammals. Gulo and Uox are indicated in terms of their
standard scores by the blue and red lines respectively.
Supplementary Figure 3: TF-bound sequences (“peaks”) evolve more slowly than unbound sequences
(“flanks”). (A) Substitutions per site along the human-gibbon lineage in human and macaque peaks and
flanks (where flanks correspond to unbound, non-coding DNA flanking the bound sequences). Peak
sequences are evolving more slowly than flanks (Wilcoxon tests; ** indicates significance at P < 10-6). (B)
Substitutions per site along the macaque branch in human and macaque peaks and flanks. Macaque
peak sequences are evolving more slowly, but not human peaks (Wilcoxon tests; * indicates significance
at BF corr. P < 0.05). Both panels A and B are based upon peaks and flanks within 100kb of the 23 liver-
specific genes.
Supplementary Figure 4: Changes in binding strength (∆S) are not distributed differently between flanks
and peak summits for both liver-specific genes and Uox. (A) Human peaks nearby liver-specific genes; (B)
Human peaks nearby Uox; (C) Macaque peaks nearby liver-specific genes; (D) Macaque peaks nearby
Uox. Black dots correspond the actual substitutions in each dataset. In peaks sequences, substitutions
were called in +/- 25bp of the peak summits.