ChIP-Seq -method for studying epigenetic...

ChIP-Seq - method for studying

epigenetic mechanisms

Oleg Shpynov

28.07.2018

● Regulation

● TFs and histone modifications

● ChIP-Seq protocol

● Ultra-Low-Input ChIP-Seq

● MACS2 and SICER peak callers

● Semi-supervised approach to Peak Calling

● Human monocytes aging project results

Agenda

● Evolution mainly works on regulatory, not protein-coding, DNA

● If we know regulation, we can find key spots in large pathways

● Next-generation sequencing!

http://massgenomics.org/2012/01/the-current-state-of-dbsnp.html

Why study regulation?

Primate to human: it’s in the regulation

● Gene structure and expression are

well conserved

● Gene coexpression is not

● The difference lies in gene regulation

humanchimpanzee

Proc Natl Acad Sci U S A. 2006 Nov 21;103(47):17973-8.

Chromosomes and chromatin

● Chromosomes are dense complexes of DNA and proteins

● Each human chromosome contains on average 5 cm of DNA

● This is about 2 m of DNA overall – too much!

● Chromatin = euchromatin + heterochromatin

https://www.shmoop.com/dna/dna-packaging.htmlhttp://www.mun.ca/biology/scarr/FISH_chromosome_painting.html

ChromEMT

● In 2017, a new method for chromatin

staining allowed to obtain high-contrast

electron tomography images of mitotic

chromosomes

● Chaotic 5 to 22nm structures observed

http://science.sciencemag.org/content/357/6349/eaag0025.long

Regulation of transcription

● Basal transcription: general

transcription factors bind the

promoter and RNA polymerase II

● Activator proteins bind DNA spots

named enhancers

● Enhancers are often located far

and have to loop

https://courses.lumenlearning.com/suny-wmopen-biology1/chapter/eukaryotic-gene-regulation/

Transcription factors

● ~1,500 transcription factors in humans

● Binding motif represented by consensus sequence

● Master regulators exist but are not always known

● Functions:

○ stabilize/block RNAP II binding to DNA

○ catalyze histone acetylation or deacetylation

○ recruit coactivator or corepressor

http://www.broadinstitute.org/education/glossary/transcription-factor

Chromatin Immunoprecipitation (ChIP)

http://www.bio.brandeis.edu/haberlab/jehsite/chIP.html

DNA-binding proteins are crosslinked

to DNA with formaldehyde in vivo.

Isolate the chromatin. Shear DNA

along with bound proteins into small

fragments.

Bind antibodies specific to the DNA-

binding protein to isolate the complex

by precipitation. Reverse the cross-

linking to release the DNA and digest

the proteins.

Use PCR to amplify specific DNA

sequences to see if they were

precipitated with the antibody.

Who was first?

http://www.snarkyscientist.com/2013/06/19/the-history-of-the-biggest-technique-of-2009-who-invented-chip-seq/

ChIP-Seq

● DNA library obtained after ChIP can be amplified and sequenced

● ChIP-Seq can be used for both transcription factors and histone

modifications, like H3K4me3

Epigenetic regulation

http://en.wikipedia.org/wiki/Epigenetics

● Histone modifications

● DNA methylation

● Noncoding RNA

Histone tail modification

https://www.irbbarcelona.org/en/news/understanding-the-molecular-origin-of-epigenetic-markers

● Histones tails stick outside

and can be recognized

● Chemical modifications

of histones influence

DNA accessibility

● Histone modifications

can be read, erased, and

recognized

Promoter histone marks

14The EMBO Journal, 31, pp 3130–3146 (2012) C. Xu et al, Nature Communications , 2(227),pp 1-8 (2011)

● Narrow peaks of H3K4me3 mark promoters

● Enzyme that methylates K4 binds only to non-CpG-methylated promoters!

Enhancer histone marks

Bauer, D.E.; Kamran, S.C. et al, Science, 342(6155), pp 253-7 (2013)

erythroblasts

● Enhancers are associated with H3K4me1 and H3K27ac

● H3K27ac is thought to distinguish active enhancers from poised

Transcription elongation marks

proB cells

Proc Natl Acad Sci U S A, 107(50), pp 21931-6 (2010)

Inactive Active Active

● Elongation is marked my H3K36me3 and H3K79me2

Gene repression by chromatin marks

H3K4me3

H3K27me3

Mikkelsen, T.S.; Ku, M. et al, Nature 448, pp 553-560 (2007)Vastenhouw N.L.; Zhang Y. et al, Nature 464, pp 922-6 (2010)

● H3K27me3 marks suppressed genes poised to be activated

● Stem cells can have both H3K4me3 and H3K27me3 – unique!

Heterochromatin marks

http://medcell.med.yale.edu/histology/cell_lab/euchromatin_and_heterochromatin.php

● Marks H3K9me2 and

H3K9me3 are strongly

associated with heterochromatin

● Binding with protein HP1

Chromatin marks regulation (simplified)

Lee, T.I.; Young, R.A., Cell, 152(6), pp 1237-51 (2013)

Let’s study how regulation by histone modifications

changes with aging

Multiomics dissection of healthy human aging

5 marks x 40 donors

http://artyomovlav.wustl.edu/aging

Conventional vs Ultra Low Input ChIP-Seq+

Robust well-adopted

protocol

Good signal-to-noise

Lots of high quality data

available for human

Guidelines and pipelines

by ENCODE, Blueprint,

2-5mln cells required

per single run

100k cells required per

single run

Difficulties to process in

wet lab

Worse signal-to-noise

ratio than conventional

ChIP-Seq Original

protocol is for mice

No high quality data

available for human

Сonventional vs ULI ChIP-Seq

ULI ChIP-Seq is always noisy

H3K4me3 - big variance in signal-to-noise ratio

ULI ChIP-Seq challenges

● High noise in the data due to ULI protocol

● High variance in signal-to-noise ratio

Peak calling - easy signal extraction problem

86 existing tools on the list*

Tools for easy problem?

* https://omictools.com/peak-calling-category

How to chose?

ENCODE ChIP-Seq pipeline

Problems

● MACS2 performs poorly on broad modifications

● Different signal-to-noise ratio in replicates

● Replicate concordance step fails

● IDR method works only for 2 replicates

https://www.encodeproject.org/pipelines/ENCPL272XAE/

Different tools = Different Data models

MACS2 - not good for broad marks● Estimate fragment size to shift tags

● Estimate local λ for Poisson from control track (non-specific binding)

● Use posterior probabilities to compute p-values and q-values,

merging close enriched locations to peaks

SICER - fails for TFs and narrow marks

● Uses coverage to estimate for λ-s for 2 Poisson distributions

● Uses blacklist regions to overcome mappability issues

● Complicated procedure of scoring islands and significance detection

Different Data models = Different cases

Modification Tool

TFs or NARROW Histone marks MACS2 or SPP or PeakSeg

BROAD or MIXED Histone marks SICER or PeakSeg or RSEG

Different cases = Different tools

https://github.com/olegs/bioinformatics/blob/master/chipseq/chipseq.pdf

Application for ULI ChIP-Seq data

MACS2, SICER peaks number

MACS2, SICER peaks length

Proc & Cons

Jaccard* ● Widely used

● Bad for shifts and

enclosed regions

● Symmetric

Overlap● Works with enclosed

regions (A < B)

● Tolerant for shifts

● Non-symmetric

*Jaccard(A, B) = length(A intersect B) / length(A union B) 37

How to estimate consistency?

Overlap(A, B) = ⅓Overlap (B, A) = 1 B

MACS2, SICER pairwise peaks overlap

~400 points shown,

pairwise 20 vs 20

MACS2, SICER pairwise peaks overlap

These are the tracks with

low signal-to-noise ratio

~400 points shown,

pairwise 20 vs 20

Are standard tools not applicable

or we failed to use correct parameters?

Classical vs supervised approach

Peak callers used in our study:● PeakSeg - didn’t work out-of-the-box

● MACS2 --broad

● RSEG - was too slow

● SICER

Peak callers optimization performance

https://academic.oup.com/bioinformatics/article/33/4/491/2608653

Semi-supervised approach

● Manually labelled dataset

● Parameter grid for each peak caller (MACS2, SICER, SPAN)

● Determine the parameter which gives the lowest error rate

● Preprocess input data

● Create 3 state HMM

● Train model by Baum-Welch

EM algorithm

● Compute posterior

probabilities

SPAN Peak Analyzer

● Preprocess input data

● Create 3 state HMM

● Train model by Baum-Welch

EM algorithm

● Compute posterior

probabilities

Parameters

● Use q-values to control FDR

at level alpha

● Use gap to merge close

enriched positions

SPAN Peak Analyzer

● Train models

● Visualize tracks in

genome browser

● Create visual labels:

500+ labels x 5 ChIP-seq

targets

● Optimize parameters in

single click

● Consistent peak calling!

Semi-supervised scheme

Peaks with default parameters

http://artyomovlab.wustl.edu/aging/howto.html

Peaks with optimized parameters

Overall ChIP-Seq processing scheme

Peaks number consistency improved

* https://genome.cshlp.org/content/11/12/1975.full.html

Peaks length consistency improved

* http://bionumbers.hms.harvard.edu/bionumber.aspx?id=105336

Consistency between samples improved

Criticism

A: Use consistency as a quality function,

while learning on the same markup

Q: Only a small fraction of genome is used,

labels are created only where we see consistency

visually

Validation: consistency with ENCODE improved

Validation: expected overlap between all experiments

No difference in core 5 histone marks found

No difference? Talk about variation!

Is it about regulation?

● No differences in 5 histone marks

● Difference in DNA methylation

DMRs are overrepresented in histone marks

oleg.shpynov@jetbrains.comhttps://research.jetbrains.org/groups/biolabs

Summary

● Histone modifications is mechanism of regulation

● ChIP-Seq allows to profile histone modifications

● ULI ChIP-Seq allows to profile many modifications for same donor

● MACS2, SICER are not applicable for data with different signal-to-noise ratio

● Semi-supervised approach produces high quality results

● No changes in 5 core histone marks in HEALTHY human monocytes aging

● Regulation? Potentially interesting changes in DNA methylation in enhancers

Thank you!

ENCODE project

● ENCODE = ENCyclopedia Of DNA Elements

● Pilot cost (2007): $55M, up to date: ~$300M

● RNA-Seq, ChIP-seq of major TFs and histone modifications, DNA methylation

● Series of publications in the Fall of 2012 (6 Nature papers, 30 papers overall)

http://www.sciencemag.org/content/337/6099/1159/F2.expansion.html

ENCODE project discoveries

● 400,000 enhancers and 70,000

promoters

● More than 90% of genomic variation

are in noncoding areas

● DNase I footprint is not that big

● mRNAs are more abundant in cytosol,

other RNAs – in the nucleus

● “More than 80% of human genome is

functionally active”

http://www.evolutionnews.org/2012/09/the_demise_of_j_1064061.html

ENCODE project criticism

● 80% of DNA cannot be truly functional, since

only about 10% (5-15%) is conserved

● This means ~70% of genome is either

○ impervious to deleterious mutations, or

○ does not mutate, or

○ does not have deleterious mutations

http://blogs.scientificamerican.com/guest-blog/2012/09/17/junk-dna-junky-pr/

Histone code

hypothesis

Strahl, Allis, Nature 403(6), 2000, 41-45

● Concept similar to

genetic code

● Implies existence of

histone mark

combinations that

have specific

function

Main tools for genome segmentation

Jason Ernst lab - ChromHMM William Noble lab - Segway

Nat Methods 2012 Feb 28;9(3):215-6. doi: 10.1038/nmeth.1906Nat Methods 2012 Mar 18;9(5):473-6. doi: 10.1038/nmeth.1937

ChromHMM

● BED files are binarized using the selected chromatin marks

(present: 1, absent: 0)

● Marks are then grouped in a number of states – biologically meaningful

combinations of marks

● Transition is transfer between states, emission – probability of causing the

observed effect

Nature 2011 May 5;473(7345):43-9. doi: 10.1038/nature09906

Genome annotation

● Segmentation

allows discovery

of novel elements,

alternative

promoters

● Here we find a

new non-coding

Nucleic Acids Res 2013 Jan;41(2):827-41. doi: 10.1093/nar/gks1284

Discovery of lncRNAs

71Nature 2009 Mar 12;458(7235):223-7. doi: 10.1038/nature07672

● Long noncoding RNAs in 2008 were rare, considered artifacts

● ChIP-Seq of H3K4me3/H3K4me36 revealed thousands of lincRNAs

Superenhancers

● There are estimated 400,000 enhancers in human genome

● Not all are active in every cell – estimated 5,000 - 100,000 per cell type

● There are special types of enhancer elements called superenhancers

● Enriched for Med1, H3K27ac, H3K4me1, and master TFs

72Cell 2013 Apr 11;153(2):307-19. doi: 10.1016/j.cell.2013.03.035

Step 1: estimating fragment length d

● Slide a window of size BANDWIDTH

● Find top regions with MFOLD enrichment of treatment vs input

● Use +/- strand cross correlation to estimate d

Step 2: identification of local noise parameter

● Slide a window of size 2*d across treatment and input

● Estimate λ for Poisson distribution

Step 3: identification of enriched regions

● Find regions with P-values < PVALUE

● Determine summit position inside enriched regions as max density

Step 4: Significance testing

● Swap treatment and control, call peaks using same PVALUE

Step 5: Broad peak calling

● Use PVALUE or BROAD-CUTOFF option to filter enriched peaks

● Compose broad regions of nearby enriched peaks

● Max length of region is 4*d

Step 1: detection of Islands

● Use coverage to estimate global λ-s for

Poisson distributions (treatment and

control)

● Classify enriched windows

● Enriched windows are separated by gaps

● Island is a cluster of enriched windows

separated by gaps of size at most GAP

windows

Example: GAP = 2

Step 2+: scoring

● The scoring function is based on probability of observation tags count in a

random background

● Scoring for enriched window = -ln P(m, lambda)

● Scoring for island is the aggregated score of all enriched windows in the

island, corresponds to the background probability of finding the observed

pattern

Score(I) = F* (Score(I1), Gap, Score(I2))

Step N: significance testing

● Use control library as background to calculate p-value for islands

● Or use random background model to calculate p-values for islands

● Compute q-values by p-values

● Filter by p-value of by q-value (FDR)

ChIP-Seq -method for studying epigenetic...

Documents

ChIP-seq Data Analysis

RNA-Seq / ChIP-Seq Analysis Workflow

AUTO HISTONE ChIP-seq KIT - diagenode.com · Auto Histone ChIP-seq kit The Auto Histone ChIP-seq kit was developed to enhance the utility of the ChIP procedure, allowing one to perform

ChIP-seq Analysisbarc.wi.mit.edu/education/hot_topics/ChIPseq_2018/chip... · 2018. 8. 29. · ChIP-Seq overview Park, P. J., ChIP-seq: advantages and challenges of a maturing technology,

ChIP-seq analysis – D. Puthierjvanheld.github.io/cisreg_course/chip-seq/slides/chipseq__roscoff20… · Denis Puthier -- BBSG2 2015-2016 --ChIP-Seq: technical considerations Quality

ChIP-seq Theory

ChIP-Seq, mRNA-Seq, & Resequencing via the Genboree Workench

Chip – Seq Peak Calling in Galaxy

ChIP-seq Methods & Analysis

Rna seq and chip seq

computation for chIP-seq and rNA-seq studiessandberg.cmb.ki.se/media/data/courses/bioinfocell/Nat Methods 2009 Pepke.pdfof this review. We view the data analysis for ChIP-seq and RNA-seq

More on TF Motif Finding ChIP-chip / seq

ChIP-seq data: quality control, read mapping and peak calling · 2015. 8. 6. · ChIP-seq data: quality control, read mapping and peak calling Data included raw ChIP-seq reads of

ChIP-seq analysis – D. Puthier

ChIP-seq - Peak Calling - Freie Universität · ChIP-seq Peter N. Robinson Gene Regulatory Networks ChIP-Seq XSET FDR MACS Q/C & IDR Big Picture ChIP-seq Peak Calling Peter N. Robinson

Introduction to ChIP-seq - GitHub Pages · 2020. 9. 18. · ChIP-Seq for TF ChIP-Seq for Chromatin marks Transciptional machinery Furey, T. ChIP–seq and beyond: new and improved

Analysis of ChIP-Seq Data

2. ChIP-Seq解析ソフトウェアの利用法dna00.bio.kyutech.ac.jp/chipseqTW/pdf/2_peak_call.pdf140306 ChIP-Seqデータ解析トレーニングワークショップ 2. ChIP-Seq解析ソフトウェアの利用法

Some Basic Analyis of ChIP-Seq Data - Bioconductormaster.bioconductor.org/.../ChIP-seq/workflow.pdf · 2008. 11. 18. · Some Basic Analyis of ChIP-Seq Data November 14, 2008 Our

ChIP-seq Data Identifying Transcription Factor Binding ... · Did my ChIP-seq worked?-- Cross-Correlation score-- FRiP score • MACS2 (Model-based analysis of ChIP-seq) 1.Pre-alignment