ChIP-Seq -method for studying epigenetic...

Preview:

Citation preview

ChIP-Seq - method for studying

epigenetic mechanisms

Oleg Shpynov

28.07.2018

● Regulation

● TFs and histone modifications

● ChIP-Seq protocol

● Ultra-Low-Input ChIP-Seq

● MACS2 and SICER peak callers

● Semi-supervised approach to Peak Calling

● Human monocytes aging project results

Agenda

2

3

● Evolution mainly works on regulatory, not protein-coding, DNA

● If we know regulation, we can find key spots in large pathways

● Next-generation sequencing!

http://massgenomics.org/2012/01/the-current-state-of-dbsnp.html

Why study regulation?

Primate to human: it’s in the regulation

4

● Gene structure and expression are

well conserved

● Gene coexpression is not

● The difference lies in gene regulation

A

C

B

D

A

C

B

D

humanchimpanzee

Proc Natl Acad Sci U S A. 2006 Nov 21;103(47):17973-8.

Chromosomes and chromatin

5

● Chromosomes are dense complexes of DNA and proteins

● Each human chromosome contains on average 5 cm of DNA

● This is about 2 m of DNA overall – too much!

● Chromatin = euchromatin + heterochromatin

https://www.shmoop.com/dna/dna-packaging.htmlhttp://www.mun.ca/biology/scarr/FISH_chromosome_painting.html

ChromEMT

6

● In 2017, a new method for chromatin

staining allowed to obtain high-contrast

electron tomography images of mitotic

chromosomes

● Chaotic 5 to 22nm structures observed

http://science.sciencemag.org/content/357/6349/eaag0025.long

Regulation of transcription

7

● Basal transcription: general

transcription factors bind the

promoter and RNA polymerase II

● Activator proteins bind DNA spots

named enhancers

● Enhancers are often located far

and have to loop

https://courses.lumenlearning.com/suny-wmopen-biology1/chapter/eukaryotic-gene-regulation/

Transcription factors

8

● ~1,500 transcription factors in humans

● Binding motif represented by consensus sequence

● Master regulators exist but are not always known

● Functions:

○ stabilize/block RNAP II binding to DNA

○ catalyze histone acetylation or deacetylation

○ recruit coactivator or corepressor

http://www.broadinstitute.org/education/glossary/transcription-factor

Chromatin Immunoprecipitation (ChIP)

9

http://www.bio.brandeis.edu/haberlab/jehsite/chIP.html

DNA-binding proteins are crosslinked

to DNA with formaldehyde in vivo.

Isolate the chromatin. Shear DNA

along with bound proteins into small

fragments.

Bind antibodies specific to the DNA-

binding protein to isolate the complex

by precipitation. Reverse the cross-

linking to release the DNA and digest

the proteins.

Use PCR to amplify specific DNA

sequences to see if they were

precipitated with the antibody.

Who was first?

10

http://www.snarkyscientist.com/2013/06/19/the-history-of-the-biggest-technique-of-2009-who-invented-chip-seq/

ChIP-Seq

● DNA library obtained after ChIP can be amplified and sequenced

● ChIP-Seq can be used for both transcription factors and histone

modifications, like H3K4me3

11

Epigenetic regulation

12

http://en.wikipedia.org/wiki/Epigenetics

● Histone modifications

● DNA methylation

● Noncoding RNA

Histone tail modification

13

https://www.irbbarcelona.org/en/news/understanding-the-molecular-origin-of-epigenetic-markers

● Histones tails stick outside

and can be recognized

● Chemical modifications

of histones influence

DNA accessibility

● Histone modifications

can be read, erased, and

recognized

Promoter histone marks

14The EMBO Journal, 31, pp 3130–3146 (2012) C. Xu et al, Nature Communications , 2(227),pp 1-8 (2011)

Mouse

heart

● Narrow peaks of H3K4me3 mark promoters

● Enzyme that methylates K4 binds only to non-CpG-methylated promoters!

Enhancer histone marks

15

Bauer, D.E.; Kamran, S.C. et al, Science, 342(6155), pp 253-7 (2013)

Human

erythroblasts

● Enhancers are associated with H3K4me1 and H3K27ac

● H3K27ac is thought to distinguish active enhancers from poised

Transcription elongation marks

16

Mouse

proB cells

Proc Natl Acad Sci U S A, 107(50), pp 21931-6 (2010)

Inactive Active Active

● Elongation is marked my H3K36me3 and H3K79me2

Gene repression by chromatin marks

17

H3K4me3

H3K4me3

H3K4me3

H3K27me3

H3K27me3

H3K27me3

ES

NPC

MEF

Mikkelsen, T.S.; Ku, M. et al, Nature 448, pp 553-560 (2007)Vastenhouw N.L.; Zhang Y. et al, Nature 464, pp 922-6 (2010)

● H3K27me3 marks suppressed genes poised to be activated

● Stem cells can have both H3K4me3 and H3K27me3 – unique!

Heterochromatin marks

18

http://medcell.med.yale.edu/histology/cell_lab/euchromatin_and_heterochromatin.php

● Marks H3K9me2 and

H3K9me3 are strongly

associated with heterochromatin

● Binding with protein HP1

Chromatin marks regulation (simplified)

19

Lee, T.I.; Young, R.A., Cell, 152(6), pp 1237-51 (2013)

Let’s study how regulation by histone modifications

changes with aging

20

21

Multiomics dissection of healthy human aging

5 marks x 40 donors

http://artyomovlav.wustl.edu/aging

Conventional vs Ultra Low Input ChIP-Seq+

Robust well-adopted

protocol

Good signal-to-noise

ratio

Lots of high quality data

available for human

Guidelines and pipelines

by ENCODE, Blueprint,

etc

2-5mln cells required

per single run

22

+

100k cells required per

single run

Difficulties to process in

wet lab

Worse signal-to-noise

ratio than conventional

ChIP-Seq Original

protocol is for mice

No high quality data

available for human

Сonventional vs ULI ChIP-Seq

23

Co

nve

ntio

na

lU

LI

24

ULI ChIP-Seq is always noisy

25

H3K4me3 - big variance in signal-to-noise ratio

26

ULI ChIP-Seq challenges

● High noise in the data due to ULI protocol

● High variance in signal-to-noise ratio

Peak calling - easy signal extraction problem

27

28

86 existing tools on the list*

Tools for easy problem?

* https://omictools.com/peak-calling-category

29

How to chose?

ENCODE ChIP-Seq pipeline

30

Problems

● MACS2 performs poorly on broad modifications

● Different signal-to-noise ratio in replicates

● Replicate concordance step fails

● IDR method works only for 2 replicates

https://www.encodeproject.org/pipelines/ENCPL272XAE/

31

Different tools = Different Data models

32

MACS2 - not good for broad marks● Estimate fragment size to shift tags

● Estimate local λ for Poisson from control track (non-specific binding)

● Use posterior probabilities to compute p-values and q-values,

merging close enriched locations to peaks

SICER - fails for TFs and narrow marks

● Uses coverage to estimate for λ-s for 2 Poisson distributions

● Uses blacklist regions to overcome mappability issues

● Complicated procedure of scoring islands and significance detection

Different Data models = Different cases

Modification Tool

TFs or NARROW Histone marks MACS2 or SPP or PeakSeg

BROAD or MIXED Histone marks SICER or PeakSeg or RSEG

33

Different cases = Different tools

https://github.com/olegs/bioinformatics/blob/master/chipseq/chipseq.pdf

Application for ULI ChIP-Seq data

34

MACS2, SICER peaks number

35

MACS2, SICER peaks length

36

Proc & Cons

Jaccard* ● Widely used

● Bad for shifts and

enclosed regions

● Symmetric

Overlap● Works with enclosed

regions (A < B)

● Tolerant for shifts

● Non-symmetric

*Jaccard(A, B) = length(A intersect B) / length(A union B) 37

How to estimate consistency?

38

How to estimate consistency?

Overlap(A, B) = ⅓Overlap (B, A) = 1 B

A

MACS2, SICER pairwise peaks overlap

39

~400 points shown,

pairwise 20 vs 20

MACS2, SICER pairwise peaks overlap

40

These are the tracks with

low signal-to-noise ratio

~400 points shown,

pairwise 20 vs 20

Are standard tools not applicable

or we failed to use correct parameters?

41

Classical vs supervised approach

42

43

Peak callers used in our study:● PeakSeg - didn’t work out-of-the-box

● MACS2 --broad

● RSEG - was too slow

● SICER

Peak callers optimization performance

https://academic.oup.com/bioinformatics/article/33/4/491/2608653

Semi-supervised approach

44

● Manually labelled dataset

● Parameter grid for each peak caller (MACS2, SICER, SPAN)

● Determine the parameter which gives the lowest error rate

45

● Preprocess input data

● Create 3 state HMM

● Train model by Baum-Welch

EM algorithm

● Compute posterior

probabilities

SPAN Peak Analyzer

46

● Preprocess input data

● Create 3 state HMM

● Train model by Baum-Welch

EM algorithm

● Compute posterior

probabilities

Parameters

● Use q-values to control FDR

at level alpha

● Use gap to merge close

enriched positions

SPAN Peak Analyzer

● Train models

● Visualize tracks in

genome browser

● Create visual labels:

500+ labels x 5 ChIP-seq

targets

● Optimize parameters in

single click

● Consistent peak calling!

Semi-supervised scheme

Peaks with default parameters

http://artyomovlab.wustl.edu/aging/howto.html

Peaks with optimized parameters

50

Overall ChIP-Seq processing scheme

Peaks number consistency improved

51

* https://genome.cshlp.org/content/11/12/1975.full.html

Peaks length consistency improved

52

* http://bionumbers.hms.harvard.edu/bionumber.aspx?id=105336

Consistency between samples improved

53

Criticism

A: Use consistency as a quality function,

while learning on the same markup

Q: Only a small fraction of genome is used,

labels are created only where we see consistency

visually

54

Validation: consistency with ENCODE improved

55

56

Validation: expected overlap between all experiments

No difference in core 5 histone marks found

57

58

No difference? Talk about variation!

Is it about regulation?

59

● No differences in 5 histone marks

● Difference in DNA methylation

DMRs are overrepresented in histone marks

60

61

oleg.shpynov@jetbrains.comhttps://research.jetbrains.org/groups/biolabs

Summary

● Histone modifications is mechanism of regulation

● ChIP-Seq allows to profile histone modifications

● ULI ChIP-Seq allows to profile many modifications for same donor

● MACS2, SICER are not applicable for data with different signal-to-noise ratio

● Semi-supervised approach produces high quality results

● No changes in 5 core histone marks in HEALTHY human monocytes aging

● Regulation? Potentially interesting changes in DNA methylation in enhancers

Thank you!

62

ENCODE project

63

● ENCODE = ENCyclopedia Of DNA Elements

● Pilot cost (2007): $55M, up to date: ~$300M

● RNA-Seq, ChIP-seq of major TFs and histone modifications, DNA methylation

● Series of publications in the Fall of 2012 (6 Nature papers, 30 papers overall)

http://www.sciencemag.org/content/337/6099/1159/F2.expansion.html

64

ENCODE project discoveries

65

● 400,000 enhancers and 70,000

promoters

● More than 90% of genomic variation

are in noncoding areas

● DNase I footprint is not that big

● mRNAs are more abundant in cytosol,

other RNAs – in the nucleus

● “More than 80% of human genome is

functionally active”

http://www.evolutionnews.org/2012/09/the_demise_of_j_1064061.html

ENCODE project criticism

66

● 80% of DNA cannot be truly functional, since

only about 10% (5-15%) is conserved

● This means ~70% of genome is either

○ impervious to deleterious mutations, or

○ does not mutate, or

○ does not have deleterious mutations

http://blogs.scientificamerican.com/guest-blog/2012/09/17/junk-dna-junky-pr/

Histone code

hypothesis

67

Strahl, Allis, Nature 403(6), 2000, 41-45

● Concept similar to

genetic code

● Implies existence of

histone mark

combinations that

have specific

function

Main tools for genome segmentation

68

Jason Ernst lab - ChromHMM William Noble lab - Segway

Nat Methods 2012 Feb 28;9(3):215-6. doi: 10.1038/nmeth.1906Nat Methods 2012 Mar 18;9(5):473-6. doi: 10.1038/nmeth.1937

ChromHMM

69

● BED files are binarized using the selected chromatin marks

(present: 1, absent: 0)

● Marks are then grouped in a number of states – biologically meaningful

combinations of marks

● Transition is transfer between states, emission – probability of causing the

observed effect

Nature 2011 May 5;473(7345):43-9. doi: 10.1038/nature09906

Genome annotation

70

● Segmentation

allows discovery

of novel elements,

alternative

promoters

● Here we find a

new non-coding

RNA

Nucleic Acids Res 2013 Jan;41(2):827-41. doi: 10.1093/nar/gks1284

Discovery of lncRNAs

71Nature 2009 Mar 12;458(7235):223-7. doi: 10.1038/nature07672

● Long noncoding RNAs in 2008 were rare, considered artifacts

● ChIP-Seq of H3K4me3/H3K4me36 revealed thousands of lincRNAs

Superenhancers

● There are estimated 400,000 enhancers in human genome

● Not all are active in every cell – estimated 5,000 - 100,000 per cell type

● There are special types of enhancer elements called superenhancers

● Enriched for Med1, H3K27ac, H3K4me1, and master TFs

72Cell 2013 Apr 11;153(2):307-19. doi: 10.1016/j.cell.2013.03.035

73

MACS2

Step 1: estimating fragment length d

● Slide a window of size BANDWIDTH

● Find top regions with MFOLD enrichment of treatment vs input

● Use +/- strand cross correlation to estimate d

74

Step 2: identification of local noise parameter

● Slide a window of size 2*d across treatment and input

● Estimate λ for Poisson distribution

75

Step 3: identification of enriched regions

● Find regions with P-values < PVALUE

● Determine summit position inside enriched regions as max density

76

Step 4: Significance testing

● Swap treatment and control, call peaks using same PVALUE

77

Step 5: Broad peak calling

● Use PVALUE or BROAD-CUTOFF option to filter enriched peaks

● Compose broad regions of nearby enriched peaks

● Max length of region is 4*d

78

79

SICER

Step 1: detection of Islands

● Use coverage to estimate global λ-s for

Poisson distributions (treatment and

control)

● Classify enriched windows

● Enriched windows are separated by gaps

● Island is a cluster of enriched windows

separated by gaps of size at most GAP

windows

80

Example: GAP = 2

Step 2+: scoring

● The scoring function is based on probability of observation tags count in a

random background

● Scoring for enriched window = -ln P(m, lambda)

● Scoring for island is the aggregated score of all enriched windows in the

island, corresponds to the background probability of finding the observed

pattern

81

Score(I) = F* (Score(I1), Gap, Score(I2))

Step N: significance testing

● Use control library as background to calculate p-value for islands

● Or use random background model to calculate p-values for islands

● Compute q-values by p-values

● Filter by p-value of by q-value (FDR)

82

Recommended