Detecting over-represented k-mers in ChIP-seq peaks

Detecting over-represented k-mers in ChIP-seq peaksJacques van Helden and Denis Puthier

2015-11-10

Contents

Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

6-mers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

7-mers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

8-mers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

K-mer occurrences in the peaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

K-mer occurrences in random genomic regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Build a table to compare k-mer occurrences between peaks and random genome regions . . . . . . 4

Evaluate different measures of over-representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

M-A plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Log-likelihood ratio (LLR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Compute p-value of over-representation with the Poisson law . . . . . . . . . . . . . . . . . . 10

Intermediate interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Datasets

K-mer occurrences in CEBPA peaks from Smith et al (2010) in the mouse genome (Mus musculus).

6-mers

Data type k repeat TableCEBPA peaks 6 CEBPA_mm9_SWEMBL_R0.12_6nt-noov-2str.tabgenomic occurrences 6 full genome mm10_genome_6nt-noov-2str.tabRandom regions 6 01 random-genome-fragments_mm10_repeat01_6nt-noov-2str.tabRandom regions 6 02 random-genome-fragments_mm10_repeat02_6nt-noov-2str.tabRandom regions 6 03 random-genome-fragments_mm10_repeat03_6nt-noov-2str.tabRandom regions 6 04 random-genome-fragments_mm10_repeat04_6nt-noov-2str.tabRandom regions 6 05 random-genome-fragments_mm10_repeat05_6nt-noov-2str.tabRandom regions 6 06 random-genome-fragments_mm10_repeat06_6nt-noov-2str.tabRandom regions 6 07 random-genome-fragments_mm10_repeat07_6nt-noov-2str.tabRandom regions 6 08 random-genome-fragments_mm10_repeat08_6nt-noov-2str.tab

7-mers

1

../../data/kmer_occurrences/CEBPA_mm9_peaks_Ballester_2010/CEBPA_mm9_SWEMBL_R0.12_6nt-noov-2str.tab

../../data/kmer_occurrences/mm10_for_ASG_course/mm10_genome_6nt-noov-2str.tab

../../data/kmer_occurrences/random_fragments_mm10/random-genome-fragments_mm10_repeat01_6nt-noov-2str.tab









8-mers


2





















Solutions

K-mer occurrences in the peaks

Histogram of peaks.6nt$occ

peaks.6nt$occ

Fre

quen

cy

0 100 200 300 400

050

100

150

200

250

3

K-mer occurrences in random genomic regions

Histogram of rand.6nt$occ

rand.6nt$occ

Fre

quen

cy

0 100 200 300 400

050

100

150

200

250

300

mean min max sumpeaks 90.26060 1 426 187381rand 86.99903 1 435 180262

Build a table to compare k-mer occurrences between peaks and random genomeregions

Row.names identifier.x obs_freq.x occ.x ovl_occ.x forbocc.xaaaaaa aaaaaa aaaaaa|tttttt 0.0009417790 178 232 856aaaaac aaaaac aaaaac|gttttt 0.0008729974 165 0 798aaaaag aaaaag aaaaag|cttttt 0.0010052697 190 0 942aaaaat aaaaat aaaaat|attttt 0.0010211424 193 0 921aaaaca aaaaca aaaaca|tgtttt 0.0016454678 311 6 1525aaaacc aaaacc aaaacc|ggtttt 0.0006560708 124 0 613

identifier.y obs_freq.y occ.y ovl_occ.y forbocc.yaaaaaa aaaaaa|tttttt 0.002096105 385 564 1893aaaaac aaaaac|gttttt 0.001497218 275 0 1354aaaaag aaaaag|cttttt 0.001535329 282 0 1378aaaaat aaaaat|attttt 0.002313882 425 3 2104aaaaca aaaaca|tgtttt 0.002003550 368 10 1808aaaacc aaaacc|ggtttt 0.001012664 186 0 918

[1] 2079

peaks rand peak.freq rand.freq mean.freq

4

aaaaaa 178 385 0.0009499362 0.002135780 0.0015428582aaaaac 165 275 0.0008805589 0.001525557 0.0012030581aaaaag 190 282 0.0010139769 0.001564390 0.0012891832aaaaat 193 425 0.0010299870 0.002357679 0.0016938332aaaaca 311 368 0.0016597200 0.002041473 0.0018505965aaaacc 124 186 0.0006617533 0.001031831 0.0008467924

identifier obs_freq occ ovl_occ forboccaaaaaa aaaaaa|tttttt 0.0009417790 178 232 856aaaaac aaaaac|gttttt 0.0008729974 165 0 798aaaaag aaaaag|cttttt 0.0010052697 190 0 942aaaaat aaaaat|attttt 0.0010211424 193 0 921aaaaca aaaaca|tgtttt 0.0016454678 311 6 1525aaaacc aaaacc|ggtttt 0.0006560708 124 0 613

identifier obs_freq occ ovl_occ forboccaaaaaa aaaaaa|tttttt 0.002096105 385 564 1893aaaaac aaaaac|gttttt 0.001497218 275 0 1354aaaaag aaaaag|cttttt 0.001535329 282 0 1378aaaaat aaaaat|attttt 0.002313882 425 3 2104aaaaca aaaaca|tgtttt 0.002003550 368 10 1808aaaacc aaaacc|ggtttt 0.001012664 186 0 918

5

0 100 200 300 400

010

020

030

040

0

6nt occurrences, peaks vs random regions

Random regions

Pea

k se

quen

ces

0 100 200 300 400

010

020

030

040

0

kmer.comparison$rand

kmer

.com

paris

on$p

eaks

Evaluate different measures of over-representation

[1] 0 Inf

6

0 50 100 150 200 250 300 350

02

46

810

Occurrence ratios

Mean occurrences

Pea

ks/r

and

occ

ratio

0 50 100 150 200 250 300 350

−3

−2

−1

01

23

Occurrences log2 ratio

Mean occurrences

log2

(pea

ks/r

and)

7

M-A plot

0 2 4 6 8

−3

−2

−1

01

23

MA plot

Mean log2 occurrences

log2

(pea

ks/r

and)

Finally, I prefer to keep the mean occurrences on the X axis rather than the log2(mean occ)

Log-likelihood ratio (LLR)

LLR = fexp · log2(fobs/fexp)

8

0 50 100 150 200 250 300 350

−3

−2

−1

01

23

Occurrences log2 ratio

Mean occurrences

log2

(pea

ks/r

and)

0 50 100 150 200 250 300 350

−0.

002

−0.

001

0.00

00.

001

Log−likelihood ratio

Mean occurrences

rand

.freq

* lo

g2(p

eaks

/ran

d)

9

The log-likelihood ratio is effective in reducing the impact of small number fluctuations: the rare k-mers (leftside of the LLR plot) achieve very low scores, whereas the ratio or log2-ratio tended to put a high emphasison them.

Compute p-value of over-representation with the Poisson law

0 2 4 6 81e−

161

1e−

931e

−25

Mean log2 occurrences

Poi

sson

p−

valu

e

−3 −2 −1 0 1 2 3

050

100

150

Volcano plot

log2−ratio of occurrences

Poi

sson

p−

valu

e

10

Intermediate interpretation

So far, we performed all our analyses using a random selection of genomic regions (“random peaks”) asbackground sequences in order to estimate the expected number of occurrences of each k-mer in the peaks.These random peaks had been selected with the same size as the actual peaks, so the total number ofoccurrences was supposed to be more or less the same as in the peaks (small differences may occur due to thepresence of N character in the genomic sequences).

However, the results are problematic, because the random expectation is estimated based on a small sequenceset, so that the numbers can fluctuate, especially for rare k-mers. We even noticed that some hexamers havezero occurrences in the random peaks

11

Documents

Detecting over-represented k-mers in ChIP-seq peaks