Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Detecting over-represented k-mers in ChIP-seq peaksJacques van Helden and Denis Puthier
2015-11-10
Contents
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6-mers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
7-mers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
8-mers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
K-mer occurrences in the peaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
K-mer occurrences in random genomic regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Build a table to compare k-mer occurrences between peaks and random genome regions . . . . . . 4
Evaluate different measures of over-representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
M-A plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Log-likelihood ratio (LLR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Compute p-value of over-representation with the Poisson law . . . . . . . . . . . . . . . . . . 10
Intermediate interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Datasets
K-mer occurrences in CEBPA peaks from Smith et al (2010) in the mouse genome (Mus musculus).
6-mers
Data type k repeat TableCEBPA peaks 6 CEBPA_mm9_SWEMBL_R0.12_6nt-noov-2str.tabgenomic occurrences 6 full genome mm10_genome_6nt-noov-2str.tabRandom regions 6 01 random-genome-fragments_mm10_repeat01_6nt-noov-2str.tabRandom regions 6 02 random-genome-fragments_mm10_repeat02_6nt-noov-2str.tabRandom regions 6 03 random-genome-fragments_mm10_repeat03_6nt-noov-2str.tabRandom regions 6 04 random-genome-fragments_mm10_repeat04_6nt-noov-2str.tabRandom regions 6 05 random-genome-fragments_mm10_repeat05_6nt-noov-2str.tabRandom regions 6 06 random-genome-fragments_mm10_repeat06_6nt-noov-2str.tabRandom regions 6 07 random-genome-fragments_mm10_repeat07_6nt-noov-2str.tabRandom regions 6 08 random-genome-fragments_mm10_repeat08_6nt-noov-2str.tab
7-mers
1
Data type k repeat TableCEBPA peaks 7 CEBPA_mm9_SWEMBL_R0.12_7nt-noov-2str.tabgenomic occurrences 7 full genome mm10_genome_7nt-noov-2str.tabRandom regions 7 01 random-genome-fragments_mm10_repeat01_7nt-noov-2str.tabRandom regions 7 02 random-genome-fragments_mm10_repeat02_7nt-noov-2str.tabRandom regions 7 03 random-genome-fragments_mm10_repeat03_7nt-noov-2str.tabRandom regions 7 04 random-genome-fragments_mm10_repeat04_7nt-noov-2str.tabRandom regions 7 05 random-genome-fragments_mm10_repeat05_7nt-noov-2str.tabRandom regions 7 07 random-genome-fragments_mm10_repeat07_7nt-noov-2str.tabRandom regions 7 07 random-genome-fragments_mm10_repeat07_7nt-noov-2str.tabRandom regions 7 08 random-genome-fragments_mm10_repeat08_7nt-noov-2str.tab
8-mers
Data type k repeat TableCEBPA peaks 8 CEBPA_mm9_SWEMBL_R0.12_8nt-noov-2str.tabgenomic occurrences 8 full genome mm10_genome_8nt-noov-2str.tabRandom regions 8 01 random-genome-fragments_mm10_repeat01_8nt-noov-2str.tabRandom regions 8 02 random-genome-fragments_mm10_repeat02_8nt-noov-2str.tabRandom regions 8 03 random-genome-fragments_mm10_repeat03_8nt-noov-2str.tabRandom regions 8 04 random-genome-fragments_mm10_repeat04_8nt-noov-2str.tabRandom regions 8 05 random-genome-fragments_mm10_repeat05_8nt-noov-2str.tabRandom regions 8 08 random-genome-fragments_mm10_repeat08_8nt-noov-2str.tabRandom regions 8 08 random-genome-fragments_mm10_repeat08_8nt-noov-2str.tabRandom regions 8 08 random-genome-fragments_mm10_repeat08_8nt-noov-2str.tab
2
Solutions
K-mer occurrences in the peaks
Histogram of peaks.6nt$occ
peaks.6nt$occ
Fre
quen
cy
0 100 200 300 400
050
100
150
200
250
3
K-mer occurrences in random genomic regions
Histogram of rand.6nt$occ
rand.6nt$occ
Fre
quen
cy
0 100 200 300 400
050
100
150
200
250
300
mean min max sumpeaks 90.26060 1 426 187381rand 86.99903 1 435 180262
Build a table to compare k-mer occurrences between peaks and random genomeregions
Row.names identifier.x obs_freq.x occ.x ovl_occ.x forbocc.xaaaaaa aaaaaa aaaaaa|tttttt 0.0009417790 178 232 856aaaaac aaaaac aaaaac|gttttt 0.0008729974 165 0 798aaaaag aaaaag aaaaag|cttttt 0.0010052697 190 0 942aaaaat aaaaat aaaaat|attttt 0.0010211424 193 0 921aaaaca aaaaca aaaaca|tgtttt 0.0016454678 311 6 1525aaaacc aaaacc aaaacc|ggtttt 0.0006560708 124 0 613
identifier.y obs_freq.y occ.y ovl_occ.y forbocc.yaaaaaa aaaaaa|tttttt 0.002096105 385 564 1893aaaaac aaaaac|gttttt 0.001497218 275 0 1354aaaaag aaaaag|cttttt 0.001535329 282 0 1378aaaaat aaaaat|attttt 0.002313882 425 3 2104aaaaca aaaaca|tgtttt 0.002003550 368 10 1808aaaacc aaaacc|ggtttt 0.001012664 186 0 918
[1] 2079
peaks rand peak.freq rand.freq mean.freq
4
aaaaaa 178 385 0.0009499362 0.002135780 0.0015428582aaaaac 165 275 0.0008805589 0.001525557 0.0012030581aaaaag 190 282 0.0010139769 0.001564390 0.0012891832aaaaat 193 425 0.0010299870 0.002357679 0.0016938332aaaaca 311 368 0.0016597200 0.002041473 0.0018505965aaaacc 124 186 0.0006617533 0.001031831 0.0008467924
identifier obs_freq occ ovl_occ forboccaaaaaa aaaaaa|tttttt 0.0009417790 178 232 856aaaaac aaaaac|gttttt 0.0008729974 165 0 798aaaaag aaaaag|cttttt 0.0010052697 190 0 942aaaaat aaaaat|attttt 0.0010211424 193 0 921aaaaca aaaaca|tgtttt 0.0016454678 311 6 1525aaaacc aaaacc|ggtttt 0.0006560708 124 0 613
identifier obs_freq occ ovl_occ forboccaaaaaa aaaaaa|tttttt 0.002096105 385 564 1893aaaaac aaaaac|gttttt 0.001497218 275 0 1354aaaaag aaaaag|cttttt 0.001535329 282 0 1378aaaaat aaaaat|attttt 0.002313882 425 3 2104aaaaca aaaaca|tgtttt 0.002003550 368 10 1808aaaacc aaaacc|ggtttt 0.001012664 186 0 918
5
0 100 200 300 400
010
020
030
040
0
6nt occurrences, peaks vs random regions
Random regions
Pea
k se
quen
ces
0 100 200 300 400
010
020
030
040
0
kmer.comparison$rand
kmer
.com
paris
on$p
eaks
Evaluate different measures of over-representation
[1] 0 Inf
6
0 50 100 150 200 250 300 350
02
46
810
Occurrence ratios
Mean occurrences
Pea
ks/r
and
occ
ratio
0 50 100 150 200 250 300 350
−3
−2
−1
01
23
Occurrences log2 ratio
Mean occurrences
log2
(pea
ks/r
and)
7
M-A plot
0 2 4 6 8
−3
−2
−1
01
23
MA plot
Mean log2 occurrences
log2
(pea
ks/r
and)
Finally, I prefer to keep the mean occurrences on the X axis rather than the log2(mean occ)
Log-likelihood ratio (LLR)
LLR = fexp · log2(fobs/fexp)
8
0 50 100 150 200 250 300 350
−3
−2
−1
01
23
Occurrences log2 ratio
Mean occurrences
log2
(pea
ks/r
and)
0 50 100 150 200 250 300 350
−0.
002
−0.
001
0.00
00.
001
Log−likelihood ratio
Mean occurrences
rand
.freq
* lo
g2(p
eaks
/ran
d)
9
The log-likelihood ratio is effective in reducing the impact of small number fluctuations: the rare k-mers (leftside of the LLR plot) achieve very low scores, whereas the ratio or log2-ratio tended to put a high emphasison them.
Compute p-value of over-representation with the Poisson law
0 2 4 6 81e−
161
1e−
931e
−25
Mean log2 occurrences
Poi
sson
p−
valu
e
−3 −2 −1 0 1 2 3
050
100
150
Volcano plot
log2−ratio of occurrences
Poi
sson
p−
valu
e
10
Intermediate interpretation
So far, we performed all our analyses using a random selection of genomic regions (“random peaks”) asbackground sequences in order to estimate the expected number of occurrences of each k-mer in the peaks.These random peaks had been selected with the same size as the actual peaks, so the total number ofoccurrences was supposed to be more or less the same as in the peaks (small differences may occur due to thepresence of N character in the genomic sequences).
However, the results are problematic, because the random expectation is estimated based on a small sequenceset, so that the numbers can fluctuate, especially for rare k-mers. We even noticed that some hexamers havezero occurrences in the random peaks
11