Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Principles and Methods in Regulatory Genomics
Bioinfo1 — MoBi Bachelor 5. FS
WS 2017/2018
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
What is it all about …
• What are the molecular actors of transcriptional regulation ?๏ Role of transcription factors ?๏ Role of epigenetic modifications ?๏ Role of 3D chromatin structure ?
• What kind of experimental data can we use to understand transcriptional regulation ?
• How can we use bioinformatics tools to make predictions ?
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Bigger genome = more evolved ?
EukaryotesArchaeBacteria
[interactive Tree of Life]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Bigger genome = more evolved ?
Genome size
EukaryotesArchaeBacteria
[interactive Tree of Life]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
More genes = more complex ?
Size of genome (Mb)
Num
ber o
f gen
esGlycine- 1.1 Gb- 46000 genes
BacteriumPlantAnimal
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
More genes = more complex ?
Size of genome (Mb)
Num
ber o
f gen
esGlycine- 1.1 Gb- 46000 genes
BacteriumPlantAnimal
Human- 3.3 Gb- 23000 genes
[https://gf.neocities.org/gs/genes.html]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
More genes = more complex ?
Size of genome (Mb)
Num
ber o
f gen
esGlycine- 1.1 Gb- 46000 genes
BacteriumPlantAnimal
Human- 3.3 Gb- 23000 genes
C. elegans- 100 Mb- 20000 genes
[https://gf.neocities.org/gs/genes.html]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Bigger non-coding genome = higher complexity
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Bigger non-coding genome = higher complexity
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Bigger non-coding genome = higher complexity
Proportion of non-coding DNA correlates withorganismal complexity
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Junk DNA ?
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Exploring the genome's activity
• Large scale consortia (ENCODE, Roadmap, …) have systematically explored the activity of the genome using experimental assays
https://www.encodeproject.org/
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Exploring the genome's activity
• Large scale consortia (ENCODE, Roadmap, …) have systematically explored the activity of the genome using experimental assays
"The vast majority (80.4%) of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type. 99% is within 1.7 kb of at least one of the biochemical events measured by ENCODE."
https://www.encodeproject.org/
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Exploring the genome's activity
• Large scale consortia (ENCODE, Roadmap, …) have systematically explored the activity of the genome using experimental assays
"The vast majority (80.4%) of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type. 99% is within 1.7 kb of at least one of the biochemical events measured by ENCODE."
https://www.encodeproject.org/
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Transcriptional regulation in metazoans
[Elodie Darbo]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Transcriptional regulation in metazoans
1. Binding of site specifictranscription factors
[Elodie Darbo]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Transcriptional regulation in metazoans
1. Binding of site specifictranscription factors
2. chromatin structure,epigenetic
[Elodie Darbo]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Transcriptional regulation in metazoans
1. Binding of site specifictranscription factors
2. chromatin structure,epigenetic
3. three-dimensionalDNA looping
[Elodie Darbo]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Transcriptional regulation in metazoans
4. Readout:gene expression
1. Binding of site specifictranscription factors
2. chromatin structure,epigenetic
3. three-dimensionalDNA looping
[Elodie Darbo]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
1. data types
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Experimental methods
• Genetic component:
• Chromatin structure and epigenetic
• Three-dimensional DNA looping
• Readout: gene expression
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Experimental methods
• Genetic component:→ ChIP-seq: transcription factor binding sites
• Chromatin structure and epigenetic→ ChIP-seq : post-translational histone modifications→ whole genome bisulfite sequencing, arrays : DNA methylation→ ATAC-seq, DNAse-seq, FAIRE-seq : open chromatin region
• Three-dimensional DNA looping→ 3C/4C/Hi-C : interacting chromatin regions
• Readout: gene expression→ RNA-seq : expression of transcribed elements
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Chromatin Immunoprecipitations
• Chromatin immunoprecipitation (ChIP) yields DNA fragments, that are ๏ bound by the protein of interest๏ marked by a specific chemical
modification (acetylation, methylation,.)
• Identification of the fragments :๏ sequencing (ChIP-seq)
→ genome-wide๏ PCR/qPCR
→ targeted experiment
• Important aspect๏ Quality/Specificity of the antibody ?๏ DNA fragment (~200-300bp)
→ binding site (~10 bp) ?
[Park, Nat.Rev. 2009]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
ChIP-sequencing
[Wilbanks & Faccioti PLoS One (2010)]]
Sharp signal Broad signal
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
ChIP-seq for histone modifications
• histones are subject to post-translational modifications at their N-terminal tail๏ Lysine methylation๏ Lysine/arginine acetylation๏ Serine phosphorylation๏ ubiquitylation
• they modify the physical properties of the DNA-nucleosome interactions
nomenclature: H3K27ac = acetylation of lysine 27 on histone 3
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Histone modifications
histone modifications are a good proxy of gene expression and presence of regulatory elements
active marks→ open chromatinH3K4me1; H3K4me3;H3K27ac
repressive marks→ closed chromatinH3K27me3; H3K9me3
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Histone modifications
• Histone marks have distinct signal profiles๏ sharp signal : narrow peaks of enrichment at specific loci (H3K4me3 =
promoters, H3K27ac = enhancers,…)๏ broad signal : wide regions of enrichment (H3K36me3 = transcribed
genes; H3K27me3 = repressed regions)
H3K4me1
H3K4me3
H3K9ac
H3K9me3
H3K27ac
H3K27me3
H3K36me3
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Example of ChIP-seq signal for transcription factors / DNA-binding proteins
MYCN
H3K4me3
DNMT1
H3K27ac
distal MYCN peak
(away from gene promoters)
MYCN peakand H3K4me3
enrichment at promoter
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Measuring DNA methylation
• DNA methylation occurs mainly on cytosines in CpG dinucleotides in the human genome
• DNA methylation is revealed byusing bisulfite conversion:๏ unmethylated cytosines are converted
C → U → T๏ methylated cytosines are protected
mC → mC
• unmethylated CpG are identified by the presence of a mismatch TpG
• 2 approaches: ๏ array based๏ sequencing
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Measuring DNA methylation
• Array based methods
• CpG containing probes on array๏ 27K probes๏ 450K probes ๏ 800K (EPIC)
• all probes contain a methylated (C) and unmethylated (T) version
• Cheap but sparse
• Whole genome bisulfite sequencing
• unmethylated C → T
• methylated C → C
• Shearing, conversion and sequencing (Illumina X-10)
• Information about the 28 million CpGs
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Measuring DNA methylation
• Array based methods
• CpG containing probes on array๏ 27K probes๏ 450K probes ๏ 800K (EPIC)
• all probes contain a methylated (C) and unmethylated (T) version
• Cheap but sparse
• Whole genome bisulfite sequencing
• unmethylated C → T
• methylated C → C
• Shearing, conversion and sequencing (Illumina X-10)
• Information about the 28 million CpGs
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Example DNA methylation
• Whole genome bisulfite sequencing provide information about all CpGs in the genome
• Vertical bars = CpG positions; red = high methylation (100%); blue = no methylation (~10%)
sam
ples
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Example DNA methylation
• Genome appears highly methylated at a global scale except at gene promoters or CpG islands (blue stripes)
• Some samples have differential methylation (grey box)
• Impact on gene regulation:→ DNAme is a repressivemark which inhibits bindingof transcription factors andexpression of genes
• in cancer:๏ hypomethylation : aberrant
expression of oncogenes๏ hypermethylation : aberant
repression of tumor suppressors
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Mapping chromatin interactions
• DNA looping allows interactions between distal DNA loci
• Identification of interacting regions through “chromatin conformation capture” methods (3C / 4C / Hi-C)
[Liebermann-Aiden, 2009]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Hi-C and topological domains
The signal at this pointindicates the strength ofthe signal between these 2 loci
Topologicalassociateddomains
[Dixon (2012,2015)][http://promoter.bx.psu.edu/hi-c/view.php]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Closer look : 4C
• 4C allows to interrogate a specific locus (“viewpoint”) for all its interaction partners
• used to identify all promoter-enhancer interactions for a given gene
Viewpoint sequenceInteracting sequence
3C = one-vs-one 4C = one-vs-all
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Closer look: 4C
View point(here : enhancerregion)
Interacting regions(here: promoters)
[Dreidax, Gartlgruber (DKFZ)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Chromatin interactions shape the cell identity
[Spurell et al., Cell 2016]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
A complex picture
• Long range interactions show numerous enhancer-promoter interactions but also many promoter-promoter interactions
• Hypothesis : Promoters can act in trans and be used as enhancers for distal genes
"Intriguingly, the multigene complexes illustrated in this study are, in principle, akin to the bacterial operon as a mechanism for coordinated transcriptional regulation of related genes, suggesting the possibility of a chromatin-based operon mechanism (chro-operon or chroperon) for spatiotemporal regulation of gene transcription in eukaryotic nuclei."
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
2. Binding of transcription factors
• protein-DNA interactions
• experimental approaches to determine binding sites (BS)
• representing binding specificity of TFs
• databases
• comparing binding profiles
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
DNA binding domains
• Transcription factors contain a DNA binding domain (DBD) and a transcriptional activator (TA)
• Homologous TFs share similar DBDs (here: forkhead)
[Luscombe 2010]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Protein DNA interactions
• majority of protein-DNA interactions for TF occur through a alpha-helix fitting into the major groove (=DNA binding domain)
• hydrogen bonds with specific bases
• stabilization of the protein-DNA complex is ensured by additional structures (helix, beta-sheet) via van der Walls interactions
[Luscombe et al., NAR (2001)][Cheng et al., JMB (2003)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Protein DNA interactions
• majority of protein-DNA interactions for TF occur through a alpha-helix fitting into the major groove (=DNA binding domain)
• hydrogen bonds with specific bases
• stabilization of the protein-DNA complex is ensured by additional structures (helix, beta-sheet) via van der Walls interactions
[Luscombe et al., NAR (2001)][Cheng et al., JMB (2003)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Structural family: Zinc coordinating
cysteine
histidine
Egr1/Zif268
Finger 1
Finger 2
Finger 3
Cys2His2 Fold (“Zinc finger”)→ one of the most common familyof transcription factors in mammalians
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Structural family: Zinc coordinating
cysteine
histidine
Egr1/Zif268
Finger 1
Finger 2
Finger 3
Cys2His2 Fold (“Zinc finger”)→ one of the most common familyof transcription factors in mammalians
Finger 1 Finger 2 Finger 3
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Structural family: Zinc coordinating
- 6 cysteines- 2 Zinc atoms= Cys6Zn2zinc-fingers
Gal4 S. cerevisae → binds as homodimer
[Campbell (2008)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Structural family: Zinc coordinating
- 6 cysteines- 2 Zinc atoms= Cys6Zn2zinc-fingers
Gal4 S. cerevisae → binds as homodimer
TF binding siteis a dyad (short repeatedmotif with spacer)[Campbell (2008)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Structural families of TFs
[Weirauch, Hughes (2010)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Structural families of TFs
[Weirauch, Hughes (2010)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Structural families of TFs
• in human, 3 classes account for > 80% of known transcription factors๏ Zinc-finger C2H2
(→ Zinc finger nucleases)๏ homeodomain
(e.g. Hox TFs)๏ helix-loop-helix factors
(e.g. c-Myc, N-Myc,...)
[Vaquerizas et al. Nat.Rev.Gen (2009)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
From protein to DNA motif
• Transcription factors belonging to the same structural family have similar binding profiles
• Highly conserved accross species
• conserved core motif
• variable flanking regions→ specificity
[Wei et al. EMBO Journal 2010]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
From protein to DNA motif
• Transcription factors belonging to the same structural family have similar binding profiles
• Highly conserved accross species
• conserved core motif
• variable flanking regions→ specificity
[Wei et al. EMBO Journal 2010]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
2. Binding of transcription factors
• protein-DNA interactions
• experimental approaches to determine binding sites (BS)
• representing binding specificity of TFs
• databases
• comparing binding profiles
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Experimental identification of binding sites
• DNAse1 footprint:protein-bound DNA is protected from digestion by the DNAse1 enzyme
• gel electrophoresis allows identification of the digested / undigested fragments
• if whole sequence is known, binding sites can be identified
• in vitro / low-throuput
[ A.Stenlund The EMBO Journal (2003) ]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Experimental identification of binding sites
Identification of open chromatin regions → potential binding sites for proteins
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Experimental identification of binding sites
• ATAC-seq: using Tn5 transposase prepared with sequencing primers
• requires a small number of input material (~10,000 cells)
• identification of open chromatin regions (peaks)
[Greenleaf (2013)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Experimental identification of binding sites
• From open regions to transcription factor binding sites→ footprinting
• Zooming into the peaks (open regions) : valleys of undigested / un-transposed DNA→ TF binding sites (TFBS)
• binding sequence can be identified with base-pair resolution
[Greenleaf (2013)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Experimental identification of binding sites
• From open regions to transcription factor binding sites→ footprinting
• Zooming into the peaks (open regions) : valleys of undigested / un-transposed DNA→ TF binding sites (TFBS)
• binding sequence can be identified with base-pair resolution
[Greenleaf (2013)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Experimental identification of binding sites
• From open regions to transcription factor binding sites→ footprinting
• Zooming into the peaks (open regions) : valleys of undigested / un-transposed DNA→ TF binding sites (TFBS)
• binding sequence can be identified with base-pair resolution
[Greenleaf (2013)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Experimental identification of binding sites
• SELEX (= systematic evolution of ligands by exponential enrichment)
๏ random library of k-mers generated๏ incubated with TF of interest๏ elution to separate bound from unbound
targets๏ PCR (=enrichment)๏ next cycle
• in-vitro / medium throughput
• increasing number of cycles leads to over-select high affinity binding sites
[ A.Stenlund The EMBO Journal (2003) ]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
DNA transcription factor interactions
Sequence Si
Ki
=kon
koff
=[TF · S
i
]
[TF ][Si
]
TF + S TF∙Skon
koff
P (s = 1|Si) =[TF · Si]
[TF · Si] + [Si]=
1
1 + eEi�µ
Equilibrium constant
Binding probability
Ei = � lnKi
µ = ln[TF ]
free energy
chemical potential
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
DNA transcription factor interactions
• How many parameters are needed to describe the DNA-protein interaction ?
Naive approach
• if TF binds regions of size w, then if one Ki per possible sequence
• 4w parameters
• w = 6 : 4096 parameters
• w = 8 : 65536 parameters
[Berg, von Hippel (1987)][Stormo, Bioinformatics (2000)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
DNA transcription factor interactions
• How many parameters are needed to describe the DNA-protein interaction ?
Naive approach
• if TF binds regions of size w, then if one Ki per possible sequence
• 4w parameters
• w = 6 : 4096 parameters
• w = 8 : 65536 parameters
Model :
• Assume independent contribution of each position of Si to the binding energy
• contribution proportional to log(freq at position j)→ 3w parameters
• w = 6 : 18 parameters
• w = 8 : 24 parameters
[Berg, von Hippel (1987)][Stormo, Bioinformatics (2000)]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Characterizing binding affinities
• Example :๏ Evi-1 mouse transcription factor๏ 14 binding sites obtained using
SELEX
• Binding sites of transcription factors are not fixed strings !
GGACAAGATAAAGACAAGATAGAGACAAGATAGGGACAAGATAGTGACAAGATCACGACAAGACAAATACAAGACAATGATAAGATAAAGATAAGATAATGATAAGATAAAGATAAGATAAAGATAAGATAAAGATAAGATAAAGATAAGACAA
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Characterizing binding affinities
• Example :๏ Evi-1 mouse transcription factor๏ 14 binding sites obtained using
SELEX
• Binding sites of transcription factors are not fixed strings !
GGACAAGATAAAGACAAGATAGAGACAAGATAGGGACAAGATAGTGACAAGATCACGACAAGACAAATACAAGACAATGATAAGATAAAGATAAGATAATGATAAGATAAAGATAAGATAAAGATAAGATAAAGATAAGATAAAGATAAGACAA
Some positions are very degenerate
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Characterizing binding affinities
• Example :๏ Evi-1 mouse transcription factor๏ 14 binding sites obtained using
SELEX
• Binding sites of transcription factors are not fixed strings !
GGACAAGATAAAGACAAGATAGAGACAAGATAGGGACAAGATAGTGACAAGATCACGACAAGACAAATACAAGACAATGATAAGATAAAGATAAGATAATGATAAGATAAAGATAAGATAAAGATAAGATAAAGATAAGATAAAGATAAGACAA
Some positions are very degenerate
Some positions are well conserved
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Characterizing binding affinities
• Solution 1:represent binding sites as consensus sequence
• IUPAC conventions ๏ W = A or T ; R = A or G (puRines)๏ K = G or T ; S = C or G๏ Y = C or T (pYrimidine) ; M = A or C๏ B = C, G or T ; D = A,G or T๏ H = A,C or T ; V = A,C or G๏ N = any nucleotide
• Transfac conventions ๏ for a given position, order nucleotides by
frequency : f1>f2>f3>f4๏ single nucleotide if f1 > 50% AND f1>2*f2๏ double if f1 + f2 > 75% AND f1<50 %;
f2<50 %๏ triple if one nucleotide never appears and
previous rules don't apply๏ N in all other cases
GGACAAGATAAAGACAAGATAGAGACAAGATAGGGACAAGATAGTGACAAGATCACGACAAGACAAATACAAGACAATGATAAGATAAAGATAAGATAATGATAAGATAAAGATAAGATAAAGATAAGATAAAGATAAGATAAAGATAAGACAA
How can we represent the binding sites ?
AGAyAAGATAA
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Characterizing binding affinities
• Solution 1:represent binding sites as consensus sequence
• Weakness 1:
๏ column 1 has 8/14 A’s๏ column 10 has 14/14 A’s๏ both are represented by A
• Weakness 2: ๏ other nucleotides in column 1 are
not taken into account
GGACAAGATAAAGACAAGATAGAGACAAGATAGGGACAAGATAGTGACAAGATCACGACAAGACAAATACAAGACAATGATAAGATAAAGATAAGATAATGATAAGATAAAGATAAGATAAAGATAAGATAAAGATAAGATAAAGATAAGACAA
How can we represent the binding sites ?
AGAyAAGATAA
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Characterizing binding affinities
• Solution 2 :
• count frequencies of nucleotides at each position
• obtain a count matrix
• represent the matrix as a logo
G G A C A A G A T A A A G A C A A G A T A GA G A C A A G A T A GG G A C A A G A T A GT G A C A A G A T C AC G A C A A G A C A AA T A C A A G A C A AT G A T A A G A T A AA G A T A A G A T A AT G A T A A G A T A AA G A T A A G A T A AA G A T A A G A T A AA G A T A A G A T A AA G A T A A G A C A A
a| 8 0 14 0 14 14 0 14 0 13 11c| 1 0 0 7 0 0 0 0 3 1 0g| 2 13 0 0 0 0 14 0 0 0 3t| 3 1 0 7 0 0 0 0 11 0 0
How can we represent the binding sites ?
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Characterizing binding affinities
• Solution 2 :
• count frequencies of nucleotides at each position
• obtain a count matrix
• represent the matrix as a logo
G G A C A A G A T A A A G A C A A G A T A GA G A C A A G A T A GG G A C A A G A T A GT G A C A A G A T C AC G A C A A G A C A AA T A C A A G A C A AT G A T A A G A T A AA G A T A A G A T A AT G A T A A G A T A AA G A T A A G A T A AA G A T A A G A T A AA G A T A A G A T A AA G A T A A G A C A A
a| 8 0 14 0 14 14 0 14 0 13 11c| 1 0 0 7 0 0 0 0 3 1 0g| 2 13 0 0 0 0 14 0 0 0 3t| 3 1 0 7 0 0 0 0 11 0 0
How can we represent the binding sites ?
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Characterizing binding affinities
• Solution 2 :
• count frequencies of nucleotides at each position
• obtain a count matrix
• represent the matrix as a logo
G G A C A A G A T A A A G A C A A G A T A GA G A C A A G A T A GG G A C A A G A T A GT G A C A A G A T C AC G A C A A G A C A AA T A C A A G A C A AT G A T A A G A T A AA G A T A A G A T A AT G A T A A G A T A AA G A T A A G A T A AA G A T A A G A T A AA G A T A A G A T A AA G A T A A G A C A A
a| 8 0 14 0 14 14 0 14 0 13 11c| 1 0 0 7 0 0 0 0 3 1 0g| 2 13 0 0 0 0 14 0 0 0 3t| 3 1 0 7 0 0 0 0 11 0 0
Difference beween position 1 and 10 is now obvious !
How can we represent the binding sites ?
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Characterizing binding affinities
• Solution 2 :
• count frequencies of nucleotides at each position
• normalize to obtain position frequency matrix (PFM)
G G A C A A G A T A A A G A C A A G A T A GA G A C A A G A T A GG G A C A A G A T A GT G A C A A G A T C AC G A C A A G A C A AA T A C A A G A C A AT G A T A A G A T A AA G A T A A G A T A AT G A T A A G A T A AA G A T A A G A T A AA G A T A A G A T A AA G A T A A G A T A AA G A T A A G A C A A
a| 0.57 0.00 1.00 0.00 1.00 1.00 0.00 1.00 0.00 0.93 0.79c| 0.07 0.00 0.00 0.50 0.00 0.00 0.00 0.00 0.21 0.07 0.00g| 0.14 0.93 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.21t| 0.21 0.07 0.00 0.50 0.00 0.00 0.00 0.00 0.79 0.00 0.00
How can we represent the binding sites ?
fi,j
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Characterizing binding affinities
• Beware of sampling effects !
• Two different sets of BS for the same TF will give different PFM
• Example : Evi-1๏ set 1 : 14 BS (selex)๏ set 2 : 47 BS (selex)
a| 0.57 0.00 1.00 0.00 1.00 1.00 0.00 1.00 0.00 0.93 0.79c| 0.07 0.00 0.00 0.50 0.00 0.00 0.00 0.00 0.21 0.07 0.00g| 0.14 0.93 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.21t| 0.21 0.07 0.00 0.50 0.00 0.00 0.00 0.00 0.79 0.00 0.00
a| 0.55 0.74 0.00 1.00 0.02 1.00 0.98 0.00 1.00 0.00 0.87 0.85 0.23 0.64c| 0.09 0.04 0.02 0.00 0.43 0.00 0.00 0.00 0.00 0.13 0.04 0.00 0.23 0.21g| 0.19 0.09 0.94 0.00 0.00 0.00 0.02 1.00 0.00 0.00 0.00 0.15 0.30 0.06t| 0.17 0.13 0.04 0.00 0.55 0.00 0.00 0.00 0.00 0.87 0.09 0.00 0.23 0.09
Increasing the sample of binding sites makes some frequencies become > 0 What if we would further increase the dataset ?
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Characterizing binding affinities
• Sampling effects are attenuated by adding pseudo-counts
• at each position, a pseudo-count (usually 1) is distributed among the nucleotides๏ equal amount (0.25)๏ OR : fraction proportional to nucleotide prior frequency
a| 8 0 14 0 14 14 0 14 0 13 11c| 1 0 0 7 0 0 0 0 3 1 0g| 2 13 0 0 0 0 14 0 0 0 3t| 3 1 0 7 0 0 0 0 11 0 0
a| 8.2 0.2 14.2 0.2 14.2 14.2 0.2 14.2 0.2 13.2 11.2c| 1.3 0.3 0.3 7.3 0.3 0.3 0.3 0.3 3.3 1.3 0.3g| 2.3 13.3 0.3 0.3 0.3 0.3 14.3 0.3 0.3 0.3 3.3t| 3.2 1.2 0.2 7.2 0.2 0.2 0.2 0.2 11.2 0.2 0.2
a| 0.55 0.02 0.94 0.02 0.95 0.95 0.02 0.95 0.02 0.88 0.75c| 0.08 0.02 0.02 0.48 0.02 0.02 0.02 0.02 0.22 0.08 0.02g| 0.15 0.88 0.02 0.02 0.02 0.02 0.95 0.02 0.02 0.02 0.22t| 0.22 0.08 0.02 0.48 0.02 0.02 0.02 0.02 0.75 0.02 0.02
add 1 using priors a:t = 0.2 ; c:g = 0.3
normalize to 1 by dividing by 14+1
ni,j
fi,j
f 0i,j
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Polls and pseudo-counts
Polls
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Polls and pseudo-counts
Polls
Final results
They should have added pseudo counts !
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Characterizing binding affinities
• why is it so important to get rid of 0 frequencies ?
• compute the probability of a sequence to represent a binding site for Evi-1
a| 0.57 0.00 1.00 0.00 1.00 1.00 0.00 1.00 0.00 0.93 0.79c| 0.07 0.00 0.00 0.50 0.00 0.00 0.00 0.00 0.21 0.07 0.00g| 0.14 0.93 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.21t| 0.21 0.07 0.00 0.50 0.00 0.00 0.00 0.00 0.79 0.00 0.00
C C A C A A G A C A AP = 0.07 * 0 * 1 * 0.5 * 1 * 1 * 1 * 1 * 0.21 * 0.93 * 0.79 = 0
a| 0.55 0.02 0.94 0.02 0.95 0.95 0.02 0.95 0.02 0.88 0.75c| 0.08 0.02 0.02 0.48 0.02 0.02 0.02 0.02 0.22 0.08 0.02g| 0.15 0.88 0.02 0.02 0.02 0.02 0.95 0.02 0.02 0.02 0.22t| 0.22 0.08 0.02 0.48 0.02 0.02 0.02 0.02 0.75 0.02 0.02
C C A C A A G A C A A
P = 0.08* 0.02* 0.94*0.48* 0.95* 0.95* 0.95*0.95*0.22 *0.88*0.75 = 8.6e-5
This sequence can NEVER be a BS → this sequence has non-0 probability of being a BS
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Generating random sequences
• We often need to bioinformatically generate sequences according a a specific model (“rolling a dice”)
• this model gives the emission probabilities, i.e. the probability P(X,j) at each position j of the sequence to obtain one of the four bases A,C,G,T
• possible models๏ equiprobable model: P(A)=P(C)=P(G)=P(T)=0.25
๏ more realistic model: P(A)=P(T)=pA,T ; P(C)=P(G)=pC,G
๏ position dependent model: P(A,j) = fA,j
A C
A C
A C A C A C
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Information content of a matrix
Would I get the same PFM if I would consider 14 random sequences?
• Generate 14 sequences of length L=11 according to the emission probabilities of the PFM; probability of obtaining exactly the same count matrix:
• Generate 14 random sequences on length L=11; probability of obtaining exactly the same count matrix:
G G A C A A G A T A A A G A C A A G A T A GA G A C A A G A T A GG G A C A A G A T A GT G A C A A G A T C AC G A C A A G A C A AA T A C A A G A C A AT G A T A A G A T A AA G A T A A G A T A AT G A T A A G A T A AA G A T A A G A T A AA G A T A A G A T A AA G A T A A G A T A AA G A T A A G A C A A
a| 8 0 14 0 14 14 0 14 0 13 11c| 1 0 0 7 0 0 0 0 3 1 0g| 2 13 0 0 0 0 14 0 0 0 3t| 3 1 0 7 0 0 0 0 11 0 0
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Information content of a matrix
• Log-likelihood ratio
• Information content (bits)
• Properties๏ if f’ij → pi : IC → 0 : zero information content if the frequencies in the
alignment are equal to the background frequencies ("random matrix”)๏ IC is maximal if the least abundant nucleotide (smallest pi) is the most
represented in the matrix (largest f'ij)๏ information content is related to the Shannon entropy, mesuring the
uncertainty : high information content → low Shannon entropy
[Hertz, Stormo 1999]
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Matrix logos
• Information content of the matrix:
• Information content of a column:
• Conventions:
• height of column represents IC
• relative sizes proportional to frequencies
same frequency (100%)but different IC
Mouse backgroundp(C)=p(G)=0.21 ; p(A) = p(T) = 0.29
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Matrix logos
a| 8 0 14 0 14 14 0 14 0 13 11c| 1 0 0 7 0 0 0 0 3 1 0g| 2 13 0 0 0 0 14 0 0 0 3t| 3 1 0 7 0 0 0 0 11 0 0
P. falciparum backgroundp(C)=p(G)=0.1 ; p(A) = p(T) = 0.4
Anaeromyxobacter backgroundp(C)=p(G)=0.37 ; p(A) = p(T) = 0.13
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
From frequencies (PFM) to weights (PWM)
• position frequency matrices do not contain information about the background distribution
• define position-weight matrices (PWM) as(or Position Specific Scoring Matrix PSSM)
a| 8 0 14 0 14 14 0 14 0 13 11c| 1 0 0 7 0 0 0 0 3 1 0g| 2 13 0 0 0 0 14 0 0 0 3t| 3 1 0 7 0 0 0 0 11 0 0
a| 0.55 0.02 0.94 0.02 0.95 0.95 0.02 0.95 0.02 0.88 0.75c| 0.08 0.02 0.02 0.48 0.02 0.02 0.02 0.02 0.22 0.08 0.02g| 0.15 0.88 0.02 0.02 0.02 0.02 0.95 0.02 0.02 0.02 0.22t| 0.22 0.08 0.02 0.48 0.02 0.02 0.02 0.02 0.75 0.02 0.02
a| 0.75 -2.71 1.30 -2.71 1.30 1.30 -2.71 1.30 -2.71 1.22 1.06c| -1.04 -2.71 -2.71 0.73 -2.71 -2.71 -2.71 -2.71 -0.07 -1.04 -2.71g| -0.46 1.31 -2.71 -2.71 -2.71 -2.71 1.39 -2.71 -2.71 -2.71 -0.10t| -0.22 -1.16 -2.71 0.58 -2.71 -2.71 -2.71 -2.71 1.02 -2.71 -2.71
Count matrix
PFM
PWM
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Motif databases
• JASPAR is an open database of binding profiles
• contains 1564 profiles (2018 version)
http://jaspar.genereg.net/
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Motif databases
• JASPAR is an open database of binding profiles
• contains 1564 profiles (2018 version)
CarlHerrmann UniversitätHeidelberg&DKFZ WS2017/2018
Different sources
ChIP-seq: real bindingsite is hidden in muchlonger sequence→ lower resolution