Upload
buimien
View
216
Download
3
Embed Size (px)
Citation preview
ERROR, BIAS, PROBLEMS AND PITFALLS IN
EPIGENETIC EPIDEMIOLOGY DR JONATHAN MILL PSYCHIATRIC EPIGENETICS GROUP
MRC SGDP CENTRE
INSTITUTE OF PSYCHIATRY
KING’S COLLEGE LONDON
www.epigenomicslab.com
Newcastle, 2012
1997: first year with >100 publications 2010: >2,000 publications 2011: almost 2,500 publications
But is there is more “interest” than “information”???
REVIEWS
RESEARCH ARTICLES
CANCER
NEUROSCIENCE MENTAL HEALTH
An exponential increase in published epigenetics research…
STEM CELLS
“…Roll over, Mendel. Watson and Crick? They are so your old man's version of DNA…”
“…The integration of standard massage practices and the knowledge of the biology of adversity will change our minds, our physiology, our epigenetics—and hence our massage practice…”
AGTGCCTCAGCCTCCCTAGTAGCTGGGATTACAGGTGCCCTCCACAATGCCCAGCTAATTT
TTGTGTTTTTAGTAGACACAAGATTTCACTATGTTGCCCAGGCTGGTCTCAACCCCTGACC
TCAAGTGATCCACCTGCCTCAGTCTCCCAGAGTGCTGGGACTGCAGGCGTGAGCCAACAAG
CCCAGGCCACGATGTCTTACTTTTCACCTAAAACCTGCCTAAATGGCATGCCCAGTTAAAA
CAATCTTTTTCTGTTACAATAATCCATGTAAGAGTATGACACATTTTCTGAAAGATTTGTC
TAAAAAAGAGCCTGGTATGTTTACTGTTGCTGCTGAATTGGATTTGACTCTGCTGCTGTAT
CAGGGCCCCTTCTGACAATTCACCTCTTGCTTCCTTTCCTGCTAATTGTCCTGTTGACTAC
TATTTTTTTTTTTTTTTGGTAACAGTGTCTGGCTCTGTCACCCAGCCTAGAGTGCAGTGGC
ACAATCTTGGCTCACTACAACCTCCATCTTCTGGGCTCAAGCTATTCTTCCACCTCAGCCT
CCCAAGTAGCTGAGACTACAGGCATGTGCCACCACACCCAGCTAGATTTTGTATTTTTTGT
AGAGACGGGGTCTTGTGATGTTGCCCAGGCTGGTCTTGAACTCCTGGGCTCAAAGCAATCC
GCCCGCCTCCGCCTCCCAAAGTGCTGAGATGACAGGCGTGAGCAACTGCGCCCAGCCTTGT
GTACTTCTTAGGGCTCTTTTACATGCCTTTCTTTTTTTAACAGCCTTCCCACCACTACCTT
TTACATGTCTTGAGATTTTCCTGTATGCATGTGTATGCGTGCACGTGCACGCACGCACACA
CACACACACACCTGATTTTGTCATTCTGGTGTTTAAAGCATATCATAGTCCTACTTCCAGA
AATACATCCAATGCAATGAACCTGGTAGCCAACACTGCTGAGAAATGACCCAAGGGTCTAC
CTTGAGTAGCCAGCCCCCAAATCCAAAGAATAGCTCCAGACCCCATAGTTTTCTCACCCAC
TAGGTCATGGGACCATGGCAAGAGTGAGAGAGTTCCACTTCCCAGAGGATGCCTGTTATTA
CCTTACCTCAATTTGAAATCTGTACTAAGGTTGAACACATGCATTCTCCTCCTTGACCTCC
ACATCCCCTGTTGTTTCCTTTTTTTGTTGTTTTTGTTTTTTGTTTTTGTTTTGAGACAGAG
TCTCGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCACGATCTCGGCTCACTGCAGTCTCTGC
CTCCCGGGCTCAAGCAATTCTCCTGCCTCAGCCTCCTGAGTAACTGGGATTACAGGTGTGT
GCCACCACGCCCGGCTGATTTTTTGTACTTTTAGTAGAGACGGGGTTTCACCATGTTGGCC
– 1 body, 1 genome: a blood sample is all you need!
– 1 life, 1 genome: you are born with the genome
you die with
– Any lifestyle, 1 genome: it doesn’t matter what
you’re exposed to
– Any disease, 1 genome: no reverse causation
– A nicely annotated reference genome and
catalogue of SNPs is freely available
– Methods that do as they say on the box and give
results that are easy to interpret
Some of the (many) issues in
epigenetic epidemiology • Technical / methodological
• Sample related issues
• Study design
• Analysis and interpretation
• Over-interpretation
• Over-simplification
• Biological Significance?
• Confounding factors
• Cause vs effect
What is a ‘normal’ brain methylome?
How do different disease-relevant regions of the brain differ epigenetically?
Are there marked epigenetic differences between the major cell-types in the brain?
Can peripheral tissues be used as a ‘proxy’ for the brain in epigenetic epidemiology?
What is the location of functionally-relevant DMRs?
How is the brain methylome influenced by factors such as age, sex and medication?
Biological, Technological, and
Methodological issues
1. We do not know where in the genome to look and what to look for
2. We have to rely on imperfect technology
3. We may be limited by available sample sizes that are optimal for epigenetic
epidemiology
4. Whatever we do, it may never be enough to fully account for epigenetic
differences between tissues and cells
5. We may be trying to detect inherently small effect sizes using sub-optimal
methods and sample cohorts
6. We lack a framework for the analysis of genome-wide epigenetic data
7. We have to manage high expectations
IJE, 2012
0
100
200
300
400
500
600
700
800
900
1000
Promoter Intragenic 3'UTR Intergenic
OBS
EXP
Most tissue-variable CGI DMRs: enrichment for intragenic CGIs
Χ2 p = 1E-246
O/E=0.09
O/E=2.37
Davies et al (in press)
And there’s more to epigenetic gene regulation than DNA methylation!
Zhou et al, Nat Rev Genet, 2011
How many DNA modifications are there??
5-hydroxymethylcytosine (5hmC) 5-formylcytosine (5fC)
5-carboxylcytosine (5caC)
5hmC appears to be particularly important in ES cell differentiation and the CNS and is implicated in postnatal neurodevelopment and aging
Traditional bisulfite-based methods do
not distinguish between 5-mC and 5-
hmC
CMS (the product of bisulfite
conversion of 5-hmC) tends to stall
DNA polymerases during PCR -
densely hydroxymethylated regions of
DNA may be underrepresented in
quantitative methylation analyses
Existing 5-mC data sets may require
re-evaluation in the context of the
possible presence of 5-hmC
Antibodies against 5-mC and 5-hmC
can pull out fragments enriched for
each mark separately – but can’t
quantify at base-pair resolution
Oxidative bisulfite-sequencing (ox-BS-
seq) (Booth et al, Science, 2012)
Huang et al, PLoS ONE, 2010
The relationship between DNA methylation and gene
expression is not necessarily straightforward…
(SAM)
Beyond transcriptional silencing:
functions of DNA methylation
• Chromatin compaction
• Genome stability
• Suppression of homologous recombination between repeats
• Genome defense against retroviruses
• Genetic recombination & DNA mutability
• X-chromosome inactivation (in females)
• Genomic imprinting
Cerebellum Frontal cortex
Cerebellum Frontal cortex
p = 1.90E-33
p = 1.39E-34
Log2
exp
ress
ion
Lo
g2 e
xpre
ssio
n
41 (82%) out of the top 50 ranked cerebellum-cortex DMRs are mirrored by significant gene expression differences
EOMES
GRM4
DN
A m
eth
ylat
ion
D
NA
met
hyl
atio
n
2. We have to reply upon imperfect technology
Measuring DNA methylation is not like measuring genotype Genome-scale assessment now feasible Compromise between coverage and precision Huge range of methods (enrichment, measurement and analysis) No consensus on analysis method
There is a huge number of methylomic profiling methodologies
Laird et al, 2010
Sensitivity
Reproducibility
Cost
Sample requirements
Coverage & Throughput
High correlation across genome-wide platforms….. But…mainly driven by the fact that the large majority of the genome is either unmethylated or fully methylated …substantial discrepancies between platforms may exist for intermediate level methylation
Can these genome-based approaches detect a small % change?
Many stages when inaccuracies in
measuring DNA methylation may occur
• Error (variation) can be introduced during
– Tissue / cell processing
– DNA preparation / storage
– Enrichment / conversion
– Amplification
– Measurement
– QC
– Analysis
• Accurate measurement may be vital if DNA methylation
is to be used for clinical (i.e. diagnostic or prognostic)
purposes
• But is ‘accuracy’ vital for epidemiological studies?
• Consistency / reliability more important?
Some inconvenient truths
• nothing you can do with normalization replaces careful experimental design
• a nuisance variable confounded with what you want to test can’t be fixed – eg case and control in different batches
• nothing you can do with normalization replaces rigorous QC – samples with unusual raw intensity distributions – multivariate methods such as PCA often identify
mislabeled samples
• Problems with outsourcing and core facilities • Bad data will always be bad data!!
The Illumina 450K array – some problems…
• Normalisation issues
– Type 1 and 2 probes
– Batch effects
– Array position
• Colour collection
• SNPs on probes
• Cross-hybridising probes
– Sex chromosomes very obvious
No method is perfect…
• Bisulfite-based methods: PCR biases, bisulfite / PCR batch effects, hydroxymethylation
• Affinity-based methods – CG density, IP efficiency / non-specificity, resolution, quantification, CNV confounds
• MSRE-based methods – limited resolution, SNPs in MSRE sites
• And it’s not all bad news especially for bigger effects.
• For smaller differences – does non-verification negate a true difference?
• Data integration and meta-analysis across studies
Cerebellum
Frontal cortex
Whole blood
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Cerebellum Frontal Cortex Blood
P = 6.73E-104
MeDIP-seq vs 450K Illumina data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Cerebellum Frontal Cortex Blood
Illumina 450K replication of MeDIP-seq identified DMRs [but NB 20% of TS-DMRs not covered with any probes, and many only by single probes]
• Low-frequency DNA methylation states
may not be optimally detected by genome-
wide (sequencing-based) approaches
because sensitivity is a function of read-
depth
• They may also not be detected via site-
specific approaches such as
Pyrosequencing or Sequenom EpiTYPER
(sensitivity 2-5%)
• Ultra-deep bisulfite-sequencing of targeted
regions?
Things to look for when
interpreting published data… • What enrichment/discrimination method has
been used?
• Have the relevant controls been used?
• Have reactions been done in replicate
(especially sodium bisulfite conversion,
bisulfite PCR)
• What magnitude of difference is being
reported?
• Do the differences reported exceed the
sensitivity of the platform?
F2RL3 encodes a protein that has functions which are relevant to cardiovascular disease Implications for confounding effects in epigenetic epidemiological analyses
• Limited number of samples – overlap between studies?
• Pre-mortem factors and pH
• Cause of death
• Peri-mortem factors
• Post-mortem factors
Sample-related technical issues…
After a typical group lunch in the Mill lab…
DNA preparation (phenol/chloroform, columns, etc)
DNA storage (TE, Te, water, -80, -20, 4)
Importance to keep consistent across samples (confounding
case-control differences, longitudinal changes, etc)
26,320 years ago….
“…Our results suggest
that as long as ancient
nuclear DNA remains
amplifiable, cytosine
methylation patterns
can be
assessed…methylation
has been faithfully
retained along with the
DNA over evolutionary
timescales…” Llamas et
al, PLoS ONE, 2012
3. We may be limited by available sample sizes that are optimal for epigenetic epidemiology
The simple brute-force approach that has been used (relatively) successfully in GWAS is not valid for EWAS
Simply running Illumina 450K arrays on your GWAS samples is potentially a waste of ££
Discordant Monozygotic Twins – a powerful tool for
epigenetic studies of complex disease
Control for: age, sex,
genetics, pre-/peri-
natal environment,
parental origin
Verification of array data
Replication in brain tissue
Significant hypomethylation
(20-30%) in 15% of SZ brains Dempster et al (2011)
DRD4
SERT
MAOA
Wong et al 2010
Virtually Identical Large Changes
Discordance for
disease phenotypes??
Longitudinal sampling…
Most existing cohorts were not designed with epigenetics in mind Do sequential samples (of relevant tissues/cells) exist in ongoing longitudinal cohorts / biobanks?
DNA methylation at three CpG sites—in the promoters of the EDARADD, TOM1L1 and NPTX2 genes—is linear with age over a range of five decades. Regression model that explains 73% of the variance in age, and is able to predict the age of an individual with an average accuracy of 5.2 years!
Bell et al, PLoS Genetics, 2012
Ruth Pidsley (unpublished) Age is a huge potential confounder in epigenetic studies
4. Whatever we do, it may never be enough to fully account for epigenetic differences between tissues and cells
Machine learning 100% tissue discrimination
HOXA gene cluster
Blood
BA9
BA10
BA8
EntCtx
STG
Cerebellum
5. We may be trying to detect
inherently small effect sizes using
sub-optimal methods and sample
cohorts
Do the small DNA methylation differences often observed between groups translate into differences in gene expression in the relevant tissue??
P-Value
Adjusted P-value
Methylation difference
HOXB8 4.10E-06 0.08 -0.02
NKX2-5 2.54E-05 0.26 -0.03
C17orf100 8.10E-05 0.53 -0.01
SGCE 0.000114176 0.53 -0.01
C4orf38 0.0001863 0.53 -0.01
HAUS2 0.000211401 0.53 -0.01
LOC399815 0.000267029 0.53 0.04
LRP1B 0.000327942 0.53 0.04
FAM92B 0.000347768 0.53 0.02
1500bp upstream of transcription start site P<1x10-8 Mean meth difference=-0.02
Confirming findings in cleaner model
systems (e.g. cell and animal models)
Control for confounding factors and environmental influences
But a mouse is not a man, and a cell-line is not a body
6. We lack a framework for the analysis of genome-wide epigenetic data
Reference epigenomes – across cells and tissues – what is normal? Cataloging regions of high inter-individual variation Integrating epigenomic data with genetics and other –omics information
Reference Epigenomes
Technology Development
Novel Epigenetic Marks
Epigenomics of Human Health & Disease
Neurodegeneration
Bipolar disorder
Schizophrenia
Autism
Atherosclerosis
Hypertension
SLE
Kidney disease
Asthma
Insulin Resistence
Where is the data: sites with unique features
Consortium homepage http://roadmapepigenomics.org • View data on genome • protocols • standards NCBI http://ncbi.nlm.nih.gov/epigenomics http://ncbi.nlm.nih.gov/geo/roadmap/epigenomics • View data • Download data • Compare samples Human Epigenome Atlas http://epigenomeatlas.org • View data on genome or with Atlas gene browser • Download data • Tools at Genboree Workbench WashU VizHub http://vizhub.wustl.edu • Next-gen browser http://epigenomegateway.wustl.edu • UCSC visualization hub at http://genome.ucsc.edu
What data are available to me? Range of cells/tissues covered:
Currently 125 cell/tissue types represented including…. iPS and ES cells, some differentiated forms Fetal tissues (heart, brain, kidney, lung, others) Adult primary cells and tissues (hematopoietic, brain regions, breast cell types, liver, kidney, colon, muscle, adipocytes, others) Most samples will have:
DNA methylation data (RRBS, MRE-seq, MeDIP-seq, whole genome bisulfite seq) ChIP-seq data (currently H3K27me3, H3K36me3, H3K4me1, H3K4me3, H3K9me3) DNase I hypersensitivity data Gene expression data (arrays or RNA-seq) Some samples will have:
Expanded panel of histone modifications (currently 20+) Can download:
.wig, .bed, some .bam, SRA, peak calls
Two ways to browse: Data Table
Type search terms here to narrow list Search isn’t literal (…type “lung”, “blood”)
Genome-wide epigenetic data at NCBI: Epigenomics Gateway
http://www.ncbi.nlm.nih.gov/epigenomics
Tutorials Text search or browser search
Compare samples
Compare Samples: Identify genes with significant epigenetic differences
GO Terms and pathways
found most frequently
A new epigenomics browser: data and metadata together
http://epigenomegateway.wustl.edu
Click here for browser
Ernst et al, Nature 2011
Top SNPs linked to cell type specific enhancer states in disease relevant cell types
What can I do with the data: interpret GWAS hits
sign
ific
ance
Genomic locus
Data from ENCODE: Human and mouse: http://genome.ucsc.edu/ENCODE Fly and worm: http://modencode.org
7. Back to hype and bad
science reporting –
especially with regard to
“transgenerational
epigenetic inheritance”
We need to manage expectations
Questions?
Lecturer in Epigenetics Bioinformatic approaches and computational epigenomics Environmental epigenomics Functional epigenomics Postdoctoral Research Workers Laboratory Technician PhD students Contact: [email protected] www.epigenomicslab.com