91
ERROR, BIAS, PROBLEMS AND PITFALLS IN EPIGENETIC EPIDEMIOLOGY DR JONATHAN MILL PSYCHIATRIC EPIGENETICS GROUP MRC SGDP CENTRE INSTITUTE OF PSYCHIATRY KING’S COLLEGE LONDON [email protected] www.epigenomicslab.com Newcastle, 2012

ERROR, BIAS, PROBLEMS AND PITFALLS IN … JONATHAN MILL PSYCHIATRIC ... How do different disease-relevant regions of the brain ... Features and sources of bias for DNA methylation

  • Upload
    buimien

  • View
    216

  • Download
    3

Embed Size (px)

Citation preview

ERROR, BIAS, PROBLEMS AND PITFALLS IN

EPIGENETIC EPIDEMIOLOGY DR JONATHAN MILL PSYCHIATRIC EPIGENETICS GROUP

MRC SGDP CENTRE

INSTITUTE OF PSYCHIATRY

KING’S COLLEGE LONDON

[email protected]

www.epigenomicslab.com

Newcastle, 2012

1997: first year with >100 publications 2010: >2,000 publications 2011: almost 2,500 publications

But is there is more “interest” than “information”???

REVIEWS

RESEARCH ARTICLES

CANCER

NEUROSCIENCE MENTAL HEALTH

An exponential increase in published epigenetics research…

STEM CELLS

“…The victory

over the genes…

Smarter, healthier,

happier...

How we can

outwit our

genome…”

“…Roll over, Mendel. Watson and Crick? They are so your old man's version of DNA…”

“…The integration of standard massage practices and the knowledge of the biology of adversity will change our minds, our physiology, our epigenetics—and hence our massage practice…”

Would you trust these guys with your money?

NATURE Vol 467|9 September 2010

Behavioural epigenetics is highly controversial

AGTGCCTCAGCCTCCCTAGTAGCTGGGATTACAGGTGCCCTCCACAATGCCCAGCTAATTT

TTGTGTTTTTAGTAGACACAAGATTTCACTATGTTGCCCAGGCTGGTCTCAACCCCTGACC

TCAAGTGATCCACCTGCCTCAGTCTCCCAGAGTGCTGGGACTGCAGGCGTGAGCCAACAAG

CCCAGGCCACGATGTCTTACTTTTCACCTAAAACCTGCCTAAATGGCATGCCCAGTTAAAA

CAATCTTTTTCTGTTACAATAATCCATGTAAGAGTATGACACATTTTCTGAAAGATTTGTC

TAAAAAAGAGCCTGGTATGTTTACTGTTGCTGCTGAATTGGATTTGACTCTGCTGCTGTAT

CAGGGCCCCTTCTGACAATTCACCTCTTGCTTCCTTTCCTGCTAATTGTCCTGTTGACTAC

TATTTTTTTTTTTTTTTGGTAACAGTGTCTGGCTCTGTCACCCAGCCTAGAGTGCAGTGGC

ACAATCTTGGCTCACTACAACCTCCATCTTCTGGGCTCAAGCTATTCTTCCACCTCAGCCT

CCCAAGTAGCTGAGACTACAGGCATGTGCCACCACACCCAGCTAGATTTTGTATTTTTTGT

AGAGACGGGGTCTTGTGATGTTGCCCAGGCTGGTCTTGAACTCCTGGGCTCAAAGCAATCC

GCCCGCCTCCGCCTCCCAAAGTGCTGAGATGACAGGCGTGAGCAACTGCGCCCAGCCTTGT

GTACTTCTTAGGGCTCTTTTACATGCCTTTCTTTTTTTAACAGCCTTCCCACCACTACCTT

TTACATGTCTTGAGATTTTCCTGTATGCATGTGTATGCGTGCACGTGCACGCACGCACACA

CACACACACACCTGATTTTGTCATTCTGGTGTTTAAAGCATATCATAGTCCTACTTCCAGA

AATACATCCAATGCAATGAACCTGGTAGCCAACACTGCTGAGAAATGACCCAAGGGTCTAC

CTTGAGTAGCCAGCCCCCAAATCCAAAGAATAGCTCCAGACCCCATAGTTTTCTCACCCAC

TAGGTCATGGGACCATGGCAAGAGTGAGAGAGTTCCACTTCCCAGAGGATGCCTGTTATTA

CCTTACCTCAATTTGAAATCTGTACTAAGGTTGAACACATGCATTCTCCTCCTTGACCTCC

ACATCCCCTGTTGTTTCCTTTTTTTGTTGTTTTTGTTTTTTGTTTTTGTTTTGAGACAGAG

TCTCGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCACGATCTCGGCTCACTGCAGTCTCTGC

CTCCCGGGCTCAAGCAATTCTCCTGCCTCAGCCTCCTGAGTAACTGGGATTACAGGTGTGT

GCCACCACGCCCGGCTGATTTTTTGTACTTTTAGTAGAGACGGGGTTTCACCATGTTGGCC

– 1 body, 1 genome: a blood sample is all you need!

– 1 life, 1 genome: you are born with the genome

you die with

– Any lifestyle, 1 genome: it doesn’t matter what

you’re exposed to

– Any disease, 1 genome: no reverse causation

– A nicely annotated reference genome and

catalogue of SNPs is freely available

– Methods that do as they say on the box and give

results that are easy to interpret

Some of the (many) issues in

epigenetic epidemiology • Technical / methodological

• Sample related issues

• Study design

• Analysis and interpretation

• Over-interpretation

• Over-simplification

• Biological Significance?

• Confounding factors

• Cause vs effect

Schizophrenia Bipolar disorder

Autism Spectrum Disorders

Wong (in prep)

Alzheimer’s Disease

Schalkwyk et al (in prep)

QDP ITFG2

What is a ‘normal’ brain methylome?

How do different disease-relevant regions of the brain differ epigenetically?

Are there marked epigenetic differences between the major cell-types in the brain?

Can peripheral tissues be used as a ‘proxy’ for the brain in epigenetic epidemiology?

What is the location of functionally-relevant DMRs?

How is the brain methylome influenced by factors such as age, sex and medication?

Biological, Technological, and

Methodological issues

1. We do not know where in the genome to look and what to look for

2. We have to rely on imperfect technology

3. We may be limited by available sample sizes that are optimal for epigenetic

epidemiology

4. Whatever we do, it may never be enough to fully account for epigenetic

differences between tissues and cells

5. We may be trying to detect inherently small effect sizes using sub-optimal

methods and sample cohorts

6. We lack a framework for the analysis of genome-wide epigenetic data

7. We have to manage high expectations

IJE, 2012

1. We do not really know where to look, or what to

look for

Promoter CpG islands!!!

Irizarry et al, Nature Genetics, 2009

CpG Island ‘Shores’

0

100

200

300

400

500

600

700

800

900

1000

Promoter Intragenic 3'UTR Intergenic

OBS

EXP

Most tissue-variable CGI DMRs: enrichment for intragenic CGIs

Χ2 p = 1E-246

O/E=0.09

O/E=2.37

Davies et al (in press)

HCP LCP

Davies et al (in press)

And there’s more to epigenetic gene regulation than DNA methylation!

Zhou et al, Nat Rev Genet, 2011

Traditional bisulfite-based methods do

not distinguish between 5-mC and 5-

hmC

CMS (the product of bisulfite

conversion of 5-hmC) tends to stall

DNA polymerases during PCR -

densely hydroxymethylated regions of

DNA may be underrepresented in

quantitative methylation analyses

Existing 5-mC data sets may require

re-evaluation in the context of the

possible presence of 5-hmC

Antibodies against 5-mC and 5-hmC

can pull out fragments enriched for

each mark separately – but can’t

quantify at base-pair resolution

Oxidative bisulfite-sequencing (ox-BS-

seq) (Booth et al, Science, 2012)

Huang et al, PLoS ONE, 2010

Lunnon et al (in prep)

The relationship between DNA methylation and gene

expression is not necessarily straightforward…

(SAM)

Beyond transcriptional silencing:

functions of DNA methylation

• Chromatin compaction

• Genome stability

• Suppression of homologous recombination between repeats

• Genome defense against retroviruses

• Genetic recombination & DNA mutability

• X-chromosome inactivation (in females)

• Genomic imprinting

Gene-body DNA methylation

Madeleine et al, Nature Biotechnology, 2009

Cerebellum Frontal cortex

Cerebellum Frontal cortex

p = 1.90E-33

p = 1.39E-34

Log2

exp

ress

ion

Lo

g2 e

xpre

ssio

n

41 (82%) out of the top 50 ranked cerebellum-cortex DMRs are mirrored by significant gene expression differences

EOMES

GRM4

DN

A m

eth

ylat

ion

D

NA

met

hyl

atio

n

2. We have to reply upon imperfect technology

Measuring DNA methylation is not like measuring genotype Genome-scale assessment now feasible Compromise between coverage and precision Huge range of methods (enrichment, measurement and analysis) No consensus on analysis method

There is a huge number of methylomic profiling methodologies

Laird et al, 2010

Sensitivity

Reproducibility

Cost

Sample requirements

Coverage & Throughput

Illumina 450K methylation array and EWAS

Still very much focused on CpG Islands….

High correlation across genome-wide platforms….. But…mainly driven by the fact that the large majority of the genome is either unmethylated or fully methylated …substantial discrepancies between platforms may exist for intermediate level methylation

Can these genome-based approaches detect a small % change?

Many stages when inaccuracies in

measuring DNA methylation may occur

• Error (variation) can be introduced during

– Tissue / cell processing

– DNA preparation / storage

– Enrichment / conversion

– Amplification

– Measurement

– QC

– Analysis

• Accurate measurement may be vital if DNA methylation

is to be used for clinical (i.e. diagnostic or prognostic)

purposes

• But is ‘accuracy’ vital for epidemiological studies?

• Consistency / reliability more important?

Some inconvenient truths

• nothing you can do with normalization replaces careful experimental design

• a nuisance variable confounded with what you want to test can’t be fixed – eg case and control in different batches

• nothing you can do with normalization replaces rigorous QC – samples with unusual raw intensity distributions – multivariate methods such as PCA often identify

mislabeled samples

• Problems with outsourcing and core facilities • Bad data will always be bad data!!

PCR bias: the effect of annealing temperature Make sure you PCR machines are calibrated!!

PCR block Illumina batch effects

Block 1 Block 2 Block 3 Block 4

The Illumina 450K array – some problems…

• Normalisation issues

– Type 1 and 2 probes

– Batch effects

– Array position

• Colour collection

• SNPs on probes

• Cross-hybridising probes

– Sex chromosomes very obvious

6-10% probes are non-specific SNPs in array probes common

Laird, Nat Rev Genet, 2010

Features and sources of bias for

DNA methylation technologies

No method is perfect…

• Bisulfite-based methods: PCR biases, bisulfite / PCR batch effects, hydroxymethylation

• Affinity-based methods – CG density, IP efficiency / non-specificity, resolution, quantification, CNV confounds

• MSRE-based methods – limited resolution, SNPs in MSRE sites

• And it’s not all bad news especially for bigger effects.

• For smaller differences – does non-verification negate a true difference?

• Data integration and meta-analysis across studies

Cerebellum

Frontal cortex

Whole blood

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Cerebellum Frontal Cortex Blood

P = 6.73E-104

MeDIP-seq vs 450K Illumina data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Cerebellum Frontal Cortex Blood

Illumina 450K replication of MeDIP-seq identified DMRs [but NB 20% of TS-DMRs not covered with any probes, and many only by single probes]

MeDIP-seq vs bisulfite pyrosequencing data

• Low-frequency DNA methylation states

may not be optimally detected by genome-

wide (sequencing-based) approaches

because sensitivity is a function of read-

depth

• They may also not be detected via site-

specific approaches such as

Pyrosequencing or Sequenom EpiTYPER

(sensitivity 2-5%)

• Ultra-deep bisulfite-sequencing of targeted

regions?

If something seems too good to be true….it probably is

Things to look for when

interpreting published data… • What enrichment/discrimination method has

been used?

• Have the relevant controls been used?

• Have reactions been done in replicate

(especially sodium bisulfite conversion,

bisulfite PCR)

• What magnitude of difference is being

reported?

• Do the differences reported exceed the

sensitivity of the platform?

F2RL3 encodes a protein that has functions which are relevant to cardiovascular disease Implications for confounding effects in epigenetic epidemiological analyses

• Limited number of samples – overlap between studies?

• Pre-mortem factors and pH

• Cause of death

• Peri-mortem factors

• Post-mortem factors

Sample-related technical issues…

After a typical group lunch in the Mill lab…

DNA preparation (phenol/chloroform, columns, etc)

DNA storage (TE, Te, water, -80, -20, 4)

Importance to keep consistent across samples (confounding

case-control differences, longitudinal changes, etc)

26,320 years ago….

“…Our results suggest

that as long as ancient

nuclear DNA remains

amplifiable, cytosine

methylation patterns

can be

assessed…methylation

has been faithfully

retained along with the

DNA over evolutionary

timescales…” Llamas et

al, PLoS ONE, 2012

3. We may be limited by available sample sizes that are optimal for epigenetic epidemiology

The simple brute-force approach that has been used (relatively) successfully in GWAS is not valid for EWAS

Simply running Illumina 450K arrays on your GWAS samples is potentially a waste of ££

Rakyan et al, 2011

Discordant Monozygotic Twins – a powerful tool for

epigenetic studies of complex disease

Control for: age, sex,

genetics, pre-/peri-

natal environment,

parental origin

Verification of array data

Replication in brain tissue

Significant hypomethylation

(20-30%) in 15% of SZ brains Dempster et al (2011)

Chloe Wong

DRD4

SERT

MAOA

Wong et al 2010

Virtually Identical Large Changes

Discordance for

disease phenotypes??

Longitudinal sampling…

Most existing cohorts were not designed with epigenetics in mind Do sequential samples (of relevant tissues/cells) exist in ongoing longitudinal cohorts / biobanks?

DNA methylation at three CpG sites—in the promoters of the EDARADD, TOM1L1 and NPTX2 genes—is linear with age over a range of five decades. Regression model that explains 73% of the variance in age, and is able to predict the age of an individual with an average accuracy of 5.2 years!

Bell et al, PLoS Genetics, 2012

Ruth Pidsley (unpublished) Age is a huge potential confounder in epigenetic studies

CRESTAR Kick-off Meeting 2011 Slide No. 62

Boks et al (Epigenetics, in press)

4. Whatever we do, it may never be enough to fully account for epigenetic differences between tissues and cells

Davies et al, in press

Machine learning 100% tissue discrimination

HOXA gene cluster

Blood

BA9

BA10

BA8

EntCtx

STG

Cerebellum

Individual differences – conserved

across blood and brain?

r=0.87, p<0.0001

Much more data needed!

Cell-type-specific methylomes: neurons, astrocytes, glia

Katie Lunnon, Jon Cooper

5. We may be trying to detect

inherently small effect sizes using

sub-optimal methods and sample

cohorts

Do the small DNA methylation differences often observed between groups translate into differences in gene expression in the relevant tissue??

P-Value

Adjusted P-value

Methylation difference

HOXB8 4.10E-06 0.08 -0.02

NKX2-5 2.54E-05 0.26 -0.03

C17orf100 8.10E-05 0.53 -0.01

SGCE 0.000114176 0.53 -0.01

C4orf38 0.0001863 0.53 -0.01

HAUS2 0.000211401 0.53 -0.01

LOC399815 0.000267029 0.53 0.04

LRP1B 0.000327942 0.53 0.04

FAM92B 0.000347768 0.53 0.02

1500bp upstream of transcription start site P<1x10-8 Mean meth difference=-0.02

Confirming findings in cleaner model

systems (e.g. cell and animal models)

Control for confounding factors and environmental influences

But a mouse is not a man, and a cell-line is not a body

6. We lack a framework for the analysis of genome-wide epigenetic data

Reference epigenomes – across cells and tissues – what is normal? Cataloging regions of high inter-individual variation Integrating epigenomic data with genetics and other –omics information

Reference Epigenomes

Technology Development

Novel Epigenetic Marks

Epigenomics of Human Health & Disease

Neurodegeneration

Bipolar disorder

Schizophrenia

Autism

Atherosclerosis

Hypertension

SLE

Kidney disease

Asthma

Insulin Resistence

http://www.roadmapepigenomics.org

Click here to browse data

Where is the data: sites with unique features

Consortium homepage http://roadmapepigenomics.org • View data on genome • protocols • standards NCBI http://ncbi.nlm.nih.gov/epigenomics http://ncbi.nlm.nih.gov/geo/roadmap/epigenomics • View data • Download data • Compare samples Human Epigenome Atlas http://epigenomeatlas.org • View data on genome or with Atlas gene browser • Download data • Tools at Genboree Workbench WashU VizHub http://vizhub.wustl.edu • Next-gen browser http://epigenomegateway.wustl.edu • UCSC visualization hub at http://genome.ucsc.edu

What data are available to me? Range of cells/tissues covered:

Currently 125 cell/tissue types represented including…. iPS and ES cells, some differentiated forms Fetal tissues (heart, brain, kidney, lung, others) Adult primary cells and tissues (hematopoietic, brain regions, breast cell types, liver, kidney, colon, muscle, adipocytes, others) Most samples will have:

DNA methylation data (RRBS, MRE-seq, MeDIP-seq, whole genome bisulfite seq) ChIP-seq data (currently H3K27me3, H3K36me3, H3K4me1, H3K4me3, H3K9me3) DNase I hypersensitivity data Gene expression data (arrays or RNA-seq) Some samples will have:

Expanded panel of histone modifications (currently 20+) Can download:

.wig, .bed, some .bam, SRA, peak calls

Two ways to browse: Data Table

Type search terms here to narrow list Search isn’t literal (…type “lung”, “blood”)

Two ways to browse: Visual Browser

Mouse over sites, click for table

Genome-wide epigenetic data at NCBI: Epigenomics Gateway

http://www.ncbi.nlm.nih.gov/epigenomics

Tutorials Text search or browser search

Compare samples

Compare Samples: Identify genes with significant epigenetic differences

GO Terms and pathways

found most frequently

The Human Epigenome Atlas

Click for data

Genboree workbench

Click to download data, or for metadata

Click to view selected data

sets

A new epigenomics browser: data and metadata together

http://epigenomegateway.wustl.edu

Click here for browser

Ernst et al, Nature 2011

Top SNPs linked to cell type specific enhancer states in disease relevant cell types

What can I do with the data: interpret GWAS hits

sign

ific

ance

Genomic locus

Data from ENCODE: Human and mouse: http://genome.ucsc.edu/ENCODE Fly and worm: http://modencode.org

Beyond GWAS: Integrated genetic-epigenetic

approach to common disease

7. Back to hype and bad

science reporting –

especially with regard to

“transgenerational

epigenetic inheritance”

We need to manage expectations

Epigenetic profile erased and reset

de novo during gametogenesis

Transgenerational longitudinal cohort studies

Questions?

Lecturer in Epigenetics Bioinformatic approaches and computational epigenomics Environmental epigenomics Functional epigenomics Postdoctoral Research Workers Laboratory Technician PhD students Contact: [email protected] www.epigenomicslab.com