27
Data analytics approaches to enable EWAS Chirag J Patel and Nam Pho Emory Exposome Workshop 06/16/16 [email protected] @chiragjp www.chiragjpgroup.org

Data analytics to support exposome research course slides

Embed Size (px)

Citation preview

Page 1: Data analytics to support exposome research course slides

Data analytics approaches to enable EWAS

Chirag J Patel and Nam PhoEmory Exposome Workshop

06/16/16

[email protected]@chiragjp

www.chiragjpgroup.org

Page 2: Data analytics to support exposome research course slides

Extensible & open-source analytics software library

Freely available exposome data for your research

(XWAS R package)

(NHANES: 40,000 individuals and 1,000 variables)

Materials for teaching and demonstration

Computer “environment” to conduct EWASs(Docker container in RStudio)

goodness!XWAS + + =

Page 3: Data analytics to support exposome research course slides

http://bit.ly/exposome-analytics-course

Please let us know if you are using the resources(or provide feedback)!

@chiragjp @nampho2

Chirag Nam

Page 4: Data analytics to support exposome research course slides

diseasesdiabetescancer

heart disease

functionexpressiontelomeres

metabolome

externalgeographyair pollution

income

internallead (serum)

nutrients (serum)infection (urine)

metabolome

phenomeexposome

Real quick: What is the exposome? What is the phenome?

Page 5: Data analytics to support exposome research course slides

phenome

Exposome associated with the phenome?

exposome

Analytic tools and big data infrastructure required to associate exposome with phenome!

…and vice versa?

Page 6: Data analytics to support exposome research course slides

We can learn a thing or two from genomics investigation…

phenomegenomee.g., GWAS

Page 7: Data analytics to support exposome research course slides

Computational approaches fueled discovery of genetic variants in disease

(example: genome-wide association [GWAS])

A search engine for robust, reproducible genotype-phenotype associations…

580 VOLUME 42 | NUMBER 7 | JULY 2010 NATURE GENETICS

A RT I C L E S

13 autosomal loci exceeded the threshold for genome-wide significance (P ranging from 2.8 × 10−8 to 1.4 × 10−22) with allele-specific odds ratios (ORs) between 1.06 and 1.14 (Table 1 and Fig. 2). All signals remained close to or beyond genome-wide significance thresholds (the least significant P value was 5.2 × 10−8) when we repeated analyses after implementing a second (post meta-analysis) round of genomic control adjustment within stage 1 data (Supplementary Note).

We extended our search for susceptibility signals to the X chromo-some, identifying one further signal in the stage 1 discovery samples meeting our criteria for follow-up (represented by rs5945326, near DUSP9, P = 2.3 × 10−6). This SNP showed strong evidence for replication in 8,535 cases and 12,326 controls (OR (allowing for X-inactivation) 1.32 (95% CI 1.16–1.49), P = 2.3 × 10−5), for a combined association P value of 3.0 × 10−10 (OR 1.27 (95% CI 1.18–1.37)) (Table 1 and Fig. 2).

Fourteen signals reaching genome-wide significanceTwo of the 14 signals reaching genome-wide significance on joint analysis (those near MTNR1B and IRS1) represent loci for which T2D associations have been recently reported in samples which partially overlap with those studied here10,14–16 (Table 1).

A third signal (rs231362) on 11p15 overlaps both intron 11 of KCNQ1 and the KCNQ1OT1 transcript that controls regional imprint-ing17 and influences expression of nearby genes including CDKN1C, a known regulator of beta-cell development18. This signal maps ~150 kb from T2D-associated SNPs in the 3 end of KCNQ1 first identified in East Asian GWA scans8,9. SNPs within the 3 signal were also detected in the current DIAGRAM+ meta-analysis (for example, rs163184, P = 6.8 × 10−5), but they failed to meet the threshold for initiating replication. A SNP in the 3 region (rs2237895) that was reported to reach genome-wide significance in Danish samples9 was neither typed nor imputed in the DIAGRAM+ studies. In our European-descent samples, rs231362 and SNPs in the 3 signal were not correlated

(r2 < 0.05), and conditional analyses (see below) establish these SNPs as independent (Fig. 2 and Supplementary Table 4). Further analysis in Icelandic samples has shown that both associations are restricted to the maternally transmitted allele11. Both T2D loci are independent of the common variant associations with electrocardiographic QT intervals that map at the 5 end of KCNQ1 (r2 < 0.02, D < 0.35 in HapMap European CEU data)19,20 (Supplementary Table 5).

Of the remaining loci, two (near BCL11A and HNF1A) have been highlighted in previous studies7,21–23 but are now shown to reach genome-wide significance. Rare mutations in HNF1A account for a substantial proportion of cases of maturity onset diabetes of the young, and a population-specific variant (G319S) influences T2D risk in Oji-Cree Indians24. Confirmation of a common variant association at HNF1A brings to five the number of loci known to harbor both rare mutations causal for monogenic forms of diabetes and common vari-ants predisposing to multifactorial diabetes, the others being PPARG, KCNJ11, WFS1 and HNF1B. A T2D association in the BCL11A region was suggested by the earlier DIAGRAM meta-analysis (rs10490072, P = 3 × 10−5), but replication was inconclusive7; there is only modest linkage disequilibrium (LD) between rs10490072 and the lead SNP from the present analysis (rs243021, r2 = 0.22, D = 0.73 in HapMap CEU).

The remaining nine signals map near the genes HMGA2, CENTD2, KLF14, PRC1, TP53INP1, ZBED3, ZFAND6, CHCHD9 and DUSP9 (Table 1 and Figs. 1 and 2) and represent new T2D risk loci uncovered by the DIAGRAM+ meta-analysis.

Understanding the genetic architecture of type 2 diabetesCombining newly identified and previously reported loci and assuming a multiplicative model, the sibling relative risk attributable to the 32 T2D susceptibility variants described in this paper is ~1.14. With addition of the five T2D loci recently identified by the Meta-Analysis of Glucose and Insulin-related traits Consortium (MAGIC) investigators12,13 and

50 Locus established previouslyLocus identified by current studyLocus not confirmed by current study

BCL11A

THADANOTCH2

ADAMTS9

IRS1

IGF2BP2

WFS1ZBED3

CDKAL1

HHEX/IDEKCNQ1 (2 signals*: )

TCF7L2

KCNJ11CENTD2MTNR1B

HMGA2 ZFAND6PRC1

FTOHNF1B DUSP9

Conditional analysis

Unconditional analysis

TSPAN8/LGR5HNF1A

CDC123/CAMK1DCHCHD9

CDKN2A/2BSLC30A8

TP53INP1

JAZF1KLF14

PPAR

40

30

–log

10(P

)–l

og10

(P)

20

10

10

1 2 3 4 5 6 7 8Chromosome

9 10 11 12 13 14 15 16 17 18 19 20 21 22 X0

0

Suggestive statistical association (P < 1 10–5)

Association in identified or established region (P < 1 10–4)

Figure 1 Genome-wide Manhattan plots for the DIAGRAM+ stage 1 meta-analysis. Top panel summarizes the results of the unconditional meta-analysis. Previously established loci are denoted in red and loci identified by the current study are denoted in green. The ten signals in blue are those taken forward but not confirmed in stage 2 analyses. The genes used to name signals have been chosen on the basis of proximity to the index SNP and should not be presumed to indicate causality. The lower panel summarizes the results of equivalent meta-analysis after conditioning on 30 previously established and newly identified autosomal T2D-associated SNPs (denoted by the dotted lines below these loci in the upper panel). Newly discovered conditional signals (outside established loci) are denoted with an orange dot if they show suggestive levels of significance (P < 10−5), whereas secondary signals close to already confirmed T2D loci are shown in purple (P < 10−4).

Voight et al, Nature Genetics 2012N=8K T2D, 39K Controls

GWAS in Type 2 Diabetes

Page 8: Data analytics to support exposome research course slides

There are non-trivial data analytic challenges in searching for exposome-phenome associations!

JAMA 2014Pac Symp Biocomp 2015

Page 9: Data analytics to support exposome research course slides

Dense correlational web!

JAMA 2014Pac Symp Biocomp 2015

what causes what?

confounding bias?

Page 10: Data analytics to support exposome research course slides

3-hydroxyfluorene and 2-hydroxyfluorene, were asso-ciated with 3% lower HDL-C (P¼ 0.006 and P¼ 0.004).Cotinine, a serum biomarker for nicotine, was associatedalso with 3% lower HDL-C (P¼ 2" 10#6).

Sensitivity analyses with further adjustmentsFor most questionnaire variable adjustments, we didnot see a sizable difference in estimated coefficients orP-values for the environmental factors (Supplemen-tary Figures 2–4, available as Supplementary Data atIJE online), including questionnaire items regardingself-reported cardiovascular-related disease statusand use of drugs. Interestingly, some adjustmentsincreased the effect size of the environmental factor.For example, the association of cotinine, 3- and2-hydroxyfluorene with HDL-C strengthened after

adjustment for alcohol intake. Adjustment for fishand shellfish consumption strengthened the associ-ation between retinyl stearate and HDL-C andtriglyceride levels. Conversely, the effect of vitaminC and folate in relation to HDL-C decreased whentaking supplement count, total fiber intake and phys-ical activity into account. Adjusting for supplementcount decreased the effect of g-tocopherol on HDL-C.

Simultaneous adjustment for self-reportedcardiovascular-related disease, supplement count,lipid-lowering drugs and physical activity strength-ened the association between tocopherols and pollu-tant factors and triglycerides, while attenuating theassociation to a-carotene (Supplementary Figure 2available as Supplementary Data at IJE online).For HDL-C levels, effects of cotinine, mercury,3- and 2-flourene, folate, vitamin C, vitamin D

CarotenoidsMinerals A

BCDE

PhytoestrogensCotinineHydrocarbons

Volatile compounds

VirusBacteria

PhenolsPhthalates

Polyfluorochemicals

Heavy metals

PCBs

Dioxins

Vitamins

Diakyl

Furans

PesticidesChloroacetanilideOrganochlorineOrganophosphatePhenolPyrethyroid

Car

oten

oids

Min

eral

sA B C D E

Phy

toes

troge

nsC

otin

ine

Hyd

roca

rbon

s

Vol

atile

com

poun

ds

Viru

sB

acte

ria

Phe

nols

Pht

hala

tes

Pol

yflu

oroc

hem

ical

s

Hea

vy m

etal

s

PC

Bs

Dio

xins

Vita

min

s

Dia

kyl

Fur

ans

Pes

ticid

esC

hlor

oace

tani

lide

Org

anoc

hlor

ine

Org

anop

hosp

hate

Phe

nol

Pyr

ethy

roid

0 0.5 1

Value

020

0040

0060

00

Colour keyand histogram

Cou

nt

Figure 2 Partial Pearson’s correlation between environmental factors. Partial Pearson’s correlation, adjusted by age andBMI (and creatinine for factors measured in urine) for each of the 188 factors were computed for each survey separately.We combined correlations between surveys using a meta-analytic random-effects estimate and displayed them in a heatmap(above), and ordered them by environmental ‘class’, coloured as in Figure 1A. Pairs of factors where correlations could notbe computed are shown in grey

836 INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

by guest on February 15, 2016http://ije.oxfordjournals.org/

Dow

nloaded from

IJE 2013

Page 11: Data analytics to support exposome research course slides

Interdependencies of the exposome: Correlation globes paint a complex view of exposure

Pac Symp Biocomput. 2015JECH. 2015http://bit.ly/globebrowse

Page 12: Data analytics to support exposome research course slides

Multiplicity: how to determine signal from noise?type 1 error (spurious findings)

JAMA 2014

Page 13: Data analytics to support exposome research course slides

Suppose you are testing 1000 exposures in case-control study (disease vs. healthy)…

… and there were no difference between the cases and controls…

…how many findings would be “significant” at a p-value threshold of 0.05 (due to chance)?

Page 14: Data analytics to support exposome research course slides

Regime of multiple tests and “signal to noise”:Histogram of p-values in 2 scenarios: no difference and 5% different

pvalues

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

80100

No difference (no true associations)

pvalues.ewas

Frequency

0.0 0.2 0.4 0.6 0.8 1.00

2040

6080

100

140

5% exposures different (5% true associations)

A B

Page 15: Data analytics to support exposome research course slides

Estimating the deviation from null:QQplot: -log10(pvalues) in the null and EWAS distributions

0.0 0.5 1.0 1.5 2.0 2.5

010

2030

40

expected −log10(pvalue)

10%

diff

eren

t −lo

g10(

pval

ue)

A

B

Bonferroni False discovery rate

Page 16: Data analytics to support exposome research course slides

The tension between type 1 and type 2 errors:Power and replication for robust associations!

Discovery sample sizes must be large to overcome multiple testing and mitigate winner’s curse

Replication sample size must be large to detect association

DiscoveryN=1K

ReplicationN=1K

Page 17: Data analytics to support exposome research course slides

What will the exposome data structure look like?:

a high-dimensioned 3D matrix of (1) exposure measurements

on (2) individuals as a function of (3) time

time

exposome phenome

pollutants

diet

metabolites . . .

gut flora

height

weightCVD

BPT2D

cancer

xenobiotics . . .

indi

vidu

als

GWAS, RVAS, pathway analysis..etc.

EWAS, PheWAS..etc.

geno

me

(sta

tic)

mixtures of exposures

time

drugs

integrative

mixtures of phenotypes

(A) (C)

(B)

exposome factors

nutrient value for individual i

individual i

in review

Page 18: Data analytics to support exposome research course slides

time

exposome phenome

pollutants

diet

metabolites . . .

gut flora

height

weightCVD

BPT2D

cancer

xenobiotics . . .

indi

vidu

als

GWAS, RVAS, pathway analysis..etc.

EWAS, PheWAS..etc.

geno

me

(sta

tic)

mixtures of exposures

time

drugs

integrative

mixtures of phenotypes

(A) (C)

(B)

longitudinal

system

genome

in review

What will the exposome data structure look like?:

a high-dimensioned 3D matrix of (1) exposure measurements

on (2) individuals as a function of (3) time

Page 19: Data analytics to support exposome research course slides

A schematic of a data-driven search for exposome-phenome associations…

time

exposome phenome

pollutants

diet

metabolites . . .

gut flora

height

weightCVD

BPT2D

cancer

xenobiotics . . .

indi

vidu

als

GWAS, RVAS, pathway analysis..etc.

EWAS, PheWAS..etc.

geno

me

(sta

tic)

mixtures of exposures

time

drugs

integrative

mixtures of phenotypes

(A) (C)

(B)

EWAS

in review

Page 20: Data analytics to support exposome research course slides

Time for you to give it a try!

Page 21: Data analytics to support exposome research course slides

Extensible & open-source analytics software library

Freely available exposome data for your research

(XWAS R Package)

(NHANES: 40,000 individuals and 1,000 variables)

Materials for teaching and demonstration

Computer “environment” to conduct EWASs(Docker container in RStudio)

goodness!XWAS + + =

Page 22: Data analytics to support exposome research course slides

Fully merged dataset:National Health and Nutrition Examination Survey

since the 1960snow biannual: 1999 onwards10,000 participants per survey

Introduction

The National Health and Nutrition Examination Survey (NHANES) is a program of studiesdesigned to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it com-bines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation.

The NHANES program began in the early 1960s and has been conducted as a series of sur-veys focusing on different population groups or health topics. In 1999, the survey became a con-tinuous program that has a changing focus on a variety of health and nutrition measurements to meet emerging needs. The survey examines a nationally representative sample of about 5,000 persons each year. These persons are located in counties across the country, 15 of which are visited each year.

The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measure-ments, as well as laboratory tests administered by highly trained medical personnel.

Findings from this survey will be used to de-termine the prevalence of major diseases and risk factors for diseases. Information will be used to assess nutritional status and its associ-ation with health promotion and disease pre-vention. NHANES findings are also the basis for national standards for such measurements as height, weight, and blood pressure. Data from this survey will be used in epidemiologi-cal studies and health sciences research, which help develop sound public health policy,

direct and design health programs and services, and expand the health knowl-edge for the Nation.

Survey Content

As in past health examination surveys, data will be collected on the prevalence of chron-ic conditions in the population. Estimates for previously undiagnosed conditions, as well as those known to and reported by respon-dents, are produced through the survey. Such information is a particular strength of the NHANES program.

Risk factors, those aspects of a person’s life-style, constitution, heredity, or environment that may increase the chances of developing a certain disease or condition, will be examined. Smoking, alcohol consumption, sexual practices, drug use, physical fitness and activity, weight, and dietary intake will be studied. Data on certain aspects of reproductive health, such as use of oral contraceptives and breastfeeding practices, will also be collected.

The diseases, medical conditions, and health indicators to be studied include:

• Anemia• Cardiovascular disease• Diabetes• Environmental exposures• Eye diseases• Hearing loss• Infectious diseases• Kidney disease• Nutrition• Obesity• Oral health• Osteoporosis

The sample for the survey is selected to represent the U.S. population of all ages. To produce reli-able statistics, NHANES over-samples persons 60 and older, African Americans, and Hispanics.

Since the United States has experienced dramatic growth in the number of older people during this century, the aging population has major impli-cations for health care needs, public policy, and research priorities. NCHS is working with public health agencies to increase the knowledge of the health status of older Americans. NHANES has a primary role in this endeavor.

All participants visit the physician. Dietary inter-views and body measurements are included for everyone. All but the very young have a blood sample taken and will have a dental screening. Depending upon the age of the participant, the rest of the examination includes tests and proce-dures to assess the various aspects of health listed above. In general, the older the individual, the more extensive the examination.

Survey Operations

Health interviews are conducted in respondents’ homes. Health measurements are performed in specially-designed and equipped mobile centers, which travel to locations throughout the country. The study team consists of a physician, medical and health technicians, as well as dietary and health interviewers. Many of the study staff are bilingual (English/Spanish).

An advanced computer system using high-end servers, desktop PCs, and wide-area networking collect and process all of the NHANES data, nearly eliminating the need for paper forms and manual coding operations. This system allows interviewers to use note-book computers with electronic pens. The staff at the mobile center can automatically transmit data into data bases through such devices as digital scales and stadiometers. Touch-sensi-tive computer screens let respondents enter their own responses to certain sensitive ques-tions in complete privacy. Survey information is available to NCHS staff within 24 hours of collection, which enhances the capability of collecting quality data and increases the speed with which results are released to the public.

In each location, local health and government officials are notified of the upcoming survey. Households in the study area receive a letter from the NCHS Director to introduce the survey. Local media may feature stories about the survey.

NHANES is designed to facilitate and en-courage participation. Transportation is provided to and from the mobile center if necessary. Participants receive compensation and a report of medical findings is given to each participant. All information collected in the survey is kept strictly confidential. Privacy is protected by public laws.

Uses of the Data

Information from NHANES is made available through an extensive series of publications and articles in scientific and technical journals. For data users and researchers throughout the world, survey data are available on the internet and on easy-to-use CD-ROMs.

Research organizations, universities, health care providers, and educators benefit from survey information. Primary data users are federal agencies that collaborated in the de-sign and development of the survey. The National Institutes of Health, the Food and Drug Administration, and CDC are among the agencies that rely upon NHANES to provide data essential for the implementation and evaluation of program activities. The U.S. Department of Agriculture and NCHS coop-erate in planning and reporting dietary and nutrition information from the survey.

NHANES’ partnership with the U.S. Environ-mental Protection Agency allows continued study of the many important environmental influences on our health.

• Physical fitness and physical functioning• Reproductive history and sexual behavior• Respiratory disease (asthma, chronic bron- chitis, emphysema)• Sexually transmitted diseases • Vision

>250 exposures (serum + urine)

>85 quantitative clinical traits (e.g., serum glucose, lipids, body mass index)

Death index linkage (cause of death)

Ready to analyze! N=41K with >1000 variables (let us know; we can give you a DOI)

in review

Page 23: Data analytics to support exposome research course slides

http://bit.ly/ewas_nhanes

preterm birth type 2 diabetes

type 2 diabetes genetics lipids

blood pressure income mortality

telomere length methodology (5)

13 EWAS-related manuscripts

Page 24: Data analytics to support exposome research course slides

Associations in Telomere Length: Can you identify the associations in this graph?

IJE, 2016

0

1

2

3

4

−0.2 −0.1 0.0 0.1 0.2effect size

−log

10(p

valu

e)

FDR<5%

shorter telomeres longer telomeres

median N=3000; N range: 300-7000

??

Page 25: Data analytics to support exposome research course slides

Nam will show you how!

Associations in Telomere Length: Can you identify the associations in this graph?

Page 26: Data analytics to support exposome research course slides

http://bit.ly/exposome-analytics-course

Please let us know if you are using the resources(or provide feedback)!

@chiragjp @nampho2

Chirag Nam

Page 27: Data analytics to support exposome research course slides

Harvard DBMI Isaac KohaneSusanne ChurchillStan ShawJenn GrandfieldMichal Preminger

Chirag J [email protected]

@chiragjpwww.chiragjpgroup.org

NIH Common FundBig Data to Knowledge

AcknowledgementsRagGroup Nam Pho Chirag Lakhani Adam Brown Danielle Rasooly Arjun Manrai Grace MahoneyMatthew Roy

Gary Miller Kristine Dennis

Stanford John PA Ioannidis