148
i The genetics of gene expression: from simulations to the early-life origins of immune diseases Qinqin Huang ORCID: 0000-0003-3073-717X Doctor of Philosophy March, 2019 Department of Clinical Pathology The University of Melbourne Systems Genomics Laboratory Baker Heart & Diabetes Institute Submitted in Partial Fulfillment of the Requirements of the Degree of Doctor of Philosophy

The genetics of gene expression: from simulations to the

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The genetics of gene expression: from simulations to the

i

The genetics of gene expression: from

simulations to the early-life origins of

immune diseases

Qinqin Huang ORCID: 0000-0003-3073-717X

Doctor of Philosophy

March, 2019

Department of Clinical Pathology

The University of Melbourne

Systems Genomics Laboratory

Baker Heart & Diabetes Institute

Submitted in Partial Fulfillment of the Requirements of the

Degree of Doctor of Philosophy

Page 2: The genetics of gene expression: from simulations to the

ii

Abstract

Human complex traits and diseases are often highly polygenic. Genome-wide association studies

(GWAS) have been successful in identifying the underlying genetic components. However, challenges

still remain and one of them is the biological interpretation of these findings. Genetic variants that are

associated with diseases or traits are enriched in regulatory regions of the genome, suggesting that they

may have a role in the regulation of intermediate molecular phenotypes, such as mRNA gene expression.

Studies investigating the genetic architecture of gene expression variation, or expression quantitative

trait loci (eQTLs), have aided the interpretation of GWAS findings by providing potential mechanisms

through which the genetic variants contribute to higher-order phenotypes. In addition, eQTLs identified

in disease-relevant tissues, or those that are specific to certain cell types or conditions are more

informative in disease pathogenesis.

This thesis first explored eQTL study design and analysis choices using extensive, empirically driven

simulations with varying sample sizes, true effect sizes, and allele frequencies of true eQTLs. False

discovery rate (FDR) control applied to the entire collection of tests had inflated FDR of genes with

eQTLs (eGenes) in most scenarios; in contrast, hierarchical correction procedures had well-calibrated

FDR. Significant eQTLs with low allele frequencies identified using small sample sizes were enriched

for false positives. Overestimation of eQTL effect sizes was common in scenarios with low statistical

power, and a bootstrap method (BootstrapQTL) which can lead to more accurate effect size estimation

was developed.

Based on the insights of the eQTL simulation study, optimal strategies were selected for the following

eQTL analysis in two types of neonatal immune cells (monocytes and T cells) under resting and

stimulated conditions. A great proportion of cis-eQTLs were specific to a certain cell type or condition,

and the majority of them were observed only upon stimulation. Response eQTLs (reQTLs), with effects

on gene expression modified by immune responses, were identified for 31% of the eGenes in monocytes

and 52% of the eGenes in T cells. Trans-eQTL effects that were mediated through expression of cis-

eGenes were observed.

Lastly, integrative analyses were performed, using the early-life eQTLs, as well as GWAS variants

associated with immune-related diseases obtained from external large cohorts. Significant overlaps

between neonatal eQTLs and postnatal disease-associated variants were observed. Some cell type- or

condition-specific cis-eQTLs colocalised with disease associations, suggesting that the potential risk

genes involved in disease pathogenesis are linked to the stimulation of certain immune cells. Causal

effects of genes were evaluated using Mendelian randomisation, and changes in expression levels (e.g.

Page 3: The genetics of gene expression: from simulations to the

iii

BTN3A2) were identified to have causal associations with multiple immune-related diseases. Taken

together, it demonstrates that the early-life genetic variants and gene expression might contribute to

later disease development.

In conclusion, this thesis provides a strong evidence base for eQTL study design and guidance for

analysis strategies in future studies. The characterisation of genetic regulation of neonatal immune

responses and the interaction between regulatory variants and stimulatory conditions is a useful resource,

and generates insights on the early-life origins of immune-related diseases that develop later in life.

Page 4: The genetics of gene expression: from simulations to the

iv

Declaration

This is to certify that:

(I) This thesis comprises only my original work towards the Doctor of Philosophy degree, except

where indicated in the preface;

(II) Due acknowledgement has been made in the text to all other material used;

(III) The thesis is fewer than the maximum word limit (100,000 words) in length, exclusive of tables,

maps, bibliographies, and appendices.

Qinqin Huang, B. Sci.

Department of Clinical Pathology,

The University of Melbourne

Systems Genomics Lab,

Baker Heart & Diabetes Institute

Page 5: The genetics of gene expression: from simulations to the

v

Preface

This preface includes the information on the contributions of collaborators and the publication status of

each chapter.

Chapter 1: Introduction. This chapter provides a literature review and rationale for the research work

performed in this thesis. This is an original work, with advisory comments from my supervisor, Michael

Inouye, and assistance in editing and grammar from other collaborators, Guillaume Méric, Artika Nath,

Shu Mei Teo, and Alex Tokolyi.

Chapter 2: Power, false discovery rate and Winner’s Curse in eQTL studies. This is an original

work, which has been published in a peer reviewed journal (Nucleic Acids Research). I performed a

simulation study to explore the eQTL (expression quantitative trait locus) study design and analysis

choices. The following publication is included along with the supplemental materials in Chapter 2:

Qin Qin Huang, Scott C. Ritchie, Marta Brożyńska, and Michael Inouye. Power, false discovery

rate and Winner's Curse in eQTL studies. Nucleic Acids Res 46, e133 (2018).

I have altered the format of this published article to be consistent with the rest of the thesis.

I am the first author of this publication, and the contributions of myself and co-authors are as follows:

• Michael Inouye conceived and directed this work.

• I performed the simulation analysis and interpreted the results with input from Michael Inouye

and Scott Ritchie.

• I wrote the manuscript and plotted the figures with input from all co-authors and James E. Peters.

All co-authors helped in the editing of the manuscript.

• I wrote the responses to reviewer comments with input from all co-authors.

• I developed a bootstrap method to correct for the effect size overestimation, and Scott Ritchie

was responsible for implementing the software and improving the performance.

Chapter 3: Characterising the genetic basis of neonatal gene expression in immune responses of

monocytes and T cells. This is an original unpublished work, in which I investigated the genetic

regulation of gene expression in immune cells under resting and stimulated conditions.

Page 6: The genetics of gene expression: from simulations to the

vi

The contributions of myself and collaborators to this work are as follows:

• Michael Inouye and Kathryn Holt conceived and directed this work.

• I led all aspects of the data analysis with input from Michael Inouye, Shu Mei Teo, Howard

Tang, and Artika Nath. Howard Tang performed the quality control of the microarray genotype

data. Scott Ritchie helped with the PEER analysis of the transcriptome data. Agus Salim

provided advice on the statistical analysis to identify response eQTLs. Andrew Bakshi provided

advice on the microarray transcriptome data analysis.

• I interpreted the results and created the figures with input from Michael Inouye, Kathryn Holt,

Howard Tang, and Marta Brożyńska.

• I wrote this chapter with help in editing and grammar from Howard Tang and Marta Brożyńska.

• I contributed to the generation of the microarray gene expression data in collaboration with

Chiea Chuen Khor at the Genome Institute of Singapore.

• The study cohort was established by Peter D Sly and Patrick G Holt. Sample collection and

stimulation were performed by Danny Mok. RNA extraction was performed by Louise M Judd.

Chapter 4: Investigating the early-life origins of immune-related diseases using neonatal eQTLs.

This is an original unpublished work, in which I explored the role of the early-life transcriptome and

genetic variants on disease development later in life.

The contributions of myself and collaborators to this work are as follows:

• Michael Inouye conceived and directed this work.

• I led all aspects of the data analysis with input from Michael Inouye, Shu Mei Teo, and Artika

Nath. Youwen (Owen) Qin provided advice on the Mendelian Randomisation analysis.

• I interpreted the results with input from Michael Inouye, Shu Mei Teo, and Howard Tang.

• I created the figures with input from Michael Inouye and Howard Tang.

• I worte this chapter with help in editing and grammar from Howard Tang and Artika Nath.

Chapter 5: Conclusions is an original work summarising the findings and importance of the work in

this thesis. Shu Mei Teo helped in editing and grammar of this chapter.

Page 7: The genetics of gene expression: from simulations to the

vii

Acknowledgements

My PhD has been a sometimes stressful, but always amazing experience. Thanks are owed to many

people who have helped and supported me.

Firstly, and most importantly, I would like to thank my principle supervisor, Michael Inouye, for his

continuous guidance and support throughout my PhD. He has been a great mentor, tolerant of my

mistakes as a new researcher, and always gives me insightful advice when I feel confused. I also

appreciate a lot of his help in preparing me for the postdoc interview. He always encourages me when

I have difficulties, and makes me believe in myself. Without him, I could not have learnt so much and

completed everything that I have done during my PhD.

I would like to thank my co-supervisor, Kathryn Holt, for her mentorship and support during my PhD.

She is always very calm, and inspires me to become a better female researcher.

I would also like to thank the rest of my PhD advisory committee, Roberto Cappai and Alicia Oshlack,

for their support and advice throughout the PhD.

My sincere thanks to many collaborators. The CAS group members – Shu Mei Teo, Howard Tang,

Artika Nath, and Marta Brożyńska – they have helped me in various aspects, from the data analysis to

the interpretation of results. I am an introvert but I feel very comfortable in this team. I did not use to

reach out for help at the beginning, but they are always thoughtful and willing to offer any help all the

time. Shu Mei is a friend and is also a mentor. She often gives me a timely push. I get helpful advice

from her on both research and life. In addition, her two daughters are so lovely, and they made me a

(slightly) better babysitter.

Many thanks to Scott Ritchie, who helped with the software implementation and made our tool online.

He gave me useful advice on coding and analysis, as well as data visualisation. He also introduced me

our computing clusters and the R language.

Special thanks to those who helped in the editing of my thesis: Guillaume Méric, Howard Tang, Marta

Brożyńska, Shu Mei Teo, Artika Nath, and Alex Tokolyi. Thank you for your assistance in editing and

grammar of my thesis, this was really helpful for a non-native speaker.

To other PhD students from the group, Howard Tang, Youwen (Owen) Qin, Jason Grealey, and Yang

Liu. We have been helping and supporting with each other. Our Thursday game night organised by

Howard (the DnD master) added a lot of joy to our PhD life.

Page 8: The genetics of gene expression: from simulations to the

viii

To other group members and previous members: Gad Abraham, Sean Byars, Andrew Bakshi, Tingting

Wang, Agus Salim, Sergio Ruiz Carmona, Rodrigo Canovas, and Alex Smith. Thank you for your help,

and it is amazing to work with you.

To my friends, though most of them are not in Australia, but they are always there when I want to share

what happens in my life and work.

To my family. My parents give me the courage to do whatever I like, because they are always supporting

me throughout my life. Thanks to my cousins in Melbourne for making my first time to live overseas

much easier.

Finally, many thanks to my partner Huihao Shi for his ongoing patience and calmness. Thank you for

helping me in reducing my stress and understanding me all the time.

Page 9: The genetics of gene expression: from simulations to the

ix

Table of Contents Chapter 1: Literature Review ......................................................................................................... 1

1.1 Complex diseases and their genetic background .................................................................. 2

1.1.1 The human genome .......................................................................................................... 2

1.1.2 Human genetic variation ................................................................................................... 3

1.1.3 Genome-wide association studies ...................................................................................... 4

1.2 Gene expression, a fundamental intermediate phenotype .................................................... 7

1.2.1 Regulation of gene expression .......................................................................................... 8

1.2.2 High-throughput transcriptome data .................................................................................. 9

1.3 Genetic basis of gene expression variation .......................................................................... 10

1.3.1 Expression quantitative trait loci (eQTLs) ....................................................................... 11

1.3.2 Cis- and trans-eQTLs ..................................................................................................... 12

1.3.3 eQTLs and human diseases ............................................................................................. 13

1.3.4 Response eQTLs ............................................................................................................. 15

1.3.5 Beyond eQTLs ............................................................................................................... 18

1.4 Early-life origins of diseases developed in adulthood ......................................................... 19

1.4.1 Immune cell populations and perinatal immune system ................................................... 19

1.4.2 Early-life origins of chronic diseases developed later in life ............................................ 20

1.5 Research objectives .............................................................................................................. 21

Chapter 2: Power, false discovery rate and Winner's Curse in eQTL studies ............................ 23

2.1 Introduction ......................................................................................................................... 24

2.2 Results .................................................................................................................................. 27

2.2.1 Simulation of cis-eQTL data ........................................................................................... 27

2.2.2 Power and false discovery rate between scenarios and multiple testing correction

procedures ............................................................................................................................... 28

2.2.3 Identification of the simulated causal eSNP..................................................................... 32

2.2.4 Winner’s Curse in eQTL effect size estimation ............................................................... 33

2.3 Discussion ............................................................................................................................. 34

2.4 Materials and Methods ........................................................................................................ 36

2.4.1 Simulating genotypes and selecting eQTLs ..................................................................... 36

2.4.2 Simulating gene expression............................................................................................. 37

2.4.3 Mapping eQTLs and correcting for multiple testing ........................................................ 39

2.4.4 Conditional analyses ....................................................................................................... 41

2.4.5 Correcting for Winner’s Curse ........................................................................................ 41

2.5 Supplemental Materials ....................................................................................................... 43

Page 10: The genetics of gene expression: from simulations to the

x

Chapter 3: Characterising the genetic basis of neonatal gene expression in immune responses of

monocytes and T cells .................................................................................................................... 63

3.1 Introduction ......................................................................................................................... 64

3.2 Results .................................................................................................................................. 66

3.2.1 Study data....................................................................................................................... 66

3.2.2 Local genetic regulatory variants and condition specificity .............................................. 67

3.2.3 Condition-specific genetic regulatory variants ................................................................. 70

3.2.4 Trans-acting regulatory loci mediated by cis-eGenes....................................................... 72

3.3 Discussion ............................................................................................................................. 75

3.4 Materials and Methods ........................................................................................................ 78

3.4.1 Study cohort and RNA sample preparation...................................................................... 78

3.4.2 Gene expression profiling and data processing ................................................................ 78

3.4.3 Genotyping and imputation ............................................................................................. 80

3.4.4 Cis-eQTL mapping and conditional analysis ................................................................... 80

3.4.5 Replication of cis-eQTLs in external datasets .................................................................. 81

3.4.6 Enrichment analysis ........................................................................................................ 82

3.4.7 Response eQTL detection ............................................................................................... 82

3.4.8 Trans-eQTL identification .............................................................................................. 83

3.4.9 Mediation analysis .......................................................................................................... 84

Chapter 4: Investigating the early-life origins of immune-related diseases using neonatal eQTLs

........................................................................................................................................................ 86

4.1 Introduction ......................................................................................................................... 87

4.2 Results .................................................................................................................................. 89

4.2.1 Early-life cis-eQTLs are enriched for SNPs associated with immune-related diseases...... 89

4.2.2 Colocalisation of early-life regulatory variants with disease associations ......................... 91

4.2.3 Causal evaluation of genes for immune-related diseases .................................................. 97

4.3 Discussion ............................................................................................................................. 98

4.4 Materials and Methods ...................................................................................................... 102

4.4.1 Genetic regulatory variants on early-life gene expression .............................................. 102

4.4.2 Processing GWAS summary statistics and enrichment analysis ..................................... 103

4.4.3 Colocalisation analysis ................................................................................................. 104

4.4.4 Mendelian randomisation analysis ................................................................................ 104

4.5 Supplemental Figures ........................................................................................................ 105

Chapter 5: Conclusions ............................................................................................................... 111

References: ................................................................................................................................... 116

Page 11: The genetics of gene expression: from simulations to the

xi

List of Tables Table 1.1: List of recent response eQTL studies ............................................................................... 16

Table S2.1: Sample size and minor allele frequency (MAF) cut-off used in recent eQTL studies ...... 43

Table 3.1: Number of cis-eGenes and independent cis-eQTLs .......................................................... 68

Table 3.2: Number of cis-eGenes replicated in external datasets ....................................................... 69

Table 3.3: Mediation analysis to identify trans-associations that are mediated by cis-eGenes ........... 75

Table 4.1: List of all colocalisations of cis-eQTLs with disease associations .................................... 93

Page 12: The genetics of gene expression: from simulations to the

xii

List of Figures

Figure 2.1: Flowchart of eQTL simulation study .............................................................................. 27

Figure 2.2: False discovery rate (FDR) and sensitivity of selected hierarchical multiple testing

correction methods .................................................................................................................. 29

Figure 2.3: Power and eQTL effect size ........................................................................................... 31

Figure 2.4: Identification of true causal eSNPs ................................................................................. 33

Figure 2.5: Winner’s Curse in eQTL effect size estimation and correction by bootstrap methods ...... 34

Figure S2.1: False discovery rate (FDR) and sensitivity of pooled methods for increasing sample sizes

............................................................................................................................................... 46

Figure S2.2: False discovery rate (FDR) of all hierarchical multiple testing correction methods ....... 47

Figure S2.3: Sensitivity/true positive rate (TPR) of all hierarchical multiple testing correction methods

............................................................................................................................................... 48

Figure S2.4: False discovery rate (FDR) and sensitivity of hierarchical multiple testing correction

using permutation tests ............................................................................................................ 49

Figure S2.5: False discovery rate (FDR) and sensitivity of selected hierarchical multiple testing

correction methods in simulations of log-normal noise ............................................................ 50

Figure S2.6: False discovery rate (FDR) and sensitivity of selected hierarchical multiple testing

correction methods in simulations of correlated gene expression .............................................. 51

Figure S2.7: False discovery rate (FDR) and sensitivity of selected hierarchical multiple testing

correction methods in simulations of dominant effects ............................................................. 52

Figure S2.8: False discovery rate (FDR) and sensitivity of selected hierarchical multiple testing

correction methods in simulations of recessive effects ............................................................. 53

Figure S2.9: False discovery rate (FDR) and sensitivity of selected hierarchical multiple testing

correction methods in simulations of two causal eSNPs per eGene........................................... 54

Figure S2.10: Simulations of log-normal noise without inverse normal transformation ..................... 55

Figure S2.11: Average number of significant eSNPs per true positive eGene.................................... 56

Figure S2.12: Histogram of linkage disequilibrium (LD) r2 between the top SNPs that were not causal

and their respective causal eSNPs ............................................................................................ 56

Figure S2.13: False discovery rate (FDR) among the independent eQTL signals identified by

conditional analyses in simulations of two (A) or three (B) causal eSNPs per eGene ................ 57

Figure S2.14: Proportion of eGenes that had simulated causal eSNPs with negatively correlated minor

allele dosages in simulations of two causal eSNPs per eGene................................................... 57

Figure S2.15: Proportion of simulated causal eSNPs identified in conditional analyses and the initial

eQTL mapping step in simulations of two (A) or three (B) causal eSNPs per eGene ................ 58

Figure S2.16: Identification of causal eSNPs among top SNPs in simulations of two causal eSNPs per

eGene ...................................................................................................................................... 58

Page 13: The genetics of gene expression: from simulations to the

xiii

Figure S2.17: Winner’s Curse in eQTL effect size estimation........................................................... 59

Figure S2.18: Accuracy of three bootstrap estimators and the naïve estimator .................................. 60

Figure S2.19: Correction for Winner’s Curse by bootstrap method ................................................... 61

Figure S2.20: Minor allele frequency (MAF) distribution of causal eSNPs in simulations of two (A) or

three (B) causal eSNPs per eGene............................................................................................ 62

Figure 3.1: Study design and analysis work flow .............................................................................. 66

Figure 3.2: Distribution of number of detectable genes per microarray sample ................................. 67

Figure 3.3: Cis-eQTLs in four experimental conditions .................................................................... 68

Figure 3.4: Absolute effect sizes of cis-eQTLs and distance (kb) from the transcription start site (TSS)

of the corresponding gene ........................................................................................................ 69

Figure 3.5: Enrichment of cis-eQTLs in 3’UTR, 5’UTR, and exon regions....................................... 70

Figure 3.6: Proportion of cis-eQTLs that are significant response eQTLs (reQTLs) .......................... 71

Figure 3.7: Effect sizes of response eQTLs (reQTLs) in monocytes and T cells ................................ 72

Figure 3.8: Trans-eQTL associations ............................................................................................... 73

Figure 3.9: Trans-eQTL effects mediated through cis-eGenes .......................................................... 74

Figure 3.10: Flowchart showing microarray data process ................................................................. 79

Figure 4.1: Enrichment of early-life cis-eQTLs for genetic variants associated with immune-related

diseases ................................................................................................................................... 90

Figure 4.2: Colocalisation of cis-eQTLs with disease associations .................................................... 92

Figure 4.3: Colocalisation between the response eQTL (reQTL) of IL13 and allergic diseases .......... 96

Figure 4.4: Causal effects of BTN3A2 gene expression on multiple immune-related diseases ............ 98

Figure S4.1: Colocalisation between the response eQTL (reQTL) of CCL20 and childhood-onset

asthma association ................................................................................................................. 106

Figure S4.2: Colocalisations with different diseases are observed for the CTSH cis-eQTLs identified

in resting (the left column) and stimulated (the right column) cells ........................................ 107

Figure S4.3: Colocalisation between the response eQTL (reQTL) of UBASH3A and multiple diseases

............................................................................................................................................. 108

Figure S4.4: Colocalisation between the response eQTL (reQTL) of IL6ST and rheumatoid arthritis

............................................................................................................................................. 109

Figure S4.5: Colocalisation between the response eQTL (reQTL) of FAM167A and systemic lupus

erythematosus ....................................................................................................................... 110

Page 14: The genetics of gene expression: from simulations to the

xiv

List of Abbreviations The following abbreviations have been used in the thesis:

!" R-squared, used to quantify LD between pairs of genetic variants.

BH FDR Benjamini-Hochberg FDR-controlling procedure.

BY FDR Benjamini-Yekutieli FDR-controlling procedure.

BMI Body mass index.

CAD Coronary artery disease.

CAS Childhood Asthma Study.

CD Cluster of differentiation; CD markers are used to distinguish cell surface molecules.

cDNA Complementary DNA.

CNV Copy number variant.

COPD Chronic obstructive pulmonary disease.

DC Dendritic cell.

DICE Database of immune cell expression, eQTLs, and epigenomics.

DILGOM The Dietary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome Study

DNA Deoxyribonucleic acid.

dsQTL DNase I sensitivity quantitative trait locus.

eGene A gene that has eQTLs.

ENCODE The Encyclopedia of DNA Elements project.

eQTL Expression quantitative trait locus.

eQTS Expression quantitative trait score.

eSNP A SNP that is associated with expression levels of an eGene (an eQTL SNP).

FDR False discovery rate.

GRS Genomic risk score.

GTEx The Genotype-Tissue Expression Project.

GWAS Genome-wide association studies.

HapMap The International Haplotype Map Project.

HipSci The Human Induced Pluripotent Stem Cell Initiative.

HLA Human leukocyte antigen.

HRC The Haplotype Reference Consortium.

HWE Hardy-Weinberg Equilibrium.

IFN Interferon.

Ig Immunoglobulin.

Page 15: The genetics of gene expression: from simulations to the

xv

IL Interleukin.

ImmVar The Immune Variation project.

indel Insertion or deletion.

INT Inverse normal transformation.

iPSC Induced pluripotent stem cell.

IV Instrumental variable.

kb Kilo base pairs.

LCL Lymphoblastoid cell line.

LD Linkage disequilibrium.

LoF Loss-of-function variant.

LPS Lipopolysaccharide.

MAF Minor allele frequency.

Mb Mega base pairs.

meQTL Methylation quantitative trait locus.

miR-eQTL MicroRNA expression quantitative trait locus.

MR Mendelian randomisation.

mRNA Messenger RNA.

NK Natural killer cell.

OR Odds ratio.

PBMC Peripheral blood mononuclear cell.

PHA Phytohemagglutinin.

PP Posterior probability.

pQTL Protein quantitative trait locus.

QTL Quantitative traits locus.

reQTL Response eQTL.

RNA Ribonucleic acid.

RNA-seq RNA sequencing technology.

SNP Single nucleotide polymorphism.

sQTL Splicing quantitative trait locus.

ST FDR Storey-Tibshirani FDR-controlling procedure.

SV Structural variant.

TF Transcription factor.

TPR True positive rate.

TRS Transcriptional risk score.

Page 16: The genetics of gene expression: from simulations to the

xvi

TSS Transcription start site.

TWAS Transcription-wide association study.

UTR Untranslated regions.

WES Whole exome sequencing.

WGS Whole genome sequencing.

Page 17: The genetics of gene expression: from simulations to the

1

Chapter 1

Literature Review

Page 18: The genetics of gene expression: from simulations to the

2

1.1 Complex diseases and their genetic background

Human diseases and conditions vary in complexity. It is estimated that over 10,000 diseases are known

to have a single genetic cause1. These Mendelian disorders, such as cystic fibrosis, are usually caused

by dysfunction of a single gene and follow classic inheritance patterns. However, many common

complex human diseases, such as autoimmune diseases, respiratory diseases, and cardiovascular

diseases, are not monogenic. These complex diseases are influenced by both genetic and environmental

factors, and are often highly polygenic, meaning that many genes with small effect sizes contribute

simultaneously to the risk of contracting the disease. Environmental factors, such as diet and exercise,

also contribute to disease pathogenesis and age of onset. The same risk factor may have different effects

on disease susceptibility across individuals, which suggests that interaction between the genetic

background and the environmental exposure influences disease development. Untangling the effect of

genetics and environmental stimuli is a major challenge in health research and genomics. A better

understanding of the human genetic architecture and how it interacts with environmental factors has the

potential to guide better prevention and more effective treatment, ultimately reducing the burden on

health care systems.

In this first section, I will illustrate how genetic factors contribute to disease pathogenesis by discussing

large-scale association studies that have been successful in identifying the genetic components of

common complex diseases, and the molecular mechanisms underlying genetic risk factors in disease

pathogenesis. I will present how findings from these cohort studies can benefit other scopes of research

in genomics.

1.1.1 The human genome

The diploid human genome is comprised of 22 pairs of autosomal chromosomes and one pair of sex

chromosomes, and hereditary information is encoded in deoxyribonucleic acid (DNA) sequences. DNA

is the hereditary material of all organisms, and also plays a fundamental role in the central dogma of

molecular biology – DNA is first transcribed to messenger ribonucleic acid (mRNA), and then

translated to protein2. There are four types of nucleobases: adenine (A), thymine (T), cytosine (C), and

guanine (G), and the sequence of these bases encodes the genetic information in DNA. In total, there

are approximately three billion base pairs in the human genome, and protein-coding regions only make

up less than 2%3.

The publication of the first draft of the human genome in 2001 by the Human Genome Sequencing

Consortium, as part of the Human Genome Project4, marked the start of a better understanding of the

functions of genes and intergenic regions within the human genome. More than 80% of the genome has

been assigned biochemical functions by the Encyclopedia of DNA Elements (ENCODE) project3,

Page 19: The genetics of gene expression: from simulations to the

3

although only a small proportion encodes proteins. Additionally, genetic diversity in human populations

has been characterised by the 1000 Genomes Project and the Haplotype Reference Consortium, and a

detailed catalogue of various forms of genetic variation in the human genome is publicly available5,6.

All these achievements in genomics have impacted research in genomics, which recently shifted to a

more hypothesis-free and systematic approach7.

1.1.2 Human genetic variation

The genetic basis of phenotype variation across individuals, such as molecular functions, metabolic

processes, and health status, is linked to variation in the genome. Genetic (DNA) variants can occur in

different forms including single nucleotide polymorphisms (SNPs), short insertions or deletions (indels),

copy number variants (CNVs), and other larger structural variants (SVs) including duplications and

inversions. SNPs have been the most commonly used genotype data in human genomics. A SNP is a

genetic variant that occurs at a specific position in the genome. SNPs and other genetic variants are not

independent with each other, and linkage disequilibrium (LD; i.e. non-random association of genetic

variants) often occurs on adjacent genetic variants, meaning that certain alleles of two variants tend to

occur together. As a result, if a set of genetic variants are in high LD with each other, which means that

the corresponding genotypes are highly correlated, we can use one of them as a representative of the

rest (or “tagged”). SNPs can be assayed in a cost-effective and high-throughput manner, and other

variants, such as CNVs, can also be tagged by SNPs (typically if the LD r2 ≥0.8)8.

Novel high-throughput genotyping technologies have led to a substantial decrease of sequencing costs,

and it is now feasible to analyse genetic variants across thousands of entire genomes from large cohort

studies. The most common and cost-effective approach to collecting data on human genetic variation is

by using SNP array platforms9. SNP arrays contain a few hundred to up to several million probes, each

targeting different SNPs. The probes are designed based on known common variants, guided by

resources of human genetic variants. The International Haplotype Map (HapMap) Project (phase III),

completed in 2009, genotyped 1.6 million common SNPs in 1,183 individuals from diverse ethnic

backgrounds10. By analysing more than 2,504 individuals from different ethnic groups, the 1000

Genomes Project characterised more than 88 million variants, among which 84.7 million (96%) were

SNPs, 3.6 million were indels, and 60,000 were structural variants5.

Importantly, the number of probes on SNP arrays is limited. To address this, genotype imputation is

commonly used to predict genotypes of variants that are not directly measured using SNP arrays. A

human haplotype is a set of nearby alleles that tend to occur together on one chromosome. Genotypes

are imputed based on haplotypes in a reference panel11. Various imputation methods are available, such

as IMPUTE212 and minimac313; the latter has been implemented into a user-friendly web service – the

Michigan Imputation Server (https://imputationserver.sph.umich.edu/). The affordable SNP array

Page 20: The genetics of gene expression: from simulations to the

4

platforms together with the imputation strategy allow for quantification of various genetic variants in a

large number of samples, making it possible to identify genetic variants, even those with small effects,

that are associated with common complex diseases and traits, which can lead to better understanding of

the genetic basis of disease susceptibility.

Unlike SNP array platforms that depend on prior knowledge of genetic variation, so-called “next-

generation sequencing” technologies enable us to identify and measure additional genetic variation,

especially rare and low-frequency variants that are not necessarily captured by SNP arrays. Whole

genome sequencing (WGS) aims to determine the complete DNA sequence of an individual’s entire

genome. In addition to sequencing the entire genome, another option is to focus on specific regions of

interest. Whole exome sequencing (WES) targets on protein-coding gene exon regions, which are

expected to be more likely to harbour causal variants, such as those that may alter protein functions.

Protein-coding exons make up only a small proportion of the total genome (1%–2%)3, thus the cost of

exon sequencing is greatly reduced compared to WGS.

1.1.3 Genome-wide association studies

Historically, linkage analysis has been a successful method in identifying genetic factors associated

with rare Mendelian disorders by detecting genetic variants that are co-inherited with disease within

pedigrees. For example, cystic fibrosis, a Mendelian recessive disease, was found to be caused by

various mutations in both copies of the CFTR gene that encodes the cystic fibrosis transmembrane

conductance regulator protein14,15. Common complex diseases and traits such as inflammatory bowel

disease and body mass index (BMI) are polygenic, and the genetic components underlying these traits

often have much smaller effects than for rare Mendelian disorders. The linkage analysis is under-

powered to detect loci with modest effect sizes, because of the limited sample size due to the

requirement to recruit related individuals, as well as the properties of statistical tests (e.g. tests for

linkage take place in the random effects part of the model, which is less statistically powerful compared

with testing the fixed effects in association analysis).

Recent advances in high-throughput genotyping and sequencing technologies and the development of

genome-wide association studies (GWAS) have offered powerful alternatives to linkage analysis in

understanding genetic components of complex traits. Association studies aim to identify genetic

variants with varying allele frequencies in individuals with different traits or disease status, and these

genetic variants can possibly contribute to the phenotypic differences. Early association studies focused

on specific genomic regions, or a selected set of candidate genes. Reduced cost in genotyping makes it

now feasible to perform a hypothesis-free, genome-wide scan5,10,16. One of the first GWAS was

published in 2005 and focused on age-related macular degeneration, which causes incurable blindness

in the elderly17. This study identified a disease-associated variant located in an intron of the CFH

Page 21: The genetics of gene expression: from simulations to the

5

(complement factor H) gene using 146 samples17. In 2007, the Wellcome Trust Case Control

Consortium (WTCCC) performed one of the first large scale GWAS, in which seven diseases were

examined using about 2,000 cases and 3,000 controls18. Many GWAS studies have since been published,

accompanied by rapid advancements in this field. The sample size of GWAS is now typically in the

tens of thousands to even hundreds of thousands. An example is the UK Biobank project

(https://www.ukbiobank.ac.uk), which collects health information and genotype data for more than

500,000 individuals. Additionally, a great number of variants associated with an ever growing number

of various phenotypes have been reported and made publicly available: 71,673 associations from more

than five thousand GWAS were included in the NHGRI-EBI GWAS Catalog

(https://www.ebi.ac.uk/gwas/) as of September 201819. Other web-based resources of curated collection

of GWAS are available, such as the LD Hub (http://ldsc.broadinstitute.org/ldhub/)20, and the

ImmunoBase (https://www.immunobase.org). ImmunoBase focuses on immunologically-related

diseases, and includes studies using ImmunoChip microarrays, which are designed for this subset of

diseases21.

In addition to the primary goal of GWAS, which is to identify genetic loci that are significantly

associated with complex traits and diseases, the findings of GWAS can also be utilised in other ways,

such as disease risk prediction. Genomic risk scores (GRSs) are used to estimate an individual’s risk of

developing certain diseases22. GRSs are calculated based on the genotype data of selected variants and

effect sizes obtained from GWAS in independent cohorts. GWAS results can also be used in Mendelian

randomisation (MR) analysis, a causal inference method in which genetic variants are used as

instrumental variables (IVs) to avoid reverse causality and confounding effects23. MR studies have

provided additional insights on causal factors of diseases beyond traditional observational studies. For

example, vitamin D levels are observed to be correlated with the risk of coronary artery disease (CAD);

however, a well-powered MR study showed that vitamin D might not be a causal factor for CAD24.

There are three assumptions for genetic variants to be valid IVs25. Firstly, IVs are correlated with the

risk factor (or exposure). Independent genetic variants that are significantly associated with the

exposure are selected as IVs in MR. Secondly, IVs are not correlated with confounders that affect both

the exposure and the outcome (e.g. income affects the intake of vitamin E and the risk of heart disease,

thus leading to a spurious relationship between these two26). Thirdly, IVs are only potentially associated

with the outcome through the exposure variable (and not associated with the outcome conditional on

the risk factor and confounders). It is not necessary for a genetic IV to be significantly associated with

the outcome. If there is a single IV, the Wald Ratio test is usually applied to perform two-sample MR,

which uses only summary statistics from association studies in two cohorts. Various other methods can

be used if multiple IVs are available, and these methods usually have different assumptions. Inverse

variance weighted methods are generally more powerful, but they cannot be applied to cases of

Page 22: The genetics of gene expression: from simulations to the

6

unbalanced horizontal pleiotropy (horizontal pleiotropy: one genetic locus influences multiple

phenotypes)27. MR-Egger regression enables causal inferences even if all IVs are invalid under the

“Instrument Strength Independent of Direct Effect” (InSIDE) assumption (i.e. the pleiotropic effects

are not correlated with the effects on the exposure)28. Median-based methods assume that more than

half of the IVs are valid (unweighted methods), or valid IVs have more than half of the weight (weighted

methods)29. In practice, measurements of intermediate molecular phenotypes such as biomarkers or

gene expression traits are available often in a relatively small cohort. Two-sample MR30 is particularly

useful because we can take advantage of GWAS summary statistics from external large cohorts for

which the intermediate phenotype of interest is not measured. Tools have been developed to include

publicly available GWAS summary statistics on various traits and diseases to perform two-sample MR

analysis, such as MR-Base (http://www.mrbase.org)31, in which the above various MR methods are all

implemented.

Typical GWAS investigate binary phenotypes and have a case-control design, where the comparison is

made between, for example, one group of individuals with diagnosed diseases (cases), and the other

group of healthy individuals (controls)32. GWAS can also use population-based cohorts, which are often

recruited to investigate continuous traits, such as height or BMI33. Univariate regression models are

commonly used in GWAS to identify significant associations between the trait and allele count of each

genetic variant. Multiple testing is also critical since millions of genetic variants are tested

simultaneously and the probability of observing significant results in these tests simply due to chance

is much higher if a significance level of 0.05 is used. Conventionally, a stringent genome-wide

significance threshold of 5´10-8 is used in GWAS in European populations. This threshold is equivalent

to a Bonferroni adjusted P-value threshold (0.05 divided by the number of tests) for which an estimated

number of independent genetic variants (one million) is used34.

Using GWAS to identify the genetic components of diseases has multiple advantages: larger sample

sizes are possible, quality control procedures are stringent, and false positives can be controlled with

appropriate multiple testing correction procedures. GWAS were initially proposed based on the

common disease/common variants (CD/CV) hypothesis, which assumes that the frequencies of genetic

variants responsible for common diseases are not too low35. Despite the success in identifying common

variants, GWAS have limited power in detecting rare variants (allele frequency <1%) that are associated

with complex diseases or other traits. Potential reasons for this could be the failure in capturing rare

variants using standard genotyping arrays, low imputation accuracy of rare variants, or the lower power

to detect low-frequency variants with small effect sizes using current sample sizes32,36.

Nevertheless, many variants associated with disease risks and complex traits have been well replicated

within and across populations with different ethnic backgrounds37,38, indicating that they are true signals.

However, due to the complex LD structure among nearby variants, it is often not possible to identify

Page 23: The genetics of gene expression: from simulations to the

7

the causal variant from multiple significant variants at a trait-associated locus in observational studies

without further investigation of biological functions. Indeed, it remains very challenging to link the true

associations to the mechanisms by which the associated genetic variants contribute to phenotypes,

because very few associated variants identified in GWAS are missense mutations (point mutations

leading to changes in protein sequence) or loss-of-function (LoF) variants, which are located in coding

regions and can cause changes in the sequence of downstream RNA or protein products. Missense SNPs,

with some being neutral (i.e. do not affect protein functions), others may alter the function or structure

of proteins, or affect binding affinity of proteins, thus contributing to diseases and traits such as colon

cancer39,40. LoF variants can alter or disrupt transcripts and consequently protein functions in various

ways: (1) introducing a stop codon (nonsense), (2) disrupting a splice site, (3) disrupting a reading frame

or removing a large coding sequence (indel)41. Synonymous SNPs located in protein-coding regions do

not necessarily change the corresponding amino acid of the encoded protein; thus, they are often

considered to be inconsequential. However, they also have strong effects on many cellular processes42,

such as splicing events and mRNA stability43. The majority of the trait- or disease-associated SNPs

(>90%) identified in GWAS are located in non-coding regions44, for which any biological interpretation

can be challenging. It has been reported that GWAS SNPs are enriched in regulatory regions of the

genome44, which suggests that they may play a role in genetic regulation of downstream genes45.

Regulatory variants may affect gene expression levels, alternative splicing, or mRNA degradation, and

these regulatory effects are often specific to certain cell types or tissues45.

1.2 Gene expression, a fundamental intermediate phenotype

Gene expression, or the expression of messenger RNA (mRNA) transcripts from genes, is a

fundamental molecular trait, which links the DNA sequence to individual-level phenotypes such as

disease susceptibility. In this section, I will discuss spatiotemporal patterns of gene expression and the

molecular mechanisms of gene expression regulation. I will also introduce two major high-throughput

technologies of gene expression quantification.

There are in total approximately 20,000 protein-coding genes in the human genome annotated by the

GENCODE project46,47. A typical eukaryotic gene is constituted by coding and non-coding sequences,

and regulatory elements, including promoters located next to the transcribed sequences, and distal

regulatory elements such as enhancers and silencers48,49. Protein-coding genes produce proteins through

a two-step process: transcription and translation. Transcription is a process that occurs in the nucleus

where one of the DNA strands (coding or sense strand) of a gene is transcribed by the RNA polymerase

in the 5’ to 3’ direction to synthesise a single-strand RNA complement (precursor messenger RNA).

Eukaryotic genes contain exon and intron regions. Introns are spliced out during post-transcriptional

processing by the splicesome machinery and exons are retained. RNA splicing often results in different

Page 24: The genetics of gene expression: from simulations to the

8

transcripts from the same gene, known as isoforms, and this is called alternative splicing. The most

common alternative splicing is exon skipping; and it is also possible for introns to be retained. To form

a mature mRNA, a 5’ cap and a poly-adenosine (poly-A) tail are added to the start and the end of the

transcript. The mature mRNA is then transported to the cytoplasm, where it is translated by ribosomes

into proteins. Gene expression is a tightly regulated process, and the regulation can happen in any of

the above steps. Gene expression regulation is of crucial importance; it makes sure genes are expressed

in certain tissues or cell types, at the correct developmental stage, with the desired amount of expression.

A great proportion of the human genome does not encode proteins; however, more than 80% of the

genome has been assigned biochemical functions, and is probably involved in various levels of gene

regulation3.

1.2.1 Regulation of gene expression

The genetic information in most cells in a living organism is stable, but gene expression is dynamic

across cell types and conditions. The types and the abundances of gene products are not the same across

cells. The same cell can have different gene expression profiles at various developmental stages or

under different conditions. These spatiotemporal patterns of gene expression are critical for our

understanding of tissue functions. In addition, dysregulation of gene expression patterns is involved in

disease pathogenesis50. Various approaches have been developed to identify the spatiotemporal patterns

of gene expression, such as reporter genes (e.g. green fluorescent protein) to indicate the expression of

the target gene of interest, and in situ hybridisation to localise a certain nucleotide sequence in fixed

tissues.

With the advent of recent high-throughput gene expression profiling technologies, we have just started

to describe and understand cell type- or tissue-specific patterns in humans. The Genotype-Tissue

Expression (GTEx) project51,52 provides to date the largest multi-tissue resource of gene expression. In

the GTEx pilot data published in 2015, Melé et al. observed transcriptional signatures composed of a

relatively small number of genes in most tissues with few exceptions53. Brain regions showed the most

distinct characteristics among all solid tissues and were enriched for splicing events. Variation in

transcript abundances was larger across tissues than across individuals. Many genes were differentially

expressed across tissues, but only a small number of genes (<200) were specific to one tissue. In the

current V7 data release of the ongoing GTEx project (https://gtexportal.org/home/), there are more than

11,000 post-mortem samples representing 53 regions or cell lines collected from 714 donors. GTEx

provides an extensive resource to characterise tissue-specific regulatory effects on the transcriptome,

but immune cell types are not included in the GTEx project. The Immune Variation (ImmVar) project54

focuses on gene expression variation in innate and adaptive immune cells, and aims to understand

various immune functions in healthy humans. The recent DICE (database of immune cell expression,

Page 25: The genetics of gene expression: from simulations to the

9

expression quantitative trait loci, and epigenomics) project collected 13 types of immune cells, among

which two had activated conditions from 91 healthy individuals, and identified that sex had major

effects on immune cell gene expression55.

Regulation of gene expression prepares cells to respond to signals from both inside of the organism (e.g.

cytokines) and the external environment. Gene expression regulation is crucial for cellular

differentiation (e.g. activation of immune systems by pathogens). It is a sophisticated program that

involves a variety of mechanisms at different stages of transcription and translation. For example, the

switching-on and off of genes and the production rate of RNAs and proteins can be affected by

chromatin accessibility, transcription initiation which is influenced by methylation status, alternative

splicing, mRNA stability and degradation, and post-translational modification of the protein product.

Gene expression regulation is a process of fundamental importance, and has been exhaustively studied.

Regulatory components can be classified in two types: cis- and trans-regulatory elements, with the

former being adjacent DNA sequences and the latter being DNA-binding proteins including

transcription factors (TFs). Typical cis-regulatory elements include promoters and enhancers, and their

epigenetic modifications (e.g. histone modifications) and binding proteins that affect transcription

initiation and regulation56. TFs can increase or suppress the transcriptional activity by binding to cis-

regulatory elements of target genes. TFs contain DNA-binding domains that target specific DNA

sequences on the genome57. In addition, genes in the same biochemical pathway interact with each other

and are often co-regulated, and a network framework can provide improved modeling of the

relationships among transcripts.

1.2.2 High-throughput transcriptome data

A transcriptome is defined as the complete set of RNA transcripts in one or a group of cells, including

mRNAs and non-coding RNAs. Unlike the genome, which is fixed in an organism, the transcriptome

varies, sometimes dramatically, across cell types, developmental stages, and external environmental

stimuli. Thus, characterising transcriptomes provides insights on the function of genes, gene expression

variation, and gene regulation under different conditions. Two main high-throughput technologies are

used to quantify gene expression profiles: hybridisation-based approaches such as oligonucleotide

microarrays58 and sequence-based approaches such as RNA sequencing (RNA-seq)59.

Microarray gene expressing profiling measures gene expression levels by hybridisation between probes

attached to specific spots on the array and complementary DNA (cDNA) synthesised from mRNA

molecules in given samples. Like other types of DNA microarrays, probes on the array are designed

based on sequences of identified and annotated genes. Gene expression microarray probes are usually

sequences of a certain number of nucleotides (e.g. 50 base pairs for Illumina HT12 microarray

platforms). The probe sequences can uniquely match certain gene transcripts. Meanwhile RNA-seq,

Page 26: The genetics of gene expression: from simulations to the

10

based on recent deep-sequencing technologies, does not rely on prior knowledge of gene sequences. It

has advantages in the aspect of, for example, identifying novel gene isoforms derived from alternative

splicing, as well as allowing for measuring abundances of different gene isoforms60. RNA-seq can also

provide information on allele-specific expression at heterozygous sites60. Unlike microarrays that are

prone to probe saturation, RNA-seq can detect a broader range of gene expression levels. Microarrays

have been more widely adopted due to the lower cost and relatively advanced analytical methods. While

both RNA-seq and microarrays are affected by technical variation such as batch effects, those affecting

microarrays tend to be better understood and can be corrected for statistically. RNA-seq used to be less

affordable compared with microarray platforms; however, the cost of RNA-seq is decreasing and it is

becoming more widely used in this field.

Study design of microarray analysis and major steps in statistical analysis have been reviewed

previously in Allison et al.61, Leung and Cavalieri62, Quackenbush63, and etc. After RNA extraction and

hybridisation of samples to arrays, array digital images are generated where the signal intensity for each

spot (or probe) reflects the amount of expression of a certain transcript. Raw values reflecting the signal

intensity are extracted in image analysis; this step is usually performed by the manufacturer of the array

platform. Next, quality control, such as removal of samples or arrays of low quality, is carried out to

boost the biological signal. Background intensity signal is corrected for in each array. Normalisation is

required to remove variation caused by non-biological effects, and to make gene expression from

different samples or arrays comparable. Quantile normalisation, which aims to make each sample have

the same probe intensity distribution, is a robust and commonly used normalisation method for

microarray data64. Quantile normalisation is an unsupervised approach, meaning that it does not use

data describing the study in normalisation. There are also supervised approaches where biological

variables are fitted65. The normalised data are then ready for downstream analysis to answer biological

questions.

1.3 Genetic basis of gene expression variation

GWAS have been a success in identifying genetic variants that are associated with common human

diseases and complex traits, aiding in our understanding of the underlying genetic background of these

phenotypes19. However, it still remains a challenge to interpret the GWAS findings, since a great

majority of the trait- or disease-associated variants (estimated at approximately 93%44) are located in

non-coding regions of the human genome. In addition, complex traits are affected by multiple genetic

loci with relatively small effects compared to rare Mendelian diseases45. GWAS variants are often

located in regulatory regions44, and these variants may influence human traits by affecting regulatory

elements, which may result in altered gene expression45. Studies investigating the genetic basis of gene

expression variation have aided in the interpretation of GWAS findings by uncovering their regulatory

Page 27: The genetics of gene expression: from simulations to the

11

effects on gene expression, a critical intermediate cellular phenotype, and linking the human genetic

variants to higher-order phenotypes66. Analysing gene expression levels or integrating eQTLs to

perform fine-mapping of causal variants and improve functional interpretation has become a routine

follow-up of GWAS. In this section, I will discuss the characteristics of eQTLs identified in human

tissues, and how they are linked to disease pathogenesis.

1.3.1 Expression quantitative trait loci (eQTLs)

Expression quantitative trait loci (eQTLs) are genetic regions that regulate gene expression levels. A

typical eQTL study aims to identify genetic variants that are associated with expression variation in a

cohort67,68. Since the first genome-wide eQTL analysis was carried out in human pedigrees in 200469,70,

we have started to understand the genetic basis and heritable components of gene expression variation

across multiple tissues, cell types, and populations with different ancestry. Results from eQTL studies

can also contribute to the identification of risk genes and to the interpretation of disease associations by

uncovering the regulatory effects of disease-associated genetic variants, which sheds light on potential

mechanisms through which these variants contribute to complex traits and diseases.

Early eQTL studies were performed in lymphoblastoid cell lines (LCLs), which were immortalised cell

lines derived from individuals of the HapMap project and the 1000 Genomes project71-77. Diverse

populations were explored for eQTLs in these projects73,74,76. Stranger et al. identified eQTLs in LCLs

from eight HapMap populations, and observed a big proportion of eQTL sharing across populations73.

They also observed that the effect sizes and directions of these shared eQTLs were highly consistent

across populations73. The RNA-seq technology to quantify transcript expression levels was first used in

eQTL analysis in 2010, leading to the characterisation of the genetic regulation on splicing events75,77.

An increasing number of studies now investigate eQTLs in various primary human tissues and cells,

since the immortalised LCLs might not recapitulate in vivo genetic regulation of transcriptomes. One

of the first large scale eQTL studies using natural human tissues was performed by Goring et al. in

2007, in which native peripheral blood lymphocytes were used78. Goring et al. identified an eQTL of

the VANN1 gene located in its promoter, and this eQTL also influenced high-density lipoprotein

cholesterol concentrations, demonstrating the use of eQTLs in identifying potential genes that affect

human traits78. The first eQTL analysis in disease-relevant tissues was published in 2008, in which

human liver samples were used and susceptibility genes were identified for relevant diseases (e.g.

RPS26 for type 1 diabetes)79. Earlier eQTL studies were mostly performed in tissues with a mixture of

different cell types, such as whole blood or peripheral blood80-84, peripheral blood mononuclear cells

(PBMCs)85, and adipose tissues86,87. More tissue and cell types have been investigated to uncover the

cell type- or tissue-specific genetic regulation of gene expression. The GTEx project has described the

patterns of eQTLs sharing across 44 human tissues in the V6p data release, and it was observed that

Page 28: The genetics of gene expression: from simulations to the

12

local regulatory variants tended to be either specific to a small number of tissue types, or shared across

most tissues52. The diverse tissue sampling in the GTEx project provides a valuable comprehensive

multi-tissue resource of genetic regulation of gene expression.

Immune cells have pivotal roles in humans, and the dysregulation of immune functions are involved in

pathogenesis of many diseases. The first eQTL study using multiple cell types from the same individuals

investigated primary fibroblasts, LCLs, and T cells from 75 individuals, and observed that more than

half of the gene regulatory variants were cell-type specific88. Studies have mapped eQTLs in

homogeneous immune cell types such as in monocytes89 and neutrophils90,91, and also characterised the

cell specificity of eQTLs using multiple immune cells from the same subjects55,92-96. Findings from these

eQTL studies in immune cells have uncovered the cell type-specific regulatory effect of genetic variants

associated with risk for immune-related diseases such as rheumatoid arthritis94 as well as

neurodegenerative diseases such as Alzheimer’s disease and Parkinson’s disease95.

1.3.2 Cis- and trans-eQTLs

To identify eQTLs, we perform an association test between genotypes of each genetic variant (encoded

as 0,1, or 2 indicating the allele dosage) and expression levels of a transcript in a study cohort. High-

throughput technologies now allow for measurement of millions of genetic variants (mostly SNPs) by

genotyping arrays and imputation strategies, and quantification of gene expression levels using

microarrays or RNA-seq. Genes with significant eQTLs are called eGenes. An eQTL usually refers to

one genomic region or locus, and the SNPs on this locus that are significantly associated with gene

expression are called eSNPs (or eVariants).

Based on the distance from the associated eGenes, cis-eQTLs (or local eQTLs) are defined as regulatory

variants located within the conventional distance threshold of one mega base pairs (1 Mb) of the

transcription start site (TSS). Cis-eQTLs are prevalent across the human genome. The largest eQTL

meta-analysis to date, using more than 30,000 blood samples, reported that 88% of the studied genes

were cis-eGenes84. GTEx identified that 86% of the protein-coding genes had cis-eQTLs in at least one

of the 44 tissues in the V6p data release, and those did not have a cis-eQTL in any of the tested tissues

tended to be intolerant of loss-of-function mutations or genes with low expression levels52. Cis-eQTLs

are enriched for variants located close to the TSS of the associated eGene, and the enrichment of cis-

eQTLs is often observed as a function of distance to the TSS52,83-85. This is consistent with the potential

mechanisms of cis-eQTLs; for example, promoter variants affect the expression of nearby genes by

regulating the binding of TFs.

More distant eQTLs or those located on different chromosomes are defined as trans-eQTLs. Unlike cis-

eQTLs, which are common and often shared across tissues, trans-eQTL are more likely to be cell type

Page 29: The genetics of gene expression: from simulations to the

13

specific84. Trans-eQTLs are also enriched for complex trait- and disease-associated variants97, and are

considered more informative than cis-eQTLs with regards to locating potential risk genes in disease

susceptibility loci84. A larger sample size is often required to identify trans-eQTLs, because they usually

have smaller effect sizes compared with cis-eQTLs98, and the multiple testing burden is heavier in a

genome-wide trans-eQTL analysis since the number of variants considered is much higher. In a recent

eQTL meta-analysis in blood samples with the largest sample size to date (31,684), 6,298 genes were

detected to have trans-eQTLs, while 16,989 genes were observed to have cis-eQTLs using this sample

size84. Given that a typical sample size of an eQTL study ranges from a few dozen (e.g. 52 samples for

primary human coronary artery smooth muscle cells99) to several hundred, most studies focus only on

cis-eQTLs. Larger sample sizes (a few thousand) are available for more accessible tissue types such as

peripheral blood samples81,82,100, which allow for more comprehensive characterisation of trans-

regulation networks.

Cis-eQTLs often regulate gene expression in a cis manner by affecting multiple steps in the gene

expression process, including TF binding, chromatin accessibility, DNA methylation, and mRNA

splicing66. For example, a cis-eQTL can be a variant located in promoter or enhancer regions, which

affects the binding of a TF66. RNA-seq technology allows for the detection of allele-specific expression,

which often happens for cis-eQTLs in heterozygous individuals52,75,77,83. Less is known about the

biological mechanisms of trans-eQTLs, which commonly have different mechanisms from cis-eQTLs.

They might regulate eGene expression by altering expression levels or functions of TFs, which target

the promoter or enhancer regions of the downstream eGene, thus leading to changes in transcriptional

activity. Studies have shown that trans-eQTLs are enriched for cis-eQTLs (i.e. trans-eQTLs are also

associated with nearby genes)97. For example, Bryois et al. identified that cis-eQTLs of two genes,

BATF3 and HMX2, both encoding TFs, were associated with the most trans-eGenes distributed across

multiple chromosomes in 869 LCLs97. These eQTLs that influence the expression of multiple

downstream transcripts across the genome are called eQTL hotspots, which are probably located near

major regulators or master TFs101. The fact that trans-eQTLs are often associated with local gene

expression suggests that the local gene may play a role in the trans-regulation of distant genes. A cis-

eQTL of IRF7 (which encodes a TF) is associated with the expression of multiple distant genes in

dendritic cells activated by influenza102. Experimental overexpression of IRF7 has validated its role in

inducing the expression of this set of genes102. Statistical evidence obtained from mediation analysis

shows that 20%–35% of trans-associations are significantly mediated through expression of cis-eGenes

associated with the same trans-eQTLs103,104.

1.3.3 eQTLs and human diseases

Page 30: The genetics of gene expression: from simulations to the

14

Findings of eQTL studies aid in our understanding of complex human diseases. Genetic variants that

are associated with complex human traits are observed to be enriched for eQTLs92,105,106, and

investigation of their regulatory effects on gene expression helps link genotypes to phenotypes through

intermediate molecular traits. Effects of eQTLs are often tissue- or context-dependent. There have been

eQTL studies in disease-relevant tissues, which can give insights on plausible mechanisms of disease

pathogenesis. For example, eQTL analysis in human kidney compartments integrating GWAS findings

identified putative risk genes (DAB2) for chronic kidney disease107. Similarly, retinal eQTL analysis

identified target genes involved in age-related macular degeneration108, and eQTLs detected in brain

regions generated biological insights on neurological disorders109,110. Tissue-specific eQTLs provide

opportunities to locate novel causal genes and to uncover pathways involved in disease pathogenesis.

Integrative approaches using eQTL datasets help to unveil the functional effects of GWAS variants.

One approach is to use a Bayesian statistical framework (such as coloc) to identify genetic loci where

shared causal variants are underlying both GWAS and eQTL signals, termed colocalisation111, given

that GWAS variants are often observed to have effects on gene expression. This method estimates the

evidence of colocalisation for each genetic locus, which is usually a 400-kb region, using summary

statistics for both traits obtained from two independent cohorts. Posterior probabilities are estimated for

five hypotheses representing all possible configurations (e.g. distinct underlying causal variants, or a

shared causal variant). Other colocalisation methods are also available, such as gwas-pw, which extends

the coloc Bayesian method, but no priors need to be determined112. Most methods assume at most one

causal signal per locus, while eCAVIAR accounts for multiple causal variants in a single locus113. The

colocalisation analysis has shown its power in identifying disease-relevant genes in certain cell types.

Guo et al. investigated colocalisations between GWAS variants associated with ten immune-related

diseases and eQTLs in B cells and monocytes114. They identified six potential susceptibility genes,

including one (CTSH) for type 1 diabetes and narcolepsy, which was specific to monocytes but not

observed in B cells114.

Recently, the prediction of gene expression based on genotypes using eQTL effect sizes has become a

new approach as a follow-up experiment for GWAS115-117. For example, an approach called

PrediXcan115 estimates the genetically determined components of gene expression, and identifies risk

genes by correlating the imputed gene expression with diseases or traits. The same group proposed

another method called MultiXcan117, which integrates multiple eQTL reference panels, and thus it

leverages eQTLs shared across multiple tissues, contexts, and developmental stages. They also

developed S-PrediXcan116 and S-MultiXcan117, which require only summary statistics of GWAS rather

than individual level genotype data. Potentially, these methods can avoid the concern of reverse

causality, because the imputed gene expression only captures the genetically regulated variation and is

Page 31: The genetics of gene expression: from simulations to the

15

not affected by external environmental factors or disease status. The performance of these models are

based on the eQTL reference data, and importantly, the accuracy in eQTL effect size estimation.

Transcription-wide association studies (TWAS) identify trait- or disease-associated gene expression

traits by a genome-wide scan for associations between gene expression traits and phenotypes in cohorts

with both data available. However, due to the relatively higher cost of gene expression profiling than

genotyping, studies with both expression and phenotype data available are still orders of magnitude

smaller than typical GWAS. Using prediction models, it is now possible to impute gene expression data

for a large GWAS cohort and to perform TWAS using a large sample size. TWAS have identified genes

involved in various traits such as obesity-related traits118, schizophrenia119, and breast cancer120. On the

other hand, we do not necessarily have phenotype data for participants from large eQTL studies, so

trait-associated genes can be identified by correlating gene expression levels with GRSs for multiple

traits calculated in these participants, and the associated GRSs are termed expression quantitative trait

scores (eQTSs)84.

Findings from eQTL analyses also have value in translational precision medicine. GRSs, which are

calculated based on the genotype data and effects of genetic variants estimated from GWAS, have

proven valuable in predicting disease risks for individuals based on their genetic information. Similarly,

a transcriptional risk score (TRS) for which transcriptomic data are used instead of genomic data, might

also provide accurate risk estimation98. TRS is calculated as a sum of standardised expression levels of

eGenes with eQTLs that are associated with disease risks, and the gene expression is quantified in

relevant tissues. TRSs can utilise findings of eQTL and GWAS analysis of external large cohorts. TRSs

calculated based on gene expression profiles in ileal biopsies were shown to outperform GRSs in

predicting risks for Crohn’s disease121. However, in practice, it is not as easy to obtain gene expression

data from disease-relevant tissues as to collect genotype data from a population, given the difficulty in

collecting tissues and the high cost of gene expression quantification. In addition, we are not always

aware of the pathologically relevant tissues for all diseases.

1.3.4 Response eQTLs

Architecture of eQTLs are often dependent on the specific cell or tissue type and context. Various cell

types have been investigated for eQTL effects. Studies exploring eQTLs in cell types under resting and

various stimulatory conditions have been recently published. These studies have focused mostly on

resting and activated human immune cells (treated by relevant stimuli such as pathogens and cytokines)

since many human diseases are characterised by dysregulation of immune functions (Table 1.1). For

example, dysregulation of homeostasis and subsequent activation of T cells have been shown to be

involved in autoimmune diseases122 and cancers123, and monocytes (including macrophages derived

Page 32: The genetics of gene expression: from simulations to the

16

Table 1.1: List of recent response eQTL studies. PMID indicates the unique identifier of publications in PubMed. Cohort abbreviations in brackets for some studies are as follows: ImmVar: The Immune Variation project54, HipSci: The Human Induced Pluripotent Stem Cell Initiative124. For studies where multiple stimuli or time points are available, sample size is the number of individuals with at least one experimental condition available for eQTL analysis, expect for the Fairfax et al. study in which 228(†) indicates the number of individuals having all experimental conditions. An asterisk next to the sample size indicates that the individuals have a selected set of transcripts measured using NanoString profiling125. The selected sets of genes were determined based on either genome-wide gene expression data in a subset of the cohort (30 in Lee et al. and 15 in Ye et al.), or GWAS SNPs associated with autoimmune diseases in Hu et al. Cell type abbreviations are as follows: iPSCs: induced pluripotent stem cells, PBMCs: peripheral blood mononuclear cells.

PMID Year Author (Consortium)

Sample size

Cell type Stimuli and duration

29737278 2018 Knowles et al.126 45 Cardiomyocytes (derived from iPSCs)

Doxorubicin (five concentrations: 0, 0.625, 1.25, 2.5, or 5 µM) for 24 hours

29988122 2018 Gate et al.127 (ImmVar)

95 CD4+ T cells Anti-CD3/CD28 antibodies for 48 hours

29379200 2018 Alasoo et al.128 (HipSci)

84 Macrophages (derived from iPSCs)

IFNγ for 18 hours Salmonella for 5 hours IFNγ followed by Salmonella

28793313 2017 Manry et al.129 51 Whole blood cells Mycobacterium leprae for 26–32 hours 28814792 2017 Kim-Hellmuth et

al.130 134 CD14+ monocytes LPS for 90 minutes and 6 hours

MDP for 90 minutes and 6 hours 5'-ppp-dsRNA for 90 minutes and 6 hours

27768888 2016 Quach et al.131 200 CD14+ monocytes LPS for 6 hours Pam3CSK4 for 6 hours R848 for 6 hours Influenza A virus for 6 hours

27768889 2016 Nedelec et al.132 168 Macrophages (derived from CD14+ monocytes)

Listeria monocytogenes (gram-positive) for 2 hours Salmonella typhimurium (gram-negative) for 2 hours

25874939 2015 Caliskan et al.133 98 PBMCs Rhinovirus for 24 hours 24604202 2014 Fairfax et al.134 228† CD14+ monocytes LPS for 2 hours and 24 hours

IFN-γ for 2 hours

Page 33: The genetics of gene expression: from simulations to the

17

24604203 2014 Lee et al.102 (ImmVar)

534* (30)

Dendritic cells (derived from CD14+ monocytes)

LPS for 5 hours Influenza virus for 10 hours IFN-β for 6.5 hours

25214635 2014 Ye et al.135 (ImmVar)

348* (15)

CD4+ T cells Anti-CD3/CD28 for 4 hours and 48 hours Anti-CD3/CD28+IFNβ for 4 hours Anti-CD3/CD28+Th17 cocktail for 48 hours

24968232 2014 Hu et al.136 174* CD4+ T cells Anti-CD3/CD28 for 72 hours 25327457 2014 Kim et al.137 137 CD14+ monocytes LPS for 90 minutes 22233810 2012 Barreiro et al.138 65 Dendritic cells

(derived from CD14+ monocytes) Mycobacterium tuberculosis for 18 hours

Page 34: The genetics of gene expression: from simulations to the

18

from monocytes) have a critical role in autoimmunity through the regulation of T cells139,140.

Characterising cell type- and condition-specific regulatory elements in immune cells can be informative

in identifying genes involved in diseases and understanding the aetiology of immune-related diseases.

Response eQTLs (reQTLs) are eQTLs that are specifically responsive to certain conditions or

experimental stimuli, or those that show varying effects across conditions. The first genome-wide

reQTL study investigated the immune responses of primary dendritic cells (DCs) to Mycobacterium

tuberculosis, which is a major cause of tuberculosis138. This study observed that reQTLs were enriched

for genetic variants associated with pulmonary tuberculosis, and identified susceptibility genes138. A study

using resting and stimulated monocytes treated by lipopolysaccharide (LPS) or interferon-γ (IFN-γ)

observed that more than half of cis-eQTLs were specific to stimulated monocytes, and interestingly, a

large proportion of cis-eQTLs identified in resting cells were no longer detectable after stimulation134.

Studies with multiple time points available have characterised dynamic patterns of reQTLs, e.g. early

or late induced, persistent, and transient effects130,135. By investigating resting and stimulated monocytes,

about half of reQTLs were reported to be time point-specific130. Local reQTLs are enriched for genetic

variants associated with autoimmune diseases127,128,130,134.

1.3.5 Beyond eQTLs

The majority of studies investigating genetic regulatory effects in functional genomics focus on

associations between SNPs and mRNA transcript abundances. Some studies focus on different types of

genetic variation such as copy number variants (CNVs)72,141 and short insertions and deletions

(indels)142. These types of genetic variation can provide additional information. For instance, the genetic

variation in gene expression captured by CNVs is not tagged by SNPs72, and CNVs have been reported

to have stronger effects than their neighbouring SNPs141. In addition to gene expression, genetic

regulatory variants also regulate many other quantitative molecular traits, such as microRNA

expression143, chromatin accessibility including DNase I hypersensitive sites128,144, methylation

status110,145,146, splicing events51,83,147-149, ribosome levels150,151, and protein abundances151-154.

Characterisation of genetic regulatory effects on multiple levels of molecular traits provides a better

understanding of multiple steps from transcription activity to protein production.

Other types of QTLs that are associated with other molecular phenotypes often, but not exclusively,

influence mRNA gene expression levels, and vice versa. Many QTLs have been reported to be enriched

for cis-eQTLs, including regulatory variants that are associated with microRNA expression (cis-miR-

eQTLs)143, DNase I sensitivity QTLs (dsQTLs)114, methylation QTL acting in cis (cis-meQTL)145,146,

and plasma protein QTLs acting in cis (cis-pQTLs)152. Bayesian statistical methods have provided

evidence of colocalisation of the above QTLs with eQTLs145,152. Independent genetic effects are often

observed. For example, pQTLs with little genetic effects on mRNA levels of the corresponding genes

Page 35: The genetics of gene expression: from simulations to the

19

have been observed, suggesting that pQTLs might also affect protein abundances through other

mechanisms, such as regulation of post-translational modification and protein degradation, in addition

to affecting transcription activity151,152. Most splicing QTLs (sQTLs) identified using Yoruba LCLs

were reported to have little or no influence on gene expression147, and sQTLs were enriched for trait-

associated SNPs identified in GWAS148. All these efforts in deciphering genetic regulatory effects on

different levels of molecular traits have provided additional insights on functional characteristics of

human genetic variation.

1.4 Early-life origins of diseases developed in adulthood

The human immune system detects pathogens, such as viruses and bacteria, and protects us from various

infections and diseases by distinguishing molecules from non-self, normal, and abnormal self. The

immune system functions in a complex manner, and multiple cell types and biological processes are

involved. The dysregulation of the immune system or the failure of immune homeostasis can lead to

various autoimmune and inflammatory diseases, such as rheumatoid arthritis155. The human immune

system is classified into two categories: the innate immune system and the adaptive immune system,

which are not mutually exclusive and often substantially interact with each other156,157. For example,

the innate immune system can activate adaptive immune responses by detecting microbes by pattern-

recognition receptors on antigen-presenting cells127. In this section, I will discuss the major cell types

involved in innate and adaptive immunity, characteristics of perinatal immune system, and early-life

origins of complex human diseases.

1.4.1 Immune cell populations and perinatal immune system

The innate immune system provides the first line of defence, and can be activated within hours. Innate

immune cells include macrophages and dendritic cells (DCs), which are derived from monocytes, as

well as neutrophils, basophils, eosinophils, and natural killer (NK) cells. Other non-immune cells also

contribute to innate immunity, such as epithelial cells and endothelial cells158. Monocytes are a subtype

of leukocyte, featured by its large size and kidney-shaped nucleus159. Bone marrow, blood, and spleen

are known to be the reservoirs of monocytes, which are recruited to sites of inflammation and injured

tissues160. There are different subclasses of monocytes, including the classical CD14++CD16- monocytes,

which have CD14 cell surface molecule highly expressed, the intermediate CD14++ CD16+ monocytes

with additional low expression of CD16, and the non-classical CD14+CD16++ monocytes161 with low

expression level of CD14 and high co-expression of CD16162. These subclasses do not represent

different developmental stages of monocytes, and they have been shown to have distinct functions163,164.

Macrophages, which digest foreign substances and abnormal cells, and DCs, which are antigen-

Page 36: The genetics of gene expression: from simulations to the

20

presenting cells, are differentiated from monocytes, and both play important roles in innate immune

responses.

The adaptive immune system, or the acquired immune system, provides pathogen-specific, long-lasting

protection. The activation of the adaptive immune functions takes longer than the innate immune system.

One characteristic of the adaptive immune system is the immunological memory, which can lead to a

more rapid and enhanced response to subsequent challenges from the same pathogen. The

immunological memory is acquired from the interaction with the external environment throughout life.

Two main classes of lymphocytes are involved in adaptive immune responses, namely T cells that

mature in the thymus and B cells that mature in the bone marrow in adult mammals. CD8+ T cells, also

known as cytotoxic T cells or killer T cells, have important roles in cell-mediated immunity: they kill

infected cells or cancer cells. CD4+ T cells, or T helper cells, secrete cytokines to regulate the activation

of other immune cells such as B cells and cytotoxic T cells. Other T cell subpopulations include memory

T cells and suppressor T cells. B cells secrete antibodies or immunoglobulins (Ig), which can destroy

the cells directly, or label infected cells or microbes so that they can be destroyed by other cells such as

macrophages.

Neonates and infants are more susceptible to infectious diseases than adults, partly because of a reduced

number and a functional impairment of immune cells165. There is a lack of immunological memory in

neonates, given the limited number of antigens in utero and the reduced number of immune cells166,167.

However, the higher susceptibility to infection observed in neonates is not entirely due to the immaturity

of the neonatal immune system; it is also considered a result of the specific immunological demands

during the perinatal period – avoiding pro-inflammatory/T helper 1 (TH1)-cell-associated immune

responses, in addition to protection from infections165-167. Briefly, TH1 cytokines and their pathways

involve interferon-γ (IFN-γ), interleukin-2, and interleukin-12, and TH1 cells drive cell-mediated

immune responses, while TH2 cells are responsible for antibody production and humoral immune

responses and interleukin-4, interleukin-13, and interleukin-5 are involved168. The foetal and neonatal

immune systems have an imbalance between TH1 and TH2 cell populations, and the production of

cytokines is biased towards TH2 cytokines, since TH1-type cytokines and immune responses are

associated with spontaneous abortions169,170.

1.4.2 Early-life origins of chronic diseases developed later in life

Early life, which includes the foetal stage and the first years after birth, is a critical period for humans,

and environmental exposures and physiological changes experienced in this period influence immune

system development and future disease risks171-174. Many chronic human diseases that develop later in

life originate from early childhood. For example, the development of respiratory diseases such as

chronic obstructive pulmonary disease (COPD) is influenced by early-life risk factors such as maternal

Page 37: The genetics of gene expression: from simulations to the

21

smoke and postnatal smoke exposure, which have long-lasting influence on lung functions and

respiratory health175,176.

Perhaps the most well-known and extensively replicated epidemiologic observation is the association

between small body size at birth as well as during infancy and multiple phenotypes including disease

risks later in adulthood, such as coronary heart disease177,178, type 2 diabetes179, reduced bone mass180,

and body composition including fat and muscle mass181. From all this, the “developmental origins”

hypothesis emerged, which proposed that adaptive responses to the external environment made during

foetal and early-life stages have long-term impacts on disease risk182,183. It is known that early-life

events such as microbial colonisation influence the development of the host immune system as well as

immune-related diseases such as asthma172. Exposure to antibiotics in early life has been reported to be

associated with higher risks for the presence of asthma and allergy184, inflammatory bowel disease185,186,

and being overweight187.

Studies in functional genomics have also shown evidence to support the developmental origins

hypothesis. Peng et al. performed an eQTL analysis using 159 human foetal placentas, a tissue that

plays a key role in foetal development, and observed that GWAS SNPs that were also placental eQTLs

had larger effect sizes on gene expression in placentas than in other disease-relevant tissues from

adults188. This analysis provided evidence of regulation of postnatal disease susceptibility by placental

genes188. Their follow-up study investigating the roles of placental eQTLs in birth weight and

susceptibility to childhood obesity identified that the placental eQTLs contributed more to the

phenotypes compared with that from relevant tissues from adults such as liver and adipose189. TWAS

analysis using imputed gene expression based on placental transcriptome data identified potential risk

genes for the above phenotypes189. Another recent eQTL study in 120 human foetal brains identified

potential causal genes that may mediate risk for neuropsychiatric conditions, and showed that effects

of many neuropsychiatric trait-associated SNPs on relevant traits may become manifest at early stage

in life190.

1.5 Research objectives

The development of polygenic human diseases involves an individual’s genetic predisposition and its

interaction with external environmental factors. Research has been successful in identifying genetic

variants associated with various human diseases using large scale cohorts. Despite the efforts to

understand the functional roles of these genetic variants, it remains a significant challenge to interpret

the mechanisms through which the genetic variants contribute to disease pathogenesis. Investigating

the effects of these genetic variants on gene expression and subsequent critical phenotypes can greatly

inform on their regulatory roles. Furthermore, the investigation of condition-specific eQTLs, or how

Page 38: The genetics of gene expression: from simulations to the

22

eQTL effects are modified by environmental stimuli (e.g. immune responses) provides potential

molecular mechanisms of gene-environment interactions. However, most studies investigating reQTLs

are performed using adult immune cells, and there remains a gap in deciphering reQTL effects on early-

life transcriptome, given that many diseases originate from early childhood.

The objectives of this thesis are to explore eQTL analysis methodologies, and to use gene expression

in neonatal immune cells to investigate the early-life origins of immune-related diseases that develop

in adulthood. The investigation of eQTLs has become a focus in functional genomics; however, there

was no extensive simulation study to explore the study design and analysis choices. This thesis first

explored various study parameters in empirically-driven simulated eQTL datasets, and made

recommendations on analysis strategies under various scenarios. Based on the insights from the

simulation study, this thesis next used optimal analysis strategies and explored early-life eQTLs in

neonatal immune cells. Two immune cell types (monocytes and T cells) under resting and stimulated

conditions were obtained from cord blood samples of the Childhood Asthma Study (CAS), a

prospective birth cohort of children at high risk of atopy. Response eQTLs with genetic effects modified

by relevant stimuli were also characterised to understand the molecular basis of gene-by-environment

interactions. Lastly, this thesis applied integrative approaches and leveraged findings from external

large scale GWAS to explore how early-life genetic variants and gene expression mediate risks for later

immune-related diseases.

The specific aims of this thesis were:

(1) To provide an evidence base for eQTL study design and to provide guidance for eQTL analysis

choices.

(2) To characterise the genetic regulation of gene expression in neonatal immune cells and how the

regulatory effects are modified by immune responses.

(3) To uncover the regulatory effects of variants associated with immune-related diseases on early-

life gene expression, and to associate gene expression at birth with disease risk in later adulthood.

Page 39: The genetics of gene expression: from simulations to the

23

Chapter 2

Power, false discovery rate and Winner's Curse in

eQTL studies

Page 40: The genetics of gene expression: from simulations to the

24

2.1 Introduction

Genome-wide association studies (GWAS) have identified thousands of genetic variants associated

with complex phenotypes36 and the vast majority of genome-wide significant SNPs are located in non-

coding region44, making interpretation challenging. Integration of gene expression and genetic variation

is a ubiquitous approach for uncovering genetic regulatory effects and their ramifications for pathways

relevant to human diseases and traits86,103,191,192, and indeed trait-associated SNPs have been found to be

enriched for expression quantitative trait loci (eQTL) effects105.

Yet, while eQTL analysis has become a focus of functional genomics, the lack of a strong evidence

base for eQTL study design leaves fundamental questions unanswered. In particular, while more and

more eQTLs reach statistical significance, the true proportion of false discoveries and the accuracy of

their effect size estimates have not yet been well characterised. A seminal early study compared multiple

testing correction methods for detecting eQTLs (including Bonferroni correction, false discovery rate

control and permutation) using HapMap data, however estimates of false discovery rate (FDR) and

sensitivity are not possible without knowledge of all true eQTLs in the data71. Previous eQTL

simulations are typically part of new methodologies, yet these simulations have been limited in their

reflection of real data. Genotype data have typically been simulated with a narrow minor allele

frequency (MAF) range assuming Hardy-Weinberg equilibrium (e.g. MAF 30% in Ref 193, 5% and 20%

in Ref 194, 40% in Ref 195), thus they have not captured realistic patterns of genetic variation, especially

linkage disequilibrium (LD) complexity. Furthermore, MAFs at 1% or greater are typically utilized for

eQTL analysis (Table S2.1). Others have simulated only a fixed sample size195-197. Typically, eQTL

studies have sample sizes of 50 to 1,000, with the accessibility of the tissue or condition a major

determining factor (Table S2.1). A recent trans-eQTL study performed in whole blood had a size of

5,257 samples103 and a study combined data for 2,116 whole blood samples to identify context-specific

eQTLs198. Perhaps the exemplar multiple human tissue resource, the Genotype-Tissue Expression

(GTEx) project51, comprises 44 tissues with a sample size range of 70–361 in its V6p data release52.

While studies have generally converged on linear regression or linear mixed models for eQTL detection,

the multiple testing correction approach is still a source of substantial variability among studies. Various

approaches are available for minimizing type I errors. Often criticised as too conservative, particularly

with complex LD patterns, the Bonferroni correction aims to control the familywise error rate (the

probability of making any type I error) by setting the significance level at α/N, where α is the desired

significance level (0.05 conventionally) and N is the number of tests. FDR-controlling procedures,

which aim to control the expected proportion of false discoveries among all rejected null hypotheses,

are generally considered to provide a better balance between false positives and false negatives.

Benjamini and Hochberg (BH) proposed a procedure199 assuming each statistical test is independent,

Page 41: The genetics of gene expression: from simulations to the

25

which is not the case due to LD. Benjamini and Yekutieli (BY) modified the FDR procedure to one

which, while more conservative, accommodates correlation structure between statistical tests200. The q-

value FDR-controlling approach from Storey and Tibshirani (ST) estimates the proportion of

hypotheses that are truly null (π0), while the BH procedure assumes π0 = 1 which makes ST less

conservative than the BH procedure201.

Other approaches have been proposed to deal with multiple testing specifically for eQTL studies.

Locus-restricted permutation testing is widely used to obtain empirical null distributions. To achieve

this, sample labels are randomly shuffled while keeping genotype data constant, with association tests

performed at each permutation step. For each gene, the best SNP association at each permutation is kept

to generate an empirical null distribution of minimum P-values, from which permutation test P-values

are calculated for each cis-SNP. Thousands of permutations are required to achieve accurate results,

thus there is a high computational cost. Approximations have been investigated for calculating

permutation P-values, such as those in FastQTL202 and MVN203. For example, FastQTL provides an

option to approximate the tail of the empirical null distributions of P-values using a beta distribution

thereby reducing the number of permutations required202. In addition to permutation tests, eigenMT

proposed by Davis et al. adjusts P-values in shorter time204. The number of independent tests (typically

SNPs) for each gene is estimated by eigenMT using a genotype correlation matrix, then a Bonferroni

procedure is applied204. Both FastQTL and eigenMT account for LD structure among local variants.

Recently, hierarchical procedures, such as TreeQTL205, have been proposed, which first control for

multiple testing of variants at each gene, before controlling for multiple testing across all genes. Taken

together, with many correction methods available, it is not clear which method is optimal for eQTL

mapping nor what their respective performances are for genetic variants with difference characteristics

(allele frequency, effect size, etc).

Effect size estimation for eQTLs represents a more complex and less explored problem, yet its

importance is increasing as comparison of eQTLs across tissues, experimental conditions, and meta-

analyses become more common. Furthermore, prediction of tissue-specific gene expression from

genotypes, for example using the tool PrediXcan115, is critically dependent on effect size estimation,

particularly cis-eQTL effect sizes obtained from analyses of GTEx and other studies. Conversely, a

method which predicts genotypes at eQTL SNPs (eSNPs) based on measured gene expression levels

has also been proposed206.

A well-recognised and pervasive phenomenon in GWAS is “Winner's Curse”207-210, an ascertainment

bias where the true genetic effect is smaller than its estimate within the discovery cohort. Notably, a

recent paper from Palmer and Pe'er systematically evaluated summary statistics from 100 previously

published quantitative trait studies and showed that Winner’s Curse was a key reason for the non-

replicability of significant loci211. Using a maximum likelihood method, they showed that correction

Page 42: The genetics of gene expression: from simulations to the

26

for Winner’s Curse improved replication211, yet these estimators, based on summary statistics, were

shown to over-correct Winner’s Curse and the downward bias was larger when the sample size was

small. Palmer and Pe'er definitively established the QTL study-level ramifications of Winner's Curse,

yet to our knowledge no study has comprehensively investigated Winner’s Curse for eQTLs or other

QTLs of the expressed genome using individual-level data. To rigorously evaluate each locus and

design follow-up experiments, it is important that we understand Winner's Curse in the context of

sample size, allele frequency and the estimated effect size, as well as design methods for adjusting effect

sizes during eQTL discovery. As with other studies evaluating key genome-wide study design questions,

large-scale simulation, where the true causal variant(s) and their effect(s) are known from the outset, is

a critical tool for quantifying the relative performance of different approaches in diverse settings212,213.

The research objective of this chapter is to explore the fundamental questions in eQTL study design

and analysis choices. The specific aims of this chapter were:

(1) To provide an evidence base for cis-eQTL study design. More specifically, we will investigate

the statistical power and false discoveries in different scenarios with varying sample sizes, genetic

effect sizes, and allele frequencies.

(2) To provide guidance for eQTL analysis choices by recommending optimal strategies with the

most power and well-calibrated false discovery rate in various scenarios.

(3) To systemically evaluate the bias in eQTL effect size estimation, and to develop a robust method

and user-friendly tool leading to more accurate effect size estimation.

To address these aims, we used extensive simulations of realistic LD patterns of human genetic variation

and matched gene expression to investigate how various scenarios, including different sample sizes,

allele frequencies and genetic effect sizes, influence statistical power and FDR (Figure 2.1). In each

scenario, we randomly selected SNPs as true causal cis-eQTLs, each associated with expression levels

of a target gene. We performed eQTL mapping and evaluated a variety of multiple testing correction

methods, used both individually and hierarchically, under each scenario. We next investigated the

accuracy of genetic effect size estimation across scenarios, the effect of the Winner’s Curse, and how

bias was affected by study power. Finally, we evaluated the accuracy of a variety of eQTL effect size

estimation procedures.

Page 43: The genetics of gene expression: from simulations to the

27

Figure 2.1: Flowchart of eQTL simulation study.

2.2 Results

2.2.1 Simulation of cis-eQTL data

To assess the power, FDR, and effect size estimation of eQTL studies based on different parameters,

we simulated 36 scenarios with combinations of six sample sizes (N = 100, 200, 500, 1,000, 2,000, and

5,000) and six true minor allele frequencies of eSNPs (MAF = 0.5%, 1%, 5%, 10%, 25%, and 50%).

Realistic LD patterns were simulated using HAPGEN2214 with chromosome 22 of the 1000 Genomes

Project phase 3 data5 as reference. In each scenario, 618 gene expression traits were simulated, among

which 200 were under genetic regulation (true eGenes). Each true eGene was simulated to be regulated

by one cis-eQTL with a genetic effect size randomly drawn from an empirical distribution based on

eQTL analysis of a real dataset215,216.

For each gene, all SNPs located within 1Mb of the transcription start site (TSS) were tested for

association using linear regression models through Matrix eQTL217. We mapped cis-eQTLs for the 36

scenarios separately and evaluated different multiple testing correction methods. Figure 2.1 illustrates

the workflow of our eQTL simulations and methods evaluation. We used Bonferroni, FDR-controlling

GenotypeSimulated by HAPGEN2 using chr22

of 1000 Genomes reference panel

Gene expression200 “true” eGenes, 418 “null” eGenes

eQTL mappingLinear regression

Multiple-testing correction comparison

- Pooled methods- Hierarchical correction procedures

Effect size estimation

Correct for overestimation

(Winner’s Curse)

DataSimulation

MethodsEvaluation

Power estimation

False discovery rate assessment

SimulationParameters:

Sample size100, 200, 500, 1000, 2000, 5000

eQTL MAF0.5%, 1%, 5%, 10%, 25%, 50%

eQTL effect sizesfrom an empirical distribution

Data Summary

●● ●

●●

●● ●●

● ●

●●

●●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

True effect size

Estim

ated

effe

ct s

ize

●●●●●●●

●●●

●●

●●

●●●

Sample size

Powe

r

Page 44: The genetics of gene expression: from simulations to the

28

procedures, permutation approaches, and eigenMT to correct for multiple testing. The Bonferroni and

FDR procedures were applied alone to all hypotheses (pooled method) and were also used in

combination via a hierarchical correction procedure (2.4.3 Materials and Methods). We repeated the

simulation for each scenario 100 times and calculated the sensitivity and FDR of each multiple testing

correction method based on all simulations.

2.2.2 Power and false discovery rate between scenarios and multiple testing correction

procedures

We first assessed the variability in sensitivity and FDR for the various multiple testing correction

methods for eGene detection across simulation scenarios. A significant eGene was considered a true

positive if: (1) it was among the 200 true eGenes simulated, and (2) the simulated causal eSNP for that

eGene was among the significant eSNPs, or a significant eSNP was in high LD with the causal eSNP

(r2 ≥0.8). For each multiple testing correction method, sensitivity, or true positive rate (TPR), was

calculated as the proportion of simulated true eGenes correctly identified as true positives. Conversely,

the FDR was calculated as the proportion of false positives in significant eGenes identified across all

100 simulations.

We evaluated multiple testing correction methods in two ways; first, applied across all SNP–gene

hypothesis tests (hereby “pooled methods”), and second, in combinations in a hierarchical approach in

which SNP–gene hypothesis tests were partitioned into groups by the gene being tested (hereby

“hierarchical correction procedures”)218. In the case of hierarchical correction procedures, the multiple

hypothesis tests of eGenes were controlled (Step 2, global correction) based on the multiple testing

adjusted statistics (Step 1, local correction) of each gene’s best association, then SNPs significantly

associated with the significant eGenes were identified based on the locally corrected P-value

corresponding to the threshold of 0.05 after global correction (Step 3; 2.4.3 Materials and Methods).

FDR-controlling procedures applied to all hypotheses (pooled FDR methods) failed to control the FDR

of eGenes in nearly all scenarios (Figure S2.1). We applied three FDR-controlling procedures to all

hypotheses: the Storey-Tibshirani (ST)201, Benjamini-Hochberg (BH)199, and Benjamini-Yekutieli

(BY)200 procedures. The ST and BH procedures failed to control FDR at the desired level of 0.05 in

majority of the scenarios, and FDR increased with sample size, reaching more than 0.6 under scenarios

with sample sizes of 2,000 or 5,000 and true eSNP MAFs ≥25% (Figure S2.1A). The BY procedure

was the most conservative method among pooled FDR procedures but still had inflated FDR under

scenarios with large sample sizes (≥1,000) and true eSNP MAFs ≥25%. As expected, a pooled

Bonferroni correction had very low FDR values in most scenarios, with the lowest sensitivity across

MAFs and sample sizes (Figure S2.1). However, even pooled Bonferroni correction failed to control

Page 45: The genetics of gene expression: from simulations to the

29

FDR of rare variant eQTLs (MAF ≤1%) in scenarios with <1,000 samples. Overall, we observed

inflated rates of false positive eGenes for all pooled FDR methods.

In contrast to pooled methods, we observed better calibrated FDR for hierarchical multiple testing

correction procedures, except in scenarios with low statistical power (Figure 2.2A, Figure S2.2). We

compared ST, BH, BY, Bonferroni, eigenMT, and three permutation approaches (discussed in a later

paragraph) for adjusting the cis-SNP P-values for each simulated gene (local correction), combined

with a comparison of the ST, BH, BY, and Bonferroni correction for adjusting the subsequent minimum

adjusted P-value across all genes (global correction).

Figure 2.2: False discovery rate (FDR) and sensitivity of selected hierarchical multiple testing correction methods. Comparison of the FDR (A) and sensitivity/true positive rate (TPR) (B) of six methods (different colours) for controlling multiple testing of SNPs at each gene (local correction), with Benjamini-Hochberg (BH) used to control for multiple testing across all genes (global correction). The six methods compared were Storey-Tibshirani (ST), Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY), Bonferroni correction, eigenMT, and permutation tests based on beta approximation (BPerm1k). Application of BH in the global correction step had the best sensitivity for all methods used in the local correction step of any hierarchical correction procedures. Each dot represents one scenario and plots show different minor allele frequencies (MAFs) of the simulated causal eSNPs. The dashed horizontal lines in panel A indicate the desired FDR level of 5%. Scenarios where no significant eGenes were identified are not shown in panel A.

We observed lower sensitivity as well as lower FDR than ST and BH when applying BY and Bonferroni

to correct across genes, regardless of which multiple testing correction method was used for local

correction (Figure S2.2, Figure S2.3). ST and BH global correction had identical performance, except

when permutation tests were used as local correction method, where ST had higher FDR than BH and

●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●

●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●

●●

●●●

●●

●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●

●●●

●●

●●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●

●●

●●●●

●●

●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●

●●●●●●

●●

●●●●

●●

●●●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

●●●●●●●●

●●●●

●●

●●

●●

●●

●●

●●●●●●

●●●●●●●●

●●●●●●

●●●●●●

●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

0.00

0.05

0.10

0.15

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

FDR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

A

B

Page 46: The genetics of gene expression: from simulations to the

30

often had FDR slightly higher than 5% (Figure S2.2, Figure S2.3). We therefore subsequently focused

on the BH procedure to control for multiple testing across genes in hierarchical correction procedures.

We compared three different permutation approaches to correct for multiple testing at each gene: (1)

using exact permutation test P-values from 1,000 permutations (Perm1k-BH), (2) using P-values

obtained from beta distribution approximation of each null distribution’s tail after 1,000 permutations

(BPerm1k-BH), and (3) using beta approximation under an adaptive scheme where a minimum of 100

and a maximum of 10,000 permutations were performed for each gene based on the significance level

of this gene (APerm10k-BH). Due to the prohibitive computational time required to run Perm1k and

APerm10k, we ran 10 simulations rather than 100 to compare the three permutation approaches.

Perm1k-BH had lower sensitivity than the other two permutation approaches in scenarios with low

detection power and it also had a higher FDR (Figure S2.4). BPerm1k and APerm10k had similar

performance, indicating 1,000 permutations were sufficient to obtain an accurate approximation of the

P-value null distribution tail. We therefore used BPerm1k-BH as a representative of permutation

approaches to compare with other multiple testing correction methods.

Amongst the hierarchical correction methods with BH as global correction, BY adjustment of multiple

SNPs (BY-BH) had the most conservative FDR among all methods, more so than Bonferroni-BH due

to BY’s heavier correction for the lowest P-values; however, this came at the expense of lower

sensitivity (Figure 2.2). Besides BY-BH, other methods did not show a notable difference in sensitivity.

Perhaps surprisingly, Bonferroni-BH maintained a comparable sensitivity to other methods while

having an FDR well below 0.05. In terms of calibration, eigenMT-BH had an FDR closest to 0.05 and

was relatively stable with respect to sample size, whereas other methods showed an inverse relationship

between FDR and sample size. In the 2.3 Discussion, we explore the trade-offs of FDR calibration

versus minimisation for a given power. Below, we utilise the eigenMT-BH procedure to illustrate the

ramification of our findings for eQTL study design, while also noting that design differences between

Bonferroni-BH and eigenMT-BH would be minor.

These observations were robust under a variety of more complex simulations. Relative performance of

hierarchical multiple testing procedures in terms of FDR calibration and sensitivity remained the same

when simulating (i) log-normal noise (2.4.2 Materials and Methods; Figure S2.5), (ii) correlated

expression via a shared causal cis-SNP (2.4.2 Materials and Methods; Figure S2.6), (iii) dominant

and recessive causal SNPs (2.4.2 Materials and Methods; Figure S2.7, Figure S2.8), and (iv) multiple

causal cis-SNPs per eGene (2.4.2 Materials and Methods; Figure S2.9). However, there were notable

scenarios where FDR was inflated above 5%. Simulations of log-normal noise without inverse normal

transformation resulted in FDR approaching 1.0 due to pervasive outliers, produced by extreme noise

that coincided with low MAF variants (Figure S2.10). Simulations of correlated gene expression

Page 47: The genetics of gene expression: from simulations to the

31

(Figure S2.6) showed reduced FDR control at low power across all methods compared to uncorrelated

gene expression.

Across all effect sizes and using the eigenMT-BH procedure (Figure 2.2), it was apparent that (i) eSNPs

with ≤0.5% and ≤1% MAF that were detected with <1,000 and <500 samples, respectively, were likely

to be false discoveries, (ii) for studies with 100 samples, a MAF threshold of 10% is necessary to control

FDR at ≤5% irrespective of hierarchical multiple testing procedure. Recessive eSNPs detected with

standard eQTL analyses (i.e. using linear models) were largely false discoveries when MAF was ≤25%

in 100 samples, or MAF ≤10% in up to 1,000 samples (Figure S2.8). In varying the eSNP effect size

(0.25, 0.5, 1.0, or 1.5 s.d. gene expression per allele), we found that sample sizes up to 200 (quite

common in the eQTL literature) only reached 80% power for eQTLs of ≥5% MAF and effect size 1.5

s.d. per allele or for eQTLs of 50% MAF and effect size of approximately ≥0.6 s.d. per allele (Figure

2.3). The maximum sample size of 5,000 in our simulations still did not reach 80% power to detect

eQTLs with effect size of 0.25 s.d. per allele and <5% MAF. When sample sizes were >1,000 and

MAF >25%, eQTLs with effect size of 0.25 s.d. per allele could be detected at power 80%. Studies of

100 samples were underpowered unless eQTLs were moderately common (at least 25% MAF) and of

large effect size (≥1.0 s.d. per allele).

Figure 2.3: Power and eQTL effect size. A constant genetic effect size (0.25, 0.5, 1.0, or 1.5 s.d. gene expression per allele) was simulated in each scenario. Plots represents different minor allele frequencies (MAFs) of the simulated true eSNPs. Sample size increases from left to right on x-axes. The estimated statistical power for eGene detection from 100 simulations is shown on y-axes. A hierarchical correction procedure using eigenMT for local correction and BH for global correction (eigenMT-BH) was used to correct for multiple testing. The dashed horizontal lines indicate sufficient statistical power (0.8).

●●

●●●

●●●●

●●●●● ●●●●●●

●●●●●

●●●●

●●●

●●

●●●●

●●●

●●

●●

●●●

●●●●● ●●●●●●

●●●●●● ●●●●●

●●●●

●●

●●●

●●●●●

●●●●

●●

●●●

●●●●●

●●●●●● ●●●●●●

●●●●

●●

●●

MAF: 10% MAF: 25% MAF: 50%

MAF: 0.5% MAF: 1% MAF: 5%

100

200

500

1000

2000

5000 10

020

050

010

0020

0050

00 100

200

500

1000

2000

5000

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Sample size

Powe

r

Effect size●

1.5

1

0.5

0.25

Page 48: The genetics of gene expression: from simulations to the

32

2.2.3 Identification of the simulated causal eSNP

When hierarchical multiple testing correction procedures had calibrated FDR for eGenes, we observed

multiple significant eSNPs at each true positive eGene (Figure S2.11) despite simulating only one

causal eSNP for each true eGene, as would be expected given LD. The number of SNPs significantly

associated with a true eGene increased with both sample size and true eSNP MAF, with >1,000

significant eSNPs identified per eGene on average in the scenario with the largest sample size (N =

5,000), true eSNP MAF (50%), and eQTL effect size (1.5 s.d. per allele) (Figure S2.11).

Many studies focus on the eSNP with the strongest association (lowest P-value) with each eGene (top

eSNP) when performing downstream analyses, such as enrichment analysis or effect size

estimation52,198. In our simulations, we found that while the power to detect the presence of an eQTL

increased with increasing MAF, the probability that the true causal eSNP was the top eSNP declined

(Figure 2.4A). However, holding MAF constant and increasing study power (increasing sample size

and effect size) resulted in increasing probability to detect the true causal eSNP (Figure 2.4A). In

scenarios with at least 1% power to detect an eQTL, top eSNPs with MAF 0.5% were nearly always

the true causal eSNP. Given the critical role of LD in fine-mapping, we confirmed our observations

were due to a positive relationship between an eSNPs' MAF and the amount of local LD (Figure 2.4B).

For top eSNPs that were not true causal eSNPs, 83% were in high LD (r2 ≥0.8) with the true causal

eSNPs (Figure S2.12). Overall, for studies with 80% power to detect a given eQTL of MAF ≤25%, the

top eSNP was the true causal eSNP 90% of the time.

We next investigated the sensitivity and FDR of typical conditional analyses to identify and distinguish

between multiple causal eQTL signals, using the nominal eSNP significance P-value threshold

determined by eigenMT-BH correction (2.4.4 Materials and Methods). FDR among independent

eQTL signals identified by conditional analyses decreased as sample size increased (Figure S2.13).

FDR was slightly inflated when multiple causal eSNPs had MAFs of 50% (Figure S2.13), consistent

with the inflated FDR observed in the initial eQTL scan (Figure S2.9), because of the presence of

negatively correlated minor allele dosages between the causal eSNPs of an eGene, which was more

often observed when causal eSNPs had MAFs of 50% (Figure S2.14). In scenarios where MAFs of

causal eSNPs were ≥25%, conditional analyses identified additional causal eSNPs that were not

significant in the initial eQTL mapping step (Figure S2.15). Among the top SNPs (at each independent

locus) ≥80% were the causal eSNPs (or in perfect LD) when causal eSNPs had MAF of ≤25% (Figure

S2.16). The proportion was lower when MAF of causal eSNPs were 50%, consistent with scenarios

with a single simulated causal eSNPs (Figure 2.4A).

Page 49: The genetics of gene expression: from simulations to the

33

Figure 2.4: Identification of true causal eSNPs. In each scenario, the 200 causal eSNPs have the same effect size in addition to minor allele frequency (MAF). For significant true eGenes, the proportion of top eSNPs (minimum P-value) that were true causal eSNPs (or in perfect LD) is shown (y-axes) for either (A) the power to detect eQTLs of the scenario, or (B) the amount of linkage disequilibrium (LD) for true causal eSNPs, i.e. the average number of SNPs within 1Mb and in moderate LD (r2 ≥0.5) with the causal eSNP. Scenarios are coloured according to true eSNP MAF. Only scenarios with power ≥0.01 are shown. A hierarchical correction procedure using eigenMT for local correction and BH for global correction (eigenMT-BH) was used to identify eGenes.

2.2.4 Winner’s Curse in eQTL effect size estimation

To systematically evaluate the effect of Winner’s Curse in eQTL studies, we compared beta coefficients

obtained from the Matrix eQTL linear regression models for the top eSNP of each true positive eGene

(the “naïve estimator”) to their simulated true effect sizes. We observed that median error of the naïve

estimator increased as study power decreased, as expected, and also that the naïve estimator consistently

overestimated the true effect size with overestimation increasing as power to detect an eQTL decreased

(Figure 2.5, Figure S2.17).

To address this, we investigated various methods for re-estimating effect sizes. Methods have been

proposed to correct for Winner’s Curse in GWAS208,219, but to our knowledge, no method has yet been

designed for bias correction in eQTL studies. We adapted a bootstrap resampling method220 for eQTL

studies and compared three bootstrap estimators (a shrinkage estimator, an out-of-sample estimator,

and a weighted estimator; 2.4.5 Materials and Methods) to determine the best approach for adjusting

for Winner’s Curse. All three bootstrap estimators had more accurate effect size estimates (smaller

mean squared error and median error closer to 0) than the naïve estimator when power of eQTL

detection was low to moderate (Figure 2.5B, Figure S2.18). Amongst the three bootstrap estimators,

the shrinkage estimator was closest to the true effect size overall and across all study powers. In

scenarios with high power for eQTL detection, Winner’s Curse was not apparent, and the bootstrap

A B

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

70%

80%

90%

100%

0.00 0.25 0.50 0.75 1.00Power (eigenMT−BH)

Prop

ortio

n of

top

eSN

Ps th

at w

ere

caus

al

MAF●

50%

25%

10%

5%

1%

0.5%

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

70%

80%

90%

100%

0 5 10 15 20 25LD structure

(Average number of SNPs in LD with the causal eSNP)

MAF●

50%

25%

10%

5%

1%

0.5%

Page 50: The genetics of gene expression: from simulations to the

34

shrinkage estimator and naïve estimator had similar estimates (Figure S2.19). The bootstrap method

for eQTL studies is freely available at https://github.com/InouyeLab/BootstrapQTL.

Figure 2.5: Winner’s Curse in eQTL effect size estimation and correction by bootstrap methods. (A) shows the phenomenon of Winner’s Curse by three examples: scenarios where the sample size is 200 and the minor allele frequencies (MAFs) of causal eSNPs are 5%, 10%, and 25%. Each dot represents one true positive eGene from ten simulations of the scenario. Plots compare the estimated effect size (y-axes) of the top SNP of each true positive eGene to the true effect size (x-axes) of the simulated causal eSNP. Red points show the naïve estimator (beta coefficient from liner regression) and blue points show the bootstrap shrinkage estimator, which was the best estimator (see panel B). Red (or blue) lines are linear regression fit of the naïve estimator (or the bootstrap estimator) on the simulated effect size for the true positive eGenes. Black dashed lines in panel A indicate where the estimated effect size equals to the true value. (B) shows the median error (difference between estimated and true effect size) for all estimators across 10 simulations of scenarios where a constant true effect size (0.25, 0.5, 1, or 1.5 s.d. gene expression per allele) was simulated. A hierarchical correction procedure using eigenMT for local correction and BH for global correction (eigenMT-BH) was used to correct for multiple testing.

2.3 Discussion

In this study, we have utilized extensive, realistic simulations of eQTL data to investigate fundamental

questions in eQTL study design relating to power, FDR and effect size estimation. The most commonly

used MAF cut-offs in recent eQTL studies are 1% or 5% (Table S2.1). For instance, GTEx restricted

the association tests to SNPs with minor allele count ≥10 in the tissue analysed, the corresponding MAF

being 7% and 1.4%, in the minimum (70) and the maximum (361) sample size, respectively52. In our

simulations, we found that eQTLs with a small MAF identified in low sample sizes were highly likely

to be false positives, regardless of which multiple testing correction strategy was used (Figure 2.2,

Figure S2.1A, Figure S2.2). Based on above, when 100, 200, and 500 samples are available (typical

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●●●

●●●●

●●●

●●●

●●●●

●●●●

●●●

●●

●●

●●●●●

●●●●

●●●

●●●

●●

●●●●●

●●●●

●●●

●●●

●●●●●●●●●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●

●●●●

●●●

●●●●●●●●●●

●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●

●●●●●●●●●●●●

●●●●

●●

●●●●

●●●

●●●●

●●●●●●●●●●●●

●●●●

●●

●●●●

●●●

●●●

●●

●●●●●●●●●●●●

●●●●●

●●●●

●●●●

●●●

True effect size=0.25 True effect size=0.5 True effect size=1 True effect size=1.5

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.0

0.3

0.6

0.9

0.0

0.2

0.4

0.6

0.00

0.05

0.10

0.15

0.00

0.01

0.02

Power

Mea

n Sq

uare

d Er

ror

Estimator●

Naive estimator

Bootstrap weighted estimator

Bootstrap out−of−sample estimator

Bootstrap shrinkage estimator

A

B

Size=200 MAF=5% Size=200 MAF=10% Size=200 MAF=25%

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.00.0

0.5

1.0

1.5

2.0

True effect size

Estim

ated

effe

ct s

ize

●●●

●●

●●●

●●●

●●

●●●● ●●

●●

●●●●

●●

●●●●

●●●●

●●●

●●

●●●●

●●●●

●●●●

●●●

●●●●●

●●●●

●●●

●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●●●●●●●

●●●●

●●

●●

●●●●●●●●●●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●●●●●●●●

●●●●

●●●● ●●

●●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●

●●●●

●●●● ●●●

●●●●●●●●●●●●●●●●●

●●●●

●●●● ●●●

True effect size=0.25 True effect size=0.5 True effect size=1 True effect size=1.5

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

0.00

0.25

0.50

0.75

1.00

0.0

0.2

0.4

0.6

0.8

0.0

0.1

0.2

0.3

0.4

0.00

0.05

0.10

0.15

Power

Med

ian

Erro

r

Page 51: The genetics of gene expression: from simulations to the

35

in eQTL studies), we recommend a MAF cut-off at 10%, 5% and 1%, respectively. Many studies listed

in Table S2.1 had a lower MAF cut-off than recommended. Detecting rare eQTLs with MAF 0.5% is

possible in ≥2,000 samples, but even 5,000 samples cannot provide sufficient power unless the eQTL

effect size is extremely high: ≥1 s.d. gene expression per allele dosage (Figure 2.2, Figure 2.3).

Recent eQTL studies have used pooled FDR methods to correct for multiple testing85,91,109,221,222. Here,

we show that pooled methods are inappropriate for eQTL studies, as they give inflated (sometimes

substantially) FDR which worsen as sample size or eSNP MAF increases (Figure S2.1). This suggests

that many eQTLs identified in these studies may be false positives. Hierarchical multiple testing

correction procedures had substantially better calibrated FDR. A hierarchical approach of permutation

as local correction method followed by ST global adjustment is commonly used in eQTL studies (e.g.

by GTEx52). When permutation was used as a local correction method, ST often had FDR slightly higher

than the desired level in our simulations, while use of BH instead would have better calibrated FDR.

Notably, ST and BH adjustment of multiple genes after correction for multiple local SNPs at each gene

using other methods, except permutation tests, had identical results, therefore we recommend using BH

to adjust across genes rather than ST.

Most hierarchical procedures had nearly identical sensitivity when BH was used to correct for multiple

testing across genes, thus FDR was a differentiating factor (Figure 2.2). Here, when studies were

appropriately powered, eigenMT-BH was the most closely calibrated approach for controlling FDR at

5%, and it had the least variable FDR across different sample sizes. Although eigenMT-BH had FDR

inflated above 5% in our simulations of proximal correlated genes and recessive causal eSNPs (Figure

S2.6A, Figure S2.8A), these simulations represent worst case scenarios rather than realistic data. We

expect only a fraction of eQTLs to comprise recessive effects, nor do we expect all causal eSNPs to

regulate all genes, which are highly correlated, within a 1Mb window. Thus, we expect eigenMT-BH

should control FDR at 5% in real eQTL datasets. On the other hand, Bonferroni-BH had the smallest

FDR with negligibly lower sensitivity. The trade-offs between the use of Bonferroni-BH versus

eigenMT-BH are best considered in the context of the specific study. Statistically, calibration is perhaps

the deciding factor; if the analysis is intended to guide time-consuming experimental follow-up of

specific eQTLs then it may be preferable to minimise FDR for a given detection power.

After eGene detection, identification of the causal eSNP among the significant eSNPs with high LD

remains a challenge. Interestingly, we found that the most significant eSNP was the simulated causal

eSNP approximately 90% of the time. When the top variant was not the causal variant, ~80% of the

time the top eSNP was in high LD (r2 ≥0.8). The proportion of sentinel variants that were the causal

eSNP was slightly lower, 80%, in conditional analyses applied in simulations of multiple causal eSNPs,

motivating the use of fine-mapping approaches when there is evidence for multiple independent causal

eSNPs.

Page 52: The genetics of gene expression: from simulations to the

36

Winner’s Curse in eQTL effect size estimation must be taken into account when comparing effect sizes

from different tissue types or conditions, estimating replication sample size, or constructing predictive

models. For example, a recent study compared cis-eQTL effects between blood samples (N = 1,240

samples) and four other tissues (N <85 samples), identifying >2,000 probes with cis-eQTL associations

that were tissue-dependent, and nearly half were with the same eSNP but with a different effect size223.

This may be an artefact of Winner’s Curse. To address eQTL effect over-estimation, we have presented

a bootstrap method and tool for re-estimation which should enable more accurate eQTL comparisons

as well as predictive genetic models for gene expression for less accessible tissues, cell types, conditions

or other situations where power is limited.

Since most eQTL studies focus on cis-eQTL mapping, there are limited findings of trans-eQTLs, thus

realistic simulation of trans-eQTL datasets remains a challenge. Many of the multiple testing correction

methods evaluated in our simulations are designed for cis-eQTL mapping only, such as those involving

FastQTL and eigenMT. To deal with the multiple testing problem in trans-eQTL analysis, permutations

would be time consuming for a whole genome scan, and one might consider estimating the number of

independent gene expression traits and applying a Bonferroni correction. In terms of Winner’s Curse in

effect size estimation, the bootstrap approach to reduce the upward bias would still be applicable in a

trans-eQTL setting.

The investigation of the genetic component of transcriptional variation has become an essential part of

linking genotype to phenotype66. Despite the increasing scale of eQTL studies (e.g. 5,257 samples in

Yao et al.103, and Zhernakova et al.198), fundamental questions about study design and analysis strategies

have remained unanswered. Here, we have investigated the sensitivity and FDR of diverse multiple

testing strategies, the factors contributing the identification of the causal eSNP, and the correction of

eQTL effect size overestimation using a simple tool, BootstrapQTL. The insights from our simulation

study are likely not limited to eQTL analysis, and may extend to other studies of genome-related

quantitative traits, such as chromatin accessibility, methylation and other epigenetic traits.

2.4 Materials and Methods

2.4.1 Simulating genotypes and selecting eQTLs

Genotype data were simulated using HAPGEN2214 based on the 99 FIN haplotypes of chromosome 22

from the 1000 Genomes Project data (phase3, GRCh37)5. The simulated genotypes had similar LD

patterns with the reference data. Six sets of genotype data were generated at varying sample sizes: 100,

200, 500, 1,000, 2,000, and 5,000 individuals. After filtering out SNPs with MAF less than 0.5% or

Page 53: The genetics of gene expression: from simulations to the

37

Hardy-Weinberg Equilibrium (HWE) P-value less than 5×10-6, approximately 150,000 SNPs remained

in each data set.

We explored six different true eSNP MAFs (0.5%, 1%, 5%, 10%, 25%, 50%) in each of the six genotype

datasets, resulting in 36 scenarios in total. In each scenario, 200 SNPs at the scenario MAF were

randomly chosen as true causal eSNPs, each regulating the expression of a randomly selected cis gene

(within ±1Mb from transcription start site of the gene). These 200 causal eSNPs were selected from an

LD pruned subset where the pairwise r2 was ≤0.3.

2.4.2 Simulating gene expression

To get a distribution of cis-eQTL effect sizes, we first performed eQTL mapping in DILGOM dataset215

using additive linear model with covariates that accounted for gender, age, and population structure.

Expression data were further scaled to make each gene’s expression across samples follow a standard

normal distribution. To avoid an inflated number of associations due to LD structure among variants,

we kept only the best association with the minimum nominal P-value for each gene. As shown in

simulation results, only eQTLs with large effect sizes could be identified given a limited sample size.

To reduce the bias caused by limited power, we included all genes to obtain the effect size distribution

and fit it with a gamma distribution, from which we randomly selected true effect sizes.

First, we performed a set of simulations in which the expression of 200 genes were simulated, each

regulated by a single causal eSNP, varying the study sample size, as well as the MAF and effect size of

the causal eSNP. In each scenario, 200 genes out of 618 genes on chromosome 22 were designated as

“true eGenes” regulated by a causal eSNP each and the remaining 418 as “null genes” with no truly

associated eSNPs. The 200 true associations were modelled by a simple linear regression:

"# = % × &# + (#*+,ℎ(#~/(0, 1)

where "# denotes the expression level of an eGene for individual i, β the genetic effect size of the

corresponding eSNP, &# the minor allele dosage of the eSNP coded as 0, 1 or 2, and (# the error variance

for the ith individual, which followed a standard normal distribution. For 418 null genes, no genetic

effects were simulated (β = 0) and the simulated expression was normally distributed. True eGenes

effect sizes were randomly drawn from a gamma distribution derived from a real dataset as described

above. In scenarios where causal eSNPs had a constant effect size, β was 0.25, 0.5, 1, or 1.5.

Additional simulations were performed to examine the consequences of the following for multiple

testing correction: the assumption of error normality in the simulations, correlation structure amongst

gene expression, non-linear eSNP effects, and multiple causal eSNPs. In all simulations, including those

Page 54: The genetics of gene expression: from simulations to the

38

above, 100 replicates were performed to obtain estimates of sensitivity and FDR under each scenario

for each multiple testing correction method described in the next section below.

To examine the assumption of error normality, we simulated gene expression as described above, but

changing the error term to be drawn from a log-normal distribution (with mean and s.d. of the variable’s

natural logarithm 0 and 1, respectively). Simulations were additionally performed in which gene

expression profiles were inverse rank normalised across samples using the rntransform function in the

GenABEL R package224.

To examine the effect of gene coexpression on eQTL mapping we simulated correlated expression

amongst adjacent genes arising from a single shared causal eSNP. Chromosome 22 was divided into 35

genomic blocks with a length of 1Mb. Two hundred true eGenes were randomly selected from all 618

genes, and true eGenes within each genomic block were simulated to have correlated gene expression

levels, sharing the same causal eSNP. Correlated expression "5 , "6 , …, "# for each true eGene1,

eGene2, …, eGenei in block j were simulated as following:

"5 = %7 × &7 + (5

"6 = %7 × &7 + 86 × (5 + 91 − 866 × (6

…,

"# = %7 × &7 + 8# × (5 + 91 − 8#6 × (#

All i true eGenes in this block shared a causal eSNPj, which was coded as (0, 1, 2), and had the same

genetically regulated component (%7 × &7) where %7 was the effect size of the SNP on each true eGene.

Error terms ((5, (6, …, (#) followed a standard normal distribution. For each eGenei, except for the first

eGene1, the noise component (8# × (5 + 91 − 8#6 × (#) followed a standard normal distribution that

was correlated with the error term of the first eGene1 ((5) with a correlation coefficient 8#, which was

randomly drawn from a uniform distribution U(0.6, 0.9).

To examine the effect of non-linear eSNPs on multiple testing correction two additional simulations

were performed. One in which all causal eSNPs had dominant effects, and the other in which all causal

eSNPs had recessive effects. To simulate dominant effects, causal eSNPs were coded as (0, 2, 2) based

on the absence/presence of one or more copy of the minor allele. Conversely, to simulate recessive

effects causal eSNPs were coded as (0, 0, 2). Apart from the causal eSNP coding, simulations were as

described at the beginning of this section, where 200 true eGenes were randomly selected and simulated

to have a single causal eSNP with a standard normal error term.

Page 55: The genetics of gene expression: from simulations to the

39

To examine the effect of multiple causal eSNPs on multiple testing correction two additional

simulations were performed. One in which each true eGene was regulated by two causal eSNPs, and

one in which each true eGene was regulated by three causal eSNPs, with additive effects on gene

expression as follows:

Two causal eSNPs:

" = %5 × &5 + %6 × &6 + (

Three causal eSNPs:

" = %5 × &5 + %6 × &6 + %; × &; + (

where effect sizes %5, %6, and%; were drawn from the gamma distribution described at the start of the

section based on the distribution of effect sizes observed in a real dataset. The terms &5, &6, and &;

describe the minor allele dosage of the each causal eSNP respectively. In each simulation the first causal

eSNP was randomly selected as described above, based on the desired MAF for each scenario.

Additional causal eSNPs at each eGene were randomly selected from nearby variants in LD with &5

based on the distribution of LD correlation observed between multiple causal eSNPs observed in a

conditional eQTL study of around 5,000 peripheral blood samples225, following a beta distribution with

the shape parameters 2.6 and 4.5. Using this selection scheme MAFs tended to be similar across the

multiple causal eSNPs at each eGene (Figure S2.20). The error term ( was drawn from a standard

normal distribution as described above.

2.4.3 Mapping eQTLs and correcting for multiple testing

For cis-eQTL analysis, we used Matrix eQTL217 to fit linear regression models between each gene and

the minor allele dosage of all SNPs located within 1Mb of their transcription start site. To adjust for

multiple tests, we applied either (1) a correction method to all hypotheses (pooled method), or (2) a

hierarchical correction procedure, where two methods were used in combination to correct for multiple

SNPs tested for each gene and multiple genes separately.

Pooled multiple testing correction was performed using either Bonferroni correction or FDR-controlling

procedures applied to all SNP–gene hypothesis tests. Bonferroni correction (pooled Bonferroni),

Benjamini-Hochberg199 (pooled BH), and Benjamini-Yekutieli200 (pooled BY) FDR procedures were

performed using p.adjust function in R226, and Storey-Tibshirani201 (pooled ST) procedure was

performed by the R package qvalue227.

Page 56: The genetics of gene expression: from simulations to the

40

A three-step procedure was employed to perform hierarchical multiple testing correction. In Step 1, P-

values of all cis-SNPs were adjusted for multiple testing for each gene separately (locally adjusted P-

value). In Step 2, the minimum adjusted P-value from Step 1 was taken for each gene, then these

adjusted P-values were further adjusted for multiple testing across all genes (globally adjusted P-value).

Finally, in Step 3, significant eSNPs were identified for each significant eGene as SNPs with a locally

adjusted P-value from Step 1 lower than the locally adjusted minimum P-value corresponding to the

globally adjusted P-value threshold of 0.05.

Hierarchical multiple testing correction was performed using different combinations of multiple testing

correction methods in Step 1 and Step 2 described above. In Step 1, we applied FDR procedures (ST,

BH, or BY), Bonferroni, eigenMT204, or permutation approaches to correct for multiple local SNPs

tested for each gene. In Step 2, we applied three FDR-controlling procedures, or Bonferroni correction

to control the rate of false positive eGenes. Note that eigenMT and permutation approaches are used

hierarchically by design.

When Bonferroni was used as a local correction method, the adjusted P-value was calculated by

multiplying each linear model P-value by the number of SNPs in the corresponding 1Mb cis window

for the tested gene. When using eigenMT, the linear model P-value was multiplied by the number of

effective independent tests estimated from the genotype correlation matrix by eigenMT (in Python

2.7.3)204. Permutations were performed by shuffling sample labels of expression data. For each gene,

minimum nominal P-values from all permutation tests were kept to obtain the null distribution.

Permutation P-values were calculated as the proportion of permutations showing more significant

minimum P-value than the observed nominal P-value. The null distribution used to calculate

permutation P-values was either (1) the exact distribution from permutations (exact permutation

scheme), or (2) a beta distribution approximation of the null distribution tail, which is implemented in

FastQTL (version 2.0)202. When using FastQTL, we performed either a fixed number of permutations

(1,000), or under an adaptive scheme, a number ranging from 100 to 10,000 permutations determined

via iterative estimates of gene significance throughout the permutation procedure.

When calculating the sensitivity and FDR of multiple testing correction methods, true positives and

false discoveries were calculated at the gene level. If any significant SNPs were in high LD (r2 ≥0.8)

with any simulated causal eSNP, an eGene was considered a true positive. Conversely, if there were

significant SNPs for an eGene but it was not simulated to be a true eGene, or no significant SNPs were

in high LD (r2 ≥0.8) with any simulated causal eSNP, it was considered a false discovery.

Page 57: The genetics of gene expression: from simulations to the

41

2.4.4 Conditional analyses

In simulations of multiple causal eSNPs a two-stage conditional analysis228 was performed to identify

independent eSNPs for each significant eGene after eQTL mapping and hierarchical multiple testing

correction with eigenMT-BH. The nominal P-value threshold corresponding to the global correction

FDR 0.05 cut-off calculated via eigenMT-BH in the initial eQTL scan was used to determine

significance in the conditional analysis.

The conditional analysis comprised two stages: a forward stage, and a backward stage. The forward

stage consisted of an iterative procedure. At each iteration, cis-eQTL mapping was performed for each

significant eGene, adjusting for the top SNP identified in the initial eQTL mapping. If any SNPs

remained significant after adjusting for this top SNP, the new top SNP was added to the list of

independent eQTL signals and adjusted for in subsequent iterations. If no SNPs were significant, the

iterative procedure terminated and proceeded to the backward stage. In the backward stage, each

independent eQTL signal was tested separately using a leave-out-one model adjusting for all other SNPs

in the list of independent eQTL signals as covariates. The final set of independent eQTLs comprised

the set of eSNPs that remained significant in the backward stage.

When calculating the sensitivity and FDR of the conditional analyses, each independent eQTL signal

was considered a true positive if it was in high LD (r2 ≥0.8) with any simulated causal eSNP. Conversely,

an independent eQTL signal was considered a false discovery if it was not in high LD with any

simulated causal eSNP. Where two or more independent eQTL signals were identified and in high LD

with any causal eSNP, only one signal was considered a true positive while all others were considered

false discoveries.

2.4.5 Correcting for Winner’s Curse

To evaluate and correct the effect of the Winner’s Curse, we considered the effect size estimates of the

SNP with the minimum P-value (top eSNP) for each eGene. We use %<=(>) to denote the “naïve

estimator”: the beta coefficient obtained from the linear regression of each eGene on its top eSNP.

We adjusted a bootstrap method220 to re-estimate eQTL effect sizes of significant eGenes determined

by a hierarchical correction procedure (Bonferroni-BH by default; eigenMT-BH is also recommended).

This approach consists of a repeated bootstrap analysis, in which random samples are drawn with

replacement from the study dataset to partition the study samples into two groups: a bootstrap detection

group of identical size to the original dataset comprising samples randomly selected with replacement,

and a bootstrap estimation group comprising the remainder of the study samples. Due to the sampling

with replacement, the bootstrap detection group typically comprised 63.2% of the study samples while

Page 58: The genetics of gene expression: from simulations to the

42

the bootstrap estimation group comprised the other 36.8% of samples. The effect size is then estimated

separately in the bootstrap detection and estimation groups for each eGenes and its top eSNP based on

the original dataset.

After performing the above procedure with 200 bootstraps, three bootstrap estimators were calculated

and compared for eGene effect size re-estimation:

a shrinkage estimator:

%<=(>) −1

?(>)@(%<A(>)# − %<B(>)#)

C(D)

#E5

an out-of-sample estimator:

1

?(>)@%<B(>)#

C(D)

#E5

and a weighted estimator:

(1 − F)%<=(>) + F1?(>)

@%<B(>)#

C(D)

#E5

Where %<A(>)# denotes the effect size of eGene e in each bootstrap detection group i, %<B(>)# denotes the

effect size of eGene e in each bootstrap estimation group i, and ?(>) denotes the number of bootstraps

in which the association between the eGene e and its top eSNP was significant in the bootstrap detection

group (thus ?(>) ≤200). An association between an eGene and its top eSNP was considered significant

in the bootstrap detection group if its locally adjusted P-value (corrected for multiple cis-SNPs within

1Mb of the respective eGene using e.g. eigenMT or Bonferroni) was smaller than the locally adjusted

P-value corresponding to the 0.05 threshold after global adjustment (e.g. BH) in the eGene detection

analysis prior to performing the bootstrap procedure. For the weighted estimator, the weight w was

0.632, i.e. the proportion of unique samples in the bootstrap detection group.

Page 59: The genetics of gene expression: from simulations to the

43

2.5 Supplemental Materials

Table S2.1: Sample size and minor allele frequency (MAF) cut-off used in recent eQTL studies.

Study Tissue/cell type Sample size MAF cut-off

Brynedal et al., 2017 Lymphoblastoid cell lines (LCLs) 83–185 15%

GTEx consortium, 2017

Various tissues (44 types) 70–361 1% and minor allele count ≥10

Kim-Hellmuth et al., 2017

Monocytes (resting and stimulated) 134 5%

Ko et al., 2017 Kidney 96 5%

Manry et al., 2017 Whole blood (resting and stimulated) 51 10%

Ng et al., 2017 Brain 494 1%

Pala et al., 2017 White blood cells 624 1%

Peng et al., 2017 Placentas 159 Minor allele count ≥5 (MAF approx. 1.6%)

Yao et al., 2017 Whole blood 5257 1%

Chen et al., 2016 Three immune cell types 197 Minor allele count >4 (MAF approx. 1.2%)

Franzen et al., 2016 Blood and six vascular and metabolic tissues

600 5%

Nedelec et al., 2016 Macrophages (resting and stimulated) 172 5% Quach et al., 2016 Monocytes (resting and stimulated) 200 5%

Sajuthi et al., 2016 Subcutaneous adipose and skeletal muscle

256 and 255 1%

Zhernakova et al., 2016

Whole blood 2116 5%

Caliskan et al., 2015 Peripheral blood mononuclear cells (resting and stimulated)

98 10%

Hulur et al., 2015 Distal colonic samples 40 5%

Kirsten et al., 2015 Peripheral blood mononuclear cells 2112 1%

Lock et al., 2015 Blood and brain 43 5%

Naranbhai et al., 2015 Peripheral blood CD16+ neutrophils 101 5%

Battle et al., 2014 Whole blood 922 2.50% Fairfax et al., 2014 Monocytes (resting and stimulated) 432 4%

Hu et al., 2014 T cells (resting and stimulated) 174 1%

Kim et al., 2014 Monocytes (resting and stimulated) 137 5%

Kim et al., 2014 Brain 424 1%

Lee et al., 2014 Dendritic cells (resting and stimulated) 534 5%

Ongen et al., 2014 Tumour and normal colon mucosa samples

90 5%

Ramasamy et al., 2014 Brain (10 regions) 134 5%

Wright et al., 2014 Whole blood 2494 1%

Page 60: The genetics of gene expression: from simulations to the

44

Ye et al., 2014 CD4+ T cells (resting and stimulated) 348 5%

Garnier et al., 2013 Monocytes 758 1%

Lappalainen et al., 2013

LCLs 462 5%

Li et al., 2013 Tumour samples and normal blood samples

407 5%

Westra et al., 2013 Peripheral blood 5311 5% Barreiro et al., 2012 Dendritic cells (resting and stimulated) 65 10%

Fairfax et al., 2012 Primary monotypes and primary B cells 283 1%

Fu et al., 2012 Blood and four non-blood tissues 1240 and 62–77

5%

Grundberg et al., 2012 Adipose, LCLs, and skin 667–777 5%

Hao et al., 2012 Lung 1111 1%

Stranger et al., 2012 LCLs 726 5%

References for Table S2.1:

1. Brynedal, B. et al. Large-Scale trans-eQTLs Affect Hundreds of Transcripts and Mediate Patterns of Transcriptional Co-regulation. Am J Hum Genet 100, 581–591 (2017). 2. GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017). 3. Kim-Hellmuth, S. & Lappalainen, T. Concerted Genetic Function in Blood Traits. Cell 167, 1167–1169 (2016). 4. Ko, Y.-A. et al. Genetic-Variation-Driven Gene-Expression Changes Highlight Genes with Important Functions for Kidney Disease. Am J Hum Genet 100, 940–953 (2017). 5. Manry, J. et al. Deciphering the genetic control of gene expression following Mycobacterium leprae antigen stimulation. PLoS Genet 13, e1006952 (2017). 6. Ng, B. et al. An xQTL map integrates the genetic architecture of the human brain's transcriptome and epigenome. Nat Neurosci 20, 1418–1426 (2017). 7. Pala, M. et al. Population- and individual-specific regulatory variation in Sardinia. Nat Genet 49, 700–707 (2017). 8. Peng, S. et al. Expression quantitative trait loci (eQTLs) in human placentas suggest developmental origins of complex diseases. Hum Mol Genet 26, 3432–3441 (2017). 9. Yao, C. et al. Dynamic Role of trans Regulation of Gene Expression in Relation to Complex Traits. Am J Hum Genet 100, 571–580 (2017). 10. Chen, L. et al. Genetic Drivers of Epigenetic and Transcriptional Variation in Human Immune Cells. Cell 167, 1398–1414.e24 (2016). 11. Franzen, O. et al. Cardiometabolic risk loci share downstream cis- and trans-gene regulation across tissues and diseases. Science 353, 827–830 (2016). 12. Nedelec, Y. et al. Genetic Ancestry and Natural Selection Drive Population Differences in Immune Responses to Pathogens. Cell 167, 657–669.e21 (2016). 13. Quach, H. et al. Genetic Adaptation and Neandertal Admixture Shaped the Immune System of Human Populations. Cell 167, 643–656.e17 (2016). 14. Sajuthi, S. P. et al. Mapping adipose and muscle tissue expression quantitative trait loci in African Americans to identify genes for type 2 diabetes and obesity. Hum Genet 135, 869–880 (2016). 15. Zhernakova, D. V. et al. Identification of context-dependent expression quantitative trait loci in whole blood. Nat Genet 49, 139–145 (2017). 16. Caliskan, M., Baker, S. W., Gilad, Y. & Ober, C. Host genetic variation influences gene expression response to rhinovirus infection. PLoS Genet 11, e1005111 (2015).

Page 61: The genetics of gene expression: from simulations to the

45

17. Hulur, I. et al. Enrichment of inflammatory bowel disease and colorectal cancer risk variants in colon expression quantitative trait loci. BMC Genomics 16, 138 (2015). 18. Kirsten, H. et al. Dissecting the genetics of the human transcriptome identifies novel trait-related trans-eQTLs and corroborates the regulatory relevance of non-protein coding locidagger. Hum Mol Genet 24, 4746–4763 (2015). 19. Lock, E. F. et al. Joint eQTL assessment of whole blood and dura mater tissue from individuals with Chiari type I malformation. BMC Genomics 16, 11 (2015). 20. Naranbhai, V. et al. Genomic modulators of gene expression in human neutrophils. Nat Commun 6, 7545 (2015). 21. Battle, A. et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res 24, 14–24 (2014). 22. Fairfax, B. P. et al. Innate immune activity conditions the effect of regulatory variants upon monocyte gene expression. Science 343, 1246949 (2014). 23. Hu, X. et al. Regulation of gene expression in autoimmune disease loci and the genetic basis of proliferation in CD4+ effector memory T cells. PLoS Genet 10, e1004404 (2014). 24. Kim, S. et al. Characterizing the genetic basis of innate immune response in TLR4-activated human monocytes. Nat Commun 5, 5236 (2014). 25. Kim, Y. et al. A meta-analysis of gene expression quantitative trait loci in brain. Transl Psychiatry 4, e459 (2014). 26. Lee, M. N. et al. Common genetic variants modulate pathogen-sensing responses in human dendritic cells. Science 343, 1246980 (2014). 27. Ongen, H. et al. Putative cis-regulatory drivers in colorectal cancer. Nature 512, 87–90 (2014). 28. Ramasamy, A. et al. Genetic variability in the regulation of gene expression in ten regions of the human brain. Nat Neurosci 17, 1418–1428 (2014). 29. Wright, F. A. et al. Heritability and genomics of gene expression in peripheral blood. Nat Genet 46, 430–437 (2014). 30. Ye, C. J. et al. Intersection of population variation and autoimmunity genetics in human T cell activation. Science 345, 1254665 (2014). 31. Garnier, S. et al. Genome-wide haplotype analysis of cis expression quantitative trait loci in monocytes. PLoS Genet 9, e1003240 (2013). 32. Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013). 33. Li, Q. et al. Integrative eQTL-based analyses reveal the biology of breast cancer risk loci. Cell 152, 633–641 (2013). 34. Westra, H.-J. et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat Genet 45, 1238–1243 (2013). 35. Barreiro, L. B. et al. Deciphering the genetic architecture of variation in the immune response to Mycobacterium tuberculosis infection. Proc Natl Acad Sci U S A 109, 1204–1209 (2012). 36. Fairfax, B. P. et al. Genetics of gene expression in primary immune cells identifies cell type-specific master regulators and roles of HLA alleles. Nat Genet 44, 502–510 (2012). 37. Fu, J. et al. Unraveling the regulatory mechanisms underlying tissue-dependent genetic variation of gene expression. PLoS Genet 8, e1002431 (2012). 38. Grundberg, E. et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat Genet 44, 1084–1089 (2012). 39. Hao, K. et al. Lung eQTLs to help reveal the molecular underpinnings of asthma. PLoS Genet 8, e1003029 (2012). 40. Stranger, B. E. et al. Patterns of cis regulatory variation in diverse human populations. PLoS Genet 8, e1002639 (2012).

Page 62: The genetics of gene expression: from simulations to the

46

Figure S2.1: False discovery rate (FDR) and sensitivity of pooled methods for increasing sample sizes. Each panel represents a pooled multiple testing correction method, namely the Storey-Tibshirani (ST), Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY) FDR procedures, and the Bonferroni correction procedure. FDR (A) and sensitivity/true positive rate (TPR) (B) are shown on y-axes and sample sizes are shown on x-axes. Each dot represents one simulation scenario and are coloured according to the minor allele frequency (MAF) of the simulated causal eSNPs. The dashed horizontal lines in panel A indicate the desired FDR level of 5%. Missing dots in panel A are scenarios in which no significant eGenes were identified (the corresponding sensitivity in panel B is 0).

●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●●

●●●●

Pooled ST Pooled BH Pooled BY Pooled Bonf

100 200 500 1000 2000 5000 100 200 500 1000 2000 5000 100 200 500 1000 2000 5000 100 200 500 1000 2000 50000.00

0.25

0.50

0.75

1.00

Sample Size

TPR

MAF●

50%

25%

10%

5%

1%

0.5%

A

B

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●●

●●●●●●

●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●

Pooled ST Pooled BH Pooled BY Pooled Bonf

100 200 500 1000 2000 5000 100 200 500 1000 2000 5000 100 200 500 1000 2000 5000 100 200 500 1000 2000 50000.00

0.25

0.50

0.75

1.00

Sample Size

FDR

MAF●

50%

25%

10%

5%

1%

0.5%

Page 63: The genetics of gene expression: from simulations to the

47

Figure S2.2: False discovery rate (FDR) of all hierarchical multiple testing correction methods. Each row compares four methods (different colours) applied in the global correction step of hierarchical correction procedures: the Storey-Tibshirani (ST), Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY) FDR procedures, and the Bonferroni correction procedure. Rows represent different methods (ST, BH, BY, Bonferroni, eigenMT, and beta approximation based permutation approach) used in the local correction step of hierarchical correction procedures. Scenarios where true eSNPs had different minor allele frequencies (MAFs) are shown in different plots and sample size is shown on x-axes. The dashed horizontal lines indicate the desired FDR level of 5%.

●●

●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.000

0.025

0.050

0.075

0.100

0.125

0.0

0.1

0.2

0.3

0.4

0.5

0.00

0.25

0.50

0.75

1.00

Sample size

FDR

Method●

eigenMT−ST

eigenMT−BH

eigenMT−BY

eigenMT−Bonferroni

●●●●●●●●

●●●●

●●

●●●

●●●

●●●●

●●●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.0

0.1

0.2

0.3

0.4

0.5

0.00

0.25

0.50

0.75

1.00

Sample size

FDR

Method●

Bonferroni−ST

Bonferroni−BH

Bonferroni−BY

Bonferroni−Bonferroni

●●●●

●●●●●●●●

●●

●●●●

●●●●

●●

●●

●●

●●●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●●

●●●●

●●

●●

●●

●● ●●●●

●●●●

●●●●

●●

●●

●●

●●

●●

●● ●●

●●●●●●

●●●●

●●●●

●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.05

0.10

0.15

Sample size

FDR

Method●

BY−ST

BY−BH

BY−BY

BY−Bonferroni

●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.05

0.10

0.15

0.0

0.2

0.4

0.6

0.00

0.25

0.50

0.75

1.00

Sample size

FDR

Method●

BH−ST

BH−BH

BH−BY

BH−Bonferroni

●●●●

●●●●

●●

●●

●●

●●

●●●

●●●●

●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.05

0.10

0.15

0.0

0.2

0.4

0.6

0.8

0.00

0.25

0.50

0.75

1.00

Sample size

FDR

Method●

ST−ST

ST−BH

ST−BY

ST−Bonferroni

●●●●

●●●●

●●

●●

●●

●●●●●●●

●●●●

●●●●

●●●●

●●●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.05

0.10

0.15

0.20

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

FDR

Method●

BPerm1k−ST

BPerm1k−BH

BPerm1k−BY

BPerm1k−Bonferroni

A

B

C

D

E

F

Page 64: The genetics of gene expression: from simulations to the

48

Figure S2.3: Sensitivity/true positive rate (TPR) of all hierarchical multiple testing correction methods. Each row compares four methods (different colours) applied in the global correction step of hierarchical correction procedures: the Storey-Tibshirani (ST), Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY) false discovery rate (FDR) procedures, and the Bonferroni correction procedure. Rows represent different methods (ST, BH, BY, Bonferroni, eigenMT, and permutation approach) used in the local correction step of hierarchical correction procedures. Scenarios where true eSNPs had different minor allele frequencies (MAFs) are shown in different plots and sample size is shown on x-axes.

●●●●

●●●●

●●●●●●●●●●●●●●●●

●●

●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●

●●

●●●●

●●●●●●●●

●●●●

●●●●

●●●

●●

●●●●

●●●●

●●●●●●●●

●●●●

●●●

●●

●●●●

●●●●●●●●●●●●

●●●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

Bonferroni−ST

Bonferroni−BH

Bonferroni−BY

Bonferroni−Bonferroni

●●

●●

●●●●●●●●●●●●●●●●●●●●

●●

●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●

●●

●●

●●

●●●●●●●●

●●●●

●●●●

●●●

●●

●●●●

●●●●

●●●●●●●●

●●●●

●●●

●●

●●●●

●●●●●●●●●●●●

●●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

ST−ST

ST−BH

ST−BY

ST−Bonferroni

●●

●●

●●●●

●●●●●●●●●●●●●●●●

●●

●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●

●●

●●

●●

●●●●●●●●

●●●●

●●●●

●●●

●●

●●●●

●●●●

●●●●●●●●

●●●●

●●●

●●

●●●●

●●●●●●●●●●●●

●●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

BH−ST

BH−BH

BH−BY

BH−Bonferroni

●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●

●●

●●●●

●●●●●●●●

●●●●

●●●●

●●●

●●●

●●●●

●●●●

●●●●●●●●

●●●●

●●●

●●●●

●●●●

●●●●●●●●

●●●●

●●●●

●●

●●●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

BY−ST

BY−BH

BY−BY

BY−Bonferroni

●●

●●

●●●●

●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●

●●●●●●●●●●●●

●●●●

●●●

●●

●●

●●

●●●●●●●●

●●●●

●●●●

●●●

●●

●●●●

●●●●

●●●●●●●●

●●●●

●●●

●●

●●●●

●●●●●●●●●●●●

●●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

eigenMT−ST

eigenMT−BH

eigenMT−BY

eigenMT−Bonferroni

●●

●●

●●●●

●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●

●●●●●●●●●●●●

●●●●

●●●

●●

●●

●●

●●●●

●●●●

●●●●

●●●●

●●●

●●

●●●●

●●●●

●●●●●●●●

●●●●

●●●

●●

●●

●●

●●●●●●●●●●●●

●●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

BPerm1k−ST

BPerm1k−BH

BPerm1k−BY

BPerm1k−Bonferroni

A

B

C

D

E

F

Page 65: The genetics of gene expression: from simulations to the

49

Figure S2.4: False discovery rate (FDR) and sensitivity of hierarchical multiple testing correction using permutation tests. FDR (A) and sensitivity/true positive rate (TPR) (B) of three permutation approaches (different colours) used in the local correction step of hierarchical correction procedures calculated from 10 simulations are shown on y-axes: using exact permutation P-values obtained from 1,000 permutations (Perm1k), P-values obtained from beta approximation from 1,000 permutations (BPerm1k), and P-values obtained from a beta approximation scheme with a minimum of 100 and maximum of 10,000 permutations (APerm10k). Benjamini-Hochberg (BH) or Storey-Tibshirani (ST) FDR procedure were used in the global correction step of hierarchical correction procedures to correct for multiple genes. Each dot represents one scenario and plots show scenarios with different causal eSNP MAFs. The horizontal lines in panel A indicate the desired FDR level of 5%. FDR was not calculated for scenarios where no significant eGenes were identified.

●●

●●

●●●●●●●●

●●●●

●●●●

●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●●●

●●●●

●●●●

●●●●

●●●●●●●●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●

●●

●●

●●

●●●●●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●

●●

●●

●●

●●●●●●●●●●●●

●●●●

●●●●

●●●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

Perm1k−ST

Perm1k−BH

APerm10k−ST

APerm10k−BH

BPerm1k−ST

BPerm1k−BH

●●●●

●●

●●

●●●●●●●●

●●

●●

●●●●●●

●●

●●●●

●●●●●●●●●●●●

●●

●●●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.05

0.10

0.15

0.20

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

FDR

Method●

Perm1k−ST

Perm1k−BH

APerm10k−ST

APerm10k−BH

BPerm1k−ST

BPerm1k−BH

A

B

Page 66: The genetics of gene expression: from simulations to the

50

Figure S2.5: False discovery rate (FDR) and sensitivity of selected hierarchical multiple testing correction methods in simulations of log-normal noise. The gene expression data were inverse normal transformed. Comparison of the FDR (A) and sensitivity/true positive rate (TPR) (B) of six methods (different colours) for controlling multiple testing of SNPs at each gene (local correction), with Benjamini-Hochberg (BH) used to control for multiple testing across all genes (global correction). The six methods compared were Storey-Tibshirani (ST), Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY), Bonferroni correction, eigenMT, and permutation tests based on beta approximation (BPerm1k). Each dot represents one scenario and plots show different minor allele frequencies (MAFs) of the simulated causal eSNPs. The dashed horizontal lines in panel A indicate the desired FDR level of 5%. Scenarios where no significant eGenes were identified are not shown in panel A.

●●●●●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●

●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●●

●●●●

●●●●

●●●

●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

000.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.0

0.1

0.2

0.3

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

FDR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

A

B

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●

●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●

●●●●●●●●●●●●

●●●●●●

●●●●●●

●●

●●●

●●

●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●

●●

●●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

Page 67: The genetics of gene expression: from simulations to the

51

Figure S2.6: False discovery rate (FDR) and sensitivity of selected hierarchical multiple testing correction methods in simulations of correlated gene expression. Comparison of the FDR (A) and sensitivity/true positive rate (TPR) (B) of six methods (different colours) for controlling multiple testing of SNPs at each gene (local correction), with Benjamini-Hochberg (BH) used to control for multiple testing across all genes (global correction). The six methods compared were Storey-Tibshirani (ST), Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY), Bonferroni correction, eigenMT, and permutation tests based on beta approximation (BPerm1k). Each dot represents one scenario and plots show different minor allele frequencies (MAFs) of the simulated causal eSNPs. The dashed horizontal lines in panel A indicate the desired FDR level of 5%. Scenarios where no significant eGenes were identified are not shown in panel A.

●●●●●●●●

●●●●

●●●

●●●

●●

●●

●●●●

●●●●●●●●

●●●●●●

●●●●●●

●●●●

●●

●●

●●●●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

000.00

0.02

0.04

0.06

0.08

0.000

0.025

0.050

0.075

0.00

0.03

0.06

0.09

0.00

0.05

0.10

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

FDR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●

●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●

●●●●

●●

●●●●

●●

●●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●

●●●●●●

●●

●●●●

●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●●

●●

●●●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

A

B

Page 68: The genetics of gene expression: from simulations to the

52

Figure S2.7: False discovery rate (FDR) and sensitivity of selected hierarchical multiple testing correction methods in simulations of dominant effects. Comparison of the FDR (A) and sensitivity/true positive rate (TPR) (B) of six methods (different colours) for controlling multiple testing of SNPs at each gene (local correction), with Benjamini-Hochberg (BH) used to control for multiple testing across all genes (global correction). The six methods compared were Storey-Tibshirani (ST), Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY), Bonferroni correction, eigenMT, and permutation tests based on beta approximation (BPerm1k). Each dot represents one scenario and plots show different minor allele frequencies (MAFs) of the simulated causal eSNPs. The dashed horizontal lines in panel A indicate the desired FDR level of 5%. Scenarios where no significant eGenes were identified are not shown in panel A.

●●●●●●●●

●●●●●●

●●●●●●

●●●●

●●

●●

●●●●●

●●●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

000.00

0.02

0.04

0.06

0.01

0.02

0.03

0.04

0.05

0.00

0.01

0.02

0.03

0.04

0.05

0.01

0.02

0.03

0.04

0.05

0.0

0.2

0.4

0.6

0.8

0.00

0.25

0.50

0.75

1.00

Sample size

FDR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

●●●●●●

●●

●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●

●●

●●●●

●●

●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●

●●

●●●●

●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●

●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●

●●

●●●

●●●●●●●●●●●●

●●●●●●

●●●●●●

●●

●●●

●●

●●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

A

B

Page 69: The genetics of gene expression: from simulations to the

53

Figure S2.8: False discovery rate (FDR) and sensitivity of selected hierarchical multiple testing correction methods in simulations of recessive effects. Comparison of the FDR (A) and sensitivity/true positive rate (TPR) (B) of six methods (different colours) for controlling multiple testing of SNPs at each gene (local correction), with Benjamini-Hochberg (BH) used to control for multiple testing across all genes (global correction). The six methods compared were Storey-Tibshirani (ST), Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY), Bonferroni correction, eigenMT, and permutation tests based on beta approximation (BPerm1k). Each dot represents one scenario and plots show different minor allele frequencies (MAFs) of the simulated causal eSNPs. The dashed horizontal lines in panel A indicate the desired FDR level of 5%. Scenarios where no significant eGenes were identified are not shown in panel A.

●●

●●●

●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●

●●

●●

●●●

●●●●●●●

●●

●●

●●●●

●●

●●●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●●

●●

MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

000.00

0.02

0.04

0.06

0.08

0.0

0.1

0.2

0.3

0.4

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

FDR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●

●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●

●●●●●●

●●

●●●

●●

●●●●

MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

B

A

Page 70: The genetics of gene expression: from simulations to the

54

Figure S2.9: False discovery rate (FDR) and sensitivity of selected hierarchical multiple testing correction methods in simulations of two causal eSNPs per eGene. Comparison of the FDR (A) and sensitivity/true positive rate (TPR) (B) of six methods (different colours) for controlling multiple testing of SNPs at each gene (local correction), with Benjamini-Hochberg (BH) used to control for multiple testing across all genes (global correction). The six methods compared were Storey-Tibshirani (ST), Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY), Bonferroni correction, eigenMT, and permutation tests based on beta approximation (BPerm1k). Each dot represents one scenario and scenarios where most simulated causal eSNPs had different minor allele frequencies (MAFs) are shown in different plots. The dashed horizontal lines in panel A indicate the desired FDR level of 5%. Scenarios where no significant eGenes were identified are not shown in panel A. We also simulated scenarios where each eGene had three causal eSNPs, and the results were the same.

●●

●●

●●●●

●●

●●●●

●●●●

●●●●

●●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

0.02

0.04

0.06

0.08

0.0

0.1

0.2

0.3

Sample size

FDR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

●●●●●●

●●

●●●

●●

●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●

●●

●●●

●●

●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●

●●

●●●

●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●

●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●●

●●

●●●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

A

B

Page 71: The genetics of gene expression: from simulations to the

55

Figure S2.10: Simulations of log-normal noise without inverse normal transformation. Panel A and B show the false discovery rate (FDR) (A) and sensitivity/true positive rate (TPR) (B) of six methods (different colours) for controlling multiple testing of SNPs at each gene (local correction), with Benjamini-Hochberg (BH) used to control for multiple testing across all genes (global correction). The six methods compared were Storey-Tibshirani (ST), Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY), Bonferroni correction, eigenMT, and permutation tests based on beta approximation (BPerm1k). Each dot represents one scenario and plots show different minor allele frequencies (MAFs) of the simulated causal eSNPs. The dashed horizontal lines in panel A indicate the desired FDR level of 5%. Scenarios where no significant eGenes were identified are not shown in panel A. Here the FDR and sensitivity were the average across ten replicates. Panel C shows an example of a typical false positive eQTL association, which was the most significant eGene and its most significant SNP in a scenario with a sample size of 100. Gene expression levels are shown on the y-axis. Each dot on the boxplot represents one individual, stratified by minor allele dosages on the x-axis. The eSNP shown here had a low frequency (0.5%) so there was no individual with homozygous for the minor allele. Most of the false positive eQTLs had low frequencies and the histogram in panel D shows the MAF distribution of significant eSNPs for false positive eGenes.

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●●

●●●●●

●●●●●●●●●●

●●

●●

●●

●●●●●

●●●●●

●●●●●●●●●●●

●●

●●

●●

●●

●●●●●

●●●●●

●●●●●

●●●

●●

●●

●●

●●●●●

●●●●●

●●●●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.25

0.50

0.75

1.00

0.25

0.50

0.75

1.00

Sample size

FDR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●●●●●●●●●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●●●●●

●●●●●●

●●●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●●●●●

●●●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

Method●

ST−BH

BH−BH

BY−BH

Bonferroni−BH

eigenMT−BH

BPerm1k−BH

A

B

●●

●●

●●●

●●

●● ●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●

● ●● ●

●●

●●● ● ● ●

● ●●

●●

●●

0

5

10

15

0 1

Sim

ulat

ed G

ene

Expr

essio

n

eQTL: rs529746952 ATP6V1E1

0%

20%

40%

60%

0.0 0.1 0.2 0.3 0.4 0.5MAF of eSNPs of false positive eGenes

Freq

uenc

y

C D

Page 72: The genetics of gene expression: from simulations to the

56

Figure S2.11: Average number of significant eSNPs per true positive eGene. For each scenario where a constant effect size (0.25, 0.5, 1.0, or 1.5 s.d. gene expression per allele) was simulated, the average number of eSNPs significantly associated with each true positive eGene was calculated (y-axis). A hierarchical correction procedure using eigenMT for local correction and BH for global correction (eigenMT-BH) was used to correct for multiple testing. Sample size is shown on x-axes and colors of dots indicate varying minor allele frequencies (MAFs) of true eSNPs.

Figure S2.12: Histogram of linkage disequilibrium (LD) r2 between the top SNPs that were not causal and their respective causal eSNPs. For every true positive eGene, we calculated the LD r2 between the most significant SNP and the simulated causal eSNP. Histogram bins show the percentage of non-causal top SNPs with the given r2 LD range to the causal eSNP calculated across 100 simulations of all 144 scenarios where a constant effect size was simulated. A hierarchical correction procedure using eigenMT for local correction and BH for global correction (eigenMT-BH) was used to correct for multiple testing. 83% of non-causal top SNPs had an r2 ≥0.8 with their respective causal eSNP.

●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●

●●●●●●●●●●●●

●●

●●●●

●●●●●

●●●●●●

●●●●●●●●●

●●

●●

●●●●

●●●●●

●●●●●

●●●●●●

Effect size: 0.25 Effect size: 0.5 Effect size: 1 Effect size: 1.5

100

200

500

1000

2000

5000 10

020

050

010

0020

0050

00 100

200

500

1000

2000

5000 10

020

050

010

0020

0050

000

250

500

750

1000

1250

Sample size

Aver

age

num

ber o

f eSN

Ps MAF●

50%

25%

10%

5%

1%

0.5%

0%

20%

40%

60%

0−0.1

0.1−0

.2

0.2−0

.3

0.3−0

.4

0.4−0

.5

0.5−0

.6

0.6−0

.7

0.7−0

.8

0.8−0

.90.9−1

Correlation between non−causal top SNP and causal SNP

Freq

uenc

y

Page 73: The genetics of gene expression: from simulations to the

57

Figure S2.13: False discovery rate (FDR) among the independent eQTL signals identified by conditional analyses in simulations of two (A) or three (B) causal eSNPs per eGene. Each dot shows the rate of false positive eQTL signals among all independent eQTL signals identified in conditional analyses in one scenario. Sample sizes are shown on the x-axes and plots show different minor allele frequencies (MAFs) of the simulated causal eSNPs. The horizontal lines indicate the desired FDR level of 5%. The nominal significance thresholds in each scenario were determined by eigenMT-BH hierarchical correction procedure in the initial eQTL scan. An independent eQTL signal was taken as a FP if no significant SNPs were in high LD (r2 ≥0.8) with any simulated causal eSNP, or where an eQTL signal was not truly independent from another identified true eQTL signal.

Figure S2.14: Proportion of eGenes that had simulated causal eSNPs with negatively correlated minor allele dosages in simulations of two causal eSNPs per eGene. Each dot represents one of the 36 scenarios and they are coloured based on minor allele frequencies (MAFs) of the causal eSNPs. Sample size is shown on the x-axis. For each eGene, the Pearson correlation between the two minor allele dosages of the causal eSNPs was calculated and the proportion of eGenes that had two causal eSNPs with negatively correlated minor allele dosages is shown on the y-axis. Here we only show the simulations where each eGene had two causal eSNPs; in simulations where each eGene had three causal eSNPs, the results were similar.

●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.0

0.1

0.2

0.3

Sample size

FDR Method

● eigenMT−BH

A

B

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

0.00

0.02

0.04

0.06

0.08

Sample size

FDR Method

● eigenMT−BH

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●0%

10%

20%

30%

40%

50%

100 200 500 1000 2000 5000Sample size

Prop

ortio

n of

eG

enes

that

had

nega

tivel

y co

rrela

ted

caus

al e

SNPs

MAF●

50%

25%

10%

5%

1%

0.5%

Page 74: The genetics of gene expression: from simulations to the

58

Figure S2.15: Proportion of simulated causal eSNPs identified in conditional analyses and the initial eQTL mapping step in simulations of two (A) or three (B) causal eSNPs per eGene. Each dot represents one scenario ordered by sample size (x-axes) and plots show different minor allele frequencies (MAFs) of simulated causal eSNPs. The rate of true positives (TPR) identified among all 400 or 600 simulated causal eSNPs is shown on the y-axes. Results based on conditional analyses and the initial eQTL mapping are in pink and green, respectively. The nominal significance thresholds used in conditional analyses were determined by eigenMT-BH hierarchical correction procedure in the initial eQTL scan.

Figure S2.16: Identification of causal eSNPs among top SNPs in simulations of two causal eSNPs per eGene. Each dot represents one scenario in the order of increasing sample sizes (x-axes) and the plots show different minor allele frequencies (MAFs) of the simulated causal eSNPs. In panel A, the y-axes show the number of top SNPs that were the simulated causal eSNPs or in perfect LD. In panel B, the proportion of the top SNPs that were the causal eSNPs (or in perfect LD) is shown on the y-axes. For conditional analyses (green), we focused on all top SNPs at each independent locus. For the original eQTL scan (red), we focused on only the best SNP for each significant eGene. We also simulated scenarios where each eGene had three causal eSNPs, and the results were the same.

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●●

●●●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

000.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

of c

ausa

l eSN

Ps

Analysis●

Conditional analysis

Initial eQTL scan

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

000.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

Sample size

TPR

of c

ausa

l eSN

Ps

Analysis●

Conditional analysis

Initial eQTL scan

A

B

●●●●

●●●●

●●●

●●●

●●●

●●

●●

●●●●

●●●●●

●●

●●●●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

000

100

200

300

400

Sample size

Num

ber o

f top

SN

Psth

at a

re c

ausa

l

Initial eQTL scan

Conditional analysis

A

B

●●

●●

●●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

MAF: 0.5% MAF: 1% MAF: 5% MAF: 10% MAF: 25% MAF: 50%

100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00 100

200

50010

0020

0050

00

70%

80%

90%

100%

Sample size

Num

ber o

f top

SN

Psth

at a

re c

ausa

l

Initial eQTL scan

Conditional analysis

Page 75: The genetics of gene expression: from simulations to the

59

Figure S2.17: Winner’s Curse in eQTL effect size estimation. Each plot represents one scenario, with sample size increasing from bottom to top and minor allele frequency (MAF) of true eSNPs increasing from left to right. Estimated effect sizes are shown on y-axes (naïve estimator from linear regression) and simulated true effect sizes are shown on x-axes. Each dot represents the top SNP of a true simulated eGene. True positive eGenes are shown in red and grey dots are simulated true eGenes that were not significant after hierarchical multiple testing correction using eigenMT for local correction and BH for global correction (eigenMT-BH). Dashed diagonals indicate where the estimated effect size equals to the true value. Red lines are linear regression fit of the naïve estimator on the simulated effect size for the true positive eGenes.

●●●

● ●

● ●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

● ● ●●

● ●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●●

●●

●●●●

● ●●

●●

●●

●●

● ●

●●●

●●

●●

●● ●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

● ●

●●

●●● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

● ●●●●●

●●

●●

● ●●

●●

●●

●●

●●

● ●● ●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

● ●

● ●●

● ●

●● ●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

● ●●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●● ● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

● ●●●●●●

●●

● ●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

● ●●●

●●

●●

●●

●●

● ●

● ●

● ●●

●●

●●

●●

● ●●

● ●

●●

●● ● ●

●●

●●●

●●

●● ●●

●● ●

● ●

●●

●●

●● ●●

●●

●●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●●

● ●

●●

● ●

● ●●

● ●

●●

●●● ●

●●●●

●●

● ●

●●

●●

●● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●●●

●●

● ● ●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●●●

●●●●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●● ●●

● ●

●●

●●●

●●

●●● ●

●●●

●●

●●

●● ●

●●

●●

●●●

●●

●●●●

●●●

●● ●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●●

●●

● ●●

●●●

●●

●● ●●

● ●

●●

●●

● ●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●●

●●●

●●

●● ●

● ●

●●

● ●

● ●

●●

●●

●●●

●●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●● ● ●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●● ●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

● ●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

● ●

●●

●● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●● ●● ●

●●

●●

● ●

●●●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●● ●

●●

● ●●

● ●

●●●

●●●●●

●●

●●

● ●

●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

● ●

●●

●●●

●●

●●

● ●●

● ●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●● ●●●

●●

●●

●●

●●

● ●

●●

●● ●●

●● ●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

● ●

●●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●● ●●●●● ●

●●

● ●

●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●●

●●●●

●●

●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●●●

●●

●● ●●

● ●

● ●●

● ●

●●●

●●●

●●

●●

●● ●

●●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

● ●

● ●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●● ●●●

●●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●●

●●

●●● ●

●●

● ●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ● ●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●● ●●

● ●

●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●●

● ●

●●

●●●

●●●●

●●

● ●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●●

●●●

●●●

●●

● ●

●●

●●

●●●

●●

● ●

●●

●●●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●●

●●

●●

●●●

● ●

●●●

●●

●●●

●●

●●

●●

● ●

●● ●●

●●

●●

●● ● ●

●●●

●●

● ●●

●●

●● ●

●●●

●●

●●

●●

● ●

●● ●

●●

● ●●

●●

●●

●● ●

●●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●●

●● ●●

●●

●● ●●

● ●

●●

●●

●● ●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●

●●

Size=100 MAF=0.5% Size=100 MAF=1% Size=100 MAF=5% Size=100 MAF=10% Size=100 MAF=25% Size=100 MAF=50%

Size=200 MAF=0.5% Size=200 MAF=1% Size=200 MAF=5% Size=200 MAF=10% Size=200 MAF=25% Size=200 MAF=50%

Size=500 MAF=0.5% Size=500 MAF=1% Size=500 MAF=5% Size=500 MAF=10% Size=500 MAF=25% Size=500 MAF=50%

Size=1000 MAF=0.5% Size=1000 MAF=1% Size=1000 MAF=5% Size=1000 MAF=10% Size=1000 MAF=25% Size=1000 MAF=50%

Size=2000 MAF=0.5% Size=2000 MAF=1% Size=2000 MAF=5% Size=2000 MAF=10% Size=2000 MAF=25% Size=2000 MAF=50%

Size=5000 MAF=0.5% Size=5000 MAF=1% Size=5000 MAF=5% Size=5000 MAF=10% Size=5000 MAF=25% Size=5000 MAF=50%

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

0

1

2

0

1

2

0

1

2

0

1

2

0

1

2

0

1

2

True effect size

Estim

ated

effe

ct s

ize

● ●Naive estimator Not identified

Page 76: The genetics of gene expression: from simulations to the

60

Figure S2.18: Accuracy of three bootstrap estimators and the naïve estimator. Mean squared error (MSE) with the simulated true effect size (A) and the mean ratio of estimated effect size to the simulated true effect size (B) were calculated from 10 simulations of scenarios where a constant true effect size (0.25, 0.5, 1, or 1.5 s.d. gene expression per allele) was simulated. Three bootstrap estimators as well as the naïve estimator are shown in different colours. A hierarchical correction procedure using eigenMT for local correction and BH for global correction (eigenMT-BH) was used to correct for multiple testing. The horizontal lines in panel B indicate the unbiased estimator (mean ratio of one).

A

B

●●●

●●●

●●

●●●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●●●●●

●●

●●●●

●●●●●●

●●

●●●

●●●●

●●●●●●

●●

●●●

●●●●

●●●●

●●

●●

●●

●●

●●●●●●●●●●

●●●●

●●●●●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●●●

●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●●

●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●●

●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●

●●

●●●●

●●●●

●●●●

True effect size=0.25 True effect size=0.5 True effect size=1 True effect size=1.5

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.0

0.4

0.8

1.2

0.0

0.2

0.4

0.6

0.000

0.025

0.050

0.075

0.100

0.125

0.00

0.01

0.02

0.03

Power

Mea

n Sq

uare

d Er

ror

Estimator●

Naive estimator

Bootstrap weighted estimator

Bootstrap out−of−sample estimator

Bootstrap shrinkage estimator●●●

●●●

●●

●●●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●●●●●

●●

●●●●

●●●●●●

●●

●●●

●●●●

●●●●●●

●●

●●●

●●●●

●●●●

●●

●●

●●

●●

●●●●●●●●●●

●●●●

●●●●●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●●●

●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●●

●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●●

●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●

●●

●●●●

●●●●

●●●●

True effect size=0.25 True effect size=0.5 True effect size=1 True effect size=1.5

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.0

0.4

0.8

1.2

0.0

0.2

0.4

0.6

0.000

0.025

0.050

0.075

0.100

0.125

0.00

0.01

0.02

0.03

Power

Mea

n Sq

uare

d Er

ror

Estimator●

Naive estimator

Bootstrap weighted estimator

Bootstrap out−of−sample estimator

Bootstrap shrinkage estimator●●●

●●●

●●

●●●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●●●●●

●●

●●●●

●●●●●●

●●

●●●

●●●●

●●●●●●

●●

●●●

●●●●

●●●●

●●

●●

●●

●●

●●●●●●●●●●

●●●●

●●●●●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●●●

●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●●

●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●●

●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●

●●

●●●●

●●●●

●●●●

True effect size=0.25 True effect size=0.5 True effect size=1 True effect size=1.5

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.0

0.4

0.8

1.2

0.0

0.2

0.4

0.6

0.000

0.025

0.050

0.075

0.100

0.125

0.00

0.01

0.02

0.03

Power

Mea

n Sq

uare

d Er

ror

Estimator●

Naive estimator

Bootstrap weighted estimator

Bootstrap out−of−sample estimator

Bootstrap shrinkage estimator●●●

●●●

●●

●●●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●●●●●

●●

●●●●

●●●●●●

●●

●●●

●●●●

●●●●●●

●●

●●●

●●●●

●●●●

●●

●●

●●

●●

●●●●●●●●●●

●●●●

●●●●●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●●●

●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●●

●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●●

●●

●●●●

●●●●

●●●

●●●●●●●●●●●

●●●●

●●

●●●●

●●●●

●●●●

True effect size=0.25 True effect size=0.5 True effect size=1 True effect size=1.5

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.0

0.4

0.8

1.2

0.0

0.2

0.4

0.6

0.000

0.025

0.050

0.075

0.100

0.125

0.00

0.01

0.02

0.03

Power

Mea

n Sq

uare

d Er

ror

Estimator●

Naive estimator

Bootstrap weighted estimator

Bootstrap out−of−sample estimator

Bootstrap shrinkage estimator

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●●●

●●●●

●●●

●●●

●●●●

●●●●

●●●

●●

●●

●●●●●

●●●●

●●●

●●●

●●

●●●●●

●●●●

●●●

●●●

●●●●●●●●●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●

●●●●

●●●

●●●●●●●●●●

●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●

●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●

●●●●●●●●●●●●

●●●●

●●

●●●●

●●●

●●●●

●●●●●●●●●●●●

●●●●

●●

●●●●

●●●

●●●

●●

●●●●●●●●●●●●

●●●●●

●●●●

●●●●

●●●

True effect size=0.25 True effect size=0.5 True effect size=1 True effect size=1.5

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.0

0.3

0.6

0.9

0.0

0.2

0.4

0.6

0.00

0.05

0.10

0.15

0.00

0.01

0.02

Power

Mea

n Sq

uare

d Er

ror

●●●

●●

●●●●

●●

●●

●●●● ●●

●●●●

●●

●●●●

●●●●

●●●

●●

●●●●●

●●●●

●●●● ●●

●●●●

●●●●

●●●

●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●

● ●●●●●

●●●●

●●

●●

●●●●●●●●●●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●●

●●●●

●●●●

●●●

●●

●●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●

●● ●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●

●●●●

●●●● ●●●

●●●●●●●●●●●●●●●●●

●●●●

●●●● ●●●

True effect size=0.25 True effect size=0.5 True effect size=1 True effect size=1.5

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

1.0

1.2

1.4

1.6

1.8

1.0

1.2

1.4

1.6

1.8

1.0

1.2

1.4

1.6

1.8

1.0

1.2

1.4

1.6

1.8

Power

Mea

n R

atio

Page 77: The genetics of gene expression: from simulations to the

61

Figure S2.19: Correction for Winner’s Curse by bootstrap method. Each plot represents one scenario, with sample size increasing from bottom to top and minor allele frequency (MAF) of true eSNPs increasing from left to right. Estimated effect sizes are shown on y-axes and true simulated effect sizes are shown on x-axes. Each dot represents the top SNP of a true simulated eGene. True positive eGenes are shown in red (naïve estimator) or blue (bootstrap shrinkage estimator). Grey dots are simulated true eGenes that were not significant after hierarchical multiple testing correction using eigenMT for local correction and BH for global correction (eigenMT-BH). Dashed diagonals indicate where the estimated effect size equals the true simulated effect size. Red (or blue) lines are linear regression fit of the naïve estimator (or bootstrap shrinkage estimator) on the simulated effect size for the true positive eGenes.

●●●

● ●

● ●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●●●

● ●●

●●

●●

●●

● ●

●●●

●●

●●

●● ●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

● ●

●●

●●● ●

●●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●●●●●

●●

●●

● ●●

●●

●●

●●

●●

● ●● ●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

● ●

● ●●

● ●

●● ●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

● ●●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●● ● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

● ●●●●●●

●●

● ●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●●●

●●

●●

●●

●●

●●

●●

● ●●

●●●

●●

● ●

●●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●●●

●●

●●●

●●

●●

● ●

● ●

● ●●

●●

●●

●●

● ●●

● ●

●●

●● ● ●

●●

●●●

●●

●● ●●

●● ●

● ●

●●

●●

●● ●●

●●

●●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●●

● ●

●●

● ●

● ●●

● ●

●●

●●● ●

●●●●

●●

● ●

●●

●●

●● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●●●

●●

● ● ●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●●●

●●●●●

●●

●●

●●● ●●

●●●

● ●

●●

●●

●● ●

●● ●

●●

●●●

●●

●●

● ●

●●

●●●

●●●● ●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●●

●●

●●

● ●

●●●

●●

●● ●●

● ●

●●

●●●

●●

●●● ●

●●●

●●

●●

●●● ●

● ●

●●

●●

●●

● ●

●●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

● ●

●●

●●●

●● ●

●●

●●

● ●

●● ●

●●

●●

●●

●● ●

● ●

●●●

●●

● ●

●●● ●

●●

●●

●●●

●●

● ●●

●●●

●●

●● ●●

● ●

●●

●●

● ●

● ●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●●●

●●

●●●

●● ●● ●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●● ●

●● ●

●●

● ●

●●

●●

●●

● ●

●●●

●● ●

● ●●

●●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

● ●

●●

●●

●●●

●●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

● ●● ●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●● ●

●●●

●● ● ●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●● ●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●●

●●

● ●●

● ●

●●

● ●

● ●

●●

●● ●●

●●

●●

●●

● ●

●●

●● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●●●

●●

●●

●● ●● ●

●●

●●

● ●

●●●●

●●

●●

● ●

●●●

●●

● ●●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●● ●

●●

●●

●●●

●●

● ●● ●

●● ●

●●

●●●

●● ●

●●

●●

●● ●●●

●●

●●

●●

● ●

●●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ●

●●

●●

●●

●● ●

● ●

●●

●●

● ●

●●

●●

● ●●●●

●●

●●

●● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●● ●●●

●●

●●

●●

●●●

●●

●●●

●●

● ●

● ●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●●

●●

● ●

● ●●

● ●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

● ●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●● ●●●●● ●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●●

●● ●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●● ●

●●

●●●

●●

●●

●●

●●●

●●

●● ●●●

●●

●●●

●●

●●●

●● ●●

● ●

● ●●

● ●

●●●

●●●

●●

●●

●● ●

●●●●●●

●●

●●

●●

●●

● ●●

●● ●●

●● ●●●

●●

●●●●

●●

●●

●●

●●

● ●

● ●

● ●

● ●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

● ●●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●● ●●●

●●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

● ●●

●●

● ●

●●

●●

●●●

●●●

●●

●●● ●

●●

● ●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ● ●

●●

●●

●●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●● ●●

● ●

●●

● ●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●● ●●

●●

●● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

● ●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

● ●

●●●

●●

●●

●●●

● ●

●● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

● ●

●●●

●●

●●●

●●

●●

●●

● ●

●● ●●

●●

●●

●● ● ●

●●●

● ●● ●

●●

●●

●●●●

● ●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●●

●● ●●

●●

●● ●●

● ●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●● ●

●●

●● ●

●●

●●

●●

●●

●●

Size=100 MAF=0.5% Size=100 MAF=1% Size=100 MAF=5% Size=100 MAF=10% Size=100 MAF=25% Size=100 MAF=50%

Size=200 MAF=0.5% Size=200 MAF=1% Size=200 MAF=5% Size=200 MAF=10% Size=200 MAF=25% Size=200 MAF=50%

Size=500 MAF=0.5% Size=500 MAF=1% Size=500 MAF=5% Size=500 MAF=10% Size=500 MAF=25% Size=500 MAF=50%

Size=1000 MAF=0.5% Size=1000 MAF=1% Size=1000 MAF=5% Size=1000 MAF=10% Size=1000 MAF=25% Size=1000 MAF=50%

Size=2000 MAF=0.5% Size=2000 MAF=1% Size=2000 MAF=5% Size=2000 MAF=10% Size=2000 MAF=25% Size=2000 MAF=50%

Size=5000 MAF=0.5% Size=5000 MAF=1% Size=5000 MAF=5% Size=5000 MAF=10% Size=5000 MAF=25% Size=5000 MAF=50%

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

0

1

2

0

1

2

0

1

2

0

1

2

0

1

2

0

1

2

True effect size

Estim

ated

effe

ct s

ize

● ● ●Naive estimator Bootstrap shrinkage estimator Not identified

Page 78: The genetics of gene expression: from simulations to the

62

Figure S2.20: Minor allele frequency (MAF) distribution of causal eSNPs in simulations of two (A) or three (B) causal eSNPs per eGene. Each plot shows the MAF distribution of simulated causal eSNPs in scenarios from a MAF category across all simulated sample sizes. Multiple causal eSNPs for each eGene were selected based on a realistic distribution of linkage disequilibrium, thus the MAFs of the causal eSNPs in one scenario were not always the same.

A

B

MAF category: 0.5% MAF category: 1% MAF category: 5% MAF category: 10% MAF category: 25% MAF category: 50%

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5MAF of simulated casual eSNPs

Freq

uenc

y

MAF category: 0.5% MAF category: 1% MAF category: 5% MAF category: 10% MAF category: 25% MAF category: 50%

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5MAF of simulated casual eSNPs

Freq

uenc

y

Page 79: The genetics of gene expression: from simulations to the

63

Chapter 3

Characterising the genetic basis of neonatal gene

expression in immune responses of monocytes

and T cells

Page 80: The genetics of gene expression: from simulations to the

64

3.1 Introduction

The neonatal immune system has distinct characteristics compared with the adult immune system.

Neonates are more susceptible to infections, and it is a result of both the immaturity of the immune

system, and the special immunological demand at this perinatal period165-167. The immune cell

populations and responses are biased towards TH2 type, and excessive TH1 activity is potentially

harmful for the foetus169. The majority of studies investigating expression quantitative trait loci (eQTLs)

to date were performed using tissues or cell types from adults. The genetic regulation of gene expression

in perinatal tissues (e.g. foetal brains190) is not as well-understood. Previous studies investigating how

eQTL effects are modified by stimuli, or response eQTLs (reQTLs), were all performed in cell types

(mostly immune cells) obtained from adults, as listed in Table 1.1. The genetic regulation of gene

expression and how it interacts with certain stimuli in neonatal immune systems remain to be deciphered.

In this chapter, I aim to investigate of the effects of germline human genetic variation on the neonatal

immune system by exploring the early-life transcriptome and detecting eQTLs and reQTLs in immune

cells collected at birth.

To explore early-life gene expression and regulation, I used cyrobanked cord blood samples collected

at birth from the Childhood Asthma Study (CAS), a prospective birth cohort established by

collaborators from the Telethon Kid Institute, Perth, Western Australia229-234. CAS cohort comprises

234 deeply-phenotyped individuals followed from birth to up to 10 years of age, whose information

was recorded by their parents. These individuals are at high risk of atopy (i.e. with at least one parent

diagnosed of asthma, hay fever, or eczema), with 28% having current wheeze at age 5. From birth to

age 5, parents recorded symptoms of respiratory illness and uses of medication, and study physician

paid home visit to collect information and nasopharyngeal samples during acute respiratory illness

(ARI). Nasopharyngeal samples (control: in absence of ARI) and blood samples were also collected at

multiple time points during routine visit. Bacterial and viral communities in nasopharyngeal samples

were characterised. Genotyping was performed using DNA microarray platforms. Previous studies on

this cohort have yielded insights on pathogenesis of allergic diseases. Kusel et al. identified that viral

infections in earlier life were associated with current wheeze and asthma at age 5 in children with early

sensitisation before age 2, but not in those with later or no sensitisation231. Teo et al. described the

airway microbiome in the first 5 years of life, and observed a simple structure with one of six genera

dominating the nasopharynx microbiome233,234. They identified that early colonisation by Streptococcus

was associated with asthma risk in future233. Increasing diversity in airway microbiome was observed

after age two234. Colonisation by illness-associated bacteria in the first two years was associated with

future persistent wheeze only in early sensitised children, indicating that the interaction between atopy

and pathogen in early life promote later asthma development234. Tang et al. performed clustering

analysis using immunorespiratory phenotypes and other demographic data for CAS individuals to

Page 81: The genetics of gene expression: from simulations to the

65

disentangle the heterogeneity in asthma235. They identified three trajectories which were replicated in

external cohorts, and one was a high-risk cluster with higher prevalence of allergic diseases235. The

above studies from CAS, in addition to similar studies from other research groups, all add to our

understanding of the development of asthma and allergy.

I analysed transcriptome data from immune cells obtained from CAS cord blood samples, in order to

explore how innate and adaptive immune systems are encoded at birth, and to investigate the genetic

basis of gene expression variation in immune responses. The specific aims of this chapter were:

(1) To identify regulatory variants of gene expression (or eQTLs) in neonatal immune cells under

resting and stimulatory conditions.

(2) To investigate how regulatory effects on gene expression are modified by immune responses by

detecting reQTLs with different genetic effects across conditions.

To address these aims, two immune cell populations from CAS cord blood samples were purified –

monocytes and CD8-/CD4+ T cells, representing neonatal innate immunity and adaptive immunity,

respectively. These were exposed to either a bacterial component that can activate innate immune

responses (lipopolysaccharides for monocytes/macrophages), or a pan T cell stimulant (PHA). Gene

expression in each sample was then quantified with a microarray platform (Illumina HT12 v4

BeadChip). I selected the analysis strategies based on the insights from the eQTL stimulation study in

Chapter 2. I used a minor allele frequency (MAF) threshold of 10% to filter SNPs given that the sample

size in each experimental condition was around 120. I applied a hierarchical procedure to correct for

multiple testing, which was shown to have controlled false discovery rate (FDR) of genes with eQTLs

(eGenes) in the eQTL simulation study (Chapter 2). To address the first aim, I mapped cis-eQTLs (i.e.

eQTLs located within 1Mb from the gene TSS) and trans-eQTLs (i.e. eQTLs and genes were located

on different chromosomes) within each experimental condition. I performed enrichment analysis to

provide functional annotations of eQTLs. To understand the mechanisms of trans-eQTL effects, I

applied mediation analysis to further investigate whether the trans-effects were mediated through

expression of local cis-eGenes. To address the second aim, I tested for interactions between significant

cis-eQTLs and the stimulatory conditions for each cell type, to identify eQTLs that demonstrated a

significant interaction, termed response eQTLs (reQTLs).

Page 82: The genetics of gene expression: from simulations to the

66

3.2 Results

3.2.1 Study data

I analysed the transcriptome data from in vitro cultures of resting and stimulated neonatal immune cells

from the CAS cohort (Figure 3.1). Microarray gene expression data were available for the following

four experimental conditions: resting monocytes, LPS-stimulated monocytes, resting T cells, and PHA-

stimulated T cells. Samples from monocyte cultures generally had smaller number of genes with

detectable expression levels (Figure 3.2), and low-quality samples that did not pass quality control were

mostly from monocyte cultures (13 out of 16; 3.4.2 Materials and Methods). I had access to genotype

data for a subset of the CAS individuals, and performed genotype imputation for common genetic

variants with MAF ≥10%. The number of samples available for eQTL analysis was: 116 for resting

monocytes, 125 for LPS-stimulated monocytes, 126 for resting T cells, and 127 for PHA-stimulated T

cells.

Figure 3.1: Study design and analysis work flow. Monocytes and T cell cultures were purified from resting and stimulated cord blood samples from the Childhood Asthma Study (CAS) cohort. Gene expression was quantified using a microarray platform. A subset of the CAS individuals was genotyped and imputation was performed. eQTLs were identified within each experimental condition. Datasets for resting and stimulated samples were merged to detect response eQTLs within each cell type.

CAS cohort

Cord blood PBMCs(N = 152)

Genotyping(N = 135)

Imputation

Association test

Gene expression profiling

Response eQTLs

eQTLs

Monocytes T cells

LPS PHA

At birth

Resting Resting

Page 83: The genetics of gene expression: from simulations to the

67

Figure 3.2: Distribution of number of detectable genes per microarray sample. Colour indicates four experimental conditions. Number in brackets indicates the average number of detectable genes within each group.

3.2.2 Local genetic regulatory variants and condition specificity

I tested for associations between gene expression variation and local genetic variants (cis-eQTLs)

within each of the four experimental conditions, and used a conservative hierarchical multiple testing

procedure to identify significant eQTLs at 5% FDR level (3.4.4 Materials and Methods). A larger

number of genes associated with eQTLs (eGenes) were identified in stimulated cells than in resting

cells: 1,347 and 971 in PHA-stimulated and resting T cells, respectively, and 376 and 136 in LPS-

stimulated and resting monocytes, respectively (Figure 3.3A). To rule out the possibility that the

difference in number of cis-eQTLs was due to the difference in statistical power because of different

sample sizes, I randomly sampled 116 samples – the smallest size (resting monocytes) in four

experimental conditions – from the other three groups (10% reduction in sample size) and mapped cis-

eQTLs using the same strategies. I observed the same trend in the number of cis-eQTLs, with 350, 900,

and 1,231 significant cis-eGenes identified in LPS-stimulated monocytes, resting T cells, and PHA-

stimulated T cells, respectively. The fewer eQTLs in monocyte cultures was potentially due to

monocyte samples having fewer detectable probes compared to T cell samples (Figure 3.2). Using a

two-stage conditional analysis, I identified only a few eGenes (up to 6.3% in PHA-stimulated T cells;

Table 3.1) with multiple independent eQTL signals (3.4.4 Materials and Methods).

I observed a great proportion of cis-eQTLs that had effects on gene expression only after stimulation:

60% (262 out of 376) of eGenes in LPS-stimulated monocytes were not significant in resting monocytes,

and 58% (778 out of 1,347) of eGenes in PHA-stimulated T cells had cis-eQTLs only upon stimulation.

I focused on the top eQTL SNPs (eSNPs) for each eGene: if any pair of eSNPs identified from different

cell types or stimulatory conditions were not in high linkage disequilibrium (LD; r2 ≥0.8), they were

treated as distinct eQTL–eGene associations (Figure 3.3B). The great majority (79%) of the eQTL

0

20

40

2000 4000 6000 8000 10000Number of detectable genes per sample

Num

ber o

f sam

ples Experimental condition (Average)

Resting Monocytes (6349)

LPS−stimulated Monocytes (7579)

Resting T cells (9281)

PHA−stimulated T cells (10044)

Page 84: The genetics of gene expression: from simulations to the

68

signals were condition-specific, i.e. detected in only one cell type or stimulatory condition and not

significant in the other three groups (Figure 3.3B), consistent with previous observations in eQTLs

identified using cells under different conditions134.

Figure 3.3: Cis-eQTLs in four experimental conditions. (A) A bar plot shows the number of genes with significant cis-eQTLs (eGenes) identified in each cell type and treatment group (on x-axis). Colours indicate the cell type. (B) A heatmap shows eQTL sharing across four experimental conditions (rows). Columns in the heatmap are unique eQTL–eGene associations. Significant associations are in red. Percentages labelled on the heatmap show the proportion of unique eQTL–eGene associations that are specific to a certain cell type or stimulatory condition.

Table 3.1: Number of cis-eGenes and independent cis-eQTLs.

Cell type Condition Sample size

#eGenes #eGene–eSNP associations

#eGenes with multiple independent eQTLs (proportion)

#Independent eQTL signals

Monocytes Resting 116 136 9462 1 (0.7%) 137 LPS 125 376 25110 12 (3.2%) 386

T cells Resting 126 971 68161 53 (5.5%) 1027 PHA 127 1347 100091 85 (6.3%) 1436

Next, I evaluated how many cis-eQTL signals identified in neonatal immune cells were replicated in

external eQTL datasets acquired using similar cell types and stimulatory conditions. I examined

whether any top eSNPs or eSNPs in high LD with the top eSNPs (r2 ≥0.8) for each eGene in CAS

datasets was significantly associated with the same eGene in resting and LPS-stimulated CD14+

monocytes from Fairfax et al.134 and resting and stimulated CD4+ T cells from the DICE project55 (3.4.5

Materials and Methods). The above two eQTL studies were performed using cells obtained from

adults. Approximately half of cis-eQTL signals in resting and LPS-stimulated monocytes in CAS were

replicated in Fairfax et al. dataset (Table 3.2). A smaller proportion (less than one third) of cis-eQTLs

in resting T cells were replicated in the DICE dataset. I observed even fewer replicable cis-eQTLs in

A

Significant eQTLs Not significant

PHA-stimulated T cells

Resting T cells

LPS-stimulated monocytes

Resting monocytes1%

8%

26%

44%

48%Shared eQTLs (21%)

Condition-specific eQTLs (79%)

0

250

500

750

1000

1250

Resting LP

S

Resting PH

A

Num

ber o

f eG

enes

CellMonocyte

T cell

0

250

500

750

1000

1250

Resting LP

S

Resting PH

A

Num

ber o

f eG

enes

CellMonocyte

T cell

0

250

500

750

1000

1250

Resting LP

S

Resting PH

A

Num

ber o

f eG

enes

CellMonocyte

T cell

B

Page 85: The genetics of gene expression: from simulations to the

69

PHA-stimulated T cells. This was potentially due to a different type of stimulant used in DICE (anti-

CD3/CD28 antibodies) to activate T cells.

Table 3.2: Number of cis-eGenes replicated in external datasets. Four columns indicate four experimental conditions in CAS. The first row shows the number of eGenes in CAS, and the next two rows show the number of replicated signals in two datasets, with proportion in brackets. Numbers are not listed if no similar conditions are available in the two external datasets. In the DICE project55, T cells were stimulated using a different stimulant (anti-CD3/CD28 antibodies) from the CAS study (highlighted by “*”). Fairfax et al.134 used a pooled FDR method to correct for multiple testing, and the eQTLs downloaded from the DICE project55 are significant at P-value threshold of 1´10-4.

Resting monocytes

LPS-stimulated monocytes

Resting T cells

PHA-stimulated T cells

#eGenes in CAS 136 376 971 1347 Replicated in Fairfax et al.134 65 (48%) 211 (56%) – – Replicated in the DICE project55 57 (42%) – 315 (32%) 236 (18%)*

Consistent with previous findings84,91, most of the cis-eSNPs (88.2%) were located within 200 kb from

the transcription start site (TSS) of the corresponding gene. For each eGene’s top eSNP and eSNPs in

high LD (r2 ≥0.8), the proportion increased to 95%. The top eSNPs with the largest eQTL effect sizes

also tended to be located close to the gene TSS (Figure 3.4). GARFIELD enrichment analysis236 showed

that the cis-eSNPs were enriched in 3’ untranslated regions (UTR), 5’ UTR, and exon regions (Figure

3.5).

Figure 3.4: Absolute effect sizes of cis-eQTLs and distance (kb) from the transcription start site (TSS) of the corresponding gene. Four point plots show the absolute eQTL effect sizes (lead SNPs for each signal) on y-axes in four experimental conditions (“M” and “T” indicate monocytes and T cells, respectively). SNPs with effect sizes that are among the top 5th percentile within each condition are highlighted in red. The distance (kb) from the gene TSS is shown on x-axes.

Ctrl M LPS M Ctrl T PHA T

0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

0.4

0.8

1.2

1.6

Distance from the corresponding gene TSS (kb)

Abso

lute

effe

ct s

ize

Top 5 percentile

Ctrl M LPS M Ctrl T PHA T

0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

0.4

0.8

1.2

1.6

Distance from the corresponding gene TSS (kb)

Abso

lute

effe

ct s

ize

Top 5 percentile

Resting TResting M LPS M PHA T

Ctrl M LPS M Ctrl T PHA T

0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

0.4

0.8

1.2

1.6

Distance from the corresponding gene TSS (kb)

Abso

lute

effe

ct s

ize

Top 5 percentile

Page 86: The genetics of gene expression: from simulations to the

70

Figure 3.5: Enrichment of cis-eQTLs in 3’UTR, 5’UTR, and exon regions. Four plots show the enrichment of cis-eQTL SNPs identified in four experimental conditions (“M” and “T” indicate monocytes and T cells, respectively). Logarithm of Odds Ratio (OR), estimated at the eQTL significance level of 1´10-5, is shown on y-axes (3.4.6 Materials and Methods), and the 95% confidence intervals are plotted. Tests that are significant after the Bonferroni correction are marked by red asterisks, where a P-value threshold of 0.05/(7´4) is used.

3.2.3 Condition-specific genetic regulatory variants

To investigate how genetic regulation of gene expression was modified by the stimulation, I performed

interaction tests on the top eSNPs of each eGene in two cell populations separately (3.4.7 Materials

and Methods). Simple comparison of overlaps in eQTLs across conditions (Figure 3.3B) did not

account for changes in eQTL effect sizes, and this might cause false positives. For example, a significant

eQTL identified in one condition did not pass the stringent significance threshold in the other condition

(the association signal was still relatively strong), and thus it was considered as a condition-specific

eQTL, though very similar effect sizes were observed in these two conditions. Using statistical

interaction tests could take effect sizes into account. At 5% FDR level based on permutation adjusted

P-values, I identified 125 significant interactions or response eQTLs (reQTLs) in monocytes involving

125 unique eGenes (31% out of 398 monocyte eGenes), and 956 reQTLs in T cells involving 918 unique

eGenes (52% of 1,749 T cell eGenes), among which 38 eGenes had distinct cis-eQTLs in two conditions

and both were reQTLs. The number of reQTLs as well as the proportion of eGenes that had significant

reQTLs were higher in stimulated conditions compared to that in resting conditions (Figure 3.6), which

was expected given that more eQTLs were detected upon stimulation as mentioned above. In LPS-

stimulated monocytes, 31% of the eGenes (117 out of 376) had reQTLs while the proportion in resting

monocytes was 11% (15 out of 136). The same trend was observed in T cells with the proportion being

49% (662 out of 1,347) in PHA condition and 37% (360 out of 971) in resting condition.

● ●

** * * *

●●

● ●

** * ** *

●●

● ●

** * ** *

● ●

● ●

** * ** * *

Ctrl M LPS M Ctrl T PHA T

3'UTR5'UTRExon

Upstream

DownstreamIntron

Intergenic3'UTR5'UTRExon

Upstream

DownstreamIntron

Intergenic3'UTR5'UTRExon

Upstream

DownstreamIntron

Intergenic3'UTR5'UTRExon

Upstream

DownstreamIntron

Intergenic

−0.50.00.51.01.52.0

log(

OR)

Resting TResting M LPS M PHA T

Page 87: The genetics of gene expression: from simulations to the

71

Figure 3.6: Proportion of cis-eQTLs that are significant response eQTLs (reQTLs). Height of bars indicates the number of genes with significant cis-eQTLs (eGenes) identified in each cell type and treatment group (on x-axis). The percentages on each bar indicate the proportion of eGenes with significant reQTLs (green).

Most reQTLs were significant in only one condition where an elevated eQTL effect size was observed

(brown and blue dots in Figure 3.7A), and fewer reQTLs were significant in both resting and stimulated

conditions (red dots in Figure 3.7A). Interestingly, two reQTLs with flipped directions of effects on

gene expression across conditions were observed (red dots in the grey quadrants in Figure 3.7A). For

example, the top eSNP (rs5751775) of DDT (D-dopachrome tautomerase) in resting T cells was also

significantly associated with DDT expression in PHA-stimulated T cells, but with opposite directions

of eQTL effects. Allele C was associated with increased expression levels in resting T cells, but was

also associated with decreased levels after PHA stimulation (Figure 3.7B). DDT encodes a cytokine

that has been shown to be a structural and functional homolog of migration inhibitory factor (MIF),

which is a critical regulator of both the innate and the adaptive immune response237,238. Flipped

directions of regulatory effects in T cells were also observed for the top eSNP (rs13068288) of ZNF589

(Figure 3.7C). ZNF589 encodes a zinc finger protein, a transcription factor involved in hematopoietic

stem cell differentiation239.

0

250

500

750

1000

1250

Resting LP

S

Resting PH

A

Num

ber o

f eG

enes

CellMonocyte

T cell

136

376

971

1347

11%31%

37%

49%

0

250

500

750

1000

1250

Ctrl M

LPS MCtrl

TPHA T

Num

ber o

f eG

enes

ReQTLNot

Significant

Monocyte T cell

Page 88: The genetics of gene expression: from simulations to the

72

Figure 3.7: Effect sizes of response eQTLs (reQTLs) in monocytes and T cells. (A) Two point plots show effect sizes of significant reQTLs in resting (x-axes) and stimulated conditions (y-axes) in two cell populations: monocytes (left) and T cells (right). A gene might have two dots indicating two independent top SNPs (3.4.7 Materials and Methods). Colours indicate the condition where the reQTL was found significant. ReQTLs of DDT (B) and ZNF585 (C) in the grey quadrants show opposite directions of eQTL effects across conditions. (B) Two boxplots show eQTL associations of rs5751775 (top SNP in resting T cells) with DDT gene expression in resting (left) and PHA-stimulated (right) T cells. Dots indicate individuals stratified by genotypes (x-axes) and the rank-normalised gene expression is shown on y-axes. Similarly, panel (C) shows associations of rs13068288 (top SNP in PHA-stimulated T cells) with ZNF589. Chromosome number and genomic position in hg19/GRCh37 are labelled in brackets next to the SNP rsID. Effect sizes of these associations are shown under each box plot, which is the number of s.d. of gene expression increased per allele dosage.

3.2.4 Trans-acting regulatory loci mediated by cis-eGenes

To identify eQTLs regulating gene expression in a trans manner (trans-eQTLs), I tested for associations

between each gene and all SNPs that were located on different chromosomes. Significant trans-eQTLs

were identified within each of the four groups at a genome-wide 0.05 FDR level (3.4.8 Materials and

Methods). Most trans-eQTLs were identified in T cells: 10 in resting T cells and 15 in PHA-stimulated

T cells. I identified one trans-eQTL in resting and LPS-stimulated monocytes, and this signal was

shared across all four experimental conditions. This was a trans-eQTL on chromosome 12 associated

with MYH10 on chromosome 17. The same locus was also associated with multiple trans-eGenes in T

cells: MIR130A and STX1B in both resting and stimulated T cells, and IP6K2 and MIR1471 specific to

resting T cells (Figure 3.8).

Monocyte T cell

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

ReQTL effect size in resting cells

ReQ

TL e

ffect

size

in s

timul

ated

cel

lsCondition where theSNP is significant

Resting

Stimulated

Both

A

ZNF589

DDT

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

Ctrl PHA

AA AT TT AA AT TT

−2−1

012

Ran

k N

orm

alis

edEx

pres

sion

of Z

NF5

89

rs13068288 (chr3:48331751)

Effect size: 0.49 -0.85 0.57 -0.54

B C

● ●

● ●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●●

●●● ●

● ●

●● ●

●●

●● ●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

● ●

●●●

Ctrl PHA

TT TC CC TT TC CC

−2−1

012

Ran

k N

orm

alis

edEx

pres

sion

of D

DT

rs5751775 (chr22:24266726)Resting T Resting TPHA T PHA T

Page 89: The genetics of gene expression: from simulations to the

73

Figure 3.8: Trans-eQTL associations. A circular plot shows trans-eQTL associations in lines, with arrows pointing to trans-eGenes, which are labelled (in black) outside the rim indicating chromosome numbers. Dots on the other end point indicate nearby genes (names in purple) that are associated with the same loci (cis-eGenes). Colours of the lines indicate the experimental conditions where the trans-eQTLs were identified: “Resting T”: resting T cells only, “PHA T”: stimulated T cells only, “Resting T & PHA T”: shared across both conditions of T cells, and “All four conditions”: shared across all four experimental conditions.

It has been reported that the trans-eQTLs are enriched for cis-eQTLs97, and about 20–35% of trans-

associations are significantly mediated by expression of cis-eGenes103,240. I too observed SNPs with

trans-eQTL effects that were also associated with local genes in cis (Figure 3.8). For example, the

trans-eQTL on chromosome 12 associated with a trans-eGene MYH10 was also associated with a local

gene RPS26 in three conditions, except in PHA-stimulated T cells where RPS26 did not have significant

cis-eQTLs (RPS26 had a reQTL specific to resting T cells). The top eSNP (rs1131017) was located in

the 5’ UTR region of RPS26. The same locus was also associated with SUOX expression in resting and

stimulated T cells. In addition, the trans-eQTL on chromosome 4 associated with MIR330 expression

was a cis-eQTL of SNHG8.

Mediation analysis was performed to test the hypothesis that the trans-regulation of an eQTL on the

expression of a distant gene was mediated through the expression of a local cis-eGene (Figure 3.9A,

3.4.9 Materials and Methods). After FDR adjustment, I found that in T cells the trans-association with

MIR330 that encodes a microRNA was significantly mediated through the cis-eGene SNHG8, which

Resting T PHA TResting T & PHA TAll four conditions

Condition specificity

1

2

3

4

56

7

89

10

11

12

13

14

15

16

17

1819

2021

22

DLEU

2L

STAT1

MIR1471

IP6K2

EAF2

SNHG8

NUP155

ZSCAN12P1

HLA−J

MB21D1RPS2P32

ENTPD4GTF2H1

MIR130APDE

6HKRT79SUOXRPS

26

STX1B

MYH10

MIR330HAS1

DLGAP4RPRD1B

Page 90: The genetics of gene expression: from simulations to the

74

encodes a long non-coding RNA (Figure 3.9B, Table 3.3). Although the trans-eQTL on chromosome

12 was a cis-eQTL of both SUOX and RPS26, mediation analysis did not provide evidence of SUOX

having strong mediation effects. Instead, RPS26 was a significant mediator for three trans-associations

(trans-eGenes being MYH10, IP6K2, and MIR130A) in resting T cells (Figure 3.9C, Table 3.3).

Figure 3.9: Trans-eQTL effects mediated through cis-eGenes. (A) A diagram demonstrating the mediation analysis model, where effects of trans-acting eQTL (exposure) on trans-eGene (outcome) are either mediated through a cis-eGene (mediator), or through direct effects (3.4.9 Materials and Methods). (B) and (C) show two examples of cis-eGenes (SNHG8 and RPS26, respectively; green) acting as a mediator for trans-associations (trans-eGenes in yellow). Genes that were not significant in mediation analysis are in grey. Tables show statistics of the mediation tests, and the “Mediation” column indicates the proportion of total effects of the eQTL on the trans-eGene that was mediated through the cis-eGene. Effect size estimates are in Table 3.3. Two models involving SUOX (C) were not tested because the trans-eSNPs of IP6K2 and MIR1471 were not significantly associated with SUOX. Significant mediations (FDR ≤0.05) are highlighted in bold.

AeQTL Cis-eGene

Trans-eGene

Mediator

Outcome

Exposure

!

"′

$

SNHG8

rs687492

MIR330chr19

chr4

MIR130Achr11

STX1Bchr16

IP6K2chr3

RPS26

rs1131017

chr12 SUOX

MYH10chr17

MIR1471chr2

RPS26 is a significant mediatorTrans-eGene Condition Mediation FDR

MYH10Resting M 77.0% <0.001LPS M 25.7% 0.035Resting T 21.5% 0.026

IP6K2 Resting T 54.2% 0.032

MIR130A Resting T 32.5% 0.039

STX1B Resting T 30.8% 0.066

MIR1471 Resting T 3.0% 0.949

TestSUOX as a mediatorTrans-eGene Condition Mediation FDR

MYH10Resting T 2.7% 0.643PHA T -12.3% 0.083

IP6K2 Not tested

MIR130AResting T -1.2% 0.949PHA T -15.8% 0.063

STX1B Resting T -3.5% 0.819

MIR1471 Not tested

SNHG8 is a significant mediatorTrans-eGene Condition Mediation FDR

MIR330Resting T 62.8% 0.008PHA T 45.7% 0.035

B

C

Page 91: The genetics of gene expression: from simulations to the

75

Table 3.3: Mediation analysis to identify trans-associations that are mediated by cis-eGenes. Rows in the table are the 14 mediation trios tested (3.4.9 Materials and Methods), with significant (FDR ≤0.05) mediations highlighted in bold. “M” stands for monocytes and “T” stands for T cells under the “Condition” column. The column “Mediation Effect (%)” shows the mediation effect of the eQTL on the trans-eGene through cis-eGene, with the proportion among total effect in brackets.

Condition eQTL Cis-eGene

Trans-eGene

Total Effect

a b Mediation Effect (%)

P-value

FDR

Resting M 12:56435929 RPS26 MYH10 -1.00 -1.23 0.62 -0.77 (77.0%) <0.001 <0.001 LPS M 12:56435929 RPS26 MYH10 -0.94 -1.10 0.22 -0.24 (25.7%) 0.015 0.035 Resting T 12:56435929 RPS26 MYH10 -1.30 -1.38 0.20 -0.28 (21.5%) 0.006 0.026 Resting T 12:56435929 RPS26 MIR130A -1.10 -1.38 0.26 -0.36 (32.5%) 0.019 0.039 Resting T 12:56435929 RPS26 STX1B -1.13 -1.38 0.25 -0.35 (30.8%) 0.042 0.066 Resting T 12:56389293 RPS26 MIR1471 0.79 1.28 0.02 0.02 (3.0%) 0.920 0.949 Resting T 12:56435929 RPS26 IP6K2 -0.92 -1.38 0.36 -0.50 (54.2%) 0.009 0.032 Resting T 12:56444632 SUOX MYH10 1.05 -0.47 -0.06 0.03 (2.7%) 0.506 0.643 PHA T 12:56444632 SUOX MYH10 0.58 -0.56 0.13 -0.07 (-12.3%) 0.060 0.083 Resting T 12:56444632 SUOX MIR130A 0.93 -0.47 0.02 -0.01 (-1.2%) 0.949 0.949 PHA T 12:56435412 SUOX MIR130A 0.46 -0.56 0.13 -0.07 (-15.8%) 0.036 0.063 Resting T 12:56444632 SUOX STX1B 0.92 -0.47 0.07 -0.03 (-3.5%) 0.702 0.819 Resting T 4:119170256 SNHG8 MIR330 -1.03 -1.33 0.49 -0.65 (62.8%) 0.001 0.008 PHA T 4:119170256 SNHG8 MIR330 -1.16 -1.25 0.42 -0.53 (45.7%) 0.014 0.035

3.3 Discussion

Studies investigating genetic regulation of gene expression and its interaction with stimuli have

explored various immune cell types. However, most them were performed in cell types obtained from

adults (Table 1.1), and to our knowledge, no study has explored eQTLs specific to cell samples at birth.

It is important to understand how neonatal immune systems respond to external environmental factors,

given the distinct characteristics compared with adult immune systems165-167. To explore early-life

immune responses, I investigated the genetic architecture controlling the variable gene expression of

immune cells collected at birth and cultured in vitro. I described the cell type- and condition-specific

genetic regulation, and how immune responses under different stimuli modify the genetic effects on

gene expression. I also identified some trans-eQTL effects mediated through expression of cis-eGenes.

I observed a much greater number of cis-eQTLs identified in T cells than in monocytes (Figure 3.3A),

and this difference was not previously observed in similar immune cell types from adults55,92,95. Using

a subset of 116 samples, I still observed the same trend in the number of cis-eQTLs (i.e. higher in

stimulated than resting cells and higher in T cells than monocytes), suggesting that the small number of

cis-eQTLs identified in resting monocytes was not purely due to a lack of statistical power from using

a smaller sample size. One possible reason is that monocyte samples generally had smaller numbers of

Page 92: The genetics of gene expression: from simulations to the

76

genes with detectable expression levels compared with T cell samples (Figure 3.2). Among 19 excluded

samples with low quality and fewer detectable probes, 16 were from the monocyte cultures (3.4.2

Materials and Methods). In addition, the CAS monocyte samples were an enriched population of

monocytes and macrophages, and we did not use a specific cell marker (CD14+) to sort between these

as was performed in previous studies55,92,95. Therefore, heterogeneity in the cell population might also

contribute to lower eQTL detection. If what I observed was not due to technical factors, it suggests that

genetics may explain a higher proportion of gene expression variation in neonatal T cells than neonatal

monocytes.

Approximately half of the eQTL signals identified in resting and LPS-stimulated monocytes were

replicated in similar cell type and condition collected from adults by Fairfax et al.134 (Table 3.2). The

proportion of eQTLs identified in T cells that were replicated in the DICE project55 was lower, which

is related to lower statistical power to detect eQTLs in the DICE project given the smaller sample size

(N = 91) compared with the Fairfax et al. study (N >300). In DICE, CD4+ T cells were treated with

anti-CD3/CD28 antibodies for four hours, while T cells were exposed to PHA four 24 hours in CAS.

Thus, the low replication rate (236 out of 1,347) indicates the regulatory effects are highly dependent

on cellular context. I used more stringent P-value thresholds in CAS, while Fairfax et al. used a pooled

FDR correction method, which was shown to have inflated FDR of eGene in Chapter 2. The eQTL list

downloaded from the DICE project website included all associations with P-values ≤1´10-4.

Considering the potential higher proportion of false positives in these two datasets, I did not investigate

how many eQTLs identified in these studies were replicated in CAS.

Two reQTLs showed opposite directions of effects on gene expression in resting and PHA-stimulated

T cells (Figure 3.7). T allele of rs13068288 (the top SNP of ZNF589 in stimulated T cells) was

significantly associated with increased expression levels of ZNF589 in resting T cells, and the same

direction for this association was replicated in naïve CD4+ T cells from the DICE project55. The same

allele was associated with decreased ZNF589 levels in PHA-stimulated T cells; however, ZNF589 did

not have significant eQTLs in activated T cells from DICE, which again might be related to the different

stimuli used in CAS and DICE. ZNF589, also known as the Stem Cell Zinc Finger 1 (SZF1), encodes

a zinc finger protein, which is involved in hematopoietic stem cell differentiation239, and knockdown of

this gene resulted in reduced survival of hematopoietic cells by altering expression of apoptosis-

controlling genes241. rs13068288 is an intronic SNP of gene NME6 (nucleoside diphosphate kinase 6),

which is located on the downstream region of ZNF589. rs13068288 was significantly associated with

NME6 in CAS PHA-stimulated T cells (LD r2 = 0.79 with top SNP). This SNP was associated with four

genes in multiple immune cells from the DICE project55, indicating its important regulatory role in

immune cells. The other example of flipped directions of reQTL (rs5751775) was observed for DDT

(D-dopachrome tautomerase). Expression of DDT was not detected in immune cells from the DICE

Page 93: The genetics of gene expression: from simulations to the

77

project. DDT was highly expressed in liver in the GTEx project. In GTEx, both directions of the eQTL

effects of rs5751775 have been reported: in tissues such as liver, pancreas, and stomach, the direction

is consistent with that in CAS PHA-stimulated T cells, and in more tissues including testis, brain, and

muscle, the opposite direction is observed. Peters et al. observed eQTLs with opposite directions in

different immune cell types, and interestingly, those eQTLs usually had smaller effect sizes compared

with eQTLs with consistent direction across cell types, suggesting different underlying regulatory

mechanisms242. Opposite directions of eQTLs were also reported in other studies243; however, the

potential biological mechanisms of the opposite directions of genetic effects are still unclear, and further

investigation in relevant cellular context is needed to better understand this phenomenon.

I observed two loci where SNPs with trans effects on distant genes also had cis effects on nearby genes.

Mediation analysis identified that RPS26 mediated the effect of eQTL on multiple trans-eGenes, and

there was a lack of evidence in this study to support the mediation role of SUOX, which was associated

with the same eQTL. Previous mediation analysis has shown the mediator effects of RPS26104,240. Pierce

et al. investigated cis mediation of trans-eQTLs using mononuclear cells from 1,799 South Asians, and

RPS26 was one of the seven loci where the cis-eGene significantly mediated multiple trans-

associations240. They identified trans-associations with three distant genes that were mediated through

RPS26, among which IP6K2 was replicated in CAS, STX1B was significant at P-value 0.05 but did not

pass multiple testing correction in CAS, and the third gene was not quantified in CAS. The other trans-

associations in which RPS26 was the mediator identified in CAS were not significant in Pierce et al.,

and these trios were not tested. Yang et al. identified RPS26 as a significant cis-mediator in multiple

tissues from the GTEx project, though the trans-eGenes were not the same with what I identified in

CAS104. They also identified the mediation effects of SUOX; however, within the same tissue, the trans-

associations mediated through SUOX were not those through RPS26, indicating distinct effects of

SUOX and RPS26 on downstream distant genes. RPS26 encodes a ribosomal subunit protein, which has

a critical role in ribosome assembly and protein translation244,245. In addition, RPS26 also has regulatory

roles in various other cellular processes, such as nonsense-mediated mRNA decay246 and p53

transcriptional activity247, suggesting that it may be involved in the regulation of other genes.

This chapter presents the first genome-wide eQTL and reQTL mapping in neonatal immune cells under

resting and activated conditions. The strong cell type- and condition-specific nature of genetic

regulation demonstrates the importance of considering the cell types and relevant stimulatory conditions

to fully understand the functional roles of genetic variation. Characterisation of genetic regulation

commonly modified by stimuli aids in the understanding of genetic basis of individual variation in

neonatal immune responses. The findings of this study provides a useful resource to explore neonatal

gene regulation, leading us to a better understanding of the impact of human genetic variation on early-

life immune responses.

Page 94: The genetics of gene expression: from simulations to the

78

3.4 Materials and Methods

3.4.1 Study cohort and RNA sample preparation

The study population is a subset of the Childhood Asthma Study (CAS), which is a prospective birth

cohort of 234 individuals with high risk of atopy followed from birth to up to 10 years of age229-234.

Sample collection and preparation were performed by Danny Mok, a collaborator from the Telethon

Kids Institute. Cord blood samples were collected for 152 individuals at birth. One million peripheral

blood mononuclear cells (PBMCs) from each individual were stimulated with either an innate immune

system stimulant (LPS: lipopolysaccharides, a membrane component in Gram-negative bacteria248), or

a pan T cell stimulant (PHA: phytohemagglutinin) for 24 hours (Figure 3.1). Unstimulated control

PBMC samples were also available. Non-adherent cells in suspension from resting and PHA-treated

cultures were removed and purified for CD8-/CD4+ T cells using Dynabeads (Invitrogen) and stored in

RNAprotect Cell reagent (Qiagen). Cells remaining in suspension in resting and LPS-treated cultures

were aspirated, leaving an enriched population of monocytes and macrophages adhered to the culture

wells. These adherent cells were then resuspended and transferred into RNAprotect. All cells in

RNAprotect Cell reagent were banked at -80°C.

RNA extraction was performed by Louise Judd from Monash University. The cells were thawed and

centrifuged briefly. Reagent were removed and total RNA were extracted from pelleted cells by an

established in-house procedure using TRIzol (Life Technologies) in combination with RNEasy

MinElute columns (Qiagen). The aqueous phase containing the RNA then was loaded onto an RNeasy

MinElute column (Qiagen) in order to purify and concentrate the RNA. RNA quality was assessed on

a Bioanalyzer 2100 using the RNA 6000 Nano kit (Agilent). In total there were 607 samples (one

missing sample) from 152 individuals for gene expression profiling.

3.4.2 Gene expression profiling and data processing

Total RNA from four cell culture conditions (resting and LPS-stimulated monocytes, and resting and

PHA-stimulated T cells) was quantified with Illumina HumanHT-12 v4 BeadChip gene expression

array at the Genome Institute of Singapore. This microarray platform contains 47,231 probes that target

gene transcripts and 770 negative controls that capture background noise. After excluding 31 samples

with suspected cross-contamination or insufficient quantity of complementary DNA (cDNA: DNA

synthesised from RNA in samples), 576 samples were successfully scanned. The raw microarray data

were exported by the Illumina software BeadStudio. For each sample that went through Illumina

microarray platform, probe detection P-values were calculated as the proportion of negative control

probes that had higher intensity values than each analysed probe. The probe detection P-value reflects

Page 95: The genetics of gene expression: from simulations to the

79

how strong the signal was compared with the array background noise. A probe with a detection P-value

<0.05 is defined as a “detectable probe”. All analyses were performed in R226 (Figure 3.10). I first

removed three samples with zero intensity for almost all probes including negative controls and probes

targeting housekeeping genes (2 resting monocyte samples and 1 LPS-stimulated monocyte sample). I

further removed 16 outlier samples (8 resting monocyte samples, 5 LPS-stimulated monocyte samples,

and 3 resting T cell samples) with a low number of detectable probes (lying outside median ±2 ´ inter

quartile range). Compared with other samples, these excluded samples had much lower intensity for

positive controls including those targeting housekeeping genes.

Figure 3.10: Flowchart showing microarray data process. Steps where data processing was performed for four experimental groups separately are shown in green boxes.

After quality control, 557 samples remained for normalisation (resting monocytes: 130, LPS-stimulated

monocytes: 141, resting T cells: 142, and PHA-stimulated T cells: 144). Background was corrected for

based on 770 negative control probes on the Illumina HT12 array, and then quantile normalisation and

log2 transformation were performed within each cell type and condition using the neqc function from

the limma R package249 (Figure 3.10). Updated probe annotation data were used and analysis was

restricted to 33,436 reliable probes, excluding unaligned probes and probes aligned to multiple regions

that were more than 25 bp apart250. Fifteen probes with missing data in ≥5 samples were removed.

Detectable probes targeting autosomal genes (N = 20,532) were kept, comprising of probes with

BeadStudio detection P-value ≤0.01 in ≥2.5% of the samples from a specific condition group, or in ≥5%

of all samples (criteria used in Fairfax et al. study134). Gene annotation was obtained from the

GENCODE46 release 19 (GRCh37 alignment, downloaded in October 2017). Among detectable probes,

19,230 had gene annotation in the GENCODE reference data. For genes that had multiple probes, I kept

Data from GIS:Sample probe profile

47231 probes576 samples

Control probe profileIncluding 770 negative

control probes

Excluded and imputed table18903 probes

47231 probes557 samples

Background correction Quantile normalisationLog2 transformation

Reliable probes*: 33436on autosomesExclude probes aligned to multiple positions (≥25 bp)

19230 probes (13109 genes)557 samples

13109 autosomal genes557 samples

135 (out of 152) individuals have genotype data; 494 samples

Exclude 19 samples• Including 3 samples, ≥99.9% of

whose probes had 0 intensity.• Low number of detectable

probes (which lie outside “Median ±2*IQR”)

• Positive controls including housekeeping genes

limma

PEER analysis494 samples (135 idv)

*Arlothet al., PLoS One 2015

20532 probes 557 samples

Collapse multiple probes Keep the probe with the highest mean intensity for each gene.

Inverse normal transformationExpression of every gene follows standard normal distribution.

Probe filtering• Exclude 12889

undetectable probes (20593 out of all 47323 probes)

Keep probes with detection P-value ≤0.01 in ≥2.5% of samples from a specific treatment, or in ≥5% of all samples.

Green: data process was done for each of the four conditions separately.

Probe filtering• Exclude 15 probes

with missing data in ≥5 samples

GENCODE annotation• Exclude 1293 probes

with no annotation• Exclude 9 probes

(not consistent chr, 8 of them are snoRNA)

Page 96: The genetics of gene expression: from simulations to the

80

the probe with the highest mean intensity251, resulting in 13,109 autosomal genes. For eQTL analysis, I

performed a rank-based inverse normal transformation (INT) within each group using the rntranform

function in the GenABEL R package224, so that each gene expression followed a standard normal

distribution. It was previously reported that applying INT before correcting for covariate effects (e.g.

gender) performed better than applying INT on the residuals after regressing out covariates, in terms of

decreased type-I errors and increased power252. Therefore, I adjusted covariates during the association

analysis, and not before INT.

3.4.3 Genotyping and imputation

Genomic DNA was extracted from blood samples collected from 218 individuals in CAS. Genotyping

was performed with Illumina Omni2.5 BeadChip array, with coverage of approximately 2.5 million

markers. Genotype quality control was performed by Howard Tang from the Baker Heart & Diabetes

Institute. Variants with missing call rates >1%, MAF <1%, or Hardy-Weinberg Equilibrium (HWE)

test P-value <1×10-6 were excluded, and individuals with missing call rates >1% were removed. This

produced an initial count of 1.4 million SNPs for 215 genotyped individuals. Of these, a total of 135

children also had gene expression data from cord blood (i.e. overlap with the 152 individuals described

previously).

I performed imputation using the Michigan Imputation Server13 with Haplotype Reference Consortium6

(HRC) release r1.1 as the reference panel. After filtering out variants with low imputation accuracy (R2

<0.3), 12.7 million SNPs remained. For eQTL analysis, I focused on 4.3 million SNPs with MAFs

≥10%. The MAF cut-off used here was suggested by the eQTL simulation study253 in Chapter 2 in

order to avoid inflated false positives in low-frequency variants given our limited sample size.

3.4.4 Cis-eQTL mapping and conditional analysis

To identify cis-eQTLs within each cell type and treatment group, I performed linear additive regression

to model the effect of each SNP located within 1Mb of the transcription start site (TSS) of the

corresponding gene using the Matrix eQTL217 R package. The sample size for eQTL mapping in each

experimental condition was: 116 for resting monocytes, 125 for LPS-stimulated monocytes, 126 for

resting T cells, and 127 for PHA-stimulated T cells. Genotype data were recoded as 0, 1, 2 based on the

dosage of the HRC alternative allele. Gender, first three genotype PCs254 and first ten PEER255 factors

that captured technical variation in transcriptomes were included as covariates in the linear model.

I applied a hierarchical correction procedure to correct for multiple testing. Firstly, nominal P-values

for all cis-SNPs from Matrix eQTL were adjusted by multiplying the number of effective independent

SNPs for each gene (local correction), which was estimated by eigenMT based on genotype correlation

Page 97: The genetics of gene expression: from simulations to the

81

matrix204. Secondly, the minimum locally adjusted P-value for each gene was kept and the FDR of

significant genes was controlled at 5% using the Benjamini-Hochberg (BH) FDR-controlling procedure

(global correction)199. Genes with global FDR ≤0.05 were significant eGenes. Thirdly, to obtain the list

of significant eSNPs for each eGene, the locally adjusted minimum P-value corresponding to the global

FDR threshold of 0.05 was calculated, and SNPs with a locally adjusted P-value lower than the

threshold were significant eSNPs.

Next, I performed conditional analyses to identify additional independent eQTL signals for each eGene.

The gene-level P-value nominal thresholds calculated in the hierarchical multiple-testing correction

(eigenMT-BH) were used to determine significant associations: the locally adjusted minimum P-value

corresponding to the global FDR threshold of 0.05 multiplied by the number of estimated independent

SNPs for each gene. I used a two-stage conditional analysis scheme as follows228:

(1) Forward stage. For each eGene, the number of independent cis-eQTL signals was learnt from an

iterative procedure. I started from the top SNP with the minimum P-value for the eGene, which was

added as a covariate in the linear model to test for cis-eQTLs. If any significant SNPs (with P-values

smaller than the gene’s nominal threshold) were identified, the new top SNP identified in this iteration

was added to the list of independent eQTL signals. In the next iteration of eQTL mapping, all previously

identified eSNPs were adjusted for as covariates. The forward stage terminated if no additional

significant associations were identified.

(2) Backward stage. In this stage, the final list of significant SNPs representing each independent eQTL

signal was determined. Let the list of independent SNPs for each eGene obtained from the forward stage

be G/H5, G/H6, G/H;,… , G/HJ, where M was the number of independent eQTL signals. Each of the

independent eQTL signals was tested separately using a leave-out-one model adjusting for all other

SNPs in the list as covariates. For example, when the ith eQTL signal was tested,

G/H5,… , G/H#K5, G/H#L5, … , G/HJ were added as covariates together with other covariates used in the

original eQTL scan. The final set of independent eQTLs comprised of the eSNPs that remained

significant in the backward stage.

3.4.5 Replication of cis-eQTLs in external datasets

I downloaded summary statistics of significant cis-eQTLs from a response eQTL study (Supplemental

Table S2 of the Fairfax et al. study134) and the DICE (database of immune cell expression, eQTLs, and

epigenomics) project (https://www.dice-database.org)55. Fairfax et al. mapped cis-eQTLs in resting and

LPS-stimulated CD14+ monocytes (two durations: 2 hours and 24 hours) obtained from adults aged

from 19 to 56 years, with the sample size being 414 for resting monocytes, and 261 and 322 for

monocytes treated with LPS for 2 hours and 24 hours, respectively134. The downloaded list of cis-eQTLs

Page 98: The genetics of gene expression: from simulations to the

82

were significant at FDR 5% using pooled FDR method. In the DICE project, 13 immune cell types were

collected from 91 subjects, among which CD4+ T cells and CD8+ T cells also had activated conditions55.

I downloaded eQTLs identified using CD14+ monocytes, CD4+ T cells, and activated CD4+ T cells (4-

hour treatment with anti-CD3/CD28 antibodies), and the eQTL associations were significant with P-

values ≤1´10-4. For each eGene identified in each of the four experimental conditions in CAS, I took

the top eSNPs and eSNPs in high LD (r2 ≥0.8), and if any of these eSNPs were significantly associated

with the same eGene in the corresponding conditions from the Fairfax et al. dataset and the DICE

dataset, this eQTL signal was considered as replicated.

3.4.6 Enrichment analysis

I performed enrichment analyses using GARFIELD (version 2) to investigate the enrichment patterns

of cis-eQTLs using predefined features (“annotation data”) such as genic annotations from ENCODE,

GENCODE, and Roadmap Epigenomics project provided by this tool236. GARFIELD evaluates

enrichment using generalised linear regression models that account for allele frequency, distance to the

nearest transcription start site, and linkage disequilibrium (LD). LD correlation calculated based on the

UK10K dataset is also provided by the software. For each experimental condition, I took the P-values

for all SNPs tested in cis-eQTL analysis, and the smallest P-value was kept if a SNP was tested for

associations with multiple genes. Enrichment odds ratios were calculated at various eQTL significance

thresholds: 1´10-3, 1´10-4, …, 1´10-8.

3.4.7 Response eQTL detection

Response eQTLs (reQTLs) are eQTLs with context-specific effects on gene expression. To identify

reQTLs for each cell type, I focused on top SNPs of all eGenes that were significant in either resting or

stimulated conditions. For eGenes that were significant in both conditions and the top eSNPs were

independent (LD r2 <0.8), I tested both of the top eSNPs; if the two top eSNPs were in high LD, I treated

them as the same eQTL signal and the more significant one was tested. In monocytes, 417 interaction

tests involving 398 eGenes were performed, and 1,959 tests involving 1,749 eGenes were performed in

T cells. The gene expression data in two conditions from one cell type were combined, and the following

linear mixed-effects model was tested for each eGene–top eSNP pair using the lmer function in the

lme4 R package256:

"#~M# + N# + M# × N# + NOP#5 + ⋯+ NOP#

5R + NOP#5 × N# + ⋯+ NOP#

5R × N# + (1|GTUV#)

where "# indicates the expression level of an eGene for the ith sample, M# the SNP allele dosage, N# the

condition (resting: 0 and stimulated: 1) where the gene expression was measured, NOP#5, …NOP#5R the

14 covariates used in the original eQTL mapping including gender, 3 genotype PCs, and 10 PEER

Page 99: The genetics of gene expression: from simulations to the

83

factors, and GTUV# the individual from which the ith sample was taken. The term M# × N# models the

interaction between the genotype and the condition, and (1|GTUV#) indicates the individual-specific

random effect for this paired study design.

I applied permutation approach to estimate empirical P-values for the interaction term. In each

permutation step, the condition variable was shuffled within each individual, and the same linear mixed-

effects model was tested to get the permuted statistics for the interaction term. The permutation adjusted

P-value for each interaction test was calculated as follows:

W =X + 1

Y + 1

where n was the total number of permutations (1,000) and s was the number of cases where the

permutated statistics were more significant than the original observed ones. I added 1 to both the

numerator and the denominator to avoid underestimating permutation P-values257. BH FDR-controlling

procedure was applied to the permutation adjusted P-values and significant interactions were identified

at 5% FDR level.

3.4.8 Trans-eQTL identification

To detect trans-acting genetic regulation of gene expression in each condition, I tested SNPs that were

located on different chromosomes with the corresponding gene using the same linear model and

covariates as in the cis-eQTL mapping. I tried the following different approaches to deal with the

multiple testing:

(1) Genome-wide FDR correction. BH FDR-controlling procedure was applied to nominal P-values

from linear regression from all trans-association tests52. Significant trans-associations had an FDR

≤0.05.

(2) Gene-level FDR correction. For each gene, P-value of the top SNP was multiplied by 1´106, which

was the estimated number of independent SNPs across the genome (calculated as 0.05 divided by the

commonly used genome-wide significance threshold 5´10-8). To control gene-level FDR, BH FDR-

controlling procedure was then applied to the minimum adjusted P-values for each gene.

(3) Gene-level Bonferroni correction. The Bonferroni correction was used to control the gene-level

FDR, by using a significance P-value threshold of 3.8´10-12 (5´10-8/13,109, where the denominator

indicates the number of genes). It is known that the Bonferroni correction is overly conservative,

because the tests (or genes) were not independent with each other.

Page 100: The genetics of gene expression: from simulations to the

84

In resting monocytes as well as in LPS-stimulated monocytes, one trans-eQTL signal was significant

in all three methods. I observed 10 and 15 eGenes with significant trans-eQTLs (trans-eGenes) at 5%

genome-wide FDR level, corresponding to a nominal P-value threshold of 1.9´10-10 in resting T cells

and 2.5´10-10 in PHA-stimulated T cells. The number of significant trans-eGenes dropped to six and

eight by using the gene-level FDR correction where a nominal P-value threshold of 5.3´10-12 was used

in both conditions, and to five and seven by using the gene-level Bonferroni correction. The limited

power was the major issue given the sample size; thus I used genome-wide FDR correction, the least

conservative method to determine significant trans-eQTLs used in the downstream analysis.

3.4.9 Mediation analysis

I hypothesised that trans-eQTLs regulated gene expression of distant genes through cis-mediators, or

local genes whose expression was regulated by the same trans-eQTLs. To test this hypothesis, I focused

on the trans-eQTLs that were also associated with adjacent cis-eGenes, meaning that the trans-eQTLs

were also cis-eQTLs. For each trans-eGene–cis-eGene pair, I took the trans-eSNP with the smallest P-

value, and the mediation trio consisted of an eSNP as the exposure, a cis-eGene as the mediator, and a

trans-eGene as the outcome (Figure 3.9A). In total, I tested fourteen mediation trios: one from resting

monocytes, one from LPS-stimulated monocytes, nine from resting T cells, and three from PHA-

stimulated T cells.

I performed mediation test for the 14 trios using the mediation R package258. The effect of the exposure

on the mediator was estimated in cis-eQTL mapping, which is denoted as Z. The effect of the mediator

on the outcome (U) adjusting for the exposure and the effect of the exposure on the outcome (N′)

adjusting for the mediator were estimated by the following multiple regression:

"#~M# + M#\#] + NOP#

5 + ⋯+ NOP#5R

where "# indicates the value of the outcome (or the expression level of the trans-eGene) for the ith

sample, M# the exposure (or the eSNP allele dosage), M#\#] the mediator (or the cis-eGene expression),

and NOP#5, …NOP#5R the 14 covariates used in eQTL mapping as mentioned in previous sections. The

estimates of U and N′ were beta coefficients for M#\#] and M# , respectively. The “direct effect” of the

exposure on the outcome was quantified as N′, the “indirect effect” of the exposure on the outcome

through the mediator was quantified as Z × U, and the “total effect” was the sum of the previous two

effects259. Complete mediation occurs when the direct effect N′ is zero after controlling for the mediator,

and partial mediation happens when the direct effect is different from zero. To identify significant

mediation trios (the null hypothesis _̂: ZU = 0), I used a nonparametric bootstrap method (10,000

Page 101: The genetics of gene expression: from simulations to the

85

simulations) implemented in the mediation R package for variance estimation and P-value calculation.

BH-FDR controlling procedure was applied to correct for multiple testing.

Page 102: The genetics of gene expression: from simulations to the

86

Chapter 4

Investigating the early-life origins of immune-

related diseases using neonatal eQTLs

Page 103: The genetics of gene expression: from simulations to the

87

4.1 Introduction

Epidemiologic studies have shown that many complex human diseases originate from early life177,178.

Early-life events (e.g. microbial colonisation) impact the development of the immune system and

influence future disease susceptibility171-174. Gene-by-environment interactions at early stage in life

affect the development of many diseases, such as asthma260. For example, smoke exposure in early life

interacts with genetic factors and contributes to asthma development261, and pet exposure interacting

with genetic variants attenuates the development of atopy262.

Studies of early-life traits have provided insights on the genetic contributions to phenotypic variation

and how these genetic effects begin early in life. Hoffjan et al. observed that genetic variation in genes

involved in immune regulation were associated with PHA-induced cytokines and atopic phenotypes in

the first year of life263. Thompson et al. identified genetic variation in the integrin beta3 gene (ITGB3)

associated with early-life asthma and allergy phenotypes264. Moffatt et al. performed genome-wide

association studies (GWAS), and identified that genetic variants on the ORMDL3/GSDMB locus were

associated with childhood-onset asthma265,266. Integrative analysis using expression quantitative trait

locus (eQTL) mapping data further suggested that these variants might contribute to asthma risk through

the regulation of expression levels of ORMDL3 and GSDMB265,266. Childhood-onset asthma generally

has stronger underlying genetic factors compared to adult-onset asthma. Using the UK Biobank data,

Pividori et al. identified a greater number of loci specific to childhood-onset (23 loci) than to adult-

onset (1 locus) asthma, and they also observed stronger genetic effects on childhood asthma267. GWAS

have also uncovered susceptibility loci associated with other early-life diseases, such as childhood

recalcitrant atopic dermatitis (an inflammatory skin disease)268, and viral respiratory illnesses and

asthma in childhood269. By analysing timing of parturition in 84,689 infants, Liu et al. identified a locus

located near genes encoding interlukin-1 cytokines on 2q13, and they showed that this association had

foetal rather than maternal origin270.

Recently, studies investigating genetic effects on gene expression in human perinatal tissues have

started to uncover the developmental origins of complex diseases. Hannon et al. observed that the

genetic variants associated with methylation status (methylation QTLs) in prenatal foetal brain tissues

were enriched for schizophrenia-associated variants, supporting the hypothesis of schizophrenia having

an early neurodevelopmental component146. Genetic regulatory variants affecting gene expression

(eQTLs) identified in foetal brains190 and placentas188 were also reported to be enriched for genetic

variants associated with relevant postnatal traits, suggesting that the early-life regulatory variants

plausibly influence disease risks later in life. Many methylation146 and expression QTLs190 in foetal

brains were conserved in adult brain tissues, while their impact on related traits may start at an early

stage in life. All these studies describing the early-life genetic regulation provide direct evidence on the

Page 104: The genetics of gene expression: from simulations to the

88

molecular level of genetic components underlying diseases of adulthood operating in perinatal tissues

early in life.

GWAS have identified numerous genetic loci that are associated with complex traits and diseases, but

the biological mechanisms remain elusive. Integrative approaches using eQTL datasets have provided

insights on the functional roles of genetic variants as well as pathophysiology of polygenic disease105,106.

Methods have been developed to identify risk genes involved in disease pathogenesis, such as

colocalisation analysis to identify loci with shared causal variants driving both eQTL and GWAS

signals111, and Mendelian randomisation in which genetic variants are used to make causal inferences

between gene expression traits and higher-order phenotypes23,31 (Chapter 1). In addition, more recent

studies have started to investigate how regulatory effects on gene expression are modified by certain

stimuli and response eQTLs (reQTLs) identified in these studies can potentially link disease

susceptibility genes with environmental conditions that modify their expression and underlying genetic

regulation102,128,134,135.

There is increasing recognition that the early childhood is a critical period when various events may

influence future disease development. However, the way that genetic regulatory variants in neonatal

immune cells operating in this period alter susceptibility of immune-related diseases that develop later

in life is not well-studied. In addition, investigation is needed to better understand the molecular

mechanisms through which gene-by-environment interactions affect disease pathophysiology. I

hypothesised that the effects of genetic variants associated with immune-related diseases could become

manifest at early stage of life in immune cells, and early-life transcriptome is potentially associated

with disease risks in future life.

The research objective of this chapter is to explore the early-life origins of immune-related diseases

(where pathogenesis is based on some type of immune dysfunction; e.g. allergy, autoimmunity) that

often manifest later in life. The specific aims of this chapter were:

(1) To investigate the effects of genetic variants associated with immune-related diseases on early-

life gene expression in neonatal immune cells.

(2) To identify loci where early-life eQTL and GWAS signals are driven by shared causal variants,

and to use condition-specific eQTLs to understand disease pathophysiology under relevant cellular

context.

(3) To identify gene expression traits at birth that potentially influence risks for immune-related

diseases that develop later in life.

Page 105: The genetics of gene expression: from simulations to the

89

To address these aims, I used findings from previous GWAS investigating immune-related diseases in

external large cohorts, including findings that were derived from specific resources (ImmunoBase) and

microarrays (ImmunoChip) targeting this subset of diseases, in addition to other publicly available

resources (GWAS Catalog and LD Hub). I leveraged the one of the first resources to characterise gene

expression in neonatal immune responses and functional roles of early-life regulatory variants from

Chapter 3, and applied integrative approaches to explore the role of early-life transcriptome and genetic

regulation in mediating later disease risks. To address the first aim, I performed enrichment analyses to

investigate overlaps between early-life eQTLs in neonatal immune cells, and genetic variants associated

with immune-related diseases that develop later in life. To address the second aim, a Bayesian statistical

method was used to identify specific loci where shared causal variants were underlying both the eQTL

and disease associations (colocalisation of two signals). To address the third aim, I performed

Mendelian randomisation analysis to evaluate the causal role of early-life gene expression in disease

development later in life.

4.2 Results

4.2.1 Early-life cis-eQTLs are enriched for SNPs associated with immune-related diseases

It has been established that genetic variants that are associated with complex traits and diseases are

enriched for eQTLs, and vice versa92,105,106,190,271. To investigate whether early-life gene expression

regulatory variants were more likely to affect risk for diseases later in life, I performed enrichment

analyses to test for significant overlaps between cis-eQTLs identified in each of four cell type- and

condition-specific CAS datasets at birth (4.4.1 Materials and Methods), and GWAS SNPs associated

with various immune-related diseases from previous association studies (4.4.2 Materials and

Methods). I observed that the four sets of cis-eQTLs were significantly enriched for GWAS SNPs

associated with many immune-related diseases, such as allergic disease (asthma, hay fever, or

eczema)272 and inflammatory bowel disease273 (Figure 4.1). A stronger enrichment was observed in

monocyte eQTLs than T cell eQTLs, which might be related to the smaller number of significant eQTLs

identified in monocytes than in T cells (Chapter 3).

Page 106: The genetics of gene expression: from simulations to the

90

Figure 4.1: Enrichment of early-life cis-eQTLs for genetic variants associated with immune-related diseases. Each plot represents one disease, with “ic” indicating that the study was performed using ImmunoChip array (4.4.2 Materials and Methods). Disease abbreviations are as follows: “IBD”: inflammatory bowel disease, “ATD”: autoimmune thyroid disease, “JIA”: juvenile idiopathic arthritis, “PBC”: primary biliary cirrhosis, “PSC”: primary sclerosing cholangitis, “RA”: rheumatoid arthritis, and “SLE”: systemic lupus erythematosus. Enrichment was tested for each of the four sets of cis-eQTLs separately (x-axes: “M” and “T” indicate monocytes and T cells, respectively). Height of each bar indicates the beta coefficient (or log of the odds ratio) from enrichment analysis, with 95% confidence intervals (CIs) shown as the error bars. Tests that are not significant at P-value 0.05 are not shown since the beta estimates are not reliable and the CIs are too large. Colour of each bar indicates the significance level based on minus log10 P-values; the same colour is used if the P-value is smaller than 1´10-25. Asterisks indicate that the tests are significant after Bonferroni correction for multiple testing.

* * * *

* * * *

* * * *

* * *

* * * *

* * * *

* * * *

* * * *

* * * *

* * * *

* * * *

* * * *

*

* * *

* * * *

* * * *

* * * *

* * * *

* * * *

* * * *

* * * *RA (ic) SLE Type 1 diabetes Type 1 diabetes (ic)

PBC (ic) PSC Psoriasis (ic) RA

JIA (ic) Multiple sclerosis (ic) Narcolepsy (ic) PBC

Ulcerative colitis Celiac disease Celiac disease (ic) ATD (ic)

Asthma (adult) Asthma (childhood) IBD Crohn's disease

Allergic disease Allergic rhinitis Allergic sensitisation Asthma

Restin

g MLPS M

Restin

g TPHA T

Restin

g MLPS M

Restin

g TPHA T

Restin

g MLPS M

Restin

g TPHA T

Restin

g MLPS M

Restin

g TPHA T

0369

0369

0369

0369

0369

0369

Beta

5 10 15 20 >25−log10 P−value

Page 107: The genetics of gene expression: from simulations to the

91

4.2.2 Colocalisation of early-life regulatory variants with disease associations

To further provide evidence of shared genetic causal variants driving both eQTL and GWAS signals, I

used a Bayesian method111 to investigate each locus where overlaps of eQTLs and GWAS hits were

observed (4.4.3 Materials and Methods). In total, I observed 68 colocalisations, involving 5, 9, 15, 17

cis-eQTLs in resting monocytes, LPS-stimulated monocytes, resting T cells, and PHA-stimulated T

cells, respectively (Figure 4.2, Table 4.1; the four numbers do not sum up to 68 because a single cis-

eQTL can colocalise with multiple disease associations). Three cis-eQTLs that were shared across

resting and LPS-stimulated monocytes colocalised with the same disease associations; in T cells there

were five shared cis-eQTLs. Response eQTLs (reQTLs) are those with significantly different effects on

gene expression across conditions. I observed 17 colocalisations involving 12 reQTLs: 1 monocyte

reQTL specific to LPS stimulation (eQTL of CTSH), and 11 T cell reQTLs, among which 8 were

specific to PHA stimulation and 3 to resting T cells.

For cis-eQTLs of genes such as BACH2, colocalisation was observed with multiple diseases (Figure

4.2, Table 4.1). The cis-eQTL of BACH2, which was specific to resting T cells, colocalised with GWAS

hits for autoimmune thyroid disease (ImmunoChip)274, celiac disease (ImmunoChip)275, multiple

sclerosis (ImmunoChip)276, rheumatoid arthritis (ImmunoChip)277, and type 1 diabetes

(ImmunoChip)278. BACH2 encodes a transcription repressor that restrains terminal differentiation and

promotes the development of memory lymphocytes including CD8+ T cells279 and B cells280. The top

eSNP of BACH2 (rs72928038; intronic) was also the most significant SNP associated with the diseases

on this locus. The disease risk allele A was associated with decreased expression levels of BACH2 gene,

which was consistent with previous findings showing that the knock-out or mutation of BACH2 resulted

in immunodeficiency or disruption to regulatory T cells with subsequent autoimmunity281,282.

Colocalisation of reQTLs is particularly informative because it suggests that the genetic variants may

exert effects in disease pathogenesis through pathways that only activate in specific cells under specific

conditions. For example, the cis-eQTL of IL13 in PHA-stimulated T cells, which was a significant

reQTL specific to PHA stimulation, colocalised with the GWAS hits associated with allergic disease

(asthma, hay fever, or eczema)272, asthma283, and allergic sensitisation284 (Figure 4.2, Figure 4.3, Table

4.1). The risk allele for all three diseases was associated with higher IL13 expression in PHA-stimulated

T cells (risk allele T of top eSNP rs1295686 in Figure 4.3B). The top eSNP was an intronic SNP;

among the other four eSNPs that were in high LD (r2 >0.98) with the top eSNP, there was a missense

SNP (rs20541) located on the last (5th) exon of the IL13 gene, with the risk allele A corresponding to

glutamine and the other allele G corresponding to arginine. The other colocalisation of reQTL with

asthma occurred between the cis-eQTL of CCL20 in PHA-stimulated T cells and a GWAS signal

specific to childhood-onset asthma267 (Figure 4.2, Figure S4.1, Table 4.1). The eQTL of CCL20 was

a significant reQTL specific to PHA-stimulated T cells, with a significant increase in the effect size of

Page 108: The genetics of gene expression: from simulations to the

92

the top eSNP rs13034664 after T cell activation. CCL20 did not have significant cis-eQTLs in resting

T cells. The CCL20 expression was elevated after T cell activation. The childhood asthma risk allele A

was associated with lower expression levels of CCL20 in PHA-stimulated T cells (Figure S4.1B).

Figure 4.2: Colocalisation of cis-eQTLs with disease associations. This heatmap shows all cases with strong evidence of colocalisations between cis-eQTLs of corresponding genes (eGenes) in rows and GWAS hits associated with allergic and autoimmune diseases in columns (“ic” indicates that the study was performed using ImmunoChip array). Colours indicate the cell type where the significant colocalisation was observed. Asterisks indicate that the colocalised eQTLs are response eQTLs (reQTLs).

* * **

*

*

*

*

*

*

*

**

**

*

*FAM167APLEKHM1BTN3A2MUC1

TSPAN14CARD9RGS19ITGALLSP1

HLA−DRB6HLA−DRB1

REREHIST1H2BD

IL13ERMP1CCL20

CLECL1ADCY3

PHOSPHO2SUOXHLA−GHLA−AIL6STAFF3

UBASH3AVARS2BACH2CTSHDGKQ

UBE2L3

Allergic

diseas

e

Allergic

sensit

isatio

n

Asthma

Asthma c

hildhood onse

t

Ulcerat

ive co

litis

Inflammato

ry bowel

diseas

e

Crohn's dise

ase

Celiac

diseas

e

Celiac

diseas

e (ic)

Primary

biliary

cirrhosis

(ic)

Primary

biliary

cirrhosis

Narcolep

sy (ic

)

Multiple

scler

osis (ic

)

Type 1

diabete

s (ic)

Type 1

diabete

s

Autoimmune t

hyroid dise

ase (

ic)

Rheumato

id arthriti

s (ic)

Rheumato

id arthriti

s

Primary

scler

osing ch

olangitis

System

ic lupus e

rythem

atosu

s

Disease

eGen

e

Monocyte T cell Both

* : ReQTL

* * **

*

*

*

*

*

*

*

**

**

*

*FAM167APLEKHM1BTN3A2MUC1

TSPAN14CARD9RGS19ITGALLSP1

HLA−DRB6HLA−DRB1

REREHIST1H2BD

IL13ERMP1CCL20

CLECL1ADCY3

PHOSPHO2SUOXHLA−GHLA−AIL6STAFF3

UBASH3AVARS2BACH2CTSHDGKQ

UBE2L3

Allergic

diseas

e

Allergic

sensit

isatio

n

Asthma

Asthma c

hildhood onse

t

Ulcerat

ive co

litis

Inflammato

ry bowel

diseas

e

Crohn's dise

ase

Celiac

diseas

e

Celiac

diseas

e (ic)

Primary

biliary

cirrhosis

(ic)

Primary

biliary

cirrhosis

Narcolep

sy (ic

)

Multiple

scler

osis (ic

)

Type 1

diabete

s (ic)

Type 1

diabete

s

Autoimmune t

hyroid dise

ase (

ic)

Rheumato

id arthriti

s (ic)

Rheumato

id arthriti

s

Primary

scler

osing ch

olangitis

System

ic lupus e

rythem

atosu

s

Disease

eGen

e

Monocyte T cell Both

* : reQTL

* * **

*

*

*

*

*

*

*

**

**

*

*FAM167APLEKHM1BTN3A2MUC1

TSPAN14CARD9RGS19ITGALLSP1

HLA−DRB6HLA−DRB1

REREHIST1H2BD

IL13ERMP1CCL20

CLECL1ADCY3

PHOSPHO2SUOXHLA−GHLA−AIL6STAFF3

UBASH3AVARS2BACH2CTSHDGKQ

UBE2L3

Allergic

diseas

e

Allergic

sensit

isatio

n

Asthma

Asthma c

hildhood onse

t

Ulcerat

ive co

litis

Inflammato

ry bowel

diseas

e

Crohn's dise

ase

Celiac

diseas

e

Celiac

diseas

e (ic)

Primary

biliary

cirrhosis

(ic)

Primary

biliary

cirrhosis

Narcolep

sy (ic

)

Multiple

scler

osis (ic

)

Type 1

diabete

s (ic)

Type 1

diabete

s

Autoimmune t

hyroid dise

ase (

ic)

Rheumato

id arthriti

s (ic)

Rheumato

id arthriti

s

Primary

scler

osing ch

olangitis

System

ic lupus e

rythem

atosu

s

Disease

eGen

e

Monocyte T cell Both

* reQTL*

Page 109: The genetics of gene expression: from simulations to the

93

Table 4.1: List of all colocalisations of cis-eQTLs with disease associations. Colocalised response eQTLs are highlighted in blue. Diseases with "ic" indicate that the GWAS was performed using ImmunoChip array (4.4.2 Materials and Methods). The “Cis-eGene” column shows the genes that are associated with the corresponding colocalised cis-eQTLs identified in a certain condition (the “Condition” column: “M” and “T” indicate monocytes and T cells, respectively). The “Top eSNP” (chromosome number followed by GRCh37 genomic position) and “RSID” columns show the top cis-eQTL SNPs for each gene. The “GWAS Pval” columns shows the minimum P-value of the corresponding locus. The number of SNPs in eQTL and GWAS datasets (“#SNPs eQTL” and “#SNPs GWAS”), and the number of overlapping SNPs (“#SNP tested”) for each locus are listed. The “PP3” and “PP4” columns show the posterior probabilities for distinct and shared causal variants, respectively (4.4.3 Materials and Methods). The “Ratio” column indicates the ratio of PP4 to PP3.

Disease Cis-eGene Condition Top eSNP RSID GWAS Pval

#SNPs eQTL

#SNP GWAS

#SNP tested PP3 PP4 Ratio

Allergic disease HIST1H2BD LPS M 6:26167951 rs9379828 2.31E-07 777 1432 773 0.030 0.954 31.8 Allergic disease IL13 PHA T 5:131995843 rs1295686 1.62E-16 451 714 430 0.044 0.949 21.5 Allergic disease RERE Resting T 1:8497307 rs301802 1.33E-16 508 835 504 0.077 0.920 11.9 Allergic sensitisation IL13 PHA T 5:131995843 rs1295686 5.05E-07 451 1123 449 0.082 0.900 10.9 Asthma ERMP1 Resting T 9:5847602 rs35598997 8.13E-07 513 270 224 0.011 0.973 91.4 Asthma HLA-DRB1 Resting M 6:32612339 rs9273082 4.80E-28 3776 794 626 0.010 0.989 94.7 Asthma HLA-DRB6 LPS M 6:32608014 rs9272625 4.80E-28 3805 812 645 0.010 0.989 95.1 Asthma IL13 PHA T 5:131995843 rs1295686 1.36E-14 451 215 175 0.043 0.950 22.3 Asthma childhood onset CCL20 PHA T 2:228672579 rs13034664 4.31E-12 1006 2132 998 0.087 0.913 10.5 Autoimmune thyroid disease (ic)

BACH2 Resting T 6:90976768 rs72928038 1.23E-07 521 195 148 0.006 0.988 178.3

Celiac disease UBE2L3 PHA T 22:21979096 rs11089637 2.49E-07 384 40 35 0.041 0.880 21.6 Celiac disease UBE2L3 Resting T 22:21979584 rs12158299 2.49E-07 384 40 35 0.019 0.960 49.3 Celiac disease (ic) BACH2 Resting T 6:90976768 rs72928038 2.71E-07 521 311 122 0.091 0.860 9.4 Celiac disease (ic) CTSH Resting M 15:79234957 rs34593439 6.50E-07 865 346 208 0.016 0.978 60.6 Celiac disease (ic) CTSH Resting T 15:79234957 rs34593439 6.50E-07 865 346 208 0.015 0.979 65.0 Celiac disease (ic) UBASH3A Resting T 21:43855067 rs1893592 2.96E-09 974 223 116 0.000 1.000 51780.5 Celiac disease (ic) VARS2 PHA T 6:30888161 rs2249464 9.88E-324 1804 529 434 0.142 0.855 6.0 Celiac disease (ic) VARS2 Resting T 6:30888161 rs2249464 9.88E-324 1804 529 434 0.132 0.868 6.6 Crohn's disease CARD9 LPS M 9:139266496 rs4077515 3.15E-30 629 1425 594 0.153 0.847 5.5

Page 110: The genetics of gene expression: from simulations to the

94

Crohn's disease MUC1 PHA T 1:155033308 rs11589479 1.49E-09 338 819 325 0.008 0.900 112.7 Crohn's disease TSPAN14 LPS M 10:82280137 rs1878036 5.44E-11 743 1601 714 0.152 0.848 5.6 Crohn's disease TSPAN14 Resting T 10:82280137 rs1878036 5.44E-11 743 1601 714 0.159 0.841 5.3 Inflammatory bowel disease CARD9 LPS M 9:139267533 rs4078099 3.52E-36 629 1453 596 0.152 0.848 5.6 Inflammatory bowel disease HLA-DRB1 Resting M 6:32612339 rs9273082 1.21E-51 3776 4085 2006 0.022 0.977 43.5 Inflammatory bowel disease HLA-DRB6 LPS M 6:32608014 rs9272625 1.21E-51 3805 4114 2035 0.107 0.892 8.3 Inflammatory bowel disease HLA-DRB6 Resting M 6:32571845 rs9270893 2.17E-61 3903 4135 2126 0.021 0.979 47.6 Inflammatory bowel disease LSP1 PHA T 11:1874072 rs907611 2.34E-07 866 1724 825 0.128 0.858 6.7 Inflammatory bowel disease TSPAN14 LPS M 10:82280137 rs1878036 2.99E-11 743 1621 714 0.150 0.850 5.7 Inflammatory bowel disease TSPAN14 Resting T 10:82280137 rs1878036 2.99E-11 743 1621 714 0.158 0.842 5.3 Multiple sclerosis (ic) BACH2 Resting T 6:90976768 rs72928038 7.63E-07 521 356 139 0.014 0.970 66.9 Multiple sclerosis (ic) HLA-A LPS M 6:29832919 rs2975031 2.00E-82 2545 772 431 0.008 0.992 127.0 Multiple sclerosis (ic) HLA-A Resting M 6:29878738 rs440908 2.00E-82 2411 892 472 0.012 0.987 79.8 Multiple sclerosis (ic) HLA-A PHA T 6:29916391 rs2517718 2.00E-82 2276 983 508 0.007 0.993 133.4 Multiple sclerosis (ic) HLA-G PHA T 6:29909840 rs2734904 2.00E-82 2279 958 490 0.013 0.987 74.0 Narcolepsy (ic) CTSH Resting M 15:79234957 rs34593439 1.58E-09 865 296 227 0.011 0.989 87.6 Narcolepsy (ic) CTSH Resting T 15:79234957 rs34593439 1.58E-09 865 296 227 0.010 0.990 97.1 Primary biliary cirrhosis CTSH LPS M 15:79231518 rs11638844 1.63E-07 871 158 136 0.034 0.966 28.3 Primary biliary cirrhosis CTSH PHA T 15:79232319 rs11072815 1.63E-07 871 158 136 0.060 0.939 15.6 Primary biliary cirrhosis DGKQ LPS M 4:980464 rs4690220 2.71E-07 856 200 169 0.009 0.991 115.8 Primary biliary cirrhosis DGKQ PHA T 4:965779 rs11724804 2.71E-07 883 204 173 0.018 0.982 54.6 Primary biliary cirrhosis DGKQ Resting T 4:954275 rs3733346 2.71E-07 886 207 173 0.008 0.992 131.1 Primary biliary cirrhosis (ic) BTN3A2 LPS M 6:26367833 rs9393710 1.13E-09 913 55 52 0.133 0.867 6.5 Primary biliary cirrhosis (ic) BTN3A2 Resting M 6:26367833 rs9393710 1.13E-09 913 55 52 0.124 0.876 7.0 Primary biliary cirrhosis (ic) BTN3A2 PHA T 6:26485573 rs9348721 1.13E-09 948 55 52 0.157 0.843 5.4 Primary biliary cirrhosis (ic) BTN3A2 Resting T 6:26367833 rs9393710 1.13E-09 913 55 52 0.124 0.876 7.1 Primary biliary cirrhosis (ic) PLEKHM1 PHA T 17:43803189 rs1358071 2.22E-09 1339 29 27 0.114 0.710 6.2

Page 111: The genetics of gene expression: from simulations to the

95

Primary sclerosing cholangitis

UBASH3A Resting T 21:43855067 rs1893592 1.90E-07 974 1572 851 0.011 0.987 89.7

Rheumatoid arthritis AFF3 PHA T 2:100806940 rs11676922 3.60E-12 702 1478 698 0.139 0.828 5.9 Rheumatoid arthritis IL6ST Resting T 5:55444683 rs7731626 7.90E-23 659 1346 657 0.000 1.000 98704.6 Rheumatoid arthritis UBASH3A Resting T 21:43855067 rs1893592 9.80E-09 974 1918 969 0.131 0.869 6.6 Rheumatoid arthritis (ic) BACH2 Resting T 6:90976768 rs72928038 8.23E-07 521 292 141 0.012 0.972 81.6 Systemic lupus erythematosus

FAM167A PHA T 8:11338370 rs2618444 4.83E-18 1047 1619 1031 0.131 0.695 5.3

Type 1 diabetes ADCY3 PHA T 2:25147289 rs73920612 1.12E-07 636 279 251 0.086 0.912 10.6 Type 1 diabetes CLECL1 LPS M 12:9869271 rs2268146 6.04E-09 695 359 254 0.120 0.879 7.3 Type 1 diabetes CLECL1 PHA T 12:9885999 rs10492166 6.04E-09 715 375 267 0.106 0.893 8.4 Type 1 diabetes PHOSPHO2 PHA T 2:170517594 rs7575494 2.75E-13 804 275 257 0.050 0.950 18.9 Type 1 diabetes PHOSPHO2 Resting T 2:170569672 rs13009840 2.75E-13 817 270 262 0.065 0.930 14.4 Type 1 diabetes SUOX Resting T 12:56474379 rs3741499 4.31E-31 141 89 57 0.054 0.942 17.4 Type 1 diabetes (ic) BACH2 Resting T 6:90976768 rs72928038 4.48E-12 521 298 142 0.006 0.989 172.4 Type 1 diabetes (ic) CTSH Resting M 15:79234957 rs34593439 2.37E-13 865 334 255 0.011 0.989 88.7 Type 1 diabetes (ic) CTSH Resting T 15:79234957 rs34593439 2.37E-13 865 334 255 0.010 0.990 98.3 Type 1 diabetes (ic) SUOX Resting T 12:56474379 rs3741499 7.69E-21 141 226 49 0.053 0.944 17.8 Ulcerative colitis HLA-DRB1 Resting M 6:32612339 rs9273082 4.20E-91 3776 4050 2002 0.022 0.977 43.5 Ulcerative colitis HLA-DRB6 LPS M 6:32608014 rs9272625 4.20E-91 3805 4080 2031 0.107 0.892 8.3 Ulcerative colitis HLA-DRB6 Resting M 6:32571845 rs9270893 4.20E-91 3903 4110 2125 0.021 0.979 47.5 Ulcerative colitis ITGAL Resting T 16:30482540 rs12598978 6.91E-07 236 563 222 0.109 0.873 8.0 Ulcerative colitis LSP1 PHA T 11:1874072 rs907611 1.36E-07 866 1691 811 0.082 0.915 11.1 Ulcerative colitis RGS19 Resting T 20:62695931 rs6062343 4.33E-07 566 1277 548 0.085 0.904 10.7

Page 112: The genetics of gene expression: from simulations to the

96

Figure 4.3: Colocalisation between the response eQTL (reQTL) of IL13 and allergic diseases. (A) Regional plots show eQTL association with gene expression of IL13 in PHA-stimulated T cells (purple background), and GWAS associations with allergic disease (asthma, hay fever, or eczema)272, asthma283, and allergic sensitisation284. The minus log10 P-value is plotted on y-axes for all SNPs located within 200 kb from the top eSNP of IL13 (rs1295686). Colours of dots indicate the LD correlation with the top eSNP in purple. Positions of genes located on this locus are shown on the bottom. (B) Box plots show the rank-normalised gene expression of IL13 (y-axes) in resting T cells (left) and in PHA-stimulated T cells (right) stratified by genotypes of the reQTL rs1295686 (x-axes). In resting T cells, no SNP was significantly associated with IL13.

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●● ●

●●

Ctrl PHA

TT TC CC TT TC CC

−2−1

012

Ran

k N

orm

alis

edEx

pres

sion

of I

L13

rs1295686 (chr5:131995843)

Cis−eQ

TLsAllergic disease

Asthma

Allergic sensitisation

131.8 131.9 132.0 132.1 132.2

0

2

4

6

0

5

10

15

0

5

10

0

2

4

6

Genomic position

−log

10

P−va

lue

LD with the top eSNPTop eSNP

[0.8, 1)

[0.4, 0.6)

[0.2, 0.4)

[0, 0.2)

A

B

rs1295686PHA T

0

0

131800000 132200000

IRF1RAD50

IL5IL13

IL4KIF3A

CCNI2

Resting T PHA T

Page 113: The genetics of gene expression: from simulations to the

97

Cis-eQTLs of CTSH gene identified in resting and stimulated cells colocalised with different disease

associations (Figure S4.2, Table 4.1). The same cis-eQTL signal (top eSNP: rs34843303) was

identified for CTSH in resting monocytes as well as in resting T cells. The top eSNPs in LPS-stimulated

monocytes and in PHA-stimulated T cells were in high LD with each other (r2 = 0.95), and they were

both independent of the top eSNP identified in resting cells (LD r2 <0.3). CTSH cis-eQTLs identified

in resting monocytes or T cells colocalised with GWAS hits for celiac disease (ImmunoChip)275,

narcolepsy (ImmunoChip)285, and type 1 diabetes (ImmunoChip)278; while the cis-eQTLs identified in

LPS-stimulated monocytes and in PHA-stimulated T cells colocalised with primary biliary cirrhosis286

(Figure S4.2). CTSH (Cathepsin H) encodes a lysosomal protease involved in antigen presentation,

among other functions. The top eSNP rs11638844 in LPS-stimulated monocytes was a significant

reQTL. The top eSNP rs11072815 identified in PHA-stimulated T cells also showed an increased effect

size after PHA stimulation, but it was not significant at the FDR 0.05 significance level.

4.2.3 Causal evaluation of genes for immune-related diseases

Colocalisation between eQTLs and disease associations does not necessarily imply that the genes are

causal for the diseases. For instance, colocalisation can be observed for a locus where the same causal

variant drives two unrelated traits. To further investigate the casual effects of gene expression, I

performed Mendelian randomisation (MR) analysis using pruned cis-eQTLs (LD r2 <0.1) as

instrumental variables (IVs; 4.4.4 Materials and Methods). I focused on genes with multiple IVs since

the causal inference was more reliable. BTN3A2 was identified to have significant causal associations

with multiple diseases (Figure 4.4). BTN3A2 had significant cis-eQTLs in all four experimental

conditions, and the same top eSNPs were shared in three conditions and the one in PHA-stimulated T

cells were in high LD (r2 = 0.92) with the rest. None of the top eSNPs had significant interactions with

the condition, and all had the same direction of eQTL effects on gene expression. The increased

expression of BTN3A2 was causally associated with decreased risks for asthma283 including both

childhood- and adult-onset asthma267, allergic rhinitis284, primary sclerosing cholangitis287, and systemic

lupus erythematosus288, but associated with increased risks for inflammatory bowel disease including

Crohn’s disease273, and primary biliary cirrhosis286 (ImmunoChip289). In resting and LPS-stimulated

monocytes where only one IV was available, the causal effects were also significant using the Wald

ratio test and the direction was consistent with that inferred by the inverse variance weighted method

in two T cell cultures.

I observed that the BTN3A2 eQTLs (identified in four experimental conditions) colocalised with the

GWAS hit associated with primary biliary cirrhosis (ImmunoChip289; Figure 4.2). Low posterior

probability supporting colocalisation (PP4: 0.1–0.4; 4.4.3 Materials and Methods) was observed with

asthma association, and no evidence was observed (PP4 <0.01) for primary sclerosing cholangitis and

Page 114: The genetics of gene expression: from simulations to the

98

systemic lupus erythematosus. Colocalisation was not tested for childhood- or adult-onset asthma,

allergic rhinitis, or inflammatory bowel disease including its subtype Crohn’s disease, since this locus

was not significantly (at P-value threshold of 1´10-6) associated with any of these diseases. However,

IVs were not required to be significantly associated with the outcome for MR analysis to be performed.

Figure 4.4: Causal effects of BTN3A2 gene expression on multiple immune-related diseases. Effects of increased expression of BTN3A2 in either resting or PHA-stimulated T cells were estimated using Mendelian randomisation analysis (4.4.4 Materials and Methods), and 95% confidence intervals are shown. Effect estimates and P-values were obtained using inverse variance weighted method. A positive effect means that increased gene expression is causally associated with increased disease risk.

4.3 Discussion

The majority of the genetic variants that are associated with complex traits and diseases are located in

non-coding regions, and are rarely tagging variants on coding regions44. The frequently-observed

enrichment for eQTLs suggests that these GWAS variants have potential regulatory effects on gene

expression, through which they may ultimately influence disease susceptibility105,106. This research

chapter presents an integrative analysis to explore the early-life origins of later disease risks, in which

one of the first resources of early-life gene expression and genetic regulation in in neonatal immune

cells is integrated with findings from large-scale GWAS investigating immune-related diseases that

develop later in life. This chapter shows that many genetic variants associated with diseases of

adulthood influence gene expression in immune cells at early stage of life. It also demonstrates that

changes in gene expression at birth might have causal effects on disease risks later in life.

Disease

Primary sclerosing cholangitis

Systemic lupus erythematosus

Asthma

Asthma childhood onset

Asthma adult onset

Allergic rhinitis

Inflammatory bowel disease

Crohn's disease

Primary biliary cirrhosis

Cell condition

Resting TPHA T

Resting TPHA T

Resting TPHA T

Resting TPHA T

Resting TPHA T

Resting TPHA T

Resting TPHA T

Resting TPHA T

Resting TPHA T

P−value

1.4e−249.6e−11

1e−191.1e−07

3.5e−103.8e−08

2.4e−051e−05

8.5e−070.0025

0.00860.037

0.000380.00089

6.3e−060.00017

3.5e−101.8e−09

−0.475 −0.4 −0.325 −0.25 −0.175 −0.1 −0.025 0.05 0.1 0.15 0.2Effect on disease risk per sd in gene expression

Page 115: The genetics of gene expression: from simulations to the

99

I observed significant overlaps between CAS eQTLs and variants associated with multiple immune-

related diseases (Figure 4.1). Significant enrichment was not observed for GWAS variants associated

with allergic rhinitis and allergic sensitisation284; this might be related to the reduced sample size of the

publicly-available summary statistics after excluding the 23andMe cohort.

Colocalisation of eQTLs and GWAS hits suggests that these genes might be involved in the

pathogenesis of corresponding diseases, and the use of reQTLs aids in uncovering the specific condition

where the causal SNPs regulate gene expression. I observed more colocalisations of reQTLs specific to

stimulation (N = 9) than those specific to resting conditions (N = 3; Table 4.1). For example, the IL13

reQTL that was specific to PHA stimulation in T cells colocalised with GWAS variants associated with

allergic diseases272,284 and asthma283 (Figure 4.2, Figure 4.3). Interleukin 13 (IL-13) is a cytokine

produced by activated CD4+ and CD8+ T cells290, among others, and it promotes immunoglobulin E

(IgE) production291. IL-13 is detectable in the human placenta and neonatal cells292. IL-13 has been

shown to induce asthma symptoms including airway hyperresponsiveness, increased total serum IgE,

and increased mucus production in murine models293. Increased expression of IL-13 has been observed

in sputum and bronchial biopsy in mild294 and severe295 asthma and it can serve as a biomarker for

severe refractory asthma296. Therapies that target IL-13 (anti-IL-13 antibodies) have been developed,

such as lebrikizumab297 and tralokinumab298, however, neither of them showed consistent significant

improvement in severe asthma exacerbations in phase 3 clinical trials. Further investigation is needed

to determine whether blockage of interleukins is only effective in specific subsets of individuals with

certain genetic variants.

Another asthma-associated locus, which was specific to the childhood-onset asthma267, was observed

to colocalise with the reQTL (top eSNP: rs13034664) of CCL20 (Figure 4.2, Figure S4.1). The reQTL

was specific to PHA stimulation, and CCL20 was not a significant eGene in resting T cells. This reQTL

was previously observed in the DICE project55, where no cis-eQTLs at P-value 1´10-4 were identified

for CCL20 in naïve CD4+ T cells. In activated CD4+ T cells, rs13034664 was significantly associated

with CCL20 expression (P-value = 3.9´10-9), and the direction of the eQTL effect was also consistent

with what was observed55. CCL20 encodes a C-C chemokine ligand which binds to a G protein-coupled

receptor. Increased expression of CCL20 was observed in airways of patients with chronic obstructive

pulmonary disease (COPD)299 and asthma300,301. CCL20 expression was shown to induce mucin

(MUC5AC) production by binding to its unique receptor encoded by CCR6 in human airway epithelial

cells302. However, in the CAS dataset, the childhood asthma risk allele A was associated with decreased

CCL20 expression levels in PHA-stimulated T cells. A possible explanation for this apparent

discrepancy might be related to different risk alleles for different tissue types.

The CTSH eQTLs identified in resting and in stimulated cells (monocyte or T cell) were two distinct

signals, and these two eQTL signals were observed to colocalise with different diseases (Figure S4.2).

Page 116: The genetics of gene expression: from simulations to the

100

The resting eQTLs colocalised with GWAS hits for celiac disease (ImmunoChip)275, narcolepsy

(ImmunoChip)285, and type 1 diabetes (ImmunoChip)278, while the eQTLs identified in stimulated

conditions colocalised with primary biliary cirrhosis286. Guo et al. previously observed the

colocalisation of CTSH eQTLs in monocytes with GWAS hits associated with narcolepsy and type 1

diabetes114. Interestingly, they observed the colocalisation in (2-hour and 24-hour) LPS-stimulated

monocytes as well as in resting monocytes114, while I identified a distinct eQTL signal after the LPS

stimulation which did not colocalise with the above two disease associations. The monocyte eQTL data

used by Guo et al. were obtained mostly from adults with a median age of 30 years (interquartile range:

19 to 56 years)134. In this analysis, the monocytes from cord blood samples may respond to LPS

stimulation differently with alternative genetic regulation of gene expression in terms of CTSH.

The cis-eQTL of UBASH3A identified in resting T cells (top eSNP: rs1893592) was a significant reQTL,

which colocalised with celiac disease (ImmunoChip)275, rheumatoid arthritis303, and primary sclerosing

cholangitis287 (Figure S4.3). The UBASH3A cis-eQTL identified in PHA-stimulated T cells was a

distinct signal from the resting cis-eQTL (LD r2 between the two top eSNPs was 0.008), and was also

a significant reQTL, but it did not colocalise with any diseases. UBASH3A encodes a protein that

belongs to the T cell ubiquitin ligand (TULA) family and it is also called TULA-1 or TULA. UBASH3A

is mostly expressed in lymphocytes, and it negatively regulates T cell receptor signalling304 (also called

STS-2: Suppressor of T cell signalling). The risk allele (A) of rs1893592 for the above three diseases

was associated with decreased expression levels of UBASH3A in resting T cells, which is consistent

with its function as a T cell signalling suppressor. It has been reported that UBASH3A knock-out mice

were more likely to develop arthritis, and it was linked to a higher IL-2 production in CD4+ T cells305.

To interpret the disease-associated SNPs identified in GWAS, the nearest genes to the GWAS variants

are often considered as potential disease risk genes, although it is not always true. Integration of eQTL

data aids in characterising the regulatory effects of disease-associated loci. For example, I observed

colocalisation between GWAS variants associated with rheumatoid arthritis303 and the cis-eQTL of

IL6ST in resting T cells, and the significant SNPs were located on ANKRD55 on the upstream region

of IL6ST (Figure S4.4). The IL6ST cis-eQTL was a significant reQTL specific to the resting condition,

and no significant eQTL was identified in PHA-stimulated T cells. ANKRD55 itself did not have any

significant cis-eQTLs in CAS. The risk allele G of the top SNP rs7731626 was associated with increased

gene expression levels of IL6ST. Though IL6ST was expressed in multiple tissues from the GTEx

project, rs7731626 was not an eQTL across 48 tissues in the GTEx V7 Release dataset52. In the DICE

eQTL dataset55, rs7731626 was significantly associated with ANKRD55 gene expression, while IL6ST

did not have significant cis-eQTLs. IL6ST (interlukin-6 signal transducer) encodes a protein IL-6

receptor subunit beta or b-receptor glycoprotein 130 (gp130), which is shared by many cytokines

including IL-6 for signal transmission. IL-6 has a critical role in rheumatoid arthritis, and therapy that

Page 117: The genetics of gene expression: from simulations to the

101

inhibits IL-6 trans-signalling is currently in development – for instance, LMT-28 interacts with gp130

and reduces its binding with IL-6/IL-6R (IL-6 receptor) complex306.

Another example where eQTL data helped the characterisation of potential disease mechanisms was

the colocalisation between a GWAS locus associated with systemic lupus erythematosus288 and the

stimulation-specific reQTL of FAM167A (Figure S4.5). The locus is located in the interval between

FAM167A (or C8orf13) and BLK, and is located upstream of both genes which are transcribed in

opposite directions. This FAM167A-BLK locus has been reported to be associated with autoimmune

diseases such as systemic lupus erythematosus288,307,308, systemic sclerosis309, Sjögren’s syndrome310,

and rheumatoid arthritis311. In PHA-stimulated T cells, both FAM167A and BLK had cis-eQTLs in the

FAM167A-BLK locus (two top eSNPs were in high LD: r2 = 0.84). Consistent with the literature307,309,312,

the eSNPs showed opposite directions of eQTL effects on the two genes: the disease risk allele C of the

FAM167A top eSNP rs2409780 was associated with increased expression levels of FAM167A and

decreased expression levels of BLK. Colocalisation analysis did not show strong evidence of

colocalisation between the BLK eQTL and lupus-associated variants (posterior probability for

colocalisation: 0.52). BLK (B lymphocyte kinase) encodes a tyrosine kinase that belongs to the Src

family, and it is involved in B cell receptor signalling and B cell development313. Unlike BLK, the

functions of FAM167A (family with sequence similarity 167 member A) remained unknown until

recently312. By comparing expression profiling in mouse organs, FAM167A was found to be expressed

predominantly in the lung and spleen, and the protein expression of DIORA-1 coded by FAM167A was

confirmed in human lung biopsies312. The percentage of cells in focal infiltrates expressing DIORA-1

in salivary gland (the target organ) biopsies from patients with primary Sjögren’s syndrome was

significantly higher than non-diseased controls, indicating that DIORA-1 may be involved in

inflammatory pathology driven by B cells314.

In order to evaluate the causal role of gene expression in diseases, I performed Mendelian randomisation

(MR) analysis using independent cis-eQTLs as instrumental variables (IVs). The MR analysis was

limited by the small number of available IVs. BTN3A2, which had multiple IVs, was found to have

causal associations with different immune-related diseases (Figure 4.4). This suggests that BTN3A2 is

potentially protective for asthma283 (childhood- and adult-onset asthma267), allergic rhinitis284, primary

sclerosing cholangitis287, and systemic lupus erythematosus288; and increased expression levels might

increase the risk for Crohn’s disease (a subtype of inflammatory bowel disease)273 and primary biliary

cirrhosis286,289. BTN3A2 (butyrophilin subfamily 3 member A2), which is located near the human

leukocyte antigen (HLA) region on chromosome 6, encodes a member of the immunoglobulin

superfamily. Butyrophilin (BTN) family members are immune check-point regulators315. Increased

BTN3A2 protein expression was reported to be associated with a good prognosis in epithelial ovarian

cancer (EOC) patients, and with a higher density of intraepithelial infiltration of T cells316. There are

Page 118: The genetics of gene expression: from simulations to the

102

other two members in the BTN3 family, BTN3A1 and BTN3A3, which lie close to BTN3A2. BTN3A1

was reported to be an antigen-presenting molecule, which was critical to human γδ T cell activation317,318.

A recent study showed that BTN3A2 regulated the subcellular localisation of BTN3A1 and both of

them were required for T cell activation319. BTN3A2 was previously identified as a causal gene of

chronic obstructive pulmonary disease (COPD) in a study using an MR-based method (SMR) to

investigate effects of gene expression in lungs on COPD risk320.

I performed sensitivity analysis to test for heterogeneity in the causal effect size when three or more

IVs were available. Almost all HLA genes showed significant evidence of heterogeneity (Q-pvalue

≤0.05) for at least one disease, except for HLA-G, thus the interpretation of causal effects of HLA genes

requires caution. Causal effect size of BTN3A2 for primary sclerosing cholangitis and systemic lupus

erythematosus showed evidence of heterogeneity (Q-pvalue = 1´10-7 and 2´10-4, respectively).

Significant heterogeneity in effect size may indicate horizontal pleiotropy; however, many other factors

may also contribute to heterogeneity in causal effect size. For example, heterogeneity also occurs when

the samples used to calculate the statistics of the instrument-outcome and instrument-exposure

associations are not homogeneous, or when the relationship between the causal factor and the outcome

is not the same in two cohorts321-323.

In conclusion, this chapter demonstrates how regulatory variants for gene expression in early life

contribute to disease risk in later adulthood, by investigating the overlap between cis-eQTLs in neonatal

immune cells and variants associated with disease status obtained from external large GWAS. Influence

of genetic variants on immune-related diseases may start at early years of life. This chapter provides

evidence of potential causal effects of changes in early-life gene expression (e.g. BTN3A2) on multiple

immune-related diseases developed later in life, and this analysis may aid in future endeavours to

explore this research space by providing gene candidates for further experiments.

4.4 Materials and Methods

4.4.1 Genetic regulatory variants on early-life gene expression

The genetic regulatory variants on gene expression in neonatal immune cells were obtained from the

datasets generated in Chapter 3. Cord blood peripheral blood mononuclear cell (PBMC) samples were

collected from a perspective birth cohort: Childhood Asthma Study (CAS), and in vitro cultures of

resting and stimulated cells went through an Illumina microarray platform to quantify transcriptome

data (Figure 3.1). Rank-based inverse normalised gene expression data and lists of cis-eQTLs were

available for four experimental conditions: resting monocytes (N = 116), LPS-stimulated monocytes

(N = 125), resting T cells (N = 126), and PHA-stimulated T cells (N = 127), with 136, 376, 971, and

Page 119: The genetics of gene expression: from simulations to the

103

1,347 genes detected to have significant cis-eQTLs (cis-eGenes) at FDR 5% level in each group. I did

not observe multiple independent eQTL signals for most genes (Table 3.1). In monocytes and T cells,

125 and 956 response eQTLs with significantly different genetic effects across conditions were

identified, respectively. GRCh37 genomic coordinates were available for genetic variants and genes.

4.4.2 Processing GWAS summary statistics and enrichment analysis

I used publicly-available GWAS summary statistics from the following three resources: ImmunoBase

(https://www.immunobase.org/), LD Hub20 (http://ldsc.broadinstitute.org/ldhub/), and GWAS

Catalog19 (https://www.ebi.ac.uk/gwas/). GWAS were carried out using either genome-wide SNP array

or ImmunoChip array. Fewer genetic variants were measured using ImmunoChip array (around 100,000

to 200,000 variants after quality control), which was designed for immunogenetics studies and could

capture more variants on immune-relevant genetic loci21. If one disease was investigated in multiple

studies, I downloaded the most recent one, which was usually performed with the largest sample sizes.

I also used data from GWAS using ImmunoChip array (labelled by “ImmunoChip” or “ic”), since this

platform captured different genetic information.

I downloaded and processed summary statistics obtained using European populations for the following

immune-related diseases: allergic disease (asthma, hay fever, or eczema)272, allergic rhinitis284, allergic

sensitisation284, asthma283, childhood onset asthma267, adult onset asthma267, inflammatory bowel

disease including its two subtypes – Crohn’s disease and ulcerative colitis273, celiac disease324 (and the

ImmunoChip study275), autoimmune thyroid disease (ImmunoChip)274, juvenile idiopathic arthritis

(ImmunoChip)325, multiple sclerosis (ImmunoChip)276, narcolepsy (ImmunoChip)285, primary biliary

cirrhosis286 (and the ImmunoChip study289), primary sclerosing cholangitis287, psoriasis

(ImmunoChip)326, rheumatoid arthritis303 (and the ImmunoChip study277), systemic lupus

erythematosus288, and type 1 diabetes327 (and the ImmunoChip study278). These datasets contained

statistics for both significant and non-significant genetic variants, and GRCh37 genomic coordinates

were available.

I performed enrichment analyses using GARFIELD (version 2)236, which was also used in Chapter 3

to obtain functional annotations of eQTLs (3.4.6 Materials and Methods). Here the annotation data

were the SNPs significantly associated with gene expression identified in the cis-eQTL analysis

performed in Chapter 3. I performed enrichment tests for each of the four sets of significant eQTLs

from the following experimental conditions: resting monocytes, LPS-stimulated monocytes, resting T

cells, and PHA-stimulated T cells. Generalised linear models were applied to test for enrichment in

eQTLs of variants associated with above diseases at a significance threshold of 1´10-6. Bonferroni

correction was applied to correct for multiple testing, where the number of tests was the number of

Page 120: The genetics of gene expression: from simulations to the

104

GWAS datasets (24) multiplied by the number of eQTL datasets (4), and the Bonferroni-adjusted P-

value threshold was 0.05/(4´24).

4.4.3 Colocalisation analysis

I applied a Bayesian method implemented in the coloc v3.1 R package111 to test whether any of the

disease-associated GWAS loci shared the same causal variants with early-life cis-eQTLs identified in

neonatal immune cells. Full summary statistics were required to run the colocalisation analysis using

coloc. For loci where cis-eQTLs were also associated with diseases at a P-value threshold of 1´10-6,

colocalisation test was performed on a 400-kb window centered on the top cis-eQTL SNP. For each

locus, colocalisation test was performed on overlapping SNPs where both eQTL and GWAS summary

statistics were available. I excluded regions where not enough SNPs (<25) were available for

colocalisation test. As shown in Guo et al.114, selection of different prior probabilities of a SNP being

causal for both of the traits (!"# ) had effects on the posterior support for colocalisation. To be

conservative, I thus used a lower !"# of 1´10-6 instead of the default value of 1´10-5.

For each locus, the Bayesian method assessed the support for the following five exclusive hypotheses:

no causal variants for either of the two traits (H0), a causal variant for one trait only, either gene

expression or disease risk (H1 and H2), distinct causal variants for two traits (H3), and the same shared

causal variant for both traits (H4). The package estimated posterior probabilities (PP0, PP1, PP2, PP3,

PP4) to summarise the evidence for the above five hypotheses. I selected loci where signals for both

traits were observed, thus PP0 was low. High PP1 or PP2 and low PP3 + PP4 indicate a lack of power to

identify the causal signals111, and I excluded loci where PP3 + PP4 <0.8. I focused on loci with strong

evidence support for shared causal variants (H4), i.e. ratio of PP4 to PP3 ≥5 (Table 4.1).

4.4.4 Mendelian randomisation analysis

To infer whether genes had causal effects on immune-related diseases as listed in above section, I

performed a two-sample Mendelian randomisation (MR) analysis using the TwoSampleMR R package31.

In MR, genetic variants are used as instrumental variables (IVs) to make causal inferences. The

assumptions for a genetic variant to be a valid instrumental variable (IV) are: (1) the variant is associated

with the exposure or the risk factor (or gene expression in my case), (2) it is not associated with

confounders of exposure-outcome association, and (3) it is not associated with the outcome conditional

on the exposure (or it does not affect the outcome through any other pathways that do not involve the

exposure)25. I evaluated the causal effects of significant eGenes in each condition and selected

independent cis-eSNPs after LD pruning at LD r2 <0.1 as genetic IVs. The IVs were not necessarily

associated with the outcomes (or diseases) at genome-wide significance level. Summary statistics from

Page 121: The genetics of gene expression: from simulations to the

105

both eQTL and GWAS studies were required, including beta coefficient and its standard error, effective

allele (based on which the beta was estimated), the other allele, and P-value. GWAS datasets where

these data were not available (e.g. no allele information) were excluded.

I used the TwoSampleMR R package to harmonise summary statistics before performing MR analysis,

to make sure the effect estimates of a genetic variant on the exposure and on the outcome correspond

to the same allele, and ambiguous variants (e.g. two alleles from eQTL and GWAS datasets do not

correspond to the same genetic variant) were removed31. I applied the Wald ratio test if there was only

one IV, and inverse variance weighted method if there were multiple IVs. When three or more IVs were

available, four other methods were also applied, namely MR Egger, Weighted median, Simple mode,

and Weighted mode. For most cis-eGenes, only one independent eSNP was available and used as the

IV. I focused on eGenes with multiple IVs available in at least one of the four conditions, and at least

three out of the five methods were significant at 0.05 P-value threshold. In Figure 4.4, the P-values and

effect size estimates were estimated using the inverse variance weighted method.

I used the mr_heterogeneity function in the TwoSampleMR package to test for heterogeneity of the

causal effect size using Cochran’s Q test. This sensitivity analysis was performed for the inverse

variance weighted method and MR Egger when three or more IVs were available.

4.5 Supplemental Figures

Page 122: The genetics of gene expression: from simulations to the

106

Figure S4.1: Colocalisation between the response eQTL (reQTL) of CCL20 and childhood-onset asthma association. (A) Regional plots show eQTL association with gene expression of CCL20 in PHA-stimulated T cells (purple background), and GWAS association with childhood-onset asthma267. The minus log10 P-value is plotted on y-axes for all SNPs located within 200 kb from the top eSNP of CCL20 (rs13034664). Colours of dots indicate the LD correlation with the top eSNP in purple. Positions of genes located on this locus are shown on the bottom. (B) Box plots show the rank-normalised gene expression of CCL20 (y-axes) in resting T cells (left) and in PHA-stimulated T cells (right) stratified by genotypes of the reQTL rs13034664 (x-axes). In resting T cells, no SNP was significantly associated with CCL20.

A

B

●●

● ●

●●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●●

●●

●●

●● ●

●●

●●

Ctrl PHA

AA AG GG AA AG GG

−2−1

012

Ran

k N

orm

alis

edEx

pres

sion

of C

CL2

0

rs13034664 (chr2:228672579)

Cis−eQ

TLsC

hildhood asthma

228.5 228.6 228.7 228.8

0

5

10

15

0

3

6

9

Genomic position

−log

10

P−va

lue

LD with the top eSNPTop eSNP

[0.8, 1)

[0.6, 0.8)

[0.4, 0.6)

[0.2, 0.4)

[0, 0.2)

0

0

228500000 228800000

CCL20

rs13034664 PHA T

Resting T PHA T

Page 123: The genetics of gene expression: from simulations to the

107

Figure S4.2: Colocalisations with different diseases are observed for the CTSH cis-eQTLs identified in resting (the left column) and stimulated (the right column) cells. On the left, regional plots with purple background show eQTL associations with CTSH identified in resting monocytes (Resting M) and resting T cells (Resting T). The two eQTL signals have the same top eSNP rs34593439, and both colocalised with GWAS hits associated with celiac disease (ic: ImmunoChip)275, narcolepsy (ImmunoChip)285, and type 1 diabetes (ImmunoChip)278. On the right, regional plots with purple background show eQTL associations with CTSH identified in LPS-stimulated monocytes (LPS M; top eSNP rs11638844) and PHA-stimulated T cells (PHA T). They both colocalised with GWAS signal for primary biliary cirrhosis (PBC)286. The minus log10 P-value is plotted on y-axes for all SNPs located within 200 kb from the corresponding top eSNP (left: rs34593439; right: rs11638844). Colours of dots indicate the LD correlation with the top eSNP in purple. Positions of genes located on this locus are shown on the bottom.

Cis−eQ

TLs LPS MC

is−eQTLs PH

A TPBC

Cis−eQ

TLs Ctrl M

Cis−eQ

TLs Ctrl T

Celiac (ic)

Narcolepsy (ic)

Type 1 diabetes (ic)

79.1 79.2 79.3 79.4

79.1 79.2 79.3 79.4

0.0

2.5

5.0

7.5

10.0

0

2

4

6

0

2

4

6

0

3

6

9

12

0

5

10

15

0

2

4

6

0.0

2.5

5.0

7.5

0

5

10

Genomic position

−log

10

P−va

lue

LD with the top eSNPTop eSNP

[0.8, 1)

[0.6, 0.8)

[0.4, 0.6)

[0.2, 0.4)

[0, 0.2)

0

0

79100000 79400000

MORF4L1 CTSH RASGRF1

0

0

79100000 79400000

MORF4L1 CTSH RASGRF1

rs34593439 rs11638844

rs11638844rs34593439

LPS M

PHA TResting T

Resting M

Page 124: The genetics of gene expression: from simulations to the

108

Figure S4.3: Colocalisation between the response eQTL (reQTL) of UBASH3A and multiple diseases. (A) Regional plots show eQTL association with gene expression of UBASH3A in resting T cells (purple background), and GWAS associations with celiac disease (ic: ImmunoChip)275, rheumatoid arthritis303, and primary sclerosing cholangitis (PSC)287. The minus log10 P-value is plotted on y-axes for all SNPs located within 200 kb from the top eSNP of UBASH3A (rs1893592). Colours of dots indicate the LD correlation with the top eSNP in purple. Positions of genes located on this locus are shown on the bottom. (B) Box plots show the rank-normalised gene expression of UBASH3A (y-axes) in resting T cells (left) and in PHA-stimulated T cells (right) stratified by genotypes of the reQTL rs1893592 (x-axes). In PHA-stimulated T cells, rs1893592 was not significantly associated with UBASH3A.

Cis−eQ

TLsC

eliac (ic)R

heumatoid arthritis

PSC

43.7 43.8 43.9 44.0

0.0

2.5

5.0

7.5

10.0

12.5

0

2

4

6

8

0

2

4

6

8

0

2

4

6

Genomic position

−log

10

P−va

lue LD with the top eSNP

Top eSNP

[0.4, 0.6)

[0.2, 0.4)

[0, 0.2)

0

0

43700000 44000000

TMPRSS3UBASH3A SLC37A1

rs1893592 Resting TA

B

●●

●●

●●

●●

● ●●

●●●

●●

●●

●●●

●●●

●●●

● ●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

● ●

●●●

●●

● ●

●●

●●

●●

● ●

●● ●

●●●

●●

●●●

Ctrl PHA

AA AC CC AA AC CC

−2−1

012

Ran

k N

orm

alis

edEx

pres

sion

of U

BASH

3A

rs1893592 (chr21:43855067)Resting T PHA T

Page 125: The genetics of gene expression: from simulations to the

109

Figure S4.4: Colocalisation between the response eQTL (reQTL) of IL6ST and rheumatoid arthritis. (A) Regional plots show eQTL association with gene expression of IL6ST in resting T cells (purple background), and GWAS association with rheumatoid arthritis303. The minus log10 P-value is plotted on y-axes for all SNPs located within 200 kb from the top eSNP of IL6ST (rs7731626). Colours of dots indicate the LD correlation with the top eSNP in purple. Positions of genes located on this locus are shown on the bottom. (B) Box plots show the rank-normalised gene expression of IL6ST (y-axes) in resting T cells (left) and in PHA-stimulated T cells (right) stratified by genotypes of the reQTL rs7731626 (x-axes). In PHA-stimulated T cells, no SNP was significantly associated with IL6ST.

Cis−eQ

TLsR

heumatoid arthritis

55.3 55.4 55.5 55.6

0.0

2.5

5.0

7.5

0

5

10

15

20

Genomic position

−log

10

P−va

lue LD with the top eSNP

Top eSNP

[0.4, 0.6)

[0.2, 0.4)

[0, 0.2)

rs7731626Resting T

0

0

55300000 55600000

IL6ST ANKRD55

A

B

●●

● ●

●●

● ●

●●

●●●

●●

● ●

● ●

● ●●

●●

● ●

●●●

●●

●●

●● ●

● ●

● ●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

Ctrl PHA

GG GA AA GG GA AA

−2−1

012

Ran

k N

orm

alis

edEx

pres

sion

of I

L6ST

rs7731626 (chr5:55444683)Resting T PHA T

Page 126: The genetics of gene expression: from simulations to the

110

Figure S4.5: Colocalisation between the response eQTL (reQTL) of FAM167A and systemic lupus erythematosus. (A) Regional plots show eQTL association with gene expression of FAM167A in PHA-stimulated T cells (purple background), and GWAS association with systemic lupus erythematosus288. The minus log10 P-value is plotted on y-axes for all SNPs located within 200 kb from the top eSNP of FAM167A (rs2618444). Colours of dots indicate the LD correlation with the top eSNP in purple. Positions of genes located on this locus are shown on the bottom. (B) Box plots show the rank-normalised gene expression of FAM167A (y-axes) in resting T cells (left) and in PHA-stimulated T cells (right) stratified by genotypes of the reQTL rs2618444 (x-axes). In resting T cells, no SNP was significantly associated with FAM167A.

Cis−eQ

TLsSystem

ic lupus erythematosus

11.2 11.3 11.4 11.5

0

2

4

6

0

5

10

15

Genomic position

−log

10

P−va

lue

LD with the top eSNPTop eSNP

[0.8, 1)

[0.6, 0.8)

[0.4, 0.6)

[0.2, 0.4)

[0, 0.2)

0

0

11200000 11500000

MTMR9SLC35G5

C8orf12FAM167A BLK

rs2618444 PHA TA

B

●●

● ●

●●

●● ●

● ●

●●

●● ●

●●

●●

●●

●●

● ●

●●

●●●

● ●

●●

●●

●●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

● ●●

●●

●●

●● ●

●●

●●

●●

●●●

●● ●●

● ●

●● ●

●●

●●

●●

● ●●

●●●

●●

●●

●●

●●

Ctrl PHA

AA AC CC AA AC CC

−2−1

012

Ran

k N

orm

alis

edEx

pres

sion

of F

AM16

7A

rs2618444 (chr8:11338370)Resting T PHA T

Page 127: The genetics of gene expression: from simulations to the

111

Chapter 5

Conclusions

Page 128: The genetics of gene expression: from simulations to the

112

Gene expression is a fundamental molecular phenotype that links our genetic information to various

phenotypes. Studies have observed variation in gene expression across individuals, and there is a

genetic basis underlying this heritable trait67. The genetic components of gene expression can be

characterised by association studies that identify eQTLs. Genome-wide eQTL studies have begun more

than a decade ago69-71, and eQTLs have been investigated in various tissues and cell types: from

immortalised cells lines (LCLs)71-77 to primary human tissues52,78,79, from mixed cell populations (e.g.

peripheral blood81,82) to purified cell types (e.g. specific immune cells55,92,93,95). These eQTLs have

important ramifications for risk genes and pathways involved in disease pathogenesis86,103,192.

The number of eQTL studies continue to increase, yet there was a lack of evidence base for eQTL study

design, and the rate of false positives in existing eQTL datasets remained unknown. It is critical that

extensive simulations are used to investigate the effects of key study design parameters on eQTL

detection, and to evaluate the performance of various analysis strategies. Also, most eQTL studies to

date have used tissues or cell types from adults. The early-life genetic regulation of gene expression is

not well-studied. Moreover, characterising response eQTLs (reQTLs) is key to more extensive

description of the functional roles of genetic variation under different conditions and better

understanding of the interactions between genetic and environmental factors. In addition, growing

evidence has shown that many chronic diseases originate from early childhood, and early-life gene-by-

environment interactions impact disease development171-174,260. However, no previous studies had

integrated neonatal eQTLs in immune cells with GWAS findings to explore the early-life origins of

immune-related diseases later in life.

To fill in these research gaps, this thesis has focused on answering the following three questions:

(1) What are the best study design and analysis choices in eQTL studies under various scenarios?

(2) How is gene expression regulated in neonatal immune responses?

(3) How do early-life genetic variants and gene expression mediate risks for immune-related diseases

that develop later in life?

This thesis sought to answer the first question by empirically-driven eQTL simulations. In Chapter 2,

I simulated various scenarios with different study parameters: sample size, eQTL allele frequency, and

genetic effect size. I also simulated more complex scenarios with correlated gene expression, non-

Gaussian distribution of gene expression, dominant and recessive genetic effects, and multiple causal

cis-eQTLs for each gene. I observed that applying FDR-controlling procedures to all hypotheses

(pooled FDR methods) failed to control the rate of false positive cis-eQTLs regardless of sample size

and minor allele frequency (MAF) of true eQTLs. Pooled FDR is still being used to correct for multiple

testing in recent cis-eQTL studies85,91,222,328,329. Based on the observation from the simulations, the FDR

Page 129: The genetics of gene expression: from simulations to the

113

of eQTLs in these studies are higher than the desired 5% level. A two-step hierarchical correction

procedure, in which multiple genes are controlled (global correction) based on each gene’s best statistics

adjusted for multiple SNPs in cis (local correction), showed well-calibrated FDR in most scenarios. The

exceptions were scenarios with low sample sizes and low eQTL MAFs, where all methods had inflated

FDR. I thus recommend that a MAF cut-off of at least 10% should be used if the sample size is around

100. This MAF cut-off is higher than what is commonly used in this field, 5% or even lower (e.g. a

MAF cut-off of 1% was used in a recent study with 106 samples330). We need to exercise caution in

interpreting low-frequency eQTLs identified in previous studies that used improper MAF cut-offs. I

also observed overestimation of effect sizes (or “Winner’s Curse”) in low to moderate power settings.

More accurate effect size estimation can be obtained using a bootstrap method. This method, as well as

hierarchical correction procedures, were implemented in an open access software BootstrapQTL, which

was developed in the thesis253.

Next, I investigated eQTLs and reQTLs in neonatal immune cells under resting and stimulated

conditions (Chapter 3). Two immune cell populations (monocytes and T cells) were obtained from a

prospective birth cohort, Childhood Asthma Study (CAS)229-234, which was established in Perth,

Australia. These cells were exposed to relevant stimuli: either a bacterial component or a pan T cell

stimulant. I selected analysis strategies based on the insights from the simulation study in Chapter 2,

considering the limited sample size of around 120. I observed that the majority of the cis-eQTLs were

specific to a certain cell type or stimulatory condition. I identified reQTLs (that had different effects

across conditions) for a great proportion of the eGenes (genes with eQTLs). More condition-specific

eQTLs were detectable only upon stimulation. Genetic effects with flipped directions on gene

expression across conditions were also observed. Using mediation analysis258, I identified evidence of

trans-eQTL effects acting through the expression of nearby cis-eGenes, providing potential

mechanisms of trans-eQTLs. This chapter shows that early-life genetic effects on gene expression are

often modified by environmental factors, and the cell type- and condition-specific nature of eQTLs

demonstrates the importance of understanding the regulatory roles of genetic variation in relevant

cellular context.

Lastly, I applied integrative approaches to explore the early-life origins of immune-related diseases. In

Chapter 4, I used the early-life genetic regulatory variants identified in neonatal immune cells in

Chapter 3, as well as GWAS findings from external large cohorts. I observed significant overlaps

between neonatal eQTLs and variants associated with postnatal immune-related diseases such as asthma

and type 1 diabetes. This suggests that the effects of many genetic variation underlying immune-related

diseases on gene expression manifest at this early stage in life. Further colocalisation analysis identified

genetic loci where early-life eQTL and GWAS signals were driven by shared causal variants. Many of

the colocalised eQTLs were response eQTLs (e.g. the reQTLs of IL13 and CTSH), suggesting that some

Page 130: The genetics of gene expression: from simulations to the

114

susceptibility genes might mediate disease risks in a condition-specific manner. However, I observed a

relatively low number of colocalisations with strong evidence in this study as compared to the number

of eQTLs and disease associations. This could be due to the limited power caused by the low sample

size in the CAS eQTL study. It does not necessarily indicate that gene expression traits do not

substantially contribute to immune-related disease susceptibility. More cell types and conditions in

larger sample sizes are needed to better understand the role of gene expression in diseases. Finally, I

used genetic instrumental variables to make causal inferences, and identified that changes in expression

of BTN3A2 had potential causal associations with multiple immune-related diseases. This chapter

provides the evidence of early-life genetic variants and gene expression mediating risks for later

immune-related diseases.

There are several limitations in this thesis. First, the simulation study in Chapter 2 did not incorporate

the comparison of analysis strategies in real datasets. Secondly, the cord blood used in Chapter 3 may

be under the influence of in utero exposures, and this environmental factor was not measured or

accounted for in the eQTL analysis. Further analysis is needed to investigate the role of maternal

genetics and in utero exposures, which may affect gene expression in cord blood cell populations

through mechanisms such as DNA methylation. Thirdly, maternal cell contamination is frequently

observed in cord blood samples; however, the vast majority of cells in cord blood are from neonates,

and maternal cells make up only a small fraction, estimated to be between 10-4 to 10-5 of nucleated foetal

blood cells331,332. In addition, a proper detection of eQTLs specific to neonates is lacking. Lastly, the

investigation of causal role of neonatal expression in Chapter 4 using MR was limited by the small

number of genetic instruments, and the heterogeneity in causal effect size, especially for HLA genes,

indicating that there might be horizontal pleiotropy, or the modelling assumptions might be wrong.

Future work is required to address some of the limitations of the thesis and extend the ideas that have

been explored here. The simulation study only evaluated the analysis choices of eQTL mapping in a

single condition (or tissue type), and simulations of multiple conditions such as reQTL or longitudinal

eQTL datasets will further shed light on the optimal strategies in these settings, in which additional

power can be gained by combining datasets. ReQTLs derived from two conditions can be easily

modelled by a gene-by-condition interaction term, while mapping reQTLs from multiple conditions

(different stimulated conditions and/or multiple time points) are not straightforward, and further

extensive simulations will give new insights. Neonatal cis-eQTLs were mapped and their causal effects

on adult diseases were investigated in Chapter 2 and 3, but a proper comparison between neonatal and

adult eQTLs is lacking. There are methods available to perform such analysis, e.g. a multivariate

adaptive shrinkage (mash) model333, which requires full summary eQTL statistics. Full cis-eQTL

datasets in resting immune cells from adults are available, but that generated from activated immune

cells under similar conditions are not available yet. Neonate-specific (r)eQTLs and causal expression

Page 131: The genetics of gene expression: from simulations to the

115

changes will be informative in understanding the genetic regulation of gene expression specific to

neonatal immune responses, as well as the early-life specific origins of disease risk. For the CAS cohort,

blood samples were also collected at other time points including year 2 and year 10, and the

investigation of eQTLs at multiple time points will lead to a better understanding of the developmental

change in gene expression and its underlying genetic factors, and it will perhaps also shed light on the

progress of atopic diseases.

This thesis has provided a strong evidence base for eQTL study design, which will guide future eQTL

as well as other QTL studies investigating the effects of local regulatory variants on quantitative traits,

such as methylation status110,145,146 and splicing events51,83,147-149. This thesis has shown one of the first

to identify genome-wide eQTLs and reQTLs in neonatal primary immune cells, and this dataset will be

a useful resource for future studies to explore early-life genetic regulation in immune responses. It has

also contributed to understanding the genetic basis of variation across individuals in responses to

external environmental factors. Finally, the integrative analysis in this thesis has aided in our

understanding of the role of early-life gene expression and genetic variants in later disease development,

and provided potential disease susceptibility genes and drug targets for future investigation.

Page 132: The genetics of gene expression: from simulations to the

116

References:

1. World Health Organization. Genes and human disease. <https://www.who.int/genomics/public/geneticdiseases/en/index2.html>.

2. Crick, F. Central dogma of molecular biology. Nature 227, 561-3 (1970). 3. Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome.

Nature 489, 57-74 (2012). 4. Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860-921

(2001). 5. The 1000 Genomes Project Consortium et al. A global reference for human genetic variation.

Nature 526, 68-74 (2015). 6. McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet 48,

1279-83 (2016). 7. Lander, E.S. Initial impact of the sequencing of the human genome. Nature 470, 187-97 (2011). 8. Wellcome Trust Case Control Consortium et al. Genome-wide association study of CNVs in 16,000

cases of eight common diseases and 3,000 shared controls. Nature 464, 713-20 (2010). 9. Heller, M.J. DNA microarray technology: devices, systems, and applications. Annu Rev Biomed

Eng 4, 129-53 (2002). 10. International HapMap, C. et al. Integrating common and rare genetic variation in diverse human

populations. Nature 467, 52-8 (2010). 11. Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat Rev Genet

11, 499-511 (2010). 12. Howie, B.N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for

the next generation of genome-wide association studies. PLoS Genet 5, e1000529 (2009). 13. Das, S. et al. Next-generation genotype imputation service and methods. Nat Genet 48, 1284-1287

(2016). 14. Riordan, J.R. et al. Identification of the cystic fibrosis gene: cloning and characterization of

complementary DNA. Science 245, 1066-73 (1989). 15. O'Sullivan, B.P. & Freedman, S.D. Cystic fibrosis. Lancet 373, 1891-904 (2009). 16. Hall, M.A., Moore, J.H. & Ritchie, M.D. Embracing Complex Associations in Common Traits:

Critical Considerations for Precision Medicine. Trends Genet 32, 470-484 (2016). 17. Klein, R.J. et al. Complement factor H polymorphism in age-related macular degeneration. Science

308, 385-9 (2005). 18. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven

common diseases and 3,000 shared controls. Nature 447, 661-78 (2007). 19. Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies,

targeted arrays and summary statistics 2019. Nucleic Acids Res 47, D1005-D1012 (2019). 20. Zheng, J. et al. LD Hub: a centralized database and web interface to perform LD score regression

that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics 33, 272-279 (2017).

21. Cortes, A. & Brown, M.A. Promise and pitfalls of the Immunochip. Arthritis Res Ther 13, 101 (2011).

22. Abraham, G. & Inouye, M. Genomic risk prediction of complex human disease and its clinical application. Curr Opin Genet Dev 33, 10-6 (2015).

Page 133: The genetics of gene expression: from simulations to the

117

23. Davey Smith, G. & Hemani, G. Mendelian randomization: genetic anchors for causal inference in epidemiological studies. Hum Mol Genet 23, R89-98 (2014).

24. Manousaki, D., Mokry, L.E., Ross, S., Goltzman, D. & Richards, J.B. Mendelian Randomization Studies Do Not Support a Role for Vitamin D in Coronary Artery Disease. Circ Cardiovasc Genet 9, 349-56 (2016).

25. Burgess, S., Foley, C.N. & Zuber, V. Inferring Causal Relationships Between Risk Factors and Outcomes from Genome-Wide Association Study Data. Annu Rev Genomics Hum Genet 19, 303-327 (2018).

26. Pingault, J.B. et al. Using genetic data to strengthen causal inference in observational research. Nat Rev Genet 19, 566-580 (2018).

27. Burgess, S., Butterworth, A. & Thompson, S.G. Mendelian randomization analysis with multiple genetic variants using summarized data. Genet Epidemiol 37, 658-65 (2013).

28. Bowden, J., Davey Smith, G. & Burgess, S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int J Epidemiol 44, 512-25 (2015).

29. Bowden, J., Davey Smith, G., Haycock, P.C. & Burgess, S. Consistent Estimation in Mendelian Randomization with Some Invalid Instruments Using a Weighted Median Estimator. Genet Epidemiol 40, 304-14 (2016).

30. Pierce, B.L. & Burgess, S. Efficient design for Mendelian randomization studies: subsample and 2-sample instrumental variable estimators. Am J Epidemiol 178, 1177-84 (2013).

31. Hemani, G. et al. The MR-Base platform supports systematic causal inference across the human phenome. Elife 7(2018).

32. McCarthy, M.I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9, 356-69 (2008).

33. Yengo, L. et al. Meta-analysis of genome-wide association studies for height and body mass index in approximately 700000 individuals of European ancestry. Hum Mol Genet 27, 3641-3649 (2018).

34. Risch, N. & Merikangas, K. The future of genetic studies of complex human diseases. Science 273, 1516-7 (1996).

35. Reich, D.E. & Lander, E.S. On the allelic spectrum of human disease. Trends Genet 17, 502-10 (2001).

36. Visscher, P.M. et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet 101, 5-22 (2017).

37. Marigorta, U.M. & Navarro, A. High trans-ethnic replicability of GWAS results implies common causal variants. PLoS Genet 9, e1003566 (2013).

38. Marigorta, U.M., Rodríguez, J.A., Gibson, G. & Navarro, A. Replicability and Prediction: Lessons and Challenges from GWAS. Trends in Genetics 34, 504-517 (2018).

39. Aoki, K. & Taketo, M.M. Adenomatous polyposis coli (APC): a multi-functional tumor suppressor gene. J Cell Sci 120, 3327-35 (2007).

40. Segditsas, S. & Tomlinson, I. Colorectal cancer and genetic alterations in the Wnt pathway. Oncogene 25, 7531-7 (2006).

41. MacArthur, D.G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823-8 (2012).

42. Plotkin, J.B. & Kudla, G. Synonymous but not the same: the causes and consequences of codon bias. Nat Rev Genet 12, 32-42 (2011).

43. Chamary, J.V., Parmley, J.L. & Hurst, L.D. Hearing silence: non-neutral evolution at synonymous sites in mammals. Nat Rev Genet 7, 98-108 (2006).

Page 134: The genetics of gene expression: from simulations to the

118

44. Maurano, M.T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190-5 (2012).

45. Gallagher, M.D. & Chen-Plotkin, A.S. The Post-GWAS Era: From Association to Function. Am J Hum Genet 102, 717-730 (2018).

46. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 22, 1760-74 (2012).

47. Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 22, 1775-89 (2012).

48. Maston, G.A., Evans, S.K. & Green, M.R. Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet 7, 29-59 (2006).

49. Pennacchio, L.A., Bickmore, W., Dean, A., Nobrega, M.A. & Bejerano, G. Enhancers: five essential questions. Nat Rev Genet 14, 288-95 (2013).

50. Aguet, F. & Ardlie, K.G. Tissue Specificity of Gene Expression. Current Genetic Medicine Reports 4, 163-169 (2016).

51. GTEx Consortium. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648-60 (2015).

52. GTEx Consortium et al. Genetic effects on gene expression across human tissues. Nature 550, 204-213 (2017).

53. Mele, M. et al. Human genomics. The human transcriptome across tissues and individuals. Science 348, 660-5 (2015).

54. De Jager, P.L. et al. ImmVar project: Insights and design considerations for future studies of "healthy" immune variation. Semin Immunol 27, 51-7 (2015).

55. Schmiedel, B.J. et al. Impact of Genetic Polymorphisms on Human Immune Cell Gene Expression. Cell 175, 1701-1715 e16 (2018).

56. Kim, T.K. & Shiekhattar, R. Architectural and Functional Commonalities between Enhancers and Promoters. Cell 162, 948-59 (2015).

57. Mitchell, P.J. & Tjian, R. Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science 245, 371-8 (1989).

58. Lockhart, D.J. & Winzeler, E.A. Genomics, gene expression and DNA arrays. Nature 405, 827-36 (2000).

59. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10, 57-63 (2009).

60. Oshlack, A., Robinson, M.D. & Young, M.D. From RNA-seq reads to differential expression results. Genome Biol 11, 220 (2010).

61. Allison, D.B., Cui, X., Page, G.P. & Sabripour, M. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7, 55-65 (2006).

62. Leung, Y.F. & Cavalieri, D. Fundamentals of cDNA microarray data analysis. Trends Genet 19, 649-59 (2003).

63. Quackenbush, J. Microarray data normalization and transformation. Nat Genet 32 Suppl, 496-501 (2002).

64. Bolstad, B.M., Irizarry, R.A., Astrand, M. & Speed, T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185-93 (2003).

65. Mecham, B.H., Nelson, P.S. & Storey, J.D. Supervised normalization of microarrays. Bioinformatics 26, 1308-15 (2010).

Page 135: The genetics of gene expression: from simulations to the

119

66. Albert, F.W. & Kruglyak, L. The role of regulatory variation in complex traits and disease. Nat Rev Genet 16, 197-212 (2015).

67. Stranger, B.E. & Raj, T. Genetics of human gene expression. Curr Opin Genet Dev 23, 627-34 (2013).

68. Nica, A.C. & Dermitzakis, E.T. Expression quantitative trait loci: present and future. Philos Trans R Soc Lond B Biol Sci 368, 20120362 (2013).

69. Schadt, E.E. et al. Genetics of gene expression surveyed in maize, mouse and man. Nature 422, 297-302 (2003).

70. Morley, M. et al. Genetic analysis of genome-wide variation in human gene expression. Nature 430, 743-7 (2004).

71. Stranger, B.E. et al. Genome-wide associations of gene expression variation in humans. PLoS Genet 1, e78 (2005).

72. Stranger, B.E. et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, 848-53 (2007).

73. Stranger, B.E. et al. Patterns of cis regulatory variation in diverse human populations. PLoS Genet 8, e1002639 (2012).

74. Stranger, B.E. et al. Population genomics of human gene expression. Nat Genet 39, 1217-24 (2007). 75. Montgomery, S.B. et al. Transcriptome genetics using second generation sequencing in a Caucasian

population. Nature 464, 773-7 (2010). 76. Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in

humans. Nature 501, 506-11 (2013). 77. Pickrell, J.K. et al. Understanding mechanisms underlying human gene expression variation with

RNA sequencing. Nature 464, 768-72 (2010). 78. Goring, H.H. et al. Discovery of expression QTLs using large-scale transcriptional profiling in

human lymphocytes. Nat Genet 39, 1208-16 (2007). 79. Schadt, E.E. et al. Mapping the genetic architecture of gene expression in human liver. PLoS Biol

6, e107 (2008). 80. Fehrmann, R.S. et al. Trans-eQTLs reveal that independent genetic variants associated with a

complex phenotype converge on intermediate genes, with a major role for the HLA. PLoS Genet 7, e1002197 (2011).

81. Westra, H.J. et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat Genet 45, 1238-1243 (2013).

82. Wright, F.A. et al. Heritability and genomics of gene expression in peripheral blood. Nat Genet 46, 430-7 (2014).

83. Battle, A. et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res 24, 14-24 (2014).

84. Võsa, U. et al. Unraveling the polygenic architecture of complex traits using blood eQTL meta-analysis. bioRxiv, 447367 (2018).

85. Kirsten, H. et al. Dissecting the genetics of the human transcriptome identifies novel trait-related trans-eQTLs and corroborates the regulatory relevance of non-protein coding locidagger. Hum Mol Genet 24, 4746-63 (2015).

86. Emilsson, V. et al. Genetics of gene expression and its effect on disease. Nature 452, 423-8 (2008). 87. Grundberg, E. et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat

Genet 44, 1084-9 (2012).

Page 136: The genetics of gene expression: from simulations to the

120

88. Dimas, A.S. et al. Common regulatory variation impacts gene expression in a cell type-dependent manner. Science 325, 1246-50 (2009).

89. Garnier, S. et al. Genome-wide haplotype analysis of cis expression quantitative trait loci in monocytes. PLoS Genet 9, e1003240 (2013).

90. Andiappan, A.K. et al. Genome-wide analysis of the genetic regulation of gene expression in human neutrophils. Nat Commun 6, 7971 (2015).

91. Naranbhai, V. et al. Genomic modulators of gene expression in human neutrophils. Nat Commun 6, 7545 (2015).

92. Chen, L. et al. Genetic Drivers of Epigenetic and Transcriptional Variation in Human Immune Cells. Cell 167, 1398-1414 e24 (2016).

93. Fairfax, B.P. et al. Genetics of gene expression in primary immune cells identifies cell type-specific master regulators and roles of HLA alleles. Nat Genet 44, 502-10 (2012).

94. Ishigaki, K. et al. Polygenic burdens on cell-specific pathways underlie the risk of rheumatoid arthritis. Nat Genet 49, 1120-1125 (2017).

95. Raj, T. et al. Polarization of the effects of autoimmune and neurodegenerative risk alleles in leukocytes. Science 344, 519-23 (2014).

96. Kasela, S. et al. Pathogenic implications for autoimmune mechanisms derived by comparative eQTL analysis of CD4+ versus CD8+ T cells. PLoS Genet 13, e1006643 (2017).

97. Bryois, J. et al. Cis and trans effects of human genomic variants on gene expression. PLoS Genet 10, e1004461 (2014).

98. Gibson, G., Powell, J.E. & Marigorta, U.M. Expression quantitative trait locus analysis for translational medicine. Genome Med 7, 60 (2015).

99. Liu, B. et al. Genetic Regulatory Mechanisms of Smooth Muscle Cells Map to Coronary Artery Disease Risk Loci. Am J Hum Genet 103, 377-388 (2018).

100. Lloyd-Jones, L.R. et al. The Genetic Architecture of Gene Expression in Peripheral Blood. Am J Hum Genet 100, 228-237 (2017).

101. Breitling, R. et al. Genetical genomics: spotlight on QTL hotspots. PLoS Genet 4, e1000232 (2008). 102. Lee, M.N. et al. Common genetic variants modulate pathogen-sensing responses in human dendritic

cells. Science 343, 1246980 (2014). 103. Yao, C. et al. Dynamic Role of trans Regulation of Gene Expression in Relation to Complex Traits.

Am J Hum Genet 100, 571-580 (2017). 104. Yang, F., Wang, J., Consortium, G.T., Pierce, B.L. & Chen, L.S. Identifying cis-mediators for trans-

eQTLs across many human tissues using genomic mediation analysis. Genome Res 27, 1859-1871 (2017).

105. Nicolae, D.L. et al. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet 6, e1000888 (2010).

106. Nica, A.C. et al. Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLoS Genet 6, e1000895 (2010).

107. Qiu, C. et al. Renal compartment-specific genetic variation analyses identify new pathways in chronic kidney disease. Nat Med 24, 1721-1731 (2018).

108. Ratnapriya, R. et al. Retinal transcriptome and eQTL analyses identify genes associated with age-related macular degeneration. Nat Genet (2019).

109. Ramasamy, A. et al. Genetic variability in the regulation of gene expression in ten regions of the human brain. Nat Neurosci 17, 1418-1428 (2014).

Page 137: The genetics of gene expression: from simulations to the

121

110. Ng, B. et al. An xQTL map integrates the genetic architecture of the human brain's transcriptome and epigenome. Nat Neurosci 20, 1418-1426 (2017).

111. Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet 10, e1004383 (2014).

112. Pickrell, J.K. et al. Detection and interpretation of shared genetic influences on 42 human traits. Nat Genet 48, 709-17 (2016).

113. Hormozdiari, F. et al. Colocalization of GWAS and eQTL Signals Detects Target Genes. Am J Hum Genet 99, 1245-1260 (2016).

114. Guo, H. et al. Integration of disease association and eQTL data using a Bayesian colocalisation approach highlights six candidate causal genes in immune-mediated diseases. Hum Mol Genet 24, 3305-13 (2015).

115. Gamazon, E.R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet 47, 1091-8 (2015).

116. Barbeira, A.N. et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. bioRxiv, 045260 (2017).

117. Barbeira, A.N. et al. Integrating predicted transcriptome from multiple tissues improves association detection. PLoS Genet 15, e1007889 (2019).

118. Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet 48, 245-52 (2016).

119. Gusev, A. et al. Transcriptome-wide association study of schizophrenia and chromatin activity yields mechanistic disease insights. Nat Genet 50, 538-548 (2018).

120. Wu, L. et al. A transcriptome-wide association study of 229,000 women identifies new candidate susceptibility genes for breast cancer. Nat Genet 50, 968-978 (2018).

121. Marigorta, U.M. et al. Transcriptional risk scores link GWAS to eQTLs and predict complications in Crohn's disease. Nat Genet 49, 1517-1521 (2017).

122. Ohashi, P.S. T-cell signalling and autoimmunity: molecular mechanisms of disease. Nat Rev Immunol 2, 427-38 (2002).

123. Speiser, D.E., Ho, P.C. & Verdeil, G. Regulatory circuits of T cell function in cancer. Nat Rev Immunol 16, 599-611 (2016).

124. Kilpinen, H. et al. Common genetic variation drives molecular heterogeneity in human iPSCs. Nature 546, 370-375 (2017).

125. Kulkarni, M.M. Digital multiplexed gene expression analysis using the NanoString nCounter system. Curr Protoc Mol Biol Chapter 25, Unit25B 10 (2011).

126. Knowles, D.A. et al. Determining the genetic basis of anthracycline-cardiotoxicity by molecular response QTL mapping in induced cardiomyocytes. Elife 7(2018).

127. Gate, R.E. et al. Genetic determinants of co-accessible chromatin regions in activated T cells across humans. Nat Genet (2018).

128. Alasoo, K. et al. Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response. Nat Genet 50, 424-431 (2018).

129. Manry, J. et al. Deciphering the genetic control of gene expression following Mycobacterium leprae antigen stimulation. PLoS Genet 13, e1006952 (2017).

130. Kim-Hellmuth, S. et al. Genetic regulatory effects modified by immune activation contribute to autoimmune disease associations. Nat Commun 8, 266 (2017).

131. Quach, H. et al. Genetic Adaptation and Neandertal Admixture Shaped the Immune System of Human Populations. Cell 167, 643-656 e17 (2016).

Page 138: The genetics of gene expression: from simulations to the

122

132. Nedelec, Y. et al. Genetic Ancestry and Natural Selection Drive Population Differences in Immune Responses to Pathogens. Cell 167, 657-669 e21 (2016).

133. Caliskan, M., Baker, S.W., Gilad, Y. & Ober, C. Host genetic variation influences gene expression response to rhinovirus infection. PLoS Genet 11, e1005111 (2015).

134. Fairfax, B.P. et al. Innate immune activity conditions the effect of regulatory variants upon monocyte gene expression. Science 343, 1246949 (2014).

135. Ye, C.J. et al. Intersection of population variation and autoimmunity genetics in human T cell activation. Science 345, 1254665 (2014).

136. Hu, X. et al. Regulation of gene expression in autoimmune disease loci and the genetic basis of proliferation in CD4+ effector memory T cells. PLoS Genet 10, e1004404 (2014).

137. Kim, S. et al. Characterizing the genetic basis of innate immune response in TLR4-activated human monocytes. Nat Commun 5, 5236 (2014).

138. Barreiro, L.B. et al. Deciphering the genetic architecture of variation in the immune response to Mycobacterium tuberculosis infection. Proc Natl Acad Sci U S A 109, 1204-9 (2012).

139. Weber, M.S. et al. Type II monocytes modulate T cell-mediated central nervous system autoimmune disease. Nat Med 13, 935-43 (2007).

140. Fairweather, D. & Cihakova, D. Alternatively activated macrophages in infection and autoimmunity. J Autoimmun 33, 222-30 (2009).

141. Schlattl, A., Anders, S., Waszak, S.M., Huber, W. & Korbel, J.O. Relating CNVs to transcriptome data at fine resolution: assessment of the effect of variant size, type, and overlap with functional regions. Genome Res 21, 2004-13 (2011).

142. Huang, J. et al. eQTL mapping identifies insertion- and deletion-specific eQTLs in multiple tissues. Nat Commun 6, 6821 (2015).

143. Huan, T. et al. Genome-wide identification of microRNA expression quantitative trait loci. Nat Commun 6, 6601 (2015).

144. Degner, J.F. et al. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482, 390-4 (2012).

145. Pierce, B.L. et al. Co-occurring expression and methylation QTLs allow detection of common causal variants and shared biological mechanisms. Nat Commun 9, 804 (2018).

146. Hannon, E. et al. Methylation QTLs in the developing brain and their enrichment in schizophrenia risk loci. Nat Neurosci 19, 48-54 (2016).

147. Li, Y.I. et al. RNA splicing is a primary link between genetic variation and disease. Science 352, 600-4 (2016).

148. Zhang, X. et al. Identification of common genetic variants controlling transcript isoform variation in human whole blood. Nat Genet 47, 345-52 (2015).

149. Ongen, H. & Dermitzakis, E.T. Alternative Splicing QTLs in European and African Populations. Am J Hum Genet 97, 567-75 (2015).

150. Li, Q. et al. Genome-wide search for exonic variants affecting translational efficiency. Nat Commun 4, 2260 (2013).

151. Battle, A. et al. Genomic variation. Impact of regulatory variation from RNA to protein. Science 347, 664-7 (2015).

152. Sun, B.B. et al. Genomic atlas of the human plasma proteome. Nature 558, 73-79 (2018). 153. Emilsson, V. et al. Co-regulatory networks of human serum proteins link genetics to disease.

Science 361, 769-773 (2018).

Page 139: The genetics of gene expression: from simulations to the

123

154. Yao, C. et al. Genome-wide mapping of plasma protein QTLs identifies putatively causal genes and pathways for cardiovascular disease. Nat Commun 9, 3268 (2018).

155. McInnes, I.B. & Schett, G. The pathogenesis of rheumatoid arthritis. N Engl J Med 365, 2205-19 (2011).

156. Iwasaki, A. & Medzhitov, R. Regulation of adaptive immunity by the innate immune system. Science 327, 291-5 (2010).

157. Iwasaki, A. & Medzhitov, R. Control of adaptive immunity by the innate immune system. Nat Immunol 16, 343-53 (2015).

158. Takeuchi, O. & Akira, S. Pattern recognition receptors and inflammation. Cell 140, 805-20 (2010). 159. Nichols, B.A., Bainton, D.F. & Farquhar, M.G. Differentiation of monocytes. Origin, nature, and

fate of their azurophil granules. J Cell Biol 50, 498-515 (1971). 160. Swirski, F.K. et al. Identification of splenic reservoir monocytes and their deployment to

inflammatory sites. Science 325, 612-6 (2009). 161. Passlick, B., Flieger, D. & Ziegler-Heitbrock, H.W. Identification and characterization of a novel

monocyte subpopulation in human peripheral blood. Blood 74, 2527-34 (1989). 162. Ziegler-Heitbrock, L. The CD14+ CD16+ blood monocytes: their role in infection and inflammation.

J Leukoc Biol 81, 584-92 (2007). 163. Auffray, C. et al. Monitoring of blood vessels and tissues by a population of monocytes with

patrolling behavior. Science 317, 666-70 (2007). 164. Cros, J. et al. Human CD14dim monocytes patrol and sense nucleic acids and viruses via TLR7 and

TLR8 receptors. Immunity 33, 375-86 (2010). 165. Kollmann, T.R., Kampmann, B., Mazmanian, S.K., Marchant, A. & Levy, O. Protecting the

Newborn and Young Infant from Infectious Diseases: Lessons from Immune Ontogeny. Immunity 46, 350-363 (2017).

166. Adkins, B., Leclerc, C. & Marshall-Clarke, S. Neonatal adaptive immunity comes of age. Nat Rev Immunol 4, 553-64 (2004).

167. Levy, O. Innate immunity of the newborn: basic mechanisms and clinical correlates. Nat Rev Immunol 7, 379-90 (2007).

168. Kidd, P. Th1/Th2 balance: the hypothesis, its limitations, and implications for health and disease. Altern Med Rev 8, 223-46 (2003).

169. Makhseed, M. et al. Th1 and Th2 cytokine profiles in recurrent aborters with successful pregnancy and with subsequent abortions. Hum Reprod 16, 2219-26 (2001).

170. Vitoratos, N. et al. Elevated circulating IL-1beta and TNF-alpha, and unaltered IL-6 in first-trimester pregnancies complicated by threatened abortion with an adverse outcome. Mediators Inflamm 2006, 30485 (2006).

171. Reynolds, L.A. & Finlay, B.B. Early life factors that affect allergy development. Nat Rev Immunol 17, 518-528 (2017).

172. Gensollen, T., Iyer, S.S., Kasper, D.L. & Blumberg, R.S. How colonization by microbiota in early life shapes the immune system. Science 352, 539-44 (2016).

173. Renz, H. et al. An exposome perspective: Early-life events and immune development in a changing world. J Allergy Clin Immunol 140, 24-40 (2017).

174. Gluckman, P.D., Hanson, M.A., Cooper, C. & Thornburg, K.L. Effect of in utero and early-life conditions on adult health and disease. N Engl J Med 359, 61-73 (2008).

175. Carraro, S., Scheltema, N., Bont, L. & Baraldi, E. Early-life origins of chronic respiratory diseases: understanding and promoting healthy ageing. Eur Respir J 44, 1682-96 (2014).

Page 140: The genetics of gene expression: from simulations to the

124

176. Postma, D.S., Bush, A. & van den Berge, M. Risk factors and early origins of chronic obstructive pulmonary disease. Lancet 385, 899-909 (2015).

177. Barker, D.J., Winter, P.D., Osmond, C., Margetts, B. & Simmonds, S.J. Weight in infancy and death from ischaemic heart disease. Lancet 2, 577-80 (1989).

178. Barker, D.J. et al. Fetal nutrition and cardiovascular disease in adult life. Lancet 341, 938-41 (1993). 179. Hales, C.N. & Barker, D.J. Type 2 (non-insulin-dependent) diabetes mellitus: the thrifty phenotype

hypothesis. Diabetologia 35, 595-601 (1992). 180. Cooper, C. et al. Growth in infancy and bone mass in later life. Ann Rheum Dis 56, 17-21 (1997). 181. Kensara, O.A. et al. Fetal programming of body composition: relation between birth weight and

body composition measured with dual-energy X-ray absorptiometry and anthropometric methods in older Englishmen. Am J Clin Nutr 82, 980-7 (2005).

182. Barker, D.J. The origins of the developmental origins theory. J Intern Med 261, 412-7 (2007). 183. Wadhwa, P.D., Buss, C., Entringer, S. & Swanson, J.M. Developmental origins of health and

disease: brief history of the approach and current focus on epigenetic mechanisms. Semin Reprod Med 27, 358-68 (2009).

184. Risnes, K.R., Belanger, K., Murk, W. & Bracken, M.B. Antibiotic exposure by 6 months and asthma and allergy at 6 years: Findings in a cohort of 1,401 US children. Am J Epidemiol 173, 310-8 (2011).

185. Shaw, S.Y., Blanchard, J.F. & Bernstein, C.N. Association between the use of antibiotics in the first year of life and pediatric inflammatory bowel disease. Am J Gastroenterol 105, 2687-92 (2010).

186. Kronman, M.P., Zaoutis, T.E., Haynes, K., Feng, R. & Coffin, S.E. Antibiotic exposure and IBD development among children: a population-based cohort study. Pediatrics 130, e794-803 (2012).

187. Azad, M.B., Bridgman, S.L., Becker, A.B. & Kozyrskyj, A.L. Infant antibiotic exposure and the development of childhood overweight and central adiposity. Int J Obes (Lond) 38, 1290-8 (2014).

188. Peng, S. et al. Expression quantitative trait loci (eQTLs) in human placentas suggest developmental origins of complex diseases. Hum Mol Genet 26, 3432-3441 (2017).

189. Peng, S. et al. Genetic regulation of the placental transcriptome underlies birth weight and risk of childhood obesity. PLoS Genet 14, e1007799 (2018).

190. O'Brien, H.E. et al. Expression quantitative trait loci in the developing human brain and their enrichment in neuropsychiatric disorders. Genome Biol 19, 194 (2018).

191. Hoglinger, G.U. et al. Identification of common variants influencing risk of the tauopathy progressive supranuclear palsy. Nat Genet 43, 699-705 (2011).

192. Franzen, O. et al. Cardiometabolic risk loci share downstream cis- and trans-gene regulation across tissues and diseases. Science 353, 827-30 (2016).

193. Flutre, T., Wen, X., Pritchard, J. & Stephens, M. A statistical framework for joint eQTL analysis in multiple tissues. PLoS Genet 9, e1003486 (2013).

194. Sun, W. A statistical framework for eQTL mapping using RNA-seq data. Biometrics 68, 1-11 (2012). 195. Stegle, O., Parts, L., Durbin, R. & Winn, J. A Bayesian framework to account for complex non-

genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput Biol 6, e1000770 (2010).

196. Fusi, N., Stegle, O. & Lawrence, N.D. Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Comput Biol 8, e1002330 (2012).

197. Zhang, L. & Kim, S. Learning gene networks under SNP perturbations using eQTL datasets. PLoS Comput Biol 10, e1003420 (2014).

Page 141: The genetics of gene expression: from simulations to the

125

198. Zhernakova, D.V. et al. Identification of context-dependent expression quantitative trait loci in whole blood. Nat Genet 49, 139-145 (2017).

199. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289-300 (1995).

200. Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29, 1165-1188 (2001).

201. Storey, J.D. & Tibshirani, R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A 100, 9440-5 (2003).

202. Ongen, H., Buil, A., Brown, A.A., Dermitzakis, E.T. & Delaneau, O. Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics 32, 1479-85 (2016).

203. Sul, J.H. et al. Accurate and fast multiple-testing correction in eQTL studies. Am J Hum Genet 96, 857-68 (2015).

204. Davis, J.R. et al. An Efficient Multiple-Testing Adjustment for eQTL Studies that Accounts for Linkage Disequilibrium between Variants. Am J Hum Genet 98, 216-24 (2016).

205. Peterson, C.B., Bogomolov, M., Benjamini, Y. & Sabatti, C. TreeQTL: hierarchical error control for eQTL findings. Bioinformatics 32, 2556-8 (2016).

206. Schadt, E.E., Woo, S. & Hao, K. Bayesian method to predict individual SNP genotypes from gene expression data. Nat Genet 44, 603-8 (2012).

207. Garner, C. Upward bias in odds ratio estimates from genome-wide association studies. Genet Epidemiol 31, 288-95 (2007).

208. Zollner, S. & Pritchard, J.K. Overcoming the winner's curse: estimating penetrance parameters from case-control data. Am J Hum Genet 80, 605-15 (2007).

209. Ioannidis, J.P., Thomas, G. & Daly, M.J. Validating, augmenting and refining genome-wide association signals. Nat Rev Genet 10, 318-29 (2009).

210. Forstmeier, W. & Schielzeth, H. Cryptic multiple hypotheses testing in linear models: overestimated effect sizes and the winner's curse. Behav Ecol Sociobiol 65, 47-55 (2011).

211. Palmer, C. & Pe'er, I. Statistical correction of the Winner's Curse explains replication variability in quantitative trait genome-wide association studies. PLoS Genet 13, e1006916 (2017).

212. Spencer, C.C., Su, Z., Donnelly, P. & Marchini, J. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet 5, e1000477 (2009).

213. Skol, A.D., Scott, L.J., Abecasis, G.R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 38, 209-13 (2006).

214. Su, Z., Marchini, J. & Donnelly, P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics 27, 2304-5 (2011).

215. Inouye, M. et al. An immune response network associated with blood lipid levels. PLoS Genet 6, e1001113 (2010).

216. Inouye, M. et al. Metabonomic, transcriptomic, and genomic variation of a population cohort. Mol Syst Biol 6, 441 (2010).

217. Shabalin, A.A. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics 28, 1353-8 (2012).

218. Peterson, C.B., Bogomolov, M., Benjamini, Y. & Sabatti, C. Many Phenotypes Without Many False Discoveries: Error Controlling Strategies for Multitrait Association Studies. Genet Epidemiol 40, 45-56 (2016).

Page 142: The genetics of gene expression: from simulations to the

126

219. Sun, L. et al. BR-squared: a practical solution to the winner's curse in genome-wide scans. Hum Genet 129, 545-52 (2011).

220. Sun, L. & Bull, S.B. Reduction of selection bias in genomewide studies by resampling. Genet Epidemiol 28, 352-67 (2005).

221. Sajuthi, S.P. et al. Mapping adipose and muscle tissue expression quantitative trait loci in African Americans to identify genes for type 2 diabetes and obesity. Hum Genet 135, 869-80 (2016).

222. Kim, Y. et al. A meta-analysis of gene expression quantitative trait loci in brain. Transl Psychiatry 4, e459 (2014).

223. Fu, J. et al. Unraveling the regulatory mechanisms underlying tissue-dependent genetic variation of gene expression. PLoS Genet 8, e1002431 (2012).

224. Aulchenko, Y.S., Ripke, S., Isaacs, A. & van Duijn, C.M. GenABEL: an R library for genome-wide association analysis. Bioinformatics 23, 1294-6 (2007).

225. Jansen, R. et al. Conditional eQTL analysis reveals allelic heterogeneity of gene expression. Hum Mol Genet 26, 1444-1451 (2017).

226. R Core Team. R: A language and environment for statistical computing. (R Foundation for Statistical Computing, Vienna, Austria, 2018).

227. Storey, J.D., Bass, A.J., Dabney, A. & Robinson, D. qvalue: Q-value estimation for false discovery rate control. (2015).

228. Delaneau, O. et al. A complete tool set for molecular QTL discovery and analysis. Nat Commun 8, 15452 (2017).

229. Kusel, M.M., de Klerk, N., Holt, P.G. & Sly, P.D. Antibiotic use in the first year of life and risk of atopic disease in early childhood. Clin Exp Allergy 38, 1921-8 (2008).

230. Kusel, M.M. et al. Role of respiratory viruses in acute upper and lower respiratory tract illness in the first year of life: a birth cohort study. Pediatr Infect Dis J 25, 680-6 (2006).

231. Kusel, M.M. et al. Early-life respiratory viral infections, atopic sensitization, and risk of subsequent development of persistent asthma. J Allergy Clin Immunol 119, 1105-10 (2007).

232. Kusel, M.M., Kebadze, T., Johnston, S.L., Holt, P.G. & Sly, P.D. Febrile respiratory illnesses in infancy and atopy are risk factors for persistent asthma and wheeze. Eur Respir J 39, 876-82 (2012).

233. Teo, S.M. et al. The infant nasopharyngeal microbiome impacts severity of lower respiratory infection and risk of asthma development. Cell Host Microbe 17, 704-15 (2015).

234. Teo, S.M. et al. Airway Microbiota Dynamics Uncover a Critical Window for Interplay of Pathogenic Bacteria and Allergy in Childhood Respiratory Disease. Cell Host Microbe 24, 341-352 e5 (2018).

235. Tang, H.H. et al. Trajectories of childhood immune development and respiratory health relevant to asthma and allergy. Elife 7(2018).

236. Iotchkova, V. et al. GARFIELD classifies disease-relevant genomic features through integration of functional annotations with association signals. Nat Genet 51, 343-353 (2019).

237. Merk, M., Mitchell, R.A., Endres, S. & Bucala, R. D-dopachrome tautomerase (D-DT or MIF-2): doubling the MIF cytokine family. Cytokine 59, 10-7 (2012).

238. Merk, M. et al. The D-dopachrome tautomerase (DDT) gene product is a cytokine and functional homolog of macrophage migration inhibitory factor (MIF). Proc Natl Acad Sci U S A 108, E577-85 (2011).

239. Liu, C. et al. SZF1: a novel KRAB-zinc finger gene expressed in CD34+ stem/progenitor cells. Exp Hematol 27, 313-25 (1999).

Page 143: The genetics of gene expression: from simulations to the

127

240. Pierce, B.L. et al. Mediation analysis demonstrates that trans-eQTLs are often explained by cis-mediation: a genome-wide analysis among 1,800 South Asians. PLoS Genet 10, e1004818 (2014).

241. Venturini, L. et al. The stem cell zinc finger 1 (SZF1)/ZNF589 protein has a human-specific evolutionary nucleotide DNA change and acts as a regulator of cell viability in the hematopoietic system. Exp Hematol 44, 257-68 (2016).

242. Peters, J.E. et al. Insight into Genotype-Phenotype Associations through eQTL Mapping in Multiple Cell Types in Health and Immune-Mediated Disease. PLoS Genet 12, e1005908 (2016).

243. Hauberg, M.E. et al. Large-Scale Identification of Common Trait and Disease Variants Affecting Gene Expression. Am J Hum Genet 100, 885-894 (2017).

244. Belyy, A., Levanova, N., Tabakova, I., Rospert, S. & Belyi, Y. Ribosomal Protein Rps26 Influences 80S Ribosome Assembly in Saccharomyces cerevisiae. mSphere 1(2016).

245. Ferretti, M.B., Ghalei, H., Ward, E.A., Potts, E.L. & Karbstein, K. Rps26 directs mRNA-specific translation by recognition of Kozak sequence elements. Nat Struct Mol Biol 24, 700-707 (2017).

246. Min, E.E., Roy, B., Amrani, N., He, F. & Jacobson, A. Yeast Upf1 CH domain interacts with Rps26 of the 40S ribosomal subunit. RNA 19, 1105-15 (2013).

247. Cui, D. et al. The ribosomal protein S26 regulates p53 activity in response to DNA damage. Oncogene 33, 2225-35 (2014).

248. Alexander, C. & Rietschel, E.T. Bacterial lipopolysaccharides and innate immunity. J Endotoxin Res 7, 167-202 (2001).

249. Ritchie, M.E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43, e47 (2015).

250. Arloth, J., Bader, D.M., Roh, S. & Altmann, A. Re-Annotator: Annotation Pipeline for Microarray Probe Sequences. PLoS One 10, e0139516 (2015).

251. Miller, J.A. et al. Strategies for aggregating gene expression data: the collapseRows R function. BMC Bioinformatics 12, 322 (2011).

252. Pain, O., Dudbridge, F. & Ronald, A. Are your covariates under control? How normalization can re-introduce covariate effects. Eur J Hum Genet 26, 1194-1201 (2018).

253. Huang, Q.Q., Ritchie, S.C., Brozynska, M. & Inouye, M. Power, false discovery rate and Winner's Curse in eQTL studies. Nucleic Acids Res 46, e133 (2018).

254. Abraham, G., Qiu, Y. & Inouye, M. FlashPCA2: principal component analysis of Biobank-scale genotype datasets. Bioinformatics 33, 2776-2778 (2017).

255. Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat Protoc 7, 500-7 (2012).

256. Bates, D., Machler, M., Bolker, B.M. & Walker, S.C. Fitting linear mixed-effects models using lme4. J Stat Softw 67, 1–48 (2015).

257. Phipson, B. & Smyth, G.K. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat Appl Genet Mol Biol 9, Article39 (2010).

258. Tingley, D., Yamamoto, T., Hirose, K., Keele, L. & Imai, K. mediation: R Package for Causal Mediation Analysis. J Stat Softw 59, 1-38 (2014).

259. Hayes, A.F. & Scharkow, M. The relative trustworthiness of inferential tests of the indirect effect in statistical mediation analysis: does method really matter? Psychol Sci 24, 1918-27 (2013).

260. Bonnelykke, K. & Ober, C. Leveraging gene-environment interactions and endotypes for asthma gene discovery. J Allergy Clin Immunol 137, 667-79 (2016).

Page 144: The genetics of gene expression: from simulations to the

128

261. Bouzigon, E. et al. Effect of 17q21 variants and smoking exposure in early-onset asthma. N Engl J Med 359, 1985-94 (2008).

262. Gern, J.E. et al. Effects of dog ownership and genotype on immune development and atopy in infancy. J Allergy Clin Immunol 113, 307-14 (2004).

263. Hoffjan, S. et al. Genetic variation in immunoregulatory pathways and atopic phenotypes in infancy. J Allergy Clin Immunol 113, 511-8 (2004).

264. Thompson, E.E. et al. Integrin beta 3 genotype influences asthma and allergy phenotypes in the first 6 years of life. J Allergy Clin Immunol 119, 1423-9 (2007).

265. Moffatt, M.F. et al. Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature 448, 470-3 (2007).

266. Moffatt, M.F. et al. A large-scale, consortium-based genomewide association study of asthma. N Engl J Med 363, 1211-1221 (2010).

267. Pividori, M., Schoettler, N., Nicolae, D.L., Ober, C. & Im, H.K. Shared and distinct genetic risk factors for childhood onset and adult onset asthma. bioRxiv, 427427 (2018).

268. Kim, K.W. et al. Genome-wide association study of recalcitrant atopic dermatitis in Korean children. J Allergy Clin Immunol 136, 678-684 e4 (2015).

269. Loisel, D.A. et al. Genetic associations with viral respiratory illnesses and asthma control in children. Clin Exp Allergy 46, 112-24 (2016).

270. Liu, X. et al. Variants in the fetal genome near pro-inflammatory cytokine genes on 2q13 are associated with gestational duration. bioRxiv, 423897 (2018).

271. Hormozdiari, F. et al. Leveraging molecular quantitative trait loci to understand the genetic architecture of diseases and complex traits. Nat Genet 50, 1041-1047 (2018).

272. Ferreira, M.A. et al. Shared genetic origin of asthma, hay fever and eczema elucidates allergic disease biology. Nat Genet 49, 1752-1757 (2017).

273. de Lange, K.M. et al. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nat Genet 49, 256-261 (2017).

274. Cooper, J.D. et al. Seven newly identified loci for autoimmune thyroid disease. Hum Mol Genet 21, 5202-8 (2012).

275. Trynka, G. et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nat Genet 43, 1193-201 (2011).

276. International Multiple Sclerosis Genetics, C. et al. Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nat Genet 45, 1353-60 (2013).

277. Eyre, S. et al. High-density genetic mapping identifies new susceptibility loci for rheumatoid arthritis. Nat Genet 44, 1336-40 (2012).

278. Onengut-Gumuscu, S. et al. Fine mapping of type 1 diabetes susceptibility loci and evidence for colocalization of causal variants with lymphoid gene enhancers. Nat Genet 47, 381-6 (2015).

279. Roychoudhuri, R. et al. BACH2 regulates CD8(+) T cell differentiation by controlling access of AP-1 factors to enhancers. Nat Immunol 17, 851-860 (2016).

280. Shinnakasu, R. et al. Regulated selection of germinal-center cells into the memory B cell compartment. Nat Immunol 17, 861-9 (2016).

281. Afzali, B. et al. BACH2 immunodeficiency illustrates an association between super-enhancers and haploinsufficiency. Nat Immunol 18, 813-823 (2017).

282. Roychoudhuri, R. et al. BACH2 represses effector programs to stabilize T(reg)-mediated immune homeostasis. Nature 498, 506-10 (2013).

Page 145: The genetics of gene expression: from simulations to the

129

283. Demenais, F. et al. Multiancestry association study identifies new asthma risk loci that colocalize with immune-cell enhancer marks. Nat Genet 50, 42-53 (2018).

284. Waage, J. et al. Genome-wide association and HLA fine-mapping studies identify risk loci and genetic pathways underlying allergic rhinitis. Nat Genet 50, 1072-1080 (2018).

285. Faraco, J. et al. ImmunoChip study implicates antigen presentation to T cells in narcolepsy. PLoS Genet 9, e1003270 (2013).

286. Cordell, H.J. et al. International genome-wide meta-analysis identifies new primary biliary cirrhosis risk loci and targetable pathogenic pathways. Nat Commun 6, 8019 (2015).

287. Ji, S.G. et al. Genome-wide association study of primary sclerosing cholangitis identifies new risk loci and quantifies the genetic relationship with inflammatory bowel disease. Nat Genet 49, 269-273 (2017).

288. Bentham, J. et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nat Genet 47, 1457-1464 (2015).

289. Liu, J.Z. et al. Dense fine-mapping study identifies new susceptibility loci for primary biliary cirrhosis. Nat Genet 44, 1137-41 (2012).

290. de Waal Malefyt, R. et al. Differential regulation of IL-13 and IL-4 production by human CD8+ and CD4+ Th0, Th1 and Th2 T cell clones and EBV-transformed B cells. Int Immunol 7, 1405-16 (1995).

291. Punnonen, J. et al. Interleukin 13 induces interleukin 4-independent IgG4 and IgE synthesis and CD23 expression by human B cells. Proc Natl Acad Sci U S A 90, 3730-4 (1993).

292. Williams, T.J., Jones, C.A., Miles, E.A., Warner, J.O. & Warner, J.A. Fetal and neonatal IL-13 production during pregnancy and at birth and subsequent development of atopic symptoms. J Allergy Clin Immunol 105, 951-9 (2000).

293. Wills-Karp, M. et al. Interleukin-13: central mediator of allergic asthma. Science 282, 2258-61 (1998).

294. Berry, M.A. et al. Sputum and bronchial submucosal IL-13 expression in asthma and eosinophilic bronchitis. J Allergy Clin Immunol 114, 1106-9 (2004).

295. Saha, S.K. et al. Increased sputum and bronchial biopsy IL-13 expression in severe asthma. J Allergy Clin Immunol 121, 685-91 (2008).

296. Tsilogianni, Z. et al. Sputum interleukin-13 as a biomarker for the evaluation of asthma control. Clin Exp Allergy 46, 1498 (2016).

297. Hanania, N.A. et al. Efficacy and safety of lebrikizumab in patients with uncontrolled asthma (LAVOLTA I and LAVOLTA II): replicate, phase 3, randomised, double-blind, placebo-controlled trials. Lancet Respir Med 4, 781-796 (2016).

298. Panettieri, R.A., Jr. et al. Tralokinumab for severe, uncontrolled asthma (STRATOS 1 and STRATOS 2): two randomised, double-blind, placebo-controlled, phase 3 clinical trials. Lancet Respir Med 6, 511-525 (2018).

299. Demedts, I.K. et al. Accumulation of dendritic cells and increased CCL20 levels in the airways of patients with chronic obstructive pulmonary disease. Am J Respir Crit Care Med 175, 998-1005 (2007).

300. Faiz, A. et al. Profiling of healthy and asthmatic airway smooth muscle cells following interleukin-1beta treatment: a novel role for CCL20 in chronic mucus hypersecretion. Eur Respir J 52(2018).

301. Pichavant, M. et al. Asthmatic bronchial epithelium activated by the proteolytic allergen Der p 1 increases selective dendritic cell recruitment. J Allergy Clin Immunol 115, 771-8 (2005).

Page 146: The genetics of gene expression: from simulations to the

130

302. Kim, S., Lewis, C. & Nadel, J.A. CCL20/CCR6 feedback exaggerates epidermal growth factor receptor-dependent MUC5AC mucin production in human airway epithelial (NCI-H292) cells. J Immunol 186, 3392-400 (2011).

303. Okada, Y. et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376-81 (2014).

304. Carpino, N. et al. Regulation of ZAP-70 activation and TCR signaling by two related proteins, Sts-1 and Sts-2. Immunity 20, 37-46 (2004).

305. Okabe, N. et al. Suppressor of TCR signaling-2 (STS-2) suppresses arthritis development in mice. Mod Rheumatol 28, 626-636 (2018).

306. Hong, S.S. et al. A Novel Small-Molecule Inhibitor Targeting the IL-6 Receptor beta Subunit, Glycoprotein 130. J Immunol 195, 237-45 (2015).

307. Hom, G. et al. Association of systemic lupus erythematosus with C8orf13-BLK and ITGAM-ITGAX. N Engl J Med 358, 900-9 (2008).

308. Demirci, F.Y. et al. Multiple signals at the extended 8p23 locus are associated with susceptibility to systemic lupus erythematosus. J Med Genet 54, 381-389 (2017).

309. Gourh, P. et al. Association of the C8orf13-BLK region with systemic sclerosis in North-American and European populations. J Autoimmun 34, 155-62 (2010).

310. Lessard, C.J. et al. Variants at multiple loci implicated in both innate and adaptive immune responses are associated with Sjogren's syndrome. Nat Genet 45, 1284-92 (2013).

311. Gregersen, P.K. et al. REL, encoding a member of the NF-kappaB family of transcription factors, is a newly defined risk locus for rheumatoid arthritis. Nat Genet 41, 820-3 (2009).

312. Mentlein, L. et al. The rheumatic disease-associated FAM167A-BLK locus encodes DIORA-1, a novel disordered protein expressed highly in bronchial epithelium and alveolar macrophages. Clin Exp Immunol 193, 167-177 (2018).

313. Samuelson, E.M., Laird, R.M., Maue, A.C., Rochford, R. & Hayes, S.M. Blk haploinsufficiency impairs the development, but enhances the functional responses, of MZ B cells. Immunol Cell Biol 90, 620-9 (2012).

314. Aqrawi, L.A. et al. Clinical associations and expression pattern of the autoimmunity susceptibility factor DIORA-1 in patients with primary Sjogren's syndrome. Ann Rheum Dis 77, 1840-1842 (2018).

315. Guo, Y. & Wang, A.Y. Novel Immune Check-Point Regulators in Tolerance Maintenance. Front Immunol 6, 421 (2015).

316. Le Page, C. et al. BTN3A2 expression in epithelial ovarian cancer is associated with higher tumor infiltrating T cells and a better prognosis. PLoS One 7, e38541 (2012).

317. Vavassori, S. et al. Butyrophilin 3A1 binds phosphorylated antigens and stimulates human gammadelta T cells. Nat Immunol 14, 908-16 (2013).

318. Wang, H. et al. Butyrophilin 3A1 plays an essential role in prenyl pyrophosphate stimulation of human Vgamma2Vdelta2 T cells. J Immunol 191, 1029-42 (2013).

319. Vantourout, P. et al. Heteromeric interactions regulate butyrophilin (BTN) and BTN-like molecules governing gammadelta T cell biology. Proc Natl Acad Sci U S A 115, 1039-1044 (2018).

320. Lamontagne, M. et al. Leveraging lung tissue transcriptome to uncover candidate causal genes in COPD genetic associations. Hum Mol Genet 27, 1819-1829 (2018).

321. Hemani, G., Bowden, J. & Davey Smith, G. Evaluating the potential role of pleiotropy in Mendelian randomization studies. Hum Mol Genet 27, R195-R208 (2018).

Page 147: The genetics of gene expression: from simulations to the

131

322. Zhao, Q., Wang, J., Hemani, G., Bowden, J. & Small, D.S. Statistical inference in two-sample summary-data Mendelian randomization using robust adjusted profile score. arXiv:1801.09652 (2018).

323. Swanson, S.A. & Hernan, M.A. The challenging interpretation of instrumental variable estimates under monotonicity. Int J Epidemiol 47, 1289-1297 (2018).

324. Dubois, P.C. et al. Multiple common variants for celiac disease influencing immune gene expression. Nat Genet 42, 295-302 (2010).

325. Hinks, A. et al. Dense genotyping of immune-related disease regions identifies 14 new susceptibility loci for juvenile idiopathic arthritis. Nat Genet 45, 664-9 (2013).

326. Tsoi, L.C. et al. Identification of 15 new psoriasis susceptibility loci highlights the role of innate immunity. Nat Genet 44, 1341-8 (2012).

327. Bradfield, J.P. et al. A genome-wide meta-analysis of six type 1 diabetes cohorts identifies multiple associated loci. PLoS Genet 7, e1002293 (2011).

328. Gong, J. et al. PancanQTL: systematic identification of cis-eQTLs and trans-eQTLs in 33 cancer types. Nucleic Acids Res 46, D971-D976 (2018).

329. Singh, T. et al. Characterization of expression quantitative trait loci in the human colon. Inflamm Bowel Dis 21, 251-6 (2015).

330. Zhang, T. et al. Cell-type-specific eQTL of primary melanocytes facilitates identification of melanoma susceptibility genes. Genome Res 28, 1621-1635 (2018).

331. Petit, T. et al. Detection of maternal cells in human fetal blood during the third trimester of pregnancy using allele-specific PCR amplification. Br J Haematol 98, 767-71 (1997).

332. Morin, A.M. et al. Maternal blood contamination of collected cord blood can be identified using DNA methylation at three CpGs. Clin Epigenetics 9, 75 (2017).

333. Urbut, S.M., Wang, G., Carbonetto, P. & Stephens, M. Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Nat Genet 51, 187-195 (2019).

Page 148: The genetics of gene expression: from simulations to the

Minerva Access is the Institutional Repository of The University of Melbourne

Author/s:Huang, Qinqin

Title:The genetics of gene expression: from simulations to the early-life origins of immunediseases

Date:2019

Persistent Link:http://hdl.handle.net/11343/233195

Terms and Conditions:Terms and Conditions: Copyright in works deposited in Minerva Access is retained by thecopyright owner. The work may not be altered without permission from the copyright owner.Readers may only download, print and save electronic copies of whole works for their ownpersonal non-commercial use. Any use that exceeds these limits requires permission fromthe copyright owner. Attribution is essential when quoting or paraphrasing from these works.