1
Identification of allele-specific expression quantitative trait loci using a Poisson generalized linear model Agata Foryciarz 1 , Genna Gliner 2 , YoSon Park 3 , Christopher D Brown 3 , Barbara Engelhardt 1 1 Computer Science Department and Center for Statistics and Machine Learning, , Princeton University 2 Operations Research and Financial Engineering Department, Princeton University; 3 Department of Genetics, Perelman School of Medicine University of Pennsylvania Background Methods Results Model References Data 1. Inference Procedure 2. Hypothesis Test Distribution of effect sizes β List of significant relationships aseQTLs Inference procedure Hypothesis test Identify aseQTLs MAP estimates on β 0kl , e ijkl Variational inference used to find β kl distribution Iterates through β values until log likelihood converges Implemented in Python using the Edward library [3] (ROPE) around the null value 0 [2] *under VE model only 3. Identifying variants as aseQTLs We generate multiple associations for each gene-variant pair, one for each coding SNP inside the gene. We identify an aseQTL as a gene/ variant pair with at least one significant association, β kj 0 Inputs H 0 : β = 0 H 1 : β 0 Genotypes (x) Expression counts (y) Varying effects 0 2 4 6 0 1 2 3 1. Find 95% High Density Interval (HDI) of the distribution of β 2. Determine the ROPE around 0 by controlling the the false discovery rate < 10% 3. Perform the hypothesis test based on ROPE Iteration Log likelihood 0 250 500 -2e4 -4e4 -8e4 -6e4 [1] Andrey A Shabalin. Matrixeqtl: ultrafast eqtl analysis via large matrix operations. Bioinformatics, 28(10): 1353– 1358, 2012. [2] John K. Kruschke and Torrin M. Liddell. The bayesian new statistics: Hypothesis testing, estimation, meta- analysis, and power analysis from a bayesian perspective. Psychonomic Bulletin & Review, pages 1–29, 2017. [3] Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016. [4] Xiang Zhou and Matthew Stephens. Genome-wide efficient mixed-model analysis for association studies. Nature genetics, 44(7):821–824, 2012. [5] Müller, Samuel, Janice L. Scealy, and Alan H. Welsh. "Model selection in linear mixed models." Statistical Science 28.2 (2013): 135-167. [6] Kember, Rachel L., et al. "Genetic pleiotropy between mood disorders, metabolic, and endocrine traits in a multigenerational pedigree." bioRxiv (2017): 196055. [7] Xiang Zhou and Matthew Stephens (2012). Genome-wide efficient mixed-model analysis for association studies. Nature Genetics 44, 821–824. y ijk ~ Poisson ( λ ijkl ) log( λ ijkl ) = β 0 kl + x ijk β kl + G i μ i + e ijkl (μ 1 , μ 2 ,..., μ n )~ MVN n (0, σ u 2 I ) e ijkl ~ N (0, σ e 2 ) β kl ~ N (0, σ β 2 ) β 0 kl ~ N (0, σ β 0 2 ) Poisson Generalized Linear Model i: individual j: chromosome copy {1,2} k: eQTL position l: coding SNP position β kl : effect size, nonzero value suggests aseQTL relationship y iijk : observed expression level x ijkl : individual’s haplotype u i: Random Effect Term K: population structure (kinship matrix) e ijkl : error term We apply an SVD decomposition to the kinship matrix to create uncorrelated varying effects. 99.5% of the DNA is identical between any two humans [4]. The remaining 0.5% of the positions in the genome, known as variants, hold the key to understanding the biological differences between individuals. Expression quantitative trait loci (eQTL) studies have mapped hundreds of thousands of variants that effect gene expression variation, extending our understanding for functional roles of the non-coding sequences. However, eQTL studies require large sample sizes and dense genotyping of each sample. In contrast, identifying genetic variants associated with allele-specific expression (aseQTL) is a powerful approach to map genetic variations that regulate gene expression variation. Here we present a Bayesian method to detect ASE and aseQTLs using Poisson generalized linear model (GLM). We identify 226749 significant associations using the Fixed Effect model and 207246 significant associations using the Varying Effect model across 22 genes. As anticipated, the Varying Effects model identified fewer associations than the Fixed Effects Model, since it discounted associations accounted for by population structure. The goal of the hypothesis test is to find a threshold at which the effect size β should be considered significant. We use the Region of Practical Equivalence (ROPE) method to find β values close around to 0 to be considered insignificant. We apply each model to both a regular and a permuted dataset, in order to identify a threshold which would make 90% of all permuted associations insignificant. We applied the the Random Effect and the Fixed Effect models to 99 individuals from the Old Order Amish RNA- sequencing dataset [6]. We ran associations for a window of 1kbp from the start of each gene position for chromosomes 1 through 22. We tested a total of 14872 genes and ran 800,000 associations per model. The computation time per association varied from 60s to 3600s, depending on the number of eQTLs tested. We test two models: the Varying Effect model, which contains the population structure, and Fixed Effect, which does not. K = U S S U T , G=U S var(y)=log(λ ) = σ u 2 GG T + σ e 2 I = σ u 2 U S S U T + σ e 2 I = σ u 2 K + σ e 2 I We study the relationship between the effect size and the distance from the transcription start site (TSS) for the significant associations. The relation between TSS and association strength varies between the two models. Fixed effects model Varying Effects Model We compare the significant associations found by the Varying Effect model with those identified by the GEMMA method [7]. 75.65% of the significant associations identified by us are also identified by GEMMA. GEMMA effect size GLM Effect size Chromosome Number of associations

Identification of allele-specific expression quantitative trait loci … · 2017. 12. 16. · [6] Kember, Rachel L., et al. "Genetic pleiotropy between mood disorders, metabolic,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Identification of allele-specific expression quantitative trait loci … · 2017. 12. 16. · [6] Kember, Rachel L., et al. "Genetic pleiotropy between mood disorders, metabolic,

Identification of allele-specific expression quantitative trait loci using a Poisson generalized linear model

Agata Foryciarz1, Genna Gliner2, YoSon Park3, Christopher D Brown3, Barbara Engelhardt1 1Computer Science Department and Center for Statistics and Machine Learning,, Princeton University

2Operations Research and Financial Engineering Department, Princeton University; 3Department of Genetics, Perelman School of Medicine University of Pennsylvania

Background Methods Results

Model

References Data

1. Inference Procedure

2. Hypothesis Test

Distribution of effect sizes β

List of significant

relationships aseQTLs

Inference procedure

Hypothesis test

Identify aseQTLs

•  MAP estimates on β0kl, eijkl •  Variational inference used to find βkl

distribution •  Iterates through β values until log likelihood

converges •  Implemented in Python using the Edward library

[3]

(ROPE) around the null value 0 [2]

*under VE model only

3. Identifying variants as aseQTLs

We generate multiple associations for each gene-variant pair, one for each coding SNP inside the gene. We identify an aseQTL as a gene/variant pair with at least one significant association, βkj ≠ 0

Inputs

H0: β = 0 H1: β ≠ 0

Genotypes (x)

Expression counts (y)

Varying effects

0 2 4 6

0

1

2

3

1. Find 95% High Density Interval (HDI) of the distribution of β

2. Determine the ROPE around 0 by controlling the the false discovery rate < 10%

3. Perform the hypothesis test based on ROPE

Iteration

Log

lik

elih

ood

0 250 500

-2e4

-4e4

-8e4

-6e4

[1] Andrey A Shabalin. Matrixeqtl: ultrafast eqtl analysis via large matrix operations. Bioinformatics, 28(10): 1353– 1358, 2012. [2] John K. Kruschke and Torrin M. Liddell. The bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective. Psychonomic Bulletin & Review, pages 1–29, 2017. [3] Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016. [4] Xiang Zhou and Matthew Stephens. Genome-wide efficient mixed-model analysis for association studies. Nature genetics, 44(7):821–824, 2012. [5] Müller, Samuel, Janice L. Scealy, and Alan H. Welsh. "Model selection in linear mixed models." Statistical Science 28.2 (2013): 135-167. [6] Kember, Rachel L., et al. "Genetic pleiotropy between mood disorders, metabolic, and endocrine traits in a multigenerational pedigree." bioRxiv (2017): 196055. [7] Xiang Zhou and Matthew Stephens (2012). Genome-wide efficient mixed-model analysis for association studies. Nature Genetics 44, 821–824.

yijk ~ Poisson(λijkl )log(λijkl ) = β0kl

+ xijkβkl +Giµi + eijkl(µ1,µ2,...,µn ) ~ MVNn (0,σ u

2I )eijkl ~ N(0,σ e

2 )

βkl ~ N(0,σβ2 )

β0kl~ N(0,σβ0

2 )

Poisson Generalized Linear Model i: individual j: chromosome copy {1,2} k: eQTL position l: coding SNP position βkl: effect size, nonzero value suggests aseQTL relationship yiijk: observed expression level xijkl: individual’s haplotype ui: Random Effect Term K: population structure (kinship matrix) eijkl: error term

We apply an SVD decomposition to the kinship matrix to create uncorrelated varying effects.

99.5% of the DNA is identical between any two humans [4]. The remaining 0.5% of the positions in the genome, known as variants, hold the key to understanding the biological differences between individuals.

Expression quantitative trait loci (eQTL) studies have mapped hundreds of thousands of variants that effect gene expression variation, extending our understanding for functional roles of the non-coding sequences. However, eQTL studies require large sample sizes and dense genotyping of each sample.

In contrast, identifying genetic variants associated with allele-specific expression (aseQTL) is a powerful approach to map genetic variations that regulate gene expression variation. Here we present a Bayesian method to detect ASE and aseQTLs using Poisson generalized linear model (GLM).

We identify 226749 significant associations using the Fixed Effect model and 207246 significant associations using the Varying Effect model across 22 genes. As anticipated, the Varying Effects model identified fewer associations than the Fixed Effects Model, since it discounted associations accounted for by population structure.

The goal of the hypothesis test is to find a threshold at which the effect size β should be considered significant. We use the Region of Practical Equivalence (ROPE) method to find β values close around to 0 to be considered insignificant. We apply each model to both a regular and a permuted dataset, in order to identify a threshold which would make 90% of all permuted associations insignificant.

We applied the the Random Effect and the Fixed Effect models to 99 individuals from the Old Order Amish RNA-sequencing dataset [6]. We ran associations for a window of 1kbp from the start of each gene position for chromosomes 1 through 22. We tested a total of 14872 genes and ran 800,000 associations per model. The computation time per association varied from 60s to 3600s, depending on the number of eQTLs tested.

We test two models: the Varying Effect model, which contains the population structure, and Fixed Effect, which does not.

K =U S SUT , G=U Svar(y)=log(λ) =σ u

2GGT +σ e2I

=σ u2U S SUT +σ e

2I =σ u2K +σ e

2I

We study the relationship between the effect size and the distance from the transcription start site (TSS) for the significant associations. The relation between TSS and association strength varies between the two models.

Fixed effects model Varying Effects Model

We compare the significant associations found by the Varying Effect model with those identified by the GEMMA method [7]. 75.65% of the significant associations identified by us are also identified by GEMMA.

GEMMA effect size

GLM

Effe

ct s

ize

Chromosome

Num

ber o

f as

soci

atio

ns