19
Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Lee H. Dicker Rutgers University and Amazon, NYC * Based on joint work with Ruijun Ma (Rutgers), Murat A. Erdogdu (Stanford/MSR), and Yakir Reshef (Harvard) Borchard Colloquium July 4, 2017 * Work conducted prior to joining Amazon

Heritability estimation in modern genetics and connections

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Heritability estimation in modern genetics andconnections to some new results for quadratic

forms in statistics

Lee H. DickerRutgers University and Amazon, NYC∗

Based on joint work withRuijun Ma (Rutgers), Murat A. Erdogdu (Stanford/MSR),

and Yakir Reshef (Harvard)

Borchard ColloquiumJuly 4, 2017

∗Work conducted prior to joining Amazon

Heritability in genetics

I Heritability is a fundamental concept in genetics. It represents the extentto which an exhibited trait in an individual is attributable to genetics.

I (Slightly) more technical formulation: The proportion of variation in aphenotype that can be explained by the genotype.

I Population-level estimates of heritability can be obtained from phenotypedata (e.g. human height or milk fat percentage in cows) and genotype data(e.g. pedigree information or GWAS data).

I Heritability estimation has a long history in statistics going back to R.A.Fisher.

I Currently, LMM-based methods are probably the most prevalant approachfor estimating heritability with GWAS data.

I LMM-based methods for heritability estimation are not new (Henderson, 1950,Ann. Math. Stat.).

I Modern LMM-based methods for GWAS data took off after (Yang et al., 2010,Nat. Genet.), which gave estimates of heritability for human height.

I Thousands of papers on LMM-based heritability for GWAS data since 2010.

1 / 15

Heritability in genetics

Hayes et al. (2010) PLoS Genet.

I Much of the recent work on LMM-based methods for heritability estimationis built on older research in to cattle breeding.

2 / 15

This talk

I Some new statistical perspectives on heritability estimation.I Main messages:

1. Standardize your predictors.2. There’s a big opportunity for statistics to make an impact in this

area with thoughtful modeling and technical expertise.

3 / 15

This talk

I Some new statistical perspectives on heritability estimation.I Main messages:

1. Standardize your predictors.

2. There’s a big opportunity for statistics to make an impact in thisarea with thoughtful modeling and technical expertise.

3 / 15

This talk

I Some new statistical perspectives on heritability estimation.I Main messages:

1. Standardize your predictors.2. There’s a big opportunity for statistics to make an impact in this

area with thoughtful modeling and technical expertise.

3 / 15

Genetic relatedness and heritabilityI Let y = (y1, . . . , yn)> ∈ Rn be a vector of centered real-valued outcomes,

where yi is the phenotype value for individual i in some population.

I Assume thaty = g + e (1)

can be decomposed into an additive genetic effect g ∼ MV(0, σ2gK) and an

uncorrelated noise vector e ∼ MV(0, σ2e I).

I K = (Kij) is the genetic relationship matrix (GRM) and Kij measures geneticsimilarity between individuals i and j; the GRM is standardized so that Kii = 1.

I The noise vector e may contain environmental noise, measurement error, andother non-additive genetic effects.

I Frequently, the data is transformed before reaching the representation (1), e.g.project out covariates or other fixed effects.

I The heritability coefficient is

h2 =σ2

g

σ2g + σ2

e.

I Since E(yy>) = σ2gK + σ2

e I, the heritability coefficient is the R2 forregressing yiyj on Kij , i.e. regressing phenotype similarity on geneticsimilarity.

4 / 15

Genetic relatedness and heritability

I To estimate h2, Henderson (1950) used least squares for regressing yiyj onKij .

I Maximum likelihood is also widely used: Let

`(s2g , s

2e ) =

12

log det(s2gK + s2

e I) +12

y>(s2gK + s2

e I)−1y

be the log-likelihood for (σ2g , σ

2e) under the assumption that g, e are

Gaussian; minimize ` to get the MLE for (σ2g , σ

2e) and h2.

I Henderson’s estimator and the MLE are both reasonable estimators for agiven GRM K — how should we choose K?

I Pre-GWAS: Pedigree/familial information determines K and describes howindividuals are related.

I Post-GWAS: Molecular genetic data gives fine-grained measure of geneticsimilarity.

5 / 15

Genetic relatedness and LMMs for GWAS data

I For GWAS data, each individual’s genetic data can be encoded in a vectorxi = (xi1, . . . , xim)> ∈ Rm, where xij represents the (frequently standardized)minor allele count for the j-th single nucelotide polymorphism (SNP).

I xij is tri-nary.I m can be in the 100K-1M’s; n may be in the 1-10K’s.

I The GRM is determined by a kernel function K with Kij = K(xi , xj). Thelinear kernel

K(xi , xj) =1m

x>i xj

is the most widely used in practice.

I With the linear kernel, the original model y = g + e can be rewritten as alinear random-effects model (LMM)

y = Xb + e,

where X = (x1, . . . , xn)>, b = (b1, . . . , bm)> ∈ Rm, andb1, . . . , bm ∼ MV(0, σ2

g/m) are iid.

6 / 15

LMMs for heritability estimation: Questions

y = Xb + e, (2)

I Variance components methods for estimating h2 under (2) have emergedas one of the most popular strategies for heritability estimation in GWAS.

I However, a number of challenges have emerged.I Causal SNPs and linkage disequilibrium. In most generative models for

linking GWAS data and outcomes, there is a fixed collection of causal SNPsC ⊆ [m] and b is assumed to be a sparse vector supported on C. If the SNPsxi are highly correlated — i.e. they are in linkage disequilibrium (LD) — theLMM approach can give badly biased estimates for h2 (Speed et al., 2012,Am. J. Hum. Genet.)

I Partitioning heritability. Sometimes it’s desirable to estimate the heritabilityattributable to a subset of SNPs, S ⊆ [m]. To date, there’s no consensus abouthow this should be done and existing solutions have significant drawbacks.

7 / 15

Causal SNPs and linkage disequilibrium

I Simulations show standard LMM estimators for h2 are biased when causalSNPs are located in regions with low (or high) LD.

I Settings:I n = 500, m = 1000.I σ2

e = 0.5.

I b =√

1m (z1, . . . , zm/2, 0, . . . , 0)> ∈ Rm with z1, . . . , zm/2 ∼ N(0, 1) iid. So

C = {1, . . . ,m/2}.

I xi ∼ N(0,Σ) with Σ =

(AR(0.2) 0

0 AR(0.8)

).

I Then σ2g = 0.5 and h2 = 0.5.

h2 h2

0.5 Mean: 0.43295% CI: (0.404,0.460)

Table: Summary of results based on simulating 50 independent datasets.

8 / 15

Partitioning heritabilityI One method for partitioning heritability is to assume a LMM with multiple

variance components (Finucane et al., 2015, Nat. Genet.):

y = XSbS + XSc bSc + e, (3)

where

bi ∼

MV

(0,

σ2S

|S|

), if i ∈ S,

MV(0,

σ2Sc

m−|S|

)if i < S

are all independent.

I The S-partitioned heritability is

h2S

=σ2S

σ2S

+ σ2Sc + σ2

e.

I h2S

can be estimated, for instance, using maximum for the model (3) under aGaussian random-effects assumption.

I However, estimating the total heritability h2 = (σ2S

+ σ2Sc )/(σ2

S+ σ2

Sc + σ2e)

under this model has the same issues noted in the previousslide/simulation.

9 / 15

LMMs for heritability estimation

I Challenges arise when the location of causal SNPs iscorrelated with LD structure.

I Solution:

Remove LD structure, i.e. whiten.

10 / 15

LMMs for heritability estimation

I Challenges arise when the location of causal SNPs iscorrelated with LD structure.

I Solution: Remove LD structure, i.e. whiten.

10 / 15

LMMs for heritability estimation: Mahalanobis kernel

I Our proposal: Use the Mahalanobis kernel to measure genetic similarity.

Kij = x>i Σ−1xj ,

where Cov(xi) = Σ.

I This resolves the causal SNPs/linkage disequilibrium and partitioningheritability problems, and clears a path for more modeling progress andresults.

I Argument:

Kij = x>i Σ−1xj ↔y = Xb + e,with b ∼ MV(0, σ2

gΣ−1/m)↔ ∗ y = XΣbΣ + e,

fixed-effects model,

where XΣ = XΣ−1/2 and bΣ = Σ1/2b.

I The Mahalanobis kernel has been used extensively elsewhere in genetics(e.g. for association testing), but not to our knowledge for heritabilityestimation.

11 / 15

Revisiting Causal SNPs/LD and partitioning heritability

I To find the Mahalanobis-MLE h2Σ, replace X by XΣ−1/2 and then find the

MLE for h2 using the linear kernel, i.e. whiten/decorrelate/standardize thepredictors and compute the usual MLE.

h2 h2 h2Σ

0.5 Mean: 0.432 Mean: 0.49995% CI: (0.404,0.460) 95% CI: (0.473, 0.525)

I Let

Σ =

(ΣS,S ΣS,Sc

Σ>S,Sc ΣSc ,Sc

).

The partitioned heritability is defined via the decomposition

y = gS + gSc + e,

wheregS = XS(bS + Σ−1

S,SΣS,Sc bSc ),

gSc = (XSc − XSΣ−1S,SΣS,Sc )bSc .

I Under the LMM y = Xb + e with b ∼ MV(0, σ2gΣ−1/m), gS and gSc are

uncorrelated.

12 / 15

Why does this work?

I Model misspecification results for variance components estimationproblems (D & Erdogdu, 2016, 2017).

I Meta-result. Assume:

(i) y = Xb + e ∈ Rn

(ii) Cov(e) = σ2e .

(iii) b ∈ Rm is any (fixed or random) vector with ‖b‖2 � 1.

If xi ∼ N(0,Σ) are iid, then the variance component estimators for theMahalanobis kernel (σ2

g , σ2e) are nice:

(i) (σ2g , σ

2e)→ (σ2

g , σ2e), where σ2

g = lim ‖b‖2.(ii) (σ2

g , σ2e) are approximately normal.

I Implication: If genotypes are random, then the Mahalanobis kernel forestimating heritability works even with fixed genetic effects – and, inparticular, when the location of causal SNPs is associated with LDstructure.

13 / 15

Why does this work?I Proof idea: In the fixed-effects model with random genotypes, b inherits

randomness from X .

I Let θ = (σ2g , σ

2e) be the MLE and assume WLOG that xi ∼ N(0, I).

I Key point: Since X D= XU for any m ×m orthogonal matrix U,

θ(y,X)D= θ(y,X),

wherey = X b + e, b ∼ uniform{Sm−1(σ2

g)}.

I In other words, θ = θ(y,X) has the same distribution as θ(y,X), where thedata are drawn from a random-effects model.

I Since b is approximately Gaussian for large m, we can use tools forvariance components estimation in Gaussian random-effects models toapproximate the distribution of θ.

I Finite sample normal approximation results for quadratic forms.

I Other comments:I We can get closed form expressions for the asymptotic variance of θ using

results on the Marcenko-Pastur distribution.

14 / 15

Open questions

I Results for non-Gaussian xi .I SNP genotype data is always non-Gaussian, but simulations

suggest results may still hold.I Binary outcomes.

I Erdogdu, Bayati & D (2016).

I Applications to association testing problems.

15 / 15