Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Heritability estimation in modern genetics andconnections to some new results for quadratic
forms in statistics
Lee H. DickerRutgers University and Amazon, NYC∗
Based on joint work withRuijun Ma (Rutgers), Murat A. Erdogdu (Stanford/MSR),
and Yakir Reshef (Harvard)
Borchard ColloquiumJuly 4, 2017
∗Work conducted prior to joining Amazon
Heritability in genetics
I Heritability is a fundamental concept in genetics. It represents the extentto which an exhibited trait in an individual is attributable to genetics.
I (Slightly) more technical formulation: The proportion of variation in aphenotype that can be explained by the genotype.
I Population-level estimates of heritability can be obtained from phenotypedata (e.g. human height or milk fat percentage in cows) and genotype data(e.g. pedigree information or GWAS data).
I Heritability estimation has a long history in statistics going back to R.A.Fisher.
I Currently, LMM-based methods are probably the most prevalant approachfor estimating heritability with GWAS data.
I LMM-based methods for heritability estimation are not new (Henderson, 1950,Ann. Math. Stat.).
I Modern LMM-based methods for GWAS data took off after (Yang et al., 2010,Nat. Genet.), which gave estimates of heritability for human height.
I Thousands of papers on LMM-based heritability for GWAS data since 2010.
1 / 15
Heritability in genetics
Hayes et al. (2010) PLoS Genet.
I Much of the recent work on LMM-based methods for heritability estimationis built on older research in to cattle breeding.
2 / 15
This talk
I Some new statistical perspectives on heritability estimation.I Main messages:
1. Standardize your predictors.2. There’s a big opportunity for statistics to make an impact in this
area with thoughtful modeling and technical expertise.
3 / 15
This talk
I Some new statistical perspectives on heritability estimation.I Main messages:
1. Standardize your predictors.
2. There’s a big opportunity for statistics to make an impact in thisarea with thoughtful modeling and technical expertise.
3 / 15
This talk
I Some new statistical perspectives on heritability estimation.I Main messages:
1. Standardize your predictors.2. There’s a big opportunity for statistics to make an impact in this
area with thoughtful modeling and technical expertise.
3 / 15
Genetic relatedness and heritabilityI Let y = (y1, . . . , yn)> ∈ Rn be a vector of centered real-valued outcomes,
where yi is the phenotype value for individual i in some population.
I Assume thaty = g + e (1)
can be decomposed into an additive genetic effect g ∼ MV(0, σ2gK) and an
uncorrelated noise vector e ∼ MV(0, σ2e I).
I K = (Kij) is the genetic relationship matrix (GRM) and Kij measures geneticsimilarity between individuals i and j; the GRM is standardized so that Kii = 1.
I The noise vector e may contain environmental noise, measurement error, andother non-additive genetic effects.
I Frequently, the data is transformed before reaching the representation (1), e.g.project out covariates or other fixed effects.
I The heritability coefficient is
h2 =σ2
g
σ2g + σ2
e.
I Since E(yy>) = σ2gK + σ2
e I, the heritability coefficient is the R2 forregressing yiyj on Kij , i.e. regressing phenotype similarity on geneticsimilarity.
4 / 15
Genetic relatedness and heritability
I To estimate h2, Henderson (1950) used least squares for regressing yiyj onKij .
I Maximum likelihood is also widely used: Let
`(s2g , s
2e ) =
12
log det(s2gK + s2
e I) +12
y>(s2gK + s2
e I)−1y
be the log-likelihood for (σ2g , σ
2e) under the assumption that g, e are
Gaussian; minimize ` to get the MLE for (σ2g , σ
2e) and h2.
I Henderson’s estimator and the MLE are both reasonable estimators for agiven GRM K — how should we choose K?
I Pre-GWAS: Pedigree/familial information determines K and describes howindividuals are related.
I Post-GWAS: Molecular genetic data gives fine-grained measure of geneticsimilarity.
5 / 15
Genetic relatedness and LMMs for GWAS data
I For GWAS data, each individual’s genetic data can be encoded in a vectorxi = (xi1, . . . , xim)> ∈ Rm, where xij represents the (frequently standardized)minor allele count for the j-th single nucelotide polymorphism (SNP).
I xij is tri-nary.I m can be in the 100K-1M’s; n may be in the 1-10K’s.
I The GRM is determined by a kernel function K with Kij = K(xi , xj). Thelinear kernel
K(xi , xj) =1m
x>i xj
is the most widely used in practice.
I With the linear kernel, the original model y = g + e can be rewritten as alinear random-effects model (LMM)
y = Xb + e,
where X = (x1, . . . , xn)>, b = (b1, . . . , bm)> ∈ Rm, andb1, . . . , bm ∼ MV(0, σ2
g/m) are iid.
6 / 15
LMMs for heritability estimation: Questions
y = Xb + e, (2)
I Variance components methods for estimating h2 under (2) have emergedas one of the most popular strategies for heritability estimation in GWAS.
I However, a number of challenges have emerged.I Causal SNPs and linkage disequilibrium. In most generative models for
linking GWAS data and outcomes, there is a fixed collection of causal SNPsC ⊆ [m] and b is assumed to be a sparse vector supported on C. If the SNPsxi are highly correlated — i.e. they are in linkage disequilibrium (LD) — theLMM approach can give badly biased estimates for h2 (Speed et al., 2012,Am. J. Hum. Genet.)
I Partitioning heritability. Sometimes it’s desirable to estimate the heritabilityattributable to a subset of SNPs, S ⊆ [m]. To date, there’s no consensus abouthow this should be done and existing solutions have significant drawbacks.
7 / 15
Causal SNPs and linkage disequilibrium
I Simulations show standard LMM estimators for h2 are biased when causalSNPs are located in regions with low (or high) LD.
I Settings:I n = 500, m = 1000.I σ2
e = 0.5.
I b =√
1m (z1, . . . , zm/2, 0, . . . , 0)> ∈ Rm with z1, . . . , zm/2 ∼ N(0, 1) iid. So
C = {1, . . . ,m/2}.
I xi ∼ N(0,Σ) with Σ =
(AR(0.2) 0
0 AR(0.8)
).
I Then σ2g = 0.5 and h2 = 0.5.
h2 h2
0.5 Mean: 0.43295% CI: (0.404,0.460)
Table: Summary of results based on simulating 50 independent datasets.
8 / 15
Partitioning heritabilityI One method for partitioning heritability is to assume a LMM with multiple
variance components (Finucane et al., 2015, Nat. Genet.):
y = XSbS + XSc bSc + e, (3)
where
bi ∼
MV
(0,
σ2S
|S|
), if i ∈ S,
MV(0,
σ2Sc
m−|S|
)if i < S
are all independent.
I The S-partitioned heritability is
h2S
=σ2S
σ2S
+ σ2Sc + σ2
e.
I h2S
can be estimated, for instance, using maximum for the model (3) under aGaussian random-effects assumption.
I However, estimating the total heritability h2 = (σ2S
+ σ2Sc )/(σ2
S+ σ2
Sc + σ2e)
under this model has the same issues noted in the previousslide/simulation.
9 / 15
LMMs for heritability estimation
I Challenges arise when the location of causal SNPs iscorrelated with LD structure.
I Solution:
Remove LD structure, i.e. whiten.
10 / 15
LMMs for heritability estimation
I Challenges arise when the location of causal SNPs iscorrelated with LD structure.
I Solution: Remove LD structure, i.e. whiten.
10 / 15
LMMs for heritability estimation: Mahalanobis kernel
I Our proposal: Use the Mahalanobis kernel to measure genetic similarity.
Kij = x>i Σ−1xj ,
where Cov(xi) = Σ.
I This resolves the causal SNPs/linkage disequilibrium and partitioningheritability problems, and clears a path for more modeling progress andresults.
I Argument:
Kij = x>i Σ−1xj ↔y = Xb + e,with b ∼ MV(0, σ2
gΣ−1/m)↔ ∗ y = XΣbΣ + e,
fixed-effects model,
where XΣ = XΣ−1/2 and bΣ = Σ1/2b.
I The Mahalanobis kernel has been used extensively elsewhere in genetics(e.g. for association testing), but not to our knowledge for heritabilityestimation.
11 / 15
Revisiting Causal SNPs/LD and partitioning heritability
I To find the Mahalanobis-MLE h2Σ, replace X by XΣ−1/2 and then find the
MLE for h2 using the linear kernel, i.e. whiten/decorrelate/standardize thepredictors and compute the usual MLE.
h2 h2 h2Σ
0.5 Mean: 0.432 Mean: 0.49995% CI: (0.404,0.460) 95% CI: (0.473, 0.525)
I Let
Σ =
(ΣS,S ΣS,Sc
Σ>S,Sc ΣSc ,Sc
).
The partitioned heritability is defined via the decomposition
y = gS + gSc + e,
wheregS = XS(bS + Σ−1
S,SΣS,Sc bSc ),
gSc = (XSc − XSΣ−1S,SΣS,Sc )bSc .
I Under the LMM y = Xb + e with b ∼ MV(0, σ2gΣ−1/m), gS and gSc are
uncorrelated.
12 / 15
Why does this work?
I Model misspecification results for variance components estimationproblems (D & Erdogdu, 2016, 2017).
I Meta-result. Assume:
(i) y = Xb + e ∈ Rn
(ii) Cov(e) = σ2e .
(iii) b ∈ Rm is any (fixed or random) vector with ‖b‖2 � 1.
If xi ∼ N(0,Σ) are iid, then the variance component estimators for theMahalanobis kernel (σ2
g , σ2e) are nice:
(i) (σ2g , σ
2e)→ (σ2
g , σ2e), where σ2
g = lim ‖b‖2.(ii) (σ2
g , σ2e) are approximately normal.
I Implication: If genotypes are random, then the Mahalanobis kernel forestimating heritability works even with fixed genetic effects – and, inparticular, when the location of causal SNPs is associated with LDstructure.
13 / 15
Why does this work?I Proof idea: In the fixed-effects model with random genotypes, b inherits
randomness from X .
I Let θ = (σ2g , σ
2e) be the MLE and assume WLOG that xi ∼ N(0, I).
I Key point: Since X D= XU for any m ×m orthogonal matrix U,
θ(y,X)D= θ(y,X),
wherey = X b + e, b ∼ uniform{Sm−1(σ2
g)}.
I In other words, θ = θ(y,X) has the same distribution as θ(y,X), where thedata are drawn from a random-effects model.
I Since b is approximately Gaussian for large m, we can use tools forvariance components estimation in Gaussian random-effects models toapproximate the distribution of θ.
I Finite sample normal approximation results for quadratic forms.
I Other comments:I We can get closed form expressions for the asymptotic variance of θ using
results on the Marcenko-Pastur distribution.
14 / 15