Statistics for high-dimensional data: p-values and

Preview:

Citation preview

Statistics for high-dimensional data:p-values and confidence intervals

Peter Buhlmann

Seminar fur Statistik, ETH Zurich

June 2014

High-dimensional dataBehavioral economics and genetics (with Ernst Fehr, U. Zurich)

I n = 1′525 personsI genetic information (SNPs): p ≈ 106

I 79 response variables, measuring “behavior”

p n

goal: find significant associationsbetween behavioral responsesand genetic markers

0 20 40 60 80

020

040

060

0

Number of significant target SNPs per phenotype

Phenotype index

Num

ber

of s

igni

fican

t tar

get S

NP

s

5 10 15 25 30 35 45 50 55 65 70 75

100

300

500

700

in high-dimensional statistics:a lot of progress has been achieved over the last 8-10 years for

I point estimationI rates of convergence

but very little work on assigning measures of uncertainty,p-values, confidence intervals

we need uncertainty quantification!(the core of statistics)

goal (regarding the title of the talk):

p-values/confidence intervalfor a high-dimensional linear model

(and we can then generalize to other models)

Motif regression and variable selection

for finding HIF1α transcription factor binding sites in DNA seq.Muller, Meier, PB & Ricci

for coarse DNA segments i = 1, ...,n :

I predictor Xi = (X (1)i , . . . ,X (p)

i ) ∈ Rp:abundance score of candidate motifs j = 1, ...,p in DNAsegment i (using sequence data and computationalbiology algorithms, e.g. MDSCAN)

I univariate response Yi ∈ R: binding intensity of HIF1α tocoarse DNA segment (from CHIP-chip experiments)

question: relation between the binding intensity Y and theabundance of short candidate motifs?

; linear model is often reasonable“motif regression” (Conlon, X.S. Liu, Lieb & J.S. Liu, 2003)

Yi =

p∑j=1

β0j X (j)

i + εi

i = 1, . . . ,n = 143, p = 195

goal: variable selection and significance of variables; find the relevant motifs among the p = 195 candidates

Lasso for variable selection:

S(λ) = j ; βj(λ) 6= 0estimate for S0 = j ; β0

j 6= 0

no significance testing involvedit’s convex optimization only!and it’s very popular (Meinshausen & PB, 2006;Zhao & Yu, 2006;Wainwright, 2009;...)

for motif regression(finding HIF1α transcription factor binding sites)n=143, p=195

; Lasso selects 26 covariateswhen choosing λ = λCV via cross-validationand resulting R2 ≈ 50%

i.e. 26 interesting candidate motifs

how significant are the findings?

for motif regression(finding HIF1α transcription factor binding sites)n=143, p=195

; Lasso selects 26 covariateswhen choosing λ = λCV via cross-validationand resulting R2 ≈ 50%

i.e. 26 interesting candidate motifs

how significant are the findings?

estimated coefficients β(λCV)

0 50 100 150 200

0.00

0.05

0.10

0.15

0.20

original data

variables

coef

ficie

nts

p-values for H0,j : β0j = 0 ?

P-values for high-dimensional linear models

Y = Xβ0 + ε

goal: statistical hypothesis testing

H0,j : β0j = 0 or H0,G : β0

j = 0 for all j ∈ G ⊆ 1, . . . ,p

background: if we could handle the asymptotic distribution ofthe Lasso β(λ) under the null-hypothesis; could construct p-values

this is very difficult!asymptotic distribution of β has some point mass at zero,...Knight and Fu (2000) for p <∞ and n→∞

; standard bootstrapping and subsampling cannot be usedeither

but there are recent proposals when using adaptations ofstandard resampling methods(Chatterjee & Lahiri, 2013; Liu & Yu, 2013); non-uniformity/super-efficiency issues remain...

Low-dimensional projections and bias correctionOr de-sparsifying the Lasso estimatorrelated work by Zhang and Zhang (2011)

motivation:

βOLS,j = projection of Y onto residuals (Xj − X−j γ(j)OLS)

projection not well defined if p > n; use “regularized” residuals from Lasso on X -variables

Zj = Xj − X−j γ(j)Lasso

γ(j) = argminγ‖Xj − X−jγ‖+ λj‖γ‖1

using Y = Xβ0 + ε ;

Z Tj Y = Z T

j Xjβ0j +

∑k 6=j

Z Tj Xk + Z T

j ε

and hence

Z Tj Y

Z Tj Xj

= β0j +

∑k 6=j

Z Tj Xk

Z Tj Xj

β0k︸ ︷︷ ︸

bias

+Z T

j ε

Z Tj Xj︸ ︷︷ ︸

noise component

;

bj =Z T

j Y

Z Tj Xj−

∑k 6=j

Z Tj Xk

Z Tj Xj

βLasso;k︸ ︷︷ ︸Lasso-estim. bias corr.

bj is not sparse!... and this is crucial to obtain Gaussian limitnevertheless: it is “optimal” (see later)

I target: low-dimensional component β0j

I η := β0k ; k 6= j is a high-dimensional nuisance parameter

; exactly as in semiparametric modeling!and sparsely estimated (e.g. with Lasso)

bj is not sparse!... and this is crucial to obtain Gaussian limitnevertheless: it is “optimal” (see later)

I target: low-dimensional component β0j

I η := β0k ; k 6= j is a high-dimensional nuisance parameter

; exactly as in semiparametric modeling!and sparsely estimated (e.g. with Lasso)

=⇒ let’s turn to the blackboard!

A general principle: de-sparsifying via “inversion” of KKT

KKT conditions:sub-differential of objective function ‖Y − Xβ‖22/n + λ‖β‖1

−X T ( Y︸︷︷︸Xβ0+ε

−X β)/n + λτ = 0

‖τ‖∞ ≤ 1, τj = sign(βj) if βj 6= 0.

with Σ = X T X/n ; Σ(β − β0) + λτ = X T ε/n

“regularized inverse” of Σ, denoted by Θ (not e.g. GLasso)

β − β0 + Θλτ = ΘX T ε/n −∆,

∆ = (ΘΣ− I)(β − β0)

new estimator: b = β + Θλτ = β + ΘX T (Y − X β)/n

; b is exactly the same estimator as before (based onlow-dimensional projection using residual vectors Zj )

... when taking Θ (“regularized inverse of Σ”) havingrows using the (“nodewise”) Lasso-estimated coefficients fromXj versus X−j :

γj = argminγ∈Rp−1‖Xj − X−jγ‖22/n + 2λj‖γ‖1

Denote by

C =

1 −γ1,2 · · · −γ1,p−γ2,1 1 · · · −γ2,p

......

. . ....

−γp,1 −γp,2 · · · 1

Let

T 2 = diag(τ21 , . . . , τ

2p ), τ2

j = ‖Xj − X−j γj‖22/n + λj‖γj‖1,

ΘLasso = T−2C not symmetric... !

“inverting” the KKT conditions is a very general principle; the principle can be used for GLMs and many other models

Asymptotic pivot and optimality

Theorem (van de Geer, PB & Ritov, 2013)

√n(bj − β0

j )⇒ N (0, σ2εΩjj) (j = 1, . . . ,p)

Ωjj explicit expression ∼ (Σ−1)jj optimal!reaching semiparametric information bound

; asympt. optimal p-values and confidence intervalsif we assume:

I sub-Gaussian design (i.i.d. rows of X sub-Gaussian) withpopulation Cov(X ) = Σ has minimal eigenvalue ≥ M > 0

I sparsity for regr. Y vs. X : s0 = o(√

n/ log(p))“quite sparse”I sparsity of design: Σ−1 sparse

i.e. sparse regressions Xj vs. X−j : sj = o(n/ log(p))“maybe restrictive”

It is optimal!Cramer-Rao

for design with Σ−1 non-sparse:I Ridge projection (PB, 2013): good type I error control but

not optimal in terms of powerI convex program instead of Lasso for Zj (Javanmard &

Montanari, 2013; MSc. thesis Dezeure, 2013)Javanmard & Montanari prove optimality

so far: no convincing empirical evidence that we can deal wellwith such scenarios (Σ−1 non-sparse)

Uniform convergence:

√n(bj − β0

j )⇒ N (0, σ2εΩjj) (j = 1, . . . ,p)

convergence is uniform over B(s0) = β; ‖β‖0 ≤ s0

; honest tests and confidence regions!

and we can avoid post model selection inference(cf. Potscher and Leeb)

Simultaneous inference over all components:

√n(b − β0) ≈ (W1, . . . ,Wp) ∼ Np(0, σ2

εΩ)

; can construct P-values for:

H0,G with any G: test-statistics maxj∈G |bj |since covariance structure Ω is known

andcan easily do efficient multiple testing adjustment sincecovariance structure Ω is known!

Alternatives?I versions of bootstrapping (Chatterjee & Lahiri, 2013)

; super-efficiencyphenomenon!

i.e. non-uniform convergenceJoe Hodges

• good for estimating the zeroes (i.e., j ∈ Sc0 with β0

j = 0)• bad for estim. the non-zeroes (i.e., j ∈ S0 with β0

j 6= 0)

I multiple sample splitting (Meinshausen, Meier & PB, 2009)split the sample repeatedly in two halfs:• select variables on first half• p-values using second half, based on selected variables

; avoids (because of sample splitting) over-optmisticp-values, but potentially suffers in terms of power

I covariance test (Lockhart, Taylor, Tibshirani, Tibshirani, 2014)I no sparsity ass. on Σ−1 (Javanmard and Montanari, 2014)

Some empirical results (Dezeure, PB, Meier & Meinshausen, in progress)

compare power and control of familywise error rate (FWER)always p = 500, n = 100 and s0 = 15

0.0 0.2 0.4 0.6 0.8 1.0

Covtest

JM

MS−Split

Ridge

Despars−Lasso

FWER

0.0 0.2 0.4 0.6 0.8 1.0

Power

confidence intervalsI for β0

j (j ∈ S0)

I for β0j = 0 (j ∈ Sc

0) where the intervals exhibit the worstcoverage (for each method)

Jm201345

65 69 78 80 82 83 84 84 85 8594

86 86 86 86 87 87 88

liuyu91 92

4398 99 99 99 99 99 99 99

9999 99 99 99 99 99 99

Res−Boot92

80 5396 97 98 98 98 99 99 99

9999 99 99 99 99 99 99

MS−Split

69100 99 100 100 100 100

100100 100 100

94100

100100 100 100 100 100

Ridge86

97 95 98 98 98 98 98 98 98 9999

99 99 99 99 99 99 99

Lasso−Pro Z&Z82

86 91 86 87 88 88 88 88 89 8995

89 89 89 90 90 90 90

Lasso−Pro78

82 80 83 84 86 87 87 87 87 8795

88 88 88 88 89 89 89

Toeplitz s0=3 U[0,2]

Motif regression example

one significant variable withboth “de-sparsified Lasso” and multi sample splitting

0 50 100 150 200

0.00

0.05

0.10

0.15

0.20

motif regression

variables

coef

ficie

nts

: variable/motif with FWER-adjusted p-value 0.006: p-value clearly larger than 0.05

(this variable corresponds to known true motif)

for data-sets with p ≈ 4′000− 10′000 and n ≈ 100; often no significant variable

because it is a too extreme ratio log(p)/n

Behavioral economics and genomewide associationwith Ernst Fehr, University of Zurich

I n = 1525 probands (all students!)I m = 79 response variables measuring various behavioral

characteristics (e.g. risk aversion) from well-designedexperiments

I 460 Target SNPs (as a proxy for ≈ 106 SNPs):1380 parameters per response(but only 1341 meaningful parameters)

model: multivariate linear model

Yn×m︸ ︷︷ ︸responses

= Xn×p︸ ︷︷ ︸SNP data

βp×m + εn×m︸ ︷︷ ︸error

although p < n, the design matrix X (with categorical values∈ 1,2,3) does not have full rank

Yn×m = Xn×pβp×m + εn×m

interested in p-values for

H0,jk : βjk = 0 versus HA,jk : βjk 6= 0,H0,G : βjk = 0 for all j , k ∈ G versus HA,G = Hc

0,G

adjusted to control the familywise error rate (i.e. conservativecriterion)in total: we consider 110’857 hypotheses

we test for non-marginal regression coefficients; “predictive” GWAS

there is structure!

I 79 response experimentsI 23 chromosomes per response experimentI 20 Target SNPs per chromosome = 460 Target SNPs

. . .

.. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

1 2

1

23 1 23

1 2 20

global

79

do hierarchical FWER adjustment (Meinshausen, 2008)

. . .

.. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

1 2

1

23 1 23

1 2 20

global

79

significant not significant

1. test global hypothesis2. if significant: test all single response hypotheses3. for the significant responses: test all single chromosome hyp.4. for the significant chromosomes: test all TargetSNPs

; powerful multiple testing withdata dependent adaptation of the resolution level(our analysis with 20 TagetSNPs per chromosome is ad-hoc)

cf. general sequential testing principle (Goeman & Solari, 2010)

testing a group-hypothesis:

H0,G : β0j ≡ 0 for all j ∈ G

test-statistics:

maxj∈G|bj |

since und H0,G:

√nbG = NG(0, σ2

εΩG) + ∆G,

∆G = (∆j ; j ∈ G),√

n‖∆‖∞ = oP(1)

thus:

maxj∈G

√n|bj | ⇒ σε max

j∈G|Wj |,

(W1, . . . ,Wp) ∼ Np(0,Ω)

and can easily simulate maxj∈G |Wj |

number of significant SNP parameters per response

0 20 40 60 80

020

040

060

0

Number of significant target SNPs per phenotype

Phenotype index

Num

ber

of s

igni

fican

t tar

get S

NP

s

5 10 15 25 30 35 45 50 55 65 70 75

100

300

500

700

response 40 has most significant (levels of) Target SNPs

Conclusionscan construct asymptotically optimal

p-values and confidence intervals for low-dimensional targets inhigh-dimensional modelsR-package hdi︸︷︷︸

high-dimensional inference

(Meier, 2013)

assuming/based on suitable conditions:

• sparsity of Y vs X : s0 = o(√

n/ log(p))• sparsity of Xj vs X−j (j = 1, . . . ,p): maxj sj ≤ o(n/ log(p))• design matrix X is not too ill-posed

(e.g. restricted eigenval. ass.; or nice population covariance)

these conditions are typically uncheckable... ;confirmatory high-dimensional inference remains challenging

Thank you!

Conclusionscan construct asymptotically optimal

p-values and confidence intervals for low-dimensional targets inhigh-dimensional modelsR-package hdi︸︷︷︸

high-dimensional inference

(Meier, 2013)

assuming/based on suitable conditions:

• sparsity of Y vs X : s0 = o(√

n/ log(p))• sparsity of Xj vs X−j (j = 1, . . . ,p): maxj sj ≤ o(n/ log(p))• design matrix X is not too ill-posed

(e.g. restricted eigenval. ass.; or nice population covariance)

these conditions are typically uncheckable... ;confirmatory high-dimensional inference remains challenging

Thank you!

Conclusionscan construct asymptotically optimal

p-values and confidence intervals for low-dimensional targets inhigh-dimensional modelsR-package hdi︸︷︷︸

high-dimensional inference

(Meier, 2013)

assuming/based on suitable conditions:

• sparsity of Y vs X : s0 = o(√

n/ log(p))• sparsity of Xj vs X−j (j = 1, . . . ,p): maxj sj ≤ o(n/ log(p))• design matrix X is not too ill-posed

(e.g. restricted eigenval. ass.; or nice population covariance)

these conditions are typically uncheckable... ;confirmatory high-dimensional inference remains challenging

Thank you!

R-package: hdi (Meier, 2013)

References:

I Buhlmann, P. and van de Geer, S. (2011). Statistics forHigh-Dimensional Data: Methodology, Theory and Applications.Springer.

I Meinshausen, N., Meier, L. and Buhlmann, P. (2009). P-valuesfor high-dimensional regression. Journal of the AmericanStatistical Association 104, 1671-1681.

I Buhlmann, P. (2013). Statistical significance in high-dimensionallinear models. Bernoulli 19, 1212-1242.

I van de Geer, S., Buhlmann, P. and Ritov, Y. (2013). Onasymptotically optimal confidence regions and tests forhigh-dimensional models. Preprint arXiv:1303.0518v1

I Meier, L. (2013). hdi: High-dimensional inference. R-packageavailable from R-Forge.

Recommended