Statistics for high-dimensional data: p-values and

Statistics for high-dimensional data:p-values and confidence intervals

Peter Buhlmann

Seminar fur Statistik, ETH Zurich

June 2014

High-dimensional dataBehavioral economics and genetics (with Ernst Fehr, U. Zurich)

I n = 1′525 personsI genetic information (SNPs): p ≈ 106

I 79 response variables, measuring “behavior”

goal: find significant associationsbetween behavioral responsesand genetic markers

0 20 40 60 80

Number of significant target SNPs per phenotype

Phenotype index

5 10 15 25 30 35 45 50 55 65 70 75

in high-dimensional statistics:a lot of progress has been achieved over the last 8-10 years for

I point estimationI rates of convergence

but very little work on assigning measures of uncertainty,p-values, confidence intervals

we need uncertainty quantification!(the core of statistics)

goal (regarding the title of the talk):

p-values/confidence intervalfor a high-dimensional linear model

(and we can then generalize to other models)

Motif regression and variable selection

for finding HIF1α transcription factor binding sites in DNA seq.Muller, Meier, PB & Ricci

for coarse DNA segments i = 1, ...,n :

I predictor Xi = (X (1)i , . . . ,X (p)

i ) ∈ Rp:abundance score of candidate motifs j = 1, ...,p in DNAsegment i (using sequence data and computationalbiology algorithms, e.g. MDSCAN)

I univariate response Yi ∈ R: binding intensity of HIF1α tocoarse DNA segment (from CHIP-chip experiments)

question: relation between the binding intensity Y and theabundance of short candidate motifs?

; linear model is often reasonable“motif regression” (Conlon, X.S. Liu, Lieb & J.S. Liu, 2003)

p∑j=1

β0j X (j)

i + εi

i = 1, . . . ,n = 143, p = 195

goal: variable selection and significance of variables; find the relevant motifs among the p = 195 candidates

Lasso for variable selection:

S(λ) = j ; βj(λ) 6= 0estimate for S0 = j ; β0

j 6= 0

no significance testing involvedit’s convex optimization only!and it’s very popular (Meinshausen & PB, 2006;Zhao & Yu, 2006;Wainwright, 2009;...)

for motif regression(finding HIF1α transcription factor binding sites)n=143, p=195

; Lasso selects 26 covariateswhen choosing λ = λCV via cross-validationand resulting R2 ≈ 50%

i.e. 26 interesting candidate motifs

how significant are the findings?

for motif regression(finding HIF1α transcription factor binding sites)n=143, p=195

; Lasso selects 26 covariateswhen choosing λ = λCV via cross-validationand resulting R2 ≈ 50%

i.e. 26 interesting candidate motifs

how significant are the findings?

estimated coefficients β(λCV)

0 50 100 150 200

original data

variables

p-values for H0,j : β0j = 0 ?

P-values for high-dimensional linear models

Y = Xβ0 + ε

goal: statistical hypothesis testing

H0,j : β0j = 0 or H0,G : β0

j = 0 for all j ∈ G ⊆ 1, . . . ,p

background: if we could handle the asymptotic distribution ofthe Lasso β(λ) under the null-hypothesis; could construct p-values

this is very difficult!asymptotic distribution of β has some point mass at zero,...Knight and Fu (2000) for p <∞ and n→∞

; standard bootstrapping and subsampling cannot be usedeither

but there are recent proposals when using adaptations ofstandard resampling methods(Chatterjee & Lahiri, 2013; Liu & Yu, 2013); non-uniformity/super-efficiency issues remain...

Low-dimensional projections and bias correctionOr de-sparsifying the Lasso estimatorrelated work by Zhang and Zhang (2011)

motivation:

βOLS,j = projection of Y onto residuals (Xj − X−j γ(j)OLS)

projection not well defined if p > n; use “regularized” residuals from Lasso on X -variables

Zj = Xj − X−j γ(j)Lasso

γ(j) = argminγ‖Xj − X−jγ‖+ λj‖γ‖1

using Y = Xβ0 + ε ;

Z Tj Y = Z T

j Xjβ0j +

∑k 6=j

Z Tj Xk + Z T

and hence

Z Tj Y

Z Tj Xj

= β0j +

∑k 6=j

Z Tj Xk

Z Tj Xj

β0k︸︷︷︸

Z Tj Xj︸︷︷︸

noise component

bj =Z T

Z Tj Xj−

∑k 6=j

Z Tj Xk

Z Tj Xj

βLasso;k︸︷︷︸Lasso-estim. bias corr.

bj is not sparse!... and this is crucial to obtain Gaussian limitnevertheless: it is “optimal” (see later)

I target: low-dimensional component β0j

I η := β0k ; k 6= j is a high-dimensional nuisance parameter

; exactly as in semiparametric modeling!and sparsely estimated (e.g. with Lasso)

bj is not sparse!... and this is crucial to obtain Gaussian limitnevertheless: it is “optimal” (see later)

I target: low-dimensional component β0j

I η := β0k ; k 6= j is a high-dimensional nuisance parameter

; exactly as in semiparametric modeling!and sparsely estimated (e.g. with Lasso)

=⇒ let’s turn to the blackboard!

A general principle: de-sparsifying via “inversion” of KKT

KKT conditions:sub-differential of objective function ‖Y − Xβ‖22/n + λ‖β‖1

−X T ( Y︸︷︷︸Xβ0+ε

−X β)/n + λτ = 0

‖τ‖∞ ≤ 1, τj = sign(βj) if βj 6= 0.

with Σ = X T X/n ; Σ(β − β0) + λτ = X T ε/n

“regularized inverse” of Σ, denoted by Θ (not e.g. GLasso)

β − β0 + Θλτ = ΘX T ε/n −∆,

∆ = (ΘΣ− I)(β − β0)

new estimator: b = β + Θλτ = β + ΘX T (Y − X β)/n

; b is exactly the same estimator as before (based onlow-dimensional projection using residual vectors Zj )

... when taking Θ (“regularized inverse of Σ”) havingrows using the (“nodewise”) Lasso-estimated coefficients fromXj versus X−j :

γj = argminγ∈Rp−1‖Xj − X−jγ‖22/n + 2λj‖γ‖1

Denote by

1 −γ1,2 · · · −γ1,p−γ2,1 1 · · · −γ2,p

......

. . ....

−γp,1 −γp,2 · · · 1

T 2 = diag(τ21 , . . . , τ

2p ), τ2

j = ‖Xj − X−j γj‖22/n + λj‖γj‖1,

ΘLasso = T−2C not symmetric... !

“inverting” the KKT conditions is a very general principle; the principle can be used for GLMs and many other models

Asymptotic pivot and optimality

Theorem (van de Geer, PB & Ritov, 2013)

√n(bj − β0

j )⇒ N (0, σ2εΩjj) (j = 1, . . . ,p)

Ωjj explicit expression ∼ (Σ−1)jj optimal!reaching semiparametric information bound

; asympt. optimal p-values and confidence intervalsif we assume:

I sub-Gaussian design (i.i.d. rows of X sub-Gaussian) withpopulation Cov(X ) = Σ has minimal eigenvalue ≥ M > 0

I sparsity for regr. Y vs. X : s0 = o(√

n/ log(p))“quite sparse”I sparsity of design: Σ−1 sparse

i.e. sparse regressions Xj vs. X−j : sj = o(n/ log(p))“maybe restrictive”

It is optimal!Cramer-Rao

for design with Σ−1 non-sparse:I Ridge projection (PB, 2013): good type I error control but

not optimal in terms of powerI convex program instead of Lasso for Zj (Javanmard &

Montanari, 2013; MSc. thesis Dezeure, 2013)Javanmard & Montanari prove optimality

so far: no convincing empirical evidence that we can deal wellwith such scenarios (Σ−1 non-sparse)

Uniform convergence:

√n(bj − β0

j )⇒ N (0, σ2εΩjj) (j = 1, . . . ,p)

convergence is uniform over B(s0) = β; ‖β‖0 ≤ s0

; honest tests and confidence regions!

and we can avoid post model selection inference(cf. Potscher and Leeb)

Simultaneous inference over all components:

√n(b − β0) ≈ (W1, . . . ,Wp) ∼ Np(0, σ2

; can construct P-values for:

H0,G with any G: test-statistics maxj∈G |bj |since covariance structure Ω is known

andcan easily do efficient multiple testing adjustment sincecovariance structure Ω is known!

Alternatives?I versions of bootstrapping (Chatterjee & Lahiri, 2013)

; super-efficiencyphenomenon!

i.e. non-uniform convergenceJoe Hodges

• good for estimating the zeroes (i.e., j ∈ Sc0 with β0

j = 0)• bad for estim. the non-zeroes (i.e., j ∈ S0 with β0

j 6= 0)

I multiple sample splitting (Meinshausen, Meier & PB, 2009)split the sample repeatedly in two halfs:• select variables on first half• p-values using second half, based on selected variables

; avoids (because of sample splitting) over-optmisticp-values, but potentially suffers in terms of power

I covariance test (Lockhart, Taylor, Tibshirani, Tibshirani, 2014)I no sparsity ass. on Σ−1 (Javanmard and Montanari, 2014)

Some empirical results (Dezeure, PB, Meier & Meinshausen, in progress)

compare power and control of familywise error rate (FWER)always p = 500, n = 100 and s0 = 15

0.0 0.2 0.4 0.6 0.8 1.0

Covtest

MS−Split

Despars−Lasso

0.0 0.2 0.4 0.6 0.8 1.0

confidence intervalsI for β0

j (j ∈ S0)

I for β0j = 0 (j ∈ Sc

0) where the intervals exhibit the worstcoverage (for each method)

Jm201345

65 69 78 80 82 83 84 84 85 8594

86 86 86 86 87 87 88

liuyu91 92

4398 99 99 99 99 99 99 99

9999 99 99 99 99 99 99

Res−Boot92

80 5396 97 98 98 98 99 99 99

9999 99 99 99 99 99 99

MS−Split

69100 99 100 100 100 100

100100 100 100

100100 100 100 100 100

Ridge86

97 95 98 98 98 98 98 98 98 9999

99 99 99 99 99 99 99

Lasso−Pro Z&Z82

86 91 86 87 88 88 88 88 89 8995

89 89 89 90 90 90 90

Lasso−Pro78

82 80 83 84 86 87 87 87 87 8795

88 88 88 88 89 89 89

Toeplitz s0=3 U[0,2]

Motif regression example

one significant variable withboth “de-sparsified Lasso” and multi sample splitting

0 50 100 150 200

motif regression

variables

: variable/motif with FWER-adjusted p-value 0.006: p-value clearly larger than 0.05

(this variable corresponds to known true motif)

for data-sets with p ≈ 4′000− 10′000 and n ≈ 100; often no significant variable

because it is a too extreme ratio log(p)/n

Behavioral economics and genomewide associationwith Ernst Fehr, University of Zurich

I n = 1525 probands (all students!)I m = 79 response variables measuring various behavioral

characteristics (e.g. risk aversion) from well-designedexperiments

I 460 Target SNPs (as a proxy for ≈ 106 SNPs):1380 parameters per response(but only 1341 meaningful parameters)

model: multivariate linear model

Yn×m︸︷︷︸responses

= Xn×p︸︷︷︸SNP data

βp×m + εn×m︸︷︷︸error

although p < n, the design matrix X (with categorical values∈ 1,2,3) does not have full rank

Yn×m = Xn×pβp×m + εn×m

interested in p-values for

H0,jk : βjk = 0 versus HA,jk : βjk 6= 0,H0,G : βjk = 0 for all j , k ∈ G versus HA,G = Hc

adjusted to control the familywise error rate (i.e. conservativecriterion)in total: we consider 110’857 hypotheses

we test for non-marginal regression coefficients; “predictive” GWAS

there is structure!

I 79 response experimentsI 23 chromosomes per response experimentI 20 Target SNPs per chromosome = 460 Target SNPs

.. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

23 1 23

1 2 20

global

do hierarchical FWER adjustment (Meinshausen, 2008)

.. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

23 1 23

1 2 20

global

significant not significant

1. test global hypothesis2. if significant: test all single response hypotheses3. for the significant responses: test all single chromosome hyp.4. for the significant chromosomes: test all TargetSNPs

; powerful multiple testing withdata dependent adaptation of the resolution level(our analysis with 20 TagetSNPs per chromosome is ad-hoc)

cf. general sequential testing principle (Goeman & Solari, 2010)

testing a group-hypothesis:

H0,G : β0j ≡ 0 for all j ∈ G

test-statistics:

maxj∈G|bj |

since und H0,G:

√nbG = NG(0, σ2

εΩG) + ∆G,

∆G = (∆j ; j ∈ G),√

n‖∆‖∞ = oP(1)

maxj∈G

√n|bj | ⇒ σε max

j∈G|Wj |,

(W1, . . . ,Wp) ∼ Np(0,Ω)

and can easily simulate maxj∈G |Wj |

number of significant SNP parameters per response

0 20 40 60 80

Number of significant target SNPs per phenotype

Phenotype index

5 10 15 25 30 35 45 50 55 65 70 75

response 40 has most significant (levels of) Target SNPs

Conclusionscan construct asymptotically optimal

p-values and confidence intervals for low-dimensional targets inhigh-dimensional modelsR-package hdi︸︷︷︸

high-dimensional inference

(Meier, 2013)

assuming/based on suitable conditions:

• sparsity of Y vs X : s0 = o(√

n/ log(p))• sparsity of Xj vs X−j (j = 1, . . . ,p): maxj sj ≤ o(n/ log(p))• design matrix X is not too ill-posed

(e.g. restricted eigenval. ass.; or nice population covariance)

these conditions are typically uncheckable... ;confirmatory high-dimensional inference remains challenging

Thank you!

(Meier, 2013)

Thank you!

(Meier, 2013)

Thank you!

R-package: hdi (Meier, 2013)

References:

I Buhlmann, P. and van de Geer, S. (2011). Statistics forHigh-Dimensional Data: Methodology, Theory and Applications.Springer.

I Meinshausen, N., Meier, L. and Buhlmann, P. (2009). P-valuesfor high-dimensional regression. Journal of the AmericanStatistical Association 104, 1671-1681.

I Buhlmann, P. (2013). Statistical significance in high-dimensionallinear models. Bernoulli 19, 1212-1242.

I van de Geer, S., Buhlmann, P. and Ritov, Y. (2013). Onasymptotically optimal confidence regions and tests forhigh-dimensional models. Preprint arXiv:1303.0518v1

I Meier, L. (2013). hdi: High-dimensional inference. R-packageavailable from R-Forge.

Statistics for high-dimensional data: p-values and

Documents

APTS High-dimensional Statistics - Warwick

Statistics for high-dimensional datapbil.univ-lyon1.fr/members/fpicard/franckpicard...Statistics for high-dimensional data Vincent Rivoirard Universit e Paris-Dauphine ... Tibshirani,

Application of Statistics of Extreme Values for Inclusions

HCI Statistics without p-values

High-dimensional Inference: Confidence intervals, p-values and R-Software hdi - Chalmers · 2016. 4. 14. · High-dimensional Inference: Con dence intervals, p-values and R-Software

High-dimensional statistics with systematically corrupted data · 2018. 10. 10. · 1 Abstract High-dimensional statistics with systematically corrupted data by Po-Ling Loh Doctor

Section 3B – Test Statistics and Critical Valuesteachoutcoc.org/files/section3B_intro_to_stats_for_community_colle… · Section 3B – Test Statistics and Critical Values. Vocabulary

High-Dimensional Statistical Learning: Introduction€¦ · This course:Statistical machine learning tools forbig { mostly high-dimensional { data. 11/14 Classical Statistics Biological

High-dimensional statistics and probability

Fisher’s combined probability test for high-dimensional ... · form statistics and maximum form statistics for testing high-dimensional covariance matrices. We propose a scale-invariant

Geometry and Statistics in High-Dimensional Structured ... · Geometry and Statistics in High-Dimensional Structured Optimization Yuanming Shi ... Massive data sets require very fast

Controlling Control Charts Interpreting p -values Intermediate Statistics for ICPs

Random matrices and high-dimensional statistics: Beyond

P-Values for High-Dimensional Regression - arXiv for High-Dimensional Regression ... While the main application of the procedure are high-dimensional data, ... control …

Statistics for high-dimensional data: Introduction, and ...stat.ethz.ch/~buhlmann/teaching/presentation1.pdf · Statistics for high-dimensional data: Introduction, and the Lasso for

Using Statistics to Understand Extreme Values with ... · Using Statistics to Understand Extreme Values with Application to Hydrology *1Akanni O.O., 2 Fantola J.O. and 3Ojedokun J.P

LIES, DAMNED LIES AND STATISTICSLIES, DAMNED LIES AND STATISTICS: A TIME TRACKING SURVEY Dimensional Research | July 2013 © 2013 Dimensional Research

P-values: An introduction Matt Kramer NEA Statistics Group

NIPS2010: high-dimensional statistics: prediction, association and causal inference

IBM SPSS Statistics Base 25 - KendallHunt...missing values (available in SPSS Statistics Professional Edition or the Missing Values option). To Obtain a Codebook 1. From the menus