50
Bayesian Regression Using a Prior on the Model Fit: The R2-D2 Shrinkage Prior Yan Dora Zhang 1 , Brian P. Naughton 2 , Howard D. Bondell 3 , and Brian J. Reich 2 1 Department of Statistics and Actuarial Science, The University of Hong Kong 2 Department of Statistics, North Carolina State University 3 School of Mathematics and Statistics, University of Melbourne July 9, 2020 Abstract Prior distributions for high-dimensional linear regression require specifying a joint distribution for the unobserved regression coefficients, which is inherently difficult. We instead propose a new class of shrinkage priors for linear regression via specifying a prior first on the model fit, in particular, the coefficient of determination, and then distributing through to the coefficients in a novel way. The proposed method compares favorably to previous approaches in terms of both concentration around the origin and tail behavior, which leads to improved performance both in posterior contraction and in empirical performance. The limiting behavior of the proposed prior is 1/x, both around the origin and in the tails. This behavior is optimal in the sense that it simultaneously lies on the boundary of being an improper prior both in the tails and around the origin. None of the existing shrinkage priors obtain this behavior in both regions simultaneously. We also demonstrate that our proposed prior leads to the same near-minimax posterior contraction rate as the spike-and-slab prior. Keywords: Global-Local Shrinkage, High-dimensional regression, Beta-prime distribution, Coefficient of Determination 1 arXiv:1609.00046v3 [stat.ME] 8 Jul 2020

The R2-D2 Shrinkage Prior - arXiv

Embed Size (px)

Citation preview

Bayesian Regression Using a Prior on theModel Fit: The R2-D2 Shrinkage Prior

Yan Dora Zhang1, Brian P. Naughton2, Howard D. Bondell3, and Brian J. Reich2

1Department of Statistics and Actuarial Science, The University of Hong Kong2Department of Statistics, North Carolina State University

3School of Mathematics and Statistics, University of Melbourne

July 9, 2020

Abstract

Prior distributions for high-dimensional linear regression require specifying a jointdistribution for the unobserved regression coefficients, which is inherently difficult.We instead propose a new class of shrinkage priors for linear regression via specifyinga prior first on the model fit, in particular, the coefficient of determination, andthen distributing through to the coefficients in a novel way. The proposed methodcompares favorably to previous approaches in terms of both concentration aroundthe origin and tail behavior, which leads to improved performance both in posteriorcontraction and in empirical performance. The limiting behavior of the proposedprior is 1/x, both around the origin and in the tails. This behavior is optimal in thesense that it simultaneously lies on the boundary of being an improper prior bothin the tails and around the origin. None of the existing shrinkage priors obtain thisbehavior in both regions simultaneously. We also demonstrate that our proposedprior leads to the same near-minimax posterior contraction rate as the spike-and-slabprior.

Keywords: Global-Local Shrinkage, High-dimensional regression, Beta-prime distribution,Coefficient of Determination

1

arX

iv:1

609.

0004

6v3

[st

at.M

E]

8 J

ul 2

020

1 Introduction

Consider the linear regression model,

Yi = xTi β + εi, i = 1, · · · , n, (1)

where Yi is the ith response, xi is the p-dimensional vector of covariates for the ith obser-

vation, β = (β1, · · · , βp)T is the coefficient vector, and the εi’s are the error terms assumed

be normal and independent with E(εi) = 0 and var(εi) = σ2. High-dimensional data with

p > n in this context is common in diverse application areas. It is well known that max-

imum likelihood estimation performs poorly in this setting, and this motivates a number

of approaches in shrinkage estimation and variable selection. In the Bayesian framework,

there are two main approaches to address such problems: two component discrete mixture

prior (also referred as spike and slab prior) and continuous shrinkage priors. The discrete

mixture priors (George & McCulloch 1993, Ishwaran & Rao 2005, Mitchell & Beauchamp

1988, Narisetty & He 2014) put a point mass (spike) at βj = 0 and a continuous prior

(slab) for the terms with βj 6= 0. Although these priors have an intuitive and appealing

representation, they lead to computational issues due to the spread of posterior probability

over the 2p models formed by including subsets of the coefficients to zero. Implementation

instead can proceed instead by applying approximation methods, such as stochastic search

variable selection (George & McCulloch 1993), shotgun stochastic search (Hans et al. 2007),

variational Bayes (Ormerod et al. 2017), and EM (Rockova & George 2014) all of which

have improved the computational feasibility and include theoretical underpinnings.

The computation issues with discrete mixture priors motivate continuous shrinkage

priors. The shrinkage priors are essentially written as global-local scale mixture Gaussian

family as summarized in Polson & Scott (2010), i.e.,

βj | φj, ω ∼ N(0, ωφj), φj ∼ π(φj), (ω, σ2) ∼ π(ω, σ2),

where ω represents the global shrinkage, while φj’s are the local variance components. Cur-

rent existing global-local priors exhibit desirable theoretic and empirical properties. They

can shrink the overall signal, while varying the amount of shrinkage on different compo-

nents. These continuous priors exhibit both heavy tails and high concentration around

2

zero. The heavy tail reduces the bias in estimation of large coefficients, while the high

concentration around zero shrinks the irrelevant coefficients heavily to zero, thus reduc-

ing the noise. Some examples include Normal-Gamma mixtures (Griffin & Brown 2010),

Horseshoe (Carvalho et al. 2009, 2010), generalized Beta (Armagan et al. 2011), general-

ized double Pareto (Armagan, Dunson & Lee 2013), Dirichlet-Laplace (Bhattacharya et al.

2015), Horseshoe+ (Bhadra et al. 2016), normal-beta prime prior (Bai & Ghosh 2019).

In general, it is difficult to specify a p-dimensional prior on β, particularly with high

dimensional data. Instead, we propose to first construct a prior on the coefficient of deter-

mination, R2, for which the model-based version is defined as the square of the correlation

coefficient between the dependent variable and its modeled expectation. A prior on this

one-number summary forms a prior on a function of the parameter vector, and is then

distributed through to the individual parameters in a natural way. We develop a class of

priors that are constructed via marginalizing over the design, as well as those condition-

ing on the design. By viewing things in this framework, our proposed class of priors are

induced by a Beta(a, b) prior on R2 and lead to priors having desirable properties both

asymptotically and in finite samples.

We show that our class of priors, which we term the R2-induced Dirichlet Decomposition

(R2-D2) priors, simultaneously obtain both heavier tails and tighter concentration around

zero than all previously proposed approaches. This optimal result translates into improved

performance in estimation and inference. We also offer a theoretical framework to compare

different global-local priors. The proposed method compares favorably to the other global-

local shrinkage priors in terms of both its concentration around the origin and its tail

behavior obtaining a limiting behavior of 1/x in both regions. This behavior is optimal in

the sense that it simultaneously lies on the boundary of being an improper prior in both

areas, and translates into improved theoretical and empirical performance.

The rest of the paper is outlined as follows. Section 2 motivates the idea of inducing

a prior via R2, and distinguishes between a marginal and conditional version. Section

3 presents the details of the conditional version in both the low- and high-dimensional

settings. Section 4 details the marginal version and provides theoretical properties of both

3

the prior and the posterior. Section 5 discusses novel MCMC algorithms for computation

of both the conditional and marginal versions, while Section 6 provides simulation results.

Section 7 provides real data examples. All proofs are given in the Appendix.

2 Motivation

The typical Bayesian approach specifies a joint distribution on the model parameters,

namely for the regression coefficients and error variance. Instead, we specify a distribution

for R2 with practical meaning, and then induce a prior on the p-dimensional β.

Suppose that the predictor vector for each observation x ∼ H(·), with E(x) = µ and

cov(x) = Σ. Assume that x is independent of the error, ε, and then the marginal variance

of y = xTβ + ε is var(xTβ) + σ2. For simplicity, we assume that the response is centered

and covariates are standardized so that µ = 0, there is no intercept term in (1), and all

diagonal elements of Σ are 1. The coefficient of determination, R2, can be calculated as the

square of the correlation coefficient between the dependent variable, y, and the modeled

value, xTβ, i.e.,

R2 =cov2(y,xTβ)

var(y)var(xTβ)=

cov2(xTβ + ε,xTβ)

var(xTβ + ε)var(xTβ)=

var(xTβ)

var(xTβ) + σ2. (2)

A hypothesized value of R2 has been used previously to tune informative priors, and

to select hyper-parameters for regularization problems. Scott & Varian (2014) elicit an

informative distribution for the error variance, σ2, based on elicitation of the expected

R2, and the response. Zhang & Bondell (2018) proposed to choose hyper-parameters for

shrinkage priors by empirically minimizing the Kullback-Leibler divergence between the

expected distribution of R2 and a Beta distribution. Here, in contrast, we develop our

approach from first principles via placing a prior distribution on R2 directly, rather than

using a hypothesized value as a tool to tune parameters in already existing priors.

Based on this representation of R2, two alternative approaches can be taken to construc-

tion of the prior. A conditional version places a Beta prior on the conditional distribution

of R2 which depends on the model design, X. Conversely, a marginal version assumes that

the marginal distribution of R2 (after integrating out β and X) has a Beta distribution.

4

The former has the interpretation of the usual sample-based version of R2, while the

latter allows for more direct asymptotic analysis of the posterior, as the design is inte-

grated out. We will show that both versions lead to priors having different, but desirable

properties.

3 Conditional R2 Prior

3.1 R2 as Elliptical Contours

We now introduce the conditional version, which, conditioning on the design points, yields

R2(β) =βTXTXβ

βTXTXβ + nσ2. (3)

We specifically write R2(β) to reflect the fact that R2 depends on the unknown vector β

(as well as σ2). Notice that (3) will reduce to the familiar sample statistic, R2, if the least-

squares estimates were substituted for β and σ2. Conditional on σ2 and X, a distribution

on R2 induces a distribution on the quadratic form, βTXTXβ.

We choose a Beta(a, b) prior for R2, where the choices of shape parameters a and b

will determine the posterior behavior and will be discussed in more detail in the theoretical

results and the implementation. An Inverse-Gamma(a1, b1) prior is used for σ2, but we note

that other choices may also be applied. A prior for β given (R2, σ2) then must be defined

on the surface of the ellipsoid:{β : βTXTXβ = nR2σ2/(1−R2)

}. When XTX is full

rank we may choose β to be uniformly distributed on this ellipsoid; that is, the distribution

of β is constant given the quadratic form. We call this choice the “uniform-on-ellipsoid”

prior for β, and show a connection to a variation on a mixture of g-priors. The following

proposition shows that β given σ2 has an elliptical distribution after integrating out R2.

Proposition 1. If R2 | σ2 has a Beta(a,b) distribution and β | R2 has a uniform prior on

the ellipsoid, then β | σ2 has the probability density function:

p(β | σ2) =Γ (p/2) |ΣX |1/2

B(a, b) πp/2(σ2)−a (

βTΣXβ)a−p/2 (

1 + βTΣXβ/σ2)−(a+b)

, (4)

where β ∈ Rp,ΣX = XTX/n, and B(·, ·) denotes the Beta function.

5

As a special case, if a = p/2 and b = 1/2, then β has a multivariate Cauchy distribution

with spread parameter ΣX/σ2. Zellner & Siow (1980) recommended these Cauchy priors for

model selection problems. The next proposition shows that for a ≤ p/2, the distribution in

(4) is equivalent to a mixture of normals g-prior, with a hyperprior on g that is the product

of Beta and Inverse-Gamma distributions.

Proposition 2. If β | σ2, z, w ∼ Np

(0, zwσ2(XTX)−1

), z ∼ Inverse-Gamma (b, n/2),

w ∼ Beta (a, p/2 − a), and a ≤ p/2, then β | σ2 has the distribution given by the density

in (4).

This representation eases the posterior computations for a Gibbs sampler discussed in

Section 5.

3.2 Sparse Regression and Local Shrinkage

The prior on R2 regulates β through the quadratic form βTΣXβ, which can shrink the

regression coefficients globally, but lacks the flexibility to handle different forms of sparsity.

In addition, the posterior is not a proper distribution when X is not full rank (e.g. when

n < p). Rather than letting β | (R2, σ2) be uniformly distributed on the ellipsoid, we put

a Normal-Gamma prior on β (Griffin & Brown 2010), but restrict its support to lie on the

surface of the ellipsoid. Specifically, we let

β | (R2, σ2,Λ) ∼ Np

(0,

σ2R2

1−R2Λ

)1

{β′ΣXβ =

σ2R2

1−R2

}(5)

λj ∼ Gamma (ν, µ) , for j = 1, . . . , p, (6)

where Gamma(ν, µ) represents Gamma distribution with shape parameter ν and rate pa-

rameter µ, Λ = diag {λ1, . . . , λp} and 1 {·} is the indicator function. Note that this prior

no longer requires ΣX to be full rank for it to be proper. Proposition 3 shows that the

induced model described in the previous section is a special case of the hierarchical model

proposed here with fixed Λ.

Proposition 3. If β | (R2, σ2,Λ) has the distribution in (5), and Λ = (XTX)−1, then

β | σ2 has the distribution in (4).

6

That is, if the contours of the Normal distribution align with the ellipsoid, then we

recover the uniform-on-ellipsoid prior.

In general, the conditional distribution of β is similar to a Bingham distribution, which

is a multivariate Normal distribution conditioned to lie on the unit sphere. This Bingham

distribution has density f(w | A) = C−1A exp{−wTAw} with respect to the uniform mea-

sure on the p− 1 dimensional unit sphere, where CA is the normalizing constant (Bingham

1974). Here, β | (R2, σ2,Λ) is a Bingham distributed random vector that has been rotated

and scaled to lie on the ellipsoid rather than the unit sphere. The matrix X determines the

rotation of the ellipsoid, R2 and σ2 determine the size of the ellipsoid, and the conditional

prior on β determines the direction to the surface. If the local variance components (λj)

are small, then regions of the ellipsoid near the axes will be favored, encouraging sparser

estimates. Like the Normal-Gamma priors, this is primarily controlled by the shape pa-

rameter, ν. Figure 1 illustrates the local shrinkage properties of the prior, showing 10,000

samples of β | (R2, σ2). Default choices of the hyper-parameters ν and µ follow from the

recommendations of Griffin & Brown (2010) and are discussed in the implementation in

Section 6.

Figure 1: 10,000 samples from the prior of β | (R2, σ2) for different choices of ν.

7

4 Marginal R2 Prior

4.1 The R2-D2 Global-Local Shrinkage Prior

Rather than conditioning on the design X, we now instead show how to construct a prior

while marginalizing out both β and the design. While the conditional version retains the

interpretation of R2 as elliptical contours in the design space, the marginal version allows

for an in depth study of the asymptotic properties of both the prior and the resulting

posterior.

Consider a prior for β satisfying E(β) = 0 and cov(β) = σ2Λ, where Λ is a diagonal

matrix with diagonal elements λ1, · · · , λp. Then

var(xTβ) = Ex{varβ(xTβ | x)}+ varx{Eβ(xTβ | x)} = Ex(σ2xTΛx) + varx(0)

= σ2Ex{tr(xTΛx)} = σ2tr{ΛEx(xxT )} = σ2tr(ΛΣ) = σ2

p∑j=1

λj.

Then R2 is represented as

R2 =var(xTβ)

var(xTβ) + σ2=

σ2p∑j=1

λj

σ2p∑j=1

λj + σ2

=

p∑j=1

λj

p∑j=1

λj + 1

≡ W

W + 1, (7)

where W ≡∑p

j=1 λj is the sum of the prior variances scaled by σ2.

Similarly as conditional R2 prior, we also assume R2 ∼ Beta(a, b), a Beta distribution

with shape parameters a and b. Then in this case, the induced prior density for W =

R2/(1 − R2) is a Beta Prime distribution (Johnson et al. 1995) denoted as BP(a, b), with

probability density function

πW (x) =Γ(a+ b)

Γ(a)Γ(b)

xa−1

(1 + x)a+b, (x > 0).

Therefore W ∼ BP(a, b) is equivalent to R2 ∼ Beta(a, b). The following section will induce

a prior on β based on the distribution of the sum of prior variances, W .

Any prior of the form E(β) = 0, cov(β) = σ2Λ and W =∑p

j=1 λj ∼ BP(a, b) induces a

Beta(a, b) prior on R2. To construct a prior with such properties, we follow the global-local

8

prior framework and express λj = φjω with∑p

j=1 φj = 1. Then W =∑p

j=1 φjω = ω is

the total prior variability, and φj is the proportion of total variance allocated to the j-th

covariate. It is natural to assume that ω ∼ BP(a, b) and the variances across covariates

have a Dirichlet prior with concentration parameter (aπ, · · · , aπ), i.e., φ = (φ1, · · · , φp) ∼

Dir(aπ, · · · , aπ). Since∑p

j=1 φj = 1, E(φj) = 1/p, and var(φj) = (p−1)/{p2(paπ+1)}, then

smaller aπ would lead to larger variance of φj, j = 1, · · · , p, thus more φj would be close to

zero with only a small proportion of larger components; while larger aπ would lead to smaller

variance of φj, j = 1, · · · , p, thus producing a more uniform φ, i.e., φ ≈ (1/p, · · · , 1/p).

So aπ controls the sparsity of the model.

To fully define the global-local prior, we further need to assign a kernel distribution on

each dimension of β. Since the Laplace distribution ensures more mass around zero and

heavier tails than the normal kernel, we consider a Laplace prior on βj for j = 1, · · · , p.

The prior is then summarized as

βj | σ2, φj, ω ∼ DE(σ(φjω/2)1/2), φ ∼ Dir(aπ, · · · , aπ), ω ∼ BP(a, b), (8)

where DE(δ) denotes a double-exponential distribution (i.e., Laplace distribution) with

mean 0 and variance 2δ2. Such prior is induced by a prior on R2 and the total prior

variance of β is decomposed through a Dirichlet prior, therefore we refer to the prior as the

R2-induced Dirichlet Decomposition (R2-D2) prior. Here ω controls the global shrinkage

degree through a and b, while φj controls the local shrinkage through aπ. Assume the

variance σ2 ∼ Inverse-Gamma(a1, b1), an inverse Gamma distribution with shape and scale

parameters a1 and b1 respectively.

Proposition 4. If ω | ξ ∼ Ga(a, ξ) and ξ ∼ Ga(b, 1), then ω ∼ BP(a, b), where Ga(µ, ν)

is the Gamma random variable with shape µ and rate ν.

By applying above Proposition 4, the prior in (8) can also be written as

βj | σ2, φj, ω ∼ DE(σ(φjω/2)1/2), φ ∼ Dir(aπ, · · · , aπ), ω | ξ ∼ Ga(a, ξ), ξ ∼ Ga(b, 1).

Proposition 5. If ω | ξ ∼ Ga(a, ξ), (φ1, · · · , φp) ∼ Dir(aπ, · · · , aπ), and a = paπ, then it

follows φjω | ξ ∼ Ga(aπ, ξ), j = 1, . . . , p independently.

9

Thus, by Proposition 5, when a = paπ, prior in (8) can also be written as

βj | σ2, λj ∼ DE(σ(λj/2)1/2), λj ∼ BP(aπ, b). (9)

or equivalently

βj | σ2, λj ∼ DE(σ(λj/2)1/2), λj | ξ ∼ Ga(aπ, ξ), ξ ∼ Ga(b, 1). (10)

We set a = paπ in the rest of the paper for the R2-D2 prior.

4.2 Properties of the R2-D2 Prior

In this section, the marginal density as well as its theoretical properties of the proposed

R2-D2 prior are established. The properties of the Horseshoe (Carvalho et al. 2009, 2010),

Horseshoe+ (Bhadra et al. 2016), generalized double Pareto prior (Armagan, Dunson & Lee

2013) and Dirichlet-Laplace prior (Bhattacharya et al. 2015) are provided as a comparison.

Proofs and technical details are given in the Appendix.

For simplicity of comparison across different priors, the variance term σ2 is fixed at 1.

Proposition 6. Given the R2-D2 prior (9), the marginal density of βj for any j = 1, · · · , p

is

πR2-D2(βj) =G3,1

1,3

(β2j

2

∣∣∣ 12−baπ− 1

2,0, 1

2

)(2π)1/2Γ(aπ)Γ(b)

=G1,3

3,1

(2β2j

∣∣∣ 32−aπ ,1, 1212

+b

)(2π)1/2Γ(aπ)Γ(b)

where Γ(·) denotes the Gamma function and Gm,np,q

(z∣∣∣b1,...,bqa1,...,ap

)denotes the Meijer G-function

(see Appendix for the detailed definition).

Now we would like to compare the theoretical properties with our proposed R2-D2 prior

with a couple of common global-local shrinkage priors. We first listed these priors.

The Horseshoe prior proposed in Carvalho et al. (2009, 2010) is

βj | λj ∼ N(0, λ2j), λj | τ ∼ C+(0, τ),

where C+(0, τ) denotes a half-Cauchy distribution with scale parameter τ , with density

π(y | τ) = 2/{πτ(1 + (y/τ)2)}.

10

The Horseshoe+ prior proposed in Bhadra et al. (2016) is

βj | λj ∼ N(0, λ2j), λj | τ, ηj ∼ C+(0, τηj), ηj ∼ C+(0, 1).

The Dirichlet-Laplace prior proposed in Bhattacharya et al. (2015) is

βj | ψj ∼ DE(ψj), ψj ∼ Ga(a∗, 1/2). (11)

The normal-beta prime prior proposed in Bai & Ghosh (2019) is as follows:

βj | τj ∼ N(0, τj), τj ∼ BP(a#, b#). (12)

−0.4 −0.2 0.0 0.2 0.4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

β

Mar

gina

l den

sity

R2−D2NBPCauchyHorseshoeHorseshoe+DL

5 10 15 20

0.00

00.

002

0.00

40.

006

0.00

80.

010

β

Mar

gina

l den

sity

R2−D2NBPCauchyHorseshoeHorseshoe+DL

Figure 2: Marginal density of the R2-D2 (R2-D2), normal-beta prime (NBP), Dirichlet-

Laplace (DL), Horseshoe, Horseshoe+ prior and Cauchy distribution. In all cases, the

hyper-parameters are selected to ensure the interquartile range is 1 for visual comparison.

Figure 2 plots the marginal density function of the R2-D2 density along with the normal-

beta prime, Horseshoe, Horseshoe+, Dirichlet-Laplace, and Cauchy distributions. In the

figure, for visual comparison, the hyper-parameter in the priors, i.e., τ in Horseshoe and

Horseshoe+ prior, a∗ in Dirichlet-Laplace prior, (aπ, b) in the R2-D2 prior, are selected to

ensure the interquartile range is 1. Note that to make the plots comparable, aπ in the

11

proposed R2-D2 prior is set as a∗/2, which is half of the hyper-parameter in the Dirichlet-

Laplace prior. It will be shown later that this results in the same behavior around the

origin for the two priors. The other hyper-parameter b in the R2-D2 prior is then tuned to

ensure the interquartile range of 1 to match the others. For the normal-beta prime prior,

we follow the values in Bai & Ghosh (2019), i.e., a# = 0.48 and b# = 0.52 which also

ensures an interquartile range of 1.

From the plot, it appears that the R2-D2 prior can obtain both the highest concentration

around zero and heaviest tail simultaneously. We will quantify these rates exactly in the

next subsection, in Table 1. In particular, we will see that the R2-D2 prior is the only one

obtaining polynomial behavior in both regions.

In the normal means model, van der Pas et al. (2014) and van der Pas et al. (2017a)

investigate the Horseshoe posterior contraction rate, Bhattacharya et al. (2015) shows the

optimal posterior concentration results for Dirichlet-Laplace prior, Bhadra et al. (2016)

proves that the Horseshoe+ posterior concentrates at a faster rate than Horseshoe in the

Kullback-Leibler sense, and Bai & Ghosh (2019) shows that normal-beta prime prior leads

to a near minimax posterior concentration rate.

In this paper, we examine the concentration around zero and tail behaviors of the

marginal densities of a number of priors, and show that our proposed approach simulta-

neously achieves high concentration at the origin and heavy tails. We will also study the

posterior consistency and contraction properties in the high-dimensional regression model

setup. As shown in Figure 2, all five global-local shrinkage priors have a marginal density

with a singularity at zero while with different concentration rate. Except for the Dirichlet-

Laplace prior, all other priors’ marginal density have a heavier tail than the Cauchy dis-

tribution. We formally investigate their marginal densities’ properties in the following

sections.

4.2.1 Asymptotic tail behaviors

We examine the behavior of the tails of the proposed R2-D2 prior in this section. A prior

with heavy tails is desirable in high-dimensional regression to allow the posterior to estimate

12

large values for important predictors.

Theorem 1. Given |β| → ∞, for any aπ > 0 and b > 0, the marginal density of the

R2-D2 prior (9) satisfies πR2-D2(β) = O(1/|β|2b+1). Furthermore, when 0 < b < 1/2,

lim|β|→∞ πR2-D2(β)/β−2 = ∞, i.e., the R2-D2 prior has heavier tails than the Cauchy dis-

tribution.

With a polynomial tail heavier than Cauchy distribution, the proposed R2-D2 prior

attains a substantial improvement over a large class of global-local shrinkage priors.

As a comparison, we study the tail behavior of the Dirichlet-Laplace and double Pareto

prior. The density of generalized double Pareto prior proposed in Armagan, Dunson & Lee

(2013) is

πGDP(βj | η, α) = (1 + |βj|/η)−(α+1)/(2η/α), (α, η > 0).

Theorem 2. Given |β| → ∞, for any α > 0, the marginal density of the general-

ized double Pareto prior satisfies πGDP(β) = O(1/|β|α+1). Furthermore, when α < 1,

lim|β|→∞ πGDP(β)/β−2 =∞, i.e., the double Pareto prior has heavier tails than the Cauchy

distribution.

Theorem 3. Given |β| → ∞, for any a∗ > 0, the marginal density of the Dirichlet-

Laplace prior as shown in (11) satisfies πDL(β) = O(|β|a∗/2−3/4/exp{√

2|β|}). Further-

more, lim|β|→∞ πDL(β)/β−2 = 0, i.e., the Dirichlet-Laplace prior has lighter tails than the

Cauchy distribution.

As noted in Carvalho et al. (2010), the Horseshoe prior has exact Cauchy-like tails that

decay like β−2, and the Horseshoe+ prior has a tail of O(log(|β|)/β2) as illustrated in the

proof of Theorem 4.6 in Bhadra et al. (2016). Therefore, the double Pareto prior and the

proposed R2-D2 prior lead to the heaviest tail, followed by Horseshoe+, then Horseshoe,

and finally Dirichlet-Laplace prior.

4.2.2 Concentration properties

In this section, we study the concentration properties of the R2-D2 prior around the origin.

The concentration properties of Dirichlet-Laplace, Horseshoe, and Horseshoe+ priors are

13

also given. We favor priors with high concentration near zero to reflect the prior that most

of the covariates do not have a substantial effect on the response. We now show that R2-D2

prior has higher concentration at zero to go along with heavier tails than other global-local

priors.

Theorem 4. As |β| → 0, if 0 < aπ < 1/2 and b > 0, the marginal density of the R2-D2

prior as shown in (9) satisfies πR2-D2(β) = O(1/|β|1−2aπ).

Theorem 5. As |β| → 0, if 0 < a∗ < 1, the marginal density of the Dirichlet-Laplace prior

as shown in (11) satisfies πDL(β) = O(1/|β|1−a∗).

For the Horseshoe prior, as summarized in Carvalho et al. (2010), the marginal density

πHS(β) = (2π3)−1/2 exp(β2/2)E1(β2/2), where E1(z) =∫∞

1e−tz/t dt is the exponential

integral function. As |β| → 0,

1

2(2π3)1/2log(1 +

4

β2) ≤ πHS(β) ≤ 1

(2π3)1/2log(1 +

2

β2).

Therefore around the origin, πHS(β) = O(log(1/|β|)). Also by the proof of Theorem 4.6

in Bhadra et al. (2016), as |β| → 0, the marginal density of Horseshoe+ prior satisfies

πHS+(β) = O(log2(1/|β|)).

It is clear that 2aπ in the R2-D2 prior plays the same role around the origin as aD

in the Dirichlet-Laplace prior. Accordingly, when a∗ = 2aπ ∈ (0, 1), all these four priors

possess unbounded density near the origin. However, the R2-D2 prior and Dirichlet-Laplace

prior diverge to infinity with a polynomial order, much faster than the Horseshoe+ (with

a squared logarithm order) and the Horseshoe prior (with a logarithm order). Although

the double Pareto prior also has a polynomial order tail similar as our proposed R2-D2

prior, the double Pareto prior differs around the origin, as it remains bounded, while our

proposed R2-D2 prior is unbounded at the origin. As for the normal-beta prime prior, we

show that its concentration rate is slower than the R2-D2 prior.

The results for all priors in both tail behavior and concentration around zero are sum-

marized in Table 1. The proposed R2-D2 prior is the only one that can achieve polynomial

rates both in the tails as well as around zero. This is the limiting case in that it can then

be arbitrarily close to the boundary case of 1/|β| in each region.

14

Tail Decay Concentration at zero

Horseshoe O(

1β2

)O(

log( 1|β|))

Horseshoe+ O(

log |β|β2

)O(

log2( 1|β|))

Dirichlet-Laplace O

(|β|a∗/2−3/4

exp{√

2|β|}

)O(

1|β|1−a∗

)Generalized Double Pareto O

(1

|β|1+α

)O(1)

R2-D2 O(

1|β|1+2b

)O(

1|β|1−2aπ

)Table 1: Tail behavior and concentration around zero for Horseshoe, Horseshoe+, Dirichlet-

Laplace, Generalized Double Pareto, and R2-D2 priors.

4.2.3 Consistency and contraction of R2-D2 Posterior

In what follows, we rewrite p as pn to indicate the dimension of βn can increase with the

sample size n. We denote the true regression parameter as β0n. Let qn be the number of

nonzero components in β0n. We use || · || as the L2 norm, and || · ||1 as L1 norm for vectors,

respectively. Denote Yn = (Y1, · · · , Yn)T . Denote Xn = (xT1 , · · · ,xTn )T . Let ξn denote a set

of indices where ξn ⊂ {1, · · · , pn}, and let Xξn denote the sub-matrix of Xn that contains

the columns with indices in ξn. Denote ξ0n as the set containing the nonzero indices of β0

n.

For two positive sequences an and bn, an � bn means 0 < lim inf an/bn ≤ lim sup an/bn <∞;

an ≺ bn means an = o(bn); an � bn means bn = o(an); an � bn means either an ≺ bn or

an � bn; and an � bn means either an � bn or an � bn.

We now show that the proposed R2-D2 prior yields strong posterior consistency in the

case that pn ≺ n, and further that it attains the optimal near-minimax contraction rate in

general, including with pn � n, as given by Castillo et al. (2015), Song & Liang (2017) and

Rockova & George (2018).

Theorem 6 shows strong posterior consistency under pn = o(n), and assumes the fol-

lowing regularity conditions:

(A1) pn ≺ n;

(A2) Let dpn and d1 be the smallest and the largest singular values ofXTnXn/n respectively.

15

Assume 0 < dmin < lim infn→∞ dpn ≤ lim supn→∞ d1 < dmax < ∞, where dmin and

dmax are fixed;

(A3) maxj=1,··· ,pn{|β0nj|} ≤ En for some nondecreasing sequence, En, with log(En) =

O(log n);

(A4) qn = o(n/ log n).

Theorem 6. Under assumptions (A1)–(A4), for any b > 0, given the linear regression

model (1) with known σ2, if aπ = C/(pb/2n nrb/2 log n) for finite r > 0 and C > 0, then the

R2-D2 prior (9) yields a strongly consistent posterior, i.e., for any ε > 0,

Prβ0n

{πn(βn : ||βn − β0

n|| ≥ ε | Yn)→ 0}

= 1 as n→∞.

Theorem 6 shows the posterior strong consistency of the R2-D2 prior under pn ≺ n.

Furthermore, in the high dimensional case, with pn � n, we place an inverse-Gamma prior

on σ2, and denote σ0 as the true parameter value which is unknown but fixed. Theorem 7

shows that the R2-D2 prior contracts at the near-minimax rate in this regime, under the

following regularity conditions:

(B1) All the covariates are uniformly bounded. For simplicity, we assume they are all

bounded by 1;

(B2) pn � n;

(B3) There exists some integer pn and fixed constant d0 such that pn � qn and the smallest

singular value of matrix XTξnXξn/n is no smaller than d0 for any subset model of size

|ξn| ≤ pn;

(B4) maxj=1,...,pn{|β0nj/σ

0|} ≤ En for some nondecreasing sequence, En, with log(En) =

O(log pn).

(B5) qn = o(n/ log pn);

Theorem 7. Assume that (B1)-(B5) hold. Denote εn = M√qn(log pn)/n where M > 0 is

sufficiently large, and let kn �√qn(log pn)/n/pn. Given the linear regression model (1),

16

suppose that we place an inverse-Gamma prior on σ2 and place the R2-D2 prior (9) on β.

For any b > 0, if aπ ≤ log (1− p−1+un )/(2 log kn) where u > 0, then the following hold:

Prβ0n

{πn(βn : ||βn − β0

n|| ≥ c1σ0εn | Yn) ≥ e−c2nε

2n

}≤ e−c3nε

2n ,

Prβ0n

{πn(βn : ||βn − β0

n||1 ≥ c1σ0√qnεn | Yn) ≥ e−c2nε

2n

}≤ e−c3nε

2n ,

Prβ0n

{πn(βn : ||Xnβn −Xnβ

0n|| ≥ c1σ

0√nεn | Yn) ≥ e−c2nε

2n

}≤ e−c3nε

2n ,

for some positive constants c1, c2, and c3.

According to Raskutti et al. (2011), the minimax L2 rate is O(√qn log(pn/qn)/n). For

the R2-D2 prior, the L2 and L1 contraction rates for the posterior of βn areO(√qn(log pn)/n)

and O(qn√

(log pn)/n), respectively. So the contraction rates of R2-D2 prior are near-

minimax. Note that these rates are the same as the rates achieved by spike-and-slab

approaches as in Castillo et al. (2015), Song & Liang (2017) and Rockova & George (2018).

We note that Theorem 6 shows strong consistency in the pn ≺ n regime, while Theorem

7 shows the contraction rate for pn � n. While, the result in Theorem 7 is stronger,

including stronger conditions on the hyperparameters, we do conjecture that the near-

minimax contraction rate will also hold in the case of pn ≺ n, with a condition on the

hyperparameters that is weaker than that of Theorem 7. As also pointed out in Song &

Liang (2017), in this case it is not necessary to require a strong prior concentration which

is ensured by the conditions of Theorem 7, and one only need to impose conditions on the

local shape of the prior around the true β0.

4.2.4 Choice of Hyperparameters

Based on these properties, we now discuss a default choice of hyperparameters that can be

implemented in practice. We set aπ = a/p throughout, which is not as a choice, but it is an

important step in the definition of the prior. The reason for this very specific relationship

is to ensure that the R2-D2 prior in (8) can be re-written as (9) and (10). The theoretical

properties are derived based on (9) (or (10)), hence they assume this specific form of aπ, as

any other choice would no longer yield the same theoretical results. In addition, it ensures

17

that the MCMC algorithm is fully Gibbs sampling. With any other choice of aπ, this would

not be the case, and a Metropolis step would be needed in the algorithm.

Hence there are then two parameters to set, a and b. Based on the consistency results,

we suggest to set a as a function of b based on the condition of Theorem 6. This leads to just

one tuning parameter b. This determines the tail behavior and then all other parameters

are fixed from that. For a fully default method, we set b = 0.5 to yield Cauchy-like tails,

but other choices of tail behavior are possible if desired.

To determine a and aπ from a choice of b, note that the condition for consistency in

Theorem 6 requires that aπ = C/(pb/2n nrb/2 log n), where C and r are arbitrary constants.

For a default approach, given the choice of b, we set aπ exactly at C/(pb/2n nrb/2 log n)

choosing the arbitrary constants C and r each to be 1. This is now the default choice and

has been implemented in all of the examples to follow.

5 Posterior Computation

In this section we develop novel Markov chain Monte Carlo (MCMC) samplers for both

the conditional and marginal R2-D2 approaches. The development of these samplers are

of interest directly on their own, as after some transformation and reparametrization, we

are able to obtain efficient feasible methods. In particular, the marginal version and the

conditional uniform on ellipsoid version allow for fully Gibbs samplers. The conditional

version with local shrinkage requires a Metropolis-Hastings sampler and we show how to

sample from this posterior even in the case where p > n and hence XTX is not full rank.

5.1 Gibbs Sampler for the Conditional Uniform-on-Ellipsoid Prior

In Proposition 2, we showed that β | σ2 has a mixture of normals representation for a ≤ p/2.

In practice, we recommend choosing a = p/n as a default, and hence this applies as long

as n > 1. If σ2 is Inverse-Gamma(a1, b1) distributed, then β, σ2, z and w are drawn from

their full conditional distributions in the Gibbs sampler described as follows:

(a) Set initial values for β, σ2, z and w.

18

(b) Sample w | β, σ2, z, by first sampling u ∼ Gamma(p/2− a, βTXTXβ/ (2gσ2)

), and

setting w = 1/(1 + u).

(c) Sample z | β, σ2, w ∼ Inverse-Gamma(p/2 + b, βTXTXβ/ (2wσ2) + 1/2

).

(d) Sample σ2 | β, z, w ∼ Inverse-Gamma((n+ p)/2 + a1, SSE/2 + βTXTXβ/ (2wz) + b1

),

where SSE = (y −Xβ)T (y −Xβ).

(e) Sample β | σ2, z, w ∼ Normal(cβLS, cσ

2(XTX)−1)

, where c = zwzw+1

is the shrinkage

factor, and βLS is the least-squares estimate of β.

(f) Repeat steps b-e until convergence.

5.2 MCMC for the Conditional Local Shrinkage Model

5.2.1 Full-rank case

We now develop a novel MCMC algorithm for the local shrinkage model, sampling from the

full conditionals using a Metropolis-Hastings sampler. First, we take the eigen-decomposition

of ΣX = XTX/n = V DV T , where Vp×p is an orthogonal matrix of eigenvectors, andDp×p

is a diagonal matrix of eigenvalues. R2 is transformed such that θ = R2/(1−R2) has a

Beta-Prime (or Inverted-Beta distribution), with density p(θ) = θa−1 (1 + θ)−a−b /B(a, b).

We also transform β to lie on the unit sphere conditional on the other variables; that is,

γ = D1/2V Tβ/√θσ2. Then γ|Λ has a Bingham distribution, and we can write the full

model as follows.

y | γ, σ2, θ ∼ Nn

(√θσ2XVD−1/2γ, σ2I

)γ | Λ ∼ Bingham

(D−1/2V TΛ−1V D−1/2

)σ2 ∼ Inverse-Gamma(a1, b1)

θ ∼ Beta-Prime(a, b)

λj ∼ Gamma(ν, µ), for j = 1, . . . , .p

19

This parametrization of the prior models the direction of β independently of σ2 and R2.

However, the Bingham distribution of the direction, γ | Λ, contains an intractable normaliz-

ing constant, CX,Λ, depending onX and Λ. Specifically, CX,Λ is a confluent hypergeometric

function with matrix argument A = D−1/2V TΛ−1V D−1/2/2.

The full conditional posterior distribution of γ is a Fisher-Bingham(µ,A) distribution

(Kent 1982), where µ = yTXVD−1/2√θ/σ2. This is equivalent to a Np(A

−1µ,A−1) dis-

tribution conditioned to lie on the p−1 dimensional unit sphere. We sample from this using

the rejection sampler proposed by Kent et al. (2013) with an Angular Central Gaussian

(ACG) envelope distribution. Sampling efficiently from the ACG distribution is possible

because it is just the marginal unit direction of a multivariate Normal distribution with

mean 0, and thus only requires draws from a Normal distribution. Any standard MCMC

algorithm can sample θ and σ2, but the Adaptive Metropolis algorithm (Haario et al. 2001)

automatically accounts for the strong negative correlation between the parameters without

the need for manual tuning. A bivariate Normal proposal distribution is used for (σ2, θ)

with covariance proportional to the running covariance of the samples during the burn-in

phase.

The density function of the full conditional posteriors for variance parameters, λj, al-

most have Generalized Inverse Gaussian distributions (GIG) if it were not for the intractable

term CX,Λ. A Metropolis-Hastings algorithm would require computing this quantity. Our

solution is to propose each candidate λ∗j from a GIG distribution, and introduce auxil-

iary variables, γ∗|Λ∗, from a Bingham distribution in which the constant, CX,Λ∗ , appears

in the density. We calculate the Metropolis-Hastings acceptance probability for (Λ∗,γ∗),

and since CX,Λ∗ appears in the posterior and the proposal distribution, we avoid comput-

ing it. This is the so-called “Exchange Algorithm” proposed by Murray et al. (2006) for

doubly-intractable distributions, and also used by Fallaize & Kypraios (2016) for Bayesian

inference of the Bingham distribution.

The entire sampler for the local shrinkage model is described as follows:

(a) Set initial values for γ, σ2, θ, and Λ.

(b) Sample γ | Λ from a Fisher-Bingham(µ,A) distribution, where µ = yTXVD−1/2√θ/σ2

20

and A = D−1/2V TΛ−1V D−1/2/2.

(c) Sample (σ2, θ) jointly using an Adaptive Metropolis algorithm from a bivariate Nor-

mal distribution.

(d) Exchange algorithm to sample Λ:

(i) Sample λ∗j | uj ∼ GIG(ν, 2µ, u2j), for j = 1, . . . , p, where u = V D−1/2γ.

(ii) Sample u∗ from a Bingham(A∗) distribution, whereA∗ = D−1/2V TΛ∗−1V D−1/2/2

(iii) Accept (Λ∗,u∗) with probability:

p(Λ∗|u)

p(Λ|u)× q(Λ|u)

q(Λ∗|u)× q(u∗|Λ)

q(u∗|Λ∗)= exp

{p∑j=1

u∗2j(1/λ∗2j − 1/λ2

j

)}

(e) Repeat steps (b)-(d) until convergence, and calculate β =√θσ2V D−1/2γ for each

sample.

In step (d), p(Λ | u) is the conditional posterior distribution of Λ; q(Λ | u) is the GIG

proposal distribution with density: p(x; d, a, b) ∝ xd−1exp {−(ax+ b/x)/2},

for −∞ < d < ∞, and a, b > 0; and q(u | Λ) is the Bingham proposal distribution which

has the same constant as in p(Λ | u). We only need to keep Λ∗ at each step, and can

discard u∗. We can efficiently sample from the Bingham distribution because it is a special

case of the Fisher-Bingham with µ = 0 (Kent et al. 2013). All modeling is done in terms

of γ, σ2, θ and Λ, and we calculate β outside the sampler.

5.2.2 Non full-rank case

Next we address how to fit these models with high-dimensional data, where p > n and

XTX is not full rank. The restriction on β : βTΣXβ = σ2θ, is no longer an ellipsoid,

but an unbounded subspace in p-dimensions (e.g. parallel lines for p = 2, and an infinite

cylinder for p = 3). We assume that the rank(X) = n, and partition D =(D1 00 0

)and

V = (V1,V2). Note that D1 is the n× n diagonal matrix of positive eigenvalues, V1 is the

matrix of corresponding eigenvectors, and V2 is the matrix of the p−n eigenvectors spanning

the null space of X. We define γ = (γ1,γ2)T = (D1/21 V T

1 β,VT

2 β)T/√θσ2, so that γ is

21

multivariate Normal with the constraint that γT1 γ1 = 1. Marginally, γ1 = D1/21 V T

1 β/√θσ2

is defined on the n − 1 dimensional unit sphere, and has a Fisher-Bingham distribution

just like the full rank case. However, the reverse transformation β =√θσ2V1D

−1/21 γ1

is defined on the lower n − 1 dimensional ellipsoid within the entire constrained space

{β : βTΣXβ = σ2θ}. For example, if p = 3 and n = 2, this is the slice of the 3-dimensional

ellipsoid with the minimum L2-norm. The problem is that this lower dimensional ellipsoid

may not be able to favor the sparsity or local shrinkage encouraged by Λ. That would

be equivalent to principal components regression using the top n principal components.

To allow for shrinkage of the original coefficients and not the principal components, we

sample γ2 | γ1, which is multivariate Normal, and make the reverse transformation β =√θσ2

(V1D

−1/21 γ1 + V2γ2

). Since V2 spans the null space ofX, β is still in the constrained

region, but offers more flexibility in shrinking βj.

5.3 Gibbs Sampler for Marginal R2-D2

For posterior computation in the marginal case, the following equivalent representation is

useful. The R2-D2 prior (8) is equivalent to

βj | σ2, ψj, φj, ω ∼ N(0, ψjφjωσ2/2), ψj ∼ Exp(1/2),

φ ∼ Dir(aπ, · · · , aπ), ω | ξ ∼ Ga(a, ξ), ξ ∼ Ga(b, 1), (13)

where Exp(δ) denotes the exponential distribution with mean δ−1. The Gibbs sampling

procedure is based on (13) with a = paπ. Denote Z ∼ InvGaussian(µ, λ), if π(z) =

λ1/2(2πz3)−1/2

exp{−λ(z − µ)2/(2µ2z)}. Denote Z ∼ giG(χ, ρ, λ0), the generalized inverse

Gaussian distribution (Seshadri 1997), if π(z) ∝ zλ0−1exp{−(ρz + χ/z)/2}.

The Gibbs sampling procedure is as follows:

(a) Set initial values for β, σ2,ψ,φ, and ω.

(b) Sample β | ψ,φ, ω, σ2,Y ∼ N(µ, σ2V ), where µ = V XTY = (XTX+S−1)−1(XTY ),

V = (XTX + S−1)−1, S = diag{ψ1φ1ω/2, · · · , ψpφpω/2}, X = (x1, · · · ,xn)T , and

Y = (Y1, · · · , Yn)T .

22

(c) Sample σ2 | β,ψ,φ, ω,Y ∼ IG(a1 + (n + p)/2, b1 + (βTS−1β + (Y −Xβ)T (Y −

Xβ))/2).

(d) Sample ψ | β,φ, ω, σ2. Draw ψj−1 ∼ InvGaussian(µj =

√σ2φjω/2/|βj|, λ = 1), then

take the reciprocal to get ψj.

(e) Sample ω | β,ψ,φ, ξ, σ2 ∼ giG(χ =p∑j=1

2β2j /(σ

2ψjφj), ρ = 2ξ, λ0 = a− p/2).

(f) Sample ξ | ω ∼ Ga(a+ b, 1 + ω).

(g) Sample φ | β,ψ, ξ, σ2. Motivated by Bhattacharya et al. (2015), if a = paπ, one can

draw T1, · · · , Tp independently with Tj ∼ giG(χ = 2β2j /(σ

2ψj), ρ = 2ξ, λ0 = aπ−1/2).

Then set φj = Tj/T with T =p∑j=1

Tj.

(h) Repeat steps (b)-(g) until convergence.

6 Simulation Study

We conduct a simulation study to compare the proposed approach with other Bayesian

regression models. In each setting, 200 datasets are simulated from the homoscedastic

linear model (1), with sample size n = 60, and the number of predictors p varying in

p ∈ {50, 100, 500, 2000}. Larger sample sizes were also investigated with the compar-

isons remaining similar. The covariates xi, i = 1, · · · , n, are generated from multivari-

ate normal distribution with mean zero, and correlation matrix of autoregressive struc-

ture AR(1) with correlation ρ =0.5 or 0.9. For the regression coefficients β, we set

β = (0T10,B1T ,0T30,B2

T ,0Tp−50)T with 0k representing the zero vector of length k, and B1

and B2 each of length 5 nonzero elements. The fractions of true coefficients with exactly

zero values are 80%, 90%, 98%, and 99.5% for p ∈ {50, 100, 500, 2000}, respectively, and

the remaining 20%, 10%, 2%, and 0.5% nonzero elements B1 and B2 were independently

generated from two scenarios: (i) a Student t distribution with 3 degrees of freedom to give

heavy tails; (ii) a Uniform(0,1) distribution to give weaker signals. For scenario (i), we set

the error variance to produce a Signal-to-Noise Ratio (SNR) of 9, yielding σ2 = 10/3. Note

23

that since the generated coefficients have expectation of zero, the SNR does not depend

on the correlation structure of the covariates. For scenario (ii), the nonzero expectation

of the Uniform(0,1) then entails a different SNR for the 2 different correlation setups. For

this scenario, we set σ2 = 6 to yield SNR ≈ 0.5 and 0.7 in the ρ = 0.5 and 0.9 cases,

respectively.

To implement both the marginal and conditional R2-D2 priors, as discussed as a default

choice in Section 4.2.4, we set b = 0.5 to yield Cauchy-like tails, and aπ = C/(pb/2n nrb/2 log n)

with choosing the arbitrary constants C and r to be 1. We also set b = 0.1 to give

heavier tails (then setting aπ in the same manner). The results for b = 0.1 were similar

to b = 0.5 and thus not shown, and we recommend as a default to use b = 0.5 as a fully

automatic approach. For the conditional R2-D2, the choices of ν and µ in the Gamma prior

are based on the recommendations for implementation of Normal-Gamma priors given in

Griffin & Brown (2010). The choice is based on the degree of sparsity, which we choose as

min{n, 0.1p} so that we set the expected number of non-zero coefficients to be 10% of the

total, or the sample size, whichever is smaller. Details of how ν and µ relate to the sparsity

level are given in the Appendix.

The comparisons are made to some current state-of-the-art global-local priors: Horse-

shoe, Horseshoe+, Normal-Beta Prime and Dirichlet-Laplace.

For the Horseshoe, Dirichlet-Laplace, and Normal-Beta Prime, the implementation is

done via the R packages horseshoe, dlbayes, and NormalBetaPrime, respectively. The

Horseshoe+ is implemented through Stan in R using the code provided by the author of

Bhadra et al. (2016). The proposed R2-D2 approaches are implemented in R, based on the

discussed MCMC sampling.

In all cases, 10, 000 samples are collected with the first 5, 000 samples discarded as

burn-in.

Estimation error and AUC. The average sum of squared error corresponding to the

posterior mean across the 200 replicates is provided in Table 2 for simulation setting i for

p =100 and 500 and in Table 3 for simulation setting ii. In addition, the averaged area

under the Receiver-Operating Characteristic (ROC) curve (AUC) based on the posterior

24

t-statistic, i.e., the ratio of the posterior mean and posterior standard deviation, is also

given to offer further evaluation of the reliability of the posterior inference on the coeffi-

cients. Larger AUC signifies that the method tends to give posterior intervals further away

from zero for the true non-zero coefficients and intervals that are concentrated closer to

zero for the irrelevant variables. To perform variable selection, thresholding the posteriors

either marginally or jointly (Bondell & Reich 2012) is typical, and thus higher AUC would

represent more accurate variable selection.

Scenario 1 [non-zero coefficients from t3]

ρ = 0.5 ρ = 0.9

p=100 p=500 p=100 p=500

SSE AUC SSE AUC SSE AUC SSE AUC

Horseshoe 148 (6.4) 64 236 (13.1) 61 186 (9.2) 67 205 (11.2) 68

Horseshoe+ 146 (5.8) 64 215 (11.4) 62 190 (9.7) 67 212 (11.4) 69

Normal-BetaPrime 155 (5.7) 64 578 (20.1) 59 183 (8.1) 67 1399 (62.1) 57

Dirichlet-Laplace 166 (7.8) 67 187 (8.1) 54 197 (11.6) 74 209 (10.9) 59

R2-D2 - Conditional 146 (5.6) 66 179 (7.9) 62 174 (8.1) 73 192 (11.2) 72

R2-D2 - Marginal 142 (5.5) 65 166 (6.5) 64 179 (8.7) 72 188 (10.1) 73

Table 2: Average sum of squared error (SSE) and average area under the Receiver-Operating

Characteristic curve (AUC), based on 200 simulated datasets. All values multiplied by 10 for

readability. For SSE, standard errors are included in parentheses. For AUC, all standard errors

are in the range of 0.5− 0.7, so omitted to save space. Note that smaller is better for SSE, while

larger is better for AUC. Best performance in each column is highlighted in bold for reference.

It is clear from Table 2 and 3 that the proposed approaches, both conditional and

marginal R2-D2 prior, outperform the existing methods in nearly every case, both in terms

of the estimation error, and the area under the ROC curve. This is particularly apparent

in the cases of p = 500. The Dirichlet-Laplace performs well in some of the cases with

p = 100, but then performs significantly worse with p = 500, particularly in AUC. Both

the conditional and marginal versions of R2-D2 prior perform similarly, with slightly better

performance of the conditional version in the lower dimensional case, but slightly better

performance of the marginal version in the case of p = 500. Overall, the proposed approach

25

Scenario 2 [non-zero coefficients from U(0,1)]

ρ = 0.5 ρ = 0.9

p=100 p=500 p=100 p=500

SSE AUC SSE AUC SSE AUC SSE AUC

Horseshoe 44.1 (1.9) 62 59.9 (4.8) 58 41.7 (2.2) 73 39.8 (3.2) 58

Horseshoe+ 47.0 (2.2) 62 79.2 (8.1) 61 47.6 (4.2) 73 44.9 (4.4) 63

Normal-BetaPrime 62.4 (2.4) 62 437.1 (14.2) 57 42.5 (1.7) 70 1203 (55.1) 59

Dirichlet-Laplace 32.4 (0.7) 64 56.3 (5.7) 51 27.3 (0.6) 79 41.3 (1.2) 64

R2-D2 - Conditional 34.5 (1.2) 66 48.2 (2.9) 59 26.3 (0.6) 82 38.1 (1.7) 79

R2-D2 - Marginal 33.7 (1.0) 64 41.0 (1.3) 65 31.8 (1.2) 76 34.5 (1.0) 82

Table 3: Average sum of squared error (SSE) and average area under the Receiver-Operating

Characteristic curve (AUC), based on 200 simulated datasets. All values multiplied by 10 for

readability. For SSE, standard errors are included in parentheses. For AUC, all standard errors

are in the range of 0.5− 0.7, so omitted to save space. Note that smaller is better for SSE, while

larger is better for AUC. Best performance in each column is highlighted in bold for reference.

gives a much improved result over the existing approaches. This is to be anticipated based

on the theoretical results on the concentration and tail behaviors, which we now examine.

To better understand the relative performance of the methods, Table 4 shows the av-

erage SSE partitioned according to the value of the true β at βj = 0, |βj| ∈ (0, 0.5], and

|βj| > 0.5, j = 1, · · · , p. This allows us to view a more detailed performance of the ap-

proaches in their behavior on the zero, small non-zero, and larger non-zero coefficients,

respectively. The table shows the results for ρ = 0.5 and p = 100 for setting i, but the

decomposed SSE is similar in the other cases as well.

From the breakdown provided in Table 4, we see the differences between the approaches,

and the theoretical results on the concentration at zero and tail behavior (Table 1) really

show up strongly here. Since the Dirichlet-Laplace has high concentration around zero,

but it has tails that are much lighter than the others, we see this translate into really small

error on the zero and small coefficients, but extremely large error on the large coefficients.

Meanwhile, the Horseshoe and Horseshoe+ having less concentration around zero leads

to poorer performance at estimation of the actual zeros. Note that this also explains the

26

β = 0 |β| ∈ (0, 0.5] |β| > 0.5 Total

Horseshoe 21 24 103 148

Horseshoe+ 18 22 106 146

Normal-BetaPrime 35 38 82 155

Dirichlet-Laplace 4 6 156 166

R2D2 - Conditional 14 18 114 146

R2D2 - Marginal 10 13 119 142

Table 4: Average sum of squared error broken down by zero coefficients (β = 0), small

coefficients (|β| ∈ (0,0.5]), large coefficients (|β| >0.5]), and Total. Results are based on

the 200 datasets from Table 2 with p = 100 and ρ = 0.5.

reason for the Horseshoe and Horseshoe+ having worse AUC, as it does not push the zeros

close enough to zero in order to distinguish them from the small coefficients. Note that

the R2-D2 was set with b = 0.5 giving it the Cauchy-like tails as is the case with the

Horseshoe. Thus we see similar estimation ability for the larger coefficients. Overall, the

proposed R2-D2 approach achieves a strong performance in both regions simultaneously as

anticipated by the theory.

Credible Interval Coverage. We also examine the coverage properties of using 95%

marginal credible intervals for each approach. For this same scenario, table 5 reports

the average width of the intervals, the proportion of coverage, the Specificity, and the

Sensitivity. This gives a broader picture of the posterior inference properties. Although,

there is no promise of 95% Frequentist coverage by the 95% intervals, we see that in this

case of n = 60 and p = 100, the coverage of most of the approaches are close to 95%, with

the proposed approaches and the Horseshoe+ being almost right on the value, while the

Horseshoe and DL are a bit lower than the target, and the NBP being a bit large. For

the n = 60 case, this is a very difficult problem even when p = 100, and we see that the

sensitivity (power) for all approaches are quite low due to the intervals containing zero

even for the majority of the true non-zeros. The sensitivity for the R2-D2 approaches

are highest, along with the Horseshoe+ and Normal-Beta Prime. However, in order to do

so, the Horseshoe+ and Normal-Beta Prime intervals are quite wide in comparison, thus

27

showing that our proposed approaches look to have good posterior inference properties in

addition to their outstanding performance in estimation and variable importance ordering.

Avg Width Coverage Sensitivity Specificity

Horseshoe 0.81 0.936 0.190 0.999

Horseshoe+ 1.23 0.953 0.310 0.999

Normal-BetaPrime 1.83 0.972 0.301 0.999

Dirichlet-Laplace 0.91 0.939 0.150 1

R2D2 - Conditional 0.98 0.947 0.350 0.999

R2D2 - Marginal 1.11 0.948 0.290 1

Table 5: Coverage properties of 95% marginal posterior credible intervals. Results are

based on the 200 datasets from Table 2 with p = 100 and ρ = 0.5.

Higher dimensional setting. Table 6 shows the results for p = 2000 and n = 60 for

scenario i with ρ = 0.5. Other settings are similar and hence not shown. Table 6 shows the

estimation error and AUC as before. We also include coverage of the 95% intervals, and

also the coverage on only the non-zero coefficients. For estimation and AUC, we again see

that the performance of the proposed R2-D2 approaches perform very well in this higher

dimensional case.

In terms of coverage of the 95% intervals, as shown in van der Pas et al. (2017b)

(Theorems 1 and 2), for the HS credible intervals in the special case of the Normal means

model, the coverage will either go to zero or one depending on the size of the true coefficient.

We see that in this case of p = 2000 and n = 60, all methods have overall coverage of

nearly 100%. However, stark differences appear in looking at the coverage on the non-zero

coefficients only (of which there are only 10 out of 2000 here). We see that the HS, in

particular, almost never covers these non-zero values, with coverage of 1.5% (HS) or 5%

(HS+). This is anticipated by the results of van der Pas et al. (2017b) given that it attempts

to adapt to the sparsity, thus covering the zeros, at the expense of the non-zeros. We see

that the other methods do better at coverage of the non-zeros here (with the marginal

R2-D2 way up at 47%). We see in this higher dimensional case, the proposed approaches

exhibiting more stability, both in estimation error, and in the coverage.

28

p = 2000, n = 60 [non-zero coefficients from t3]

SSE AUC Coverage Coverage on non-zeros

Horseshoe 249 (9.6) 44 0.995 0.015

Horseshoe+ 233 (8.9) 51 0.996 0.050

Normal-BetaPrime 468 (15.7) 61 0.995 0.240

Dirichlet-Laplace 240 (8.7) 56 0.996 0.165

R2D2 - Conditional 201 (7.1) 65 0.996 0.271

R2D2 - Marginal 196 (6.9) 64 0.997 0.466

Table 6: Simulation setting 1 with p = 2000, n = 60, and ρ = 0.5. Average sum of

squared error (with standard error), area under the ROC curve, overall coverage of 95%

credible intervals on all 2000 coefficients, and coverage of the 95% credible intervals on the

10 non-zero coefficients only. Results are based on 200 datasets.

7 Data Examples

We study the predictive performance of the posterior generated by the R2-D2 prior through

a variety of real examples which exhibit varying structures. These three data sets have

many more parameters than observations, and have very different correlation structures.

The Cereal data consists of starch content measurements from 15 observations with 145

infrared spectra measurements as predictors. The data is provided with the chemometrics

R package. The Cookie data arises from an experiment testing the near-infrared (NIR)

spectroscopy of biscuit dough in which the fat content is measured on 72 samples, with 700

NIR spectra measurements as predictors. The data was generated in the experiment by

Osborne et al. (1984), and is available in the ppls R package. The Multidrug data are from

a pharmacogenomic study investigating the relationship between the drug concentration

(at which 50% growth is inhibited for a human cell line) and expression of the adenosine

triphosphate binding cassette transporter (Szakacs et al. 2004). The data consists of 853

drugs as predictors, 60 samples of human cell lines using the ABCA3 transporter as the

response, and is available in the mixOmics R package. In the statistics literature, the

Cereal and Multidrug data were both studied by Polson & Scott (2012) and Griffin &

Brown (2013); and the Cookie data was studied by Brown et al. (2001) and Ghosh &

29

Ghattas (2015).

The three datasets nicely represent 3 different correlation structures among the predictor

variables. The Multidrug covariates have low to moderate pairwise correlations, the Cookie

covariates are highly positively correlated, and the Cereal covariates have a wide range that

are both positively and negatively correlated. Figure 3 shows histograms of all pairwise

correlations for each of the data sets.

Figure 3: Histograms of pairwise correlations among the predictor variables for the three

datasets.

Cereal Cookie Multidrug

n 15 72 60

p 145 700 853

Horseshoe 14.1 (1.0) 8.3 (0.2) 15.6 (0.5)

Horseshoe+ 14.2 (0.9) 9.1 (0.3) 15.0 (0.5)

Normal-BetaPrime 25.4 (1.5) 11.9 (0.4) 18.4 (0.6)

Dirichlet-Laplace 15.1 (1.2) 12.1 (0.5) 12.2 (0.3)

R2-D2 - Conditional 12.2 (0.5) 9.8 (0.3) 12.7 (0.3)

R2-D2 - Marginal 12.1 (0.5) 8.1 (0.2) 12.6 (0.3)

Table 7: Average mean square prediction error (and standard errors) for each of the data

examples.

We randomly split each data set into a training and testing sets to evaluate the out-

of-sample predictive performance. For each data set, 75% of the observations were used

for training, and the remaining 25% were used for estimating the mean squared prediction

30

error (MSPE) between the test sample and predictions. This process was repeated to create

200 data sets for each example. The same 5 approaches as in the simulations were used on

each of the datasets. Due to the various correlation structures and dimensions, this gives

a range of potential data structures for comparison of the approaches.

The average MSPE results are given in Table 7. We see that the R2-D2 approaches

consistently outperform the existing methods across the datasets. Note that the Horseshoe

and Horseshoe+ perform well on the Cookie data, but are significantly worse than the

others on the other 2 datasets.

8 Discussion

In this paper, we propose a shrinkage prior motivated by assuming a prior on R2. The

prior exhibits polynomial behavior both around the origin and in the tails and compares

favorably with other global-local shrinkage priors. Although the motivation of our R2-D2

prior is via starting with a prior on R2, the resultant prior is simply a member of the

class of global-local shrinkage priors, which can then be applied directly to other models,

as with other priors. The prior is represented by a hierarchical scale mixture of normals,

which can then be implemented in a generalized linear model or other regression setting.

The hyperparameters would no longer have the interpretation as parameters of a Beta

prior on the R2 of the model. But the form and properties of the resulting prior, such as

tail behavior and concentration around zero remain directly useable as is with the other

global-local priors.

There is scope for further advances in algorithms for MCMC sampling for these poste-

riors, just as there has been for sampling from other prior proposals, as for example, key

sampling approaches for the high-dimensional case as in Bhattacharya et al. (2016).

References

Armagan, A., Clyde, M. & Dunson, D. B. (2011), Generalized beta mixtures of Gaussians,

in ‘Advances in neural information processing systems’, pp. 523–531.

31

Armagan, A., Dunson, D. B. & Lee, J. (2013), ‘Generalized double Pareto shrinkage’,

Statistica Sinica 23(1), 119.

Armagan, A., Dunson, D. B., Lee, J., Bajwa, W. U. & Strawn, N. (2013), ‘Posterior

consistency in linear models under shrinkage priors’, Biometrika 100(4), 1011–1018.

Bai, R. & Ghosh, M. (2019), ‘Large-scale multiple hypothesis testing with the normal-beta

prime prior’, Statistics 53(6), 1210–1233.

Bateman, H. (1953), Higher Transcendental Functions [Volumes I-III], Vol. 1, McGraw-Hill

Book Company.

Bhadra, A., Datta, J., Polson, N. G. & Willard, B. (2016), ‘The horseshoe+ estimator of

ultra-sparse signals’, Bayesian Analysis .

Bhattacharya, A., Chakraborty, A. & Mallick, B. (2016), ‘Fast sampling with gaussian

scale-mixture priors in high-dimensional regression’, Biometrika 103(4), 985–991.

Bhattacharya, A., Pati, D., Pillai, N. S. & Dunson, D. B. (2015), ‘Dirichlet–laplace priors

for optimal shrinkage’, Journal of the American Statistical Association 110(512), 1479–

1490.

Bingham, C. (1974), ‘An antipodally symmetric distribution on the sphere’, The Annals of

Statistics pp. 1201–1225.

Bondell, H. D. & Reich, B. J. (2012), ‘Consistent high-dimensional bayesian variable se-

lection via penalized credible regions’, Journal of the American Statistical Association

107(500), 1610–1624.

Brown, P. J., Fearn, T. & Vannucci, M. (2001), ‘Bayesian wavelet regression on curves with

application to a spectroscopic calibration problem’, Journal of the American Statistical

Association 96(454), 398–408.

Carvalho, C. M., Polson, N. G. & Scott, J. G. (2009), Handling sparsity via the horseshoe,

in ‘International Conference on Artificial Intelligence and Statistics’, pp. 73–80.

32

Carvalho, C. M., Polson, N. G. & Scott, J. G. (2010), ‘The Horseshoe estimator for sparse

signals’, Biometrika 97, 465–480.

Castillo, I., Schmidt-Hieber, J. & van der Vaart, A. (2015), ‘Bayesian linear regression with

sparse priors’, The Annals of Statistics 43(5), 1986–2018.

DLMF (2015), ‘NIST Digital Library of Mathematical Functions’, http://dlmf.nist.gov/,

Release 1.0.10 of 2015-08-07. Online companion to Olver et al. (2010).

URL: http://dlmf.nist.gov/

Fallaize, C. J. & Kypraios, T. (2016), ‘Exact bayesian inference for the bingham distribu-

tion’, Statistics and Computing 26(1-2), 349–360.

Fields, J. L. (1972), ‘The asymptotic expansion of the Meijer G-function’, Mathematics of

Computation pp. 757–765.

George, E. I. & McCulloch, R. E. (1993), ‘Variable selection via Gibbs sampling’, Journal

of the American Statistical Association 88(423), 881–889.

Ghosh, J. & Ghattas, A. E. (2015), ‘Bayesian variable selection under collinearity’, The

American Statistician 69(3), 165–173.

Griffin, J. E. & Brown, P. J. (2010), ‘Inference with normal-gamma prior distributions in

regression problems’, Bayesian Analysis 5(1), 171–188.

Griffin, J. E. & Brown, P. J. (2013), ‘Some priors for sparse regression modelling’, Bayesian

Analysis 8(3), 691–702.

Haario, H., Saksman, E. & Tamminen, J. (2001), ‘An adaptive metropolis algorithm’,

Bernoulli pp. 223–242.

Hans, C., Dobra, A. & West, M. (2007), ‘Shotgun stochastic search for large p regression’,

Journal of the American Statistical Association 102(478), 507–516.

Ishwaran, H. & Rao, J. S. (2005), ‘Spike and slab variable selection: Frequentist and

Bayesian strategies’, Annals of Statistics pp. 730–773.

33

Johnson, N., Kotz, S. & Balakrishnan, N. (1995), ‘Continuous univariate distributions,

volume 2. john wiley&sons’, Inc., 75.

Kent, J. T. (1982), ‘The fisher-bingham distribution on the sphere’, Journal of the Royal

Statistical Society. Series B (Methodological) pp. 71–80.

Kent, J. T., Ganeiber, A. M. & Mardia, K. V. (2013), ‘A new method to simulate the

bingham and related distributions in directional data analysis with applications’, arXiv

preprint arXiv:1310.8110 .

Miller, P. D. (2006), Applied asymptotic analysis, Vol. 75, American Mathematical Soc.

Mitchell, T. J. & Beauchamp, J. J. (1988), ‘Bayesian variable selection in linear regression’,

Journal of the American Statistical Association 83(404), 1023–1032.

Murray, I., Ghahramani, Z. & MacKay, D. (2006), Mcmc for doubly-intractable distribu-

tions, in ‘Proceedings of the 22nd annual conference on uncertainty in artificial intelli-

gence’, AUAI Press, pp. 359–366.

Narisetty, N. N. & He, X. (2014), ‘Bayesian variable selection with shrinking and diffusing

priors’, The Annals of Statistics 42(2), 789–817.

Olver, F. W. J., Lozier, D. W., Boisvert, R. F. & Clark, C. W., eds (2010), NIST Hand-

book of Mathematical Functions, Cambridge University Press, New York, NY. Print

companion to DLMF (2015).

Ormerod, J. T., You, C. & Muller, S. (2017), ‘A variational bayes approach to variable

selection’, Electronic Journal of Statistics 11(2), 3549–3594.

Osborne, B. G., Fearn, T., Miller, A. R. & Douglas, S. (1984), ‘Application of near infrared

reflectance spectroscopy to the compositional analysis of biscuits and biscuit doughs’,

Journal of the Science of Food and Agriculture 35(1), 99–105.

Polson, N. G. & Scott, J. G. (2010), ‘Shrink globally, act locally: Sparse bayesian regular-

ization and prediction’, Bayesian Statistics 9, 501–538.

34

Polson, N. G. & Scott, J. G. (2012), ‘Local shrinkage rules, levy processes and regularized

regression’, Journal of the Royal Statistical Society: Series B (Statistical Methodology)

74(2), 287–311.

Raskutti, G., Wainwright, M. J. & Yu, B. (2011), ‘Minimax rates of estimation for high-

dimensional linear regression over `q-balls’, IEEE transactions on information theory

57(10), 6976–6994.

Rockova, V. & George, E. I. (2014), ‘Emvs: The em approach to bayesian variable selection’,

Journal of the American Statistical Association 109(506), 828–846.

Rockova, V. & George, E. I. (2018), ‘The spike-and-slab lasso’, Journal of the American

Statistical Association 113(521), 431–444.

Scott, S. L. & Varian, H. R. (2014), ‘Predicting the present with bayesian structural time

series’, International Journal of Mathematical Modelling and Numerical Optimisation

5(1-2), 4–23.

Seshadri, V. (1997), ‘Halphen’s laws’, Encyclopedia of statistical sciences .

Song, Q. & Liang, F. (2017), ‘Nearly optimal bayesian shrinkage for high dimensional

regression’, arXiv preprint arXiv:1712.08964 .

Szakacs, G., Annereau, J.-P., Lababidi, S., Shankavaram, U., Arciello, A., Bussey, K. J.,

Reinhold, W., Guo, Y., Kruh, G. D. & Reimers, M. (2004), ‘Predicting drug sensitivity

and resistance: profiling abc transporter genes in cancer cells’, Cancer cell 6(2), 129–137.

van der Pas, S. L., Kleijn, B. J. & van der Vaart, A. W. (2014), ‘The horsveshoe estimator:

Posterior concentration around nearly black vectors’, Electronic Journal of Statistics

8(2), 2585–2618.

van der Pas, S., Szabo, B. & van der Vaart, A. (2017a), ‘Adaptive posterior contraction

rates for the horseshoe’, Electronic Journal of Statistics 11(2), 3196–3225.

35

van der Pas, S., Szabo, B. & van der Vaart, A. (2017b), ‘Uncertainty quantification for the

horseshoe (with discussion)’, Bayesian Analysis 12(4), 1221–1274.

Zellner, A. & Siow, A. (1980), ‘Posterior odds ratios for selected regression hypotheses’,

Trabajos de estadıstica y de investigacion operativa 31(1), 585–603.

Zhang, Y. & Bondell, H. D. (2018), ‘Variable selection via penalized credible regions with

dirichlet–laplace global-local shrinkage priors’, Bayesian Analysis 13(3), 823–844.

Zhou, M. & Carin, L. (2015), ‘Negative binomial process count and mixture modeling’,

Pattern Analysis and Machine Intelligence, IEEE Transactions on 37(2), 307–320.

Zwillinger, D. (2014), Table of integrals, series, and products, Elsevier.

36

A Appendix: Technical Details

Definition of the Meijer G-function.

A general definition of the Meijer G-function is given by the following line integral in

the complex plane (Bateman 1953):

Gm,np,q

(z∣∣∣a1,...,apb1,...,bq

)=

1

2πi

∫L

∏mj=1 Γ(bj − s)

∏nj=1 Γ(1− aj + s)∏q

j=m+1 Γ(1− bj + s)∏p

j=n+1 Γ(aj − s)zs ds

where Γ(·) denotes the gamma function and L in the integral represents the path to be

followed while integrating. The definition holds under the following assumptions:

• 0 ≤ m ≤ q and 0 ≤ n ≤ p, where m,n, p and q are integer numbers

• ak − bj 6= 1, 2, 3, . . . for k = 1, 2, . . . , n and j = 1, 2 . . . ,m

• z 6= 0.

Proof of Proposition 1. Derivation of Equation (4): Let γ be uniformly distributed on

the p− 1 dimensional unit sphere. That is,

p(γ) =Γ(p/2)

2πp/21{γTγ = 1

}.

Define θ = R2

1−R2 ∼ BP(a, b). Make the transformation r =√θ:

p(γ, θ) = p(γ)p(θ) =Γ(p/2)

2πp/2B(a, b)θa−1 (1 + θ)−a−b 1

{γTγ = 1

},

and p(γ, r) =Γ(p/2)

2πp/2B(a, b)r2a−2

(1 + r2

)−a−b1{γTγ = 1

}|2r|

=Γ(p/2)

πp/2B(a, b)r2a−1

(1 + r2

)−a−b1{γTγ = 1

}.

Make the transformation z = rγ. The Jacobian is rp−1 when decomposing a vector (sup-

ported on Rp) to a radius (supported in R+) and a unit direction (supported on the p− 1

unit sphere, Sp−1), so the reciprocal Jacobian is |(zTz)p−12 |−1 = (zTz)

1−p2 . Thus,

p(z) =Γ(p/2)

πp/2B(a, b)(zTz)a−1/2

(1 + zTz

)−a−b × (zTz)1−p2

=Γ(p/2)

πp/2B(a, b)(zTz)a−p/2

(1 + zTz

)−a−b37

Finally, make the transformation β = V D−1/2zσ, so that zTz = βTXTXβ/(σ2n), where

V DV T is the eigendecomposition of XTX/n = ΣX :

p(β | σ2) =Γ(p/2)

πp/2B(a, b)

(βTXTXβ/(σ2n)

)a−p/2 (1 + βTXTXβ/(σ2n)

)−a−b |V D−1/2σ|−1

=Γ (p/2) |ΣX |1/2

B(a, b) πp/2(σ2)−a (

βTΣXβ)a−p/2 (

1 + βTΣXβ/σ2)−(a+b)

.

Proof of Proposition 2. Mixture of normals representation for a ≤ p/2.

Let β | z, w,Σ ∼ N(

0, zwσ2(XTX

)−1),

z ∼ Inverse-Gamma (b, n/2) , and

w ∼ beta (a, p/2− a) .

Define θ = 1−ww

, so θ ∼ BP(p/2 − a, a). To simplify notation let Σ = XTX/σ2 and

ΣX = XTX/n. Then we have

p(β, θ, z | Σ) = (2π)−p/2z−p/2(1 + θ)p/2|Σ|1/2 exp

{−1 + θ

2zβTΣβ

}× nb2−b

Γ(b)z−b−1 exp {−n/2z} Γ(p/2)

Γ(a)Γ(p/2− a)θp/2−a−1(1 + θ)−p/2

=2−p/2−bnb|Σ|1/2Γ(p/2)

πp/2Γ(a)Γ(b)Γ(p/2− a)z−p/2−b−1θp/2−a−1 exp

{−n+ βTΣβ + θβTΣβ

2z

},

and

p(β, z | Σ) =2−

p2−bnb|Σ|1/2Γ(p

2)

πp2 Γ(a)Γ(b)Γ(p

2− a)

z−p2−b−1 exp

{−n+ βTΣβ

2z

}∫θp2−a−1 exp

{−θβ

TΣβ

2z

}dθ

=2−a−bnb|Σ|1/2Γ(p

2)

πp2 Γ(a)Γ(b)

z−a−b−1 exp

{−n+ βTΣβ

2z

}(βTΣβ

)a− p2 .

Hence then

p(β | Σ) =2−a−bnb|Σ|1/2Γ(p/2)

πp/2Γ(a)Γ(b)

(βTΣβ

)a−p/2 ∫z−a−b−1 exp

{−n+ βTΣβ

2z

}dz

=2−a−bnb|Σ|1/2Γ(p/2)

πp/2Γ(a)Γ(b)

(βTΣβ

)a−p/2 (n+ βTΣβ

)−a−b2a+bΓ(a+ b)

=|nΣ|1/2Γ(p/2)

πp/2B(a, b)

(βTΣβ/n

)a−p/2 (1 + βTΣβ/n

)−(a+b)

=Γ (p/2) |ΣX |1/2

B(a, b)πp/2(σ2)−a (

βTΣXβ)a−p/2 (

1 + βTΣXβ/σ2)−(a+b).

38

Proof of Proposition 3. If we let Λ−1 = XTX = nV DV T , then

p(γ) = CX,Λ exp{−γTD−1/2V TV DV TV D−1/2γ/2

}1{γTγ = 1

}=

1

C1{γTγ = 1

}where the constant C =

∫1{γTγ = 1

}dγ = 2πp/2

Γ(p/2), the surface area of the a p− 1 dimen-

sional unit sphere. The rest of the proof is identical to that of Proposition 1 deriving the

distribution in (4). It is clear that β is uniform given the ellipsoid, because it’s an elliptical

distribution, but it also comes from putting a uniform distribution on γ.

Proof of Proposition 4. The proposition follows from below derivations. For ω > 0,

π(ω) =

∫ ∞0

π(ω | ξ)π(ξ)dξ =

∫ ∞0

ξa

Γ(a)ωa−1e−ξω

1

Γ(b)ξb−1e−ξdξ

=1

Γ(a)Γ(b)ωa−1

∫ ∞0

ξa+b−1e−(1+ω)ξdξ

=1

Γ(a)Γ(b)ωa−1 Γ(a+ b)

(1 + ω)a+b

=Γ(a+ b)

Γ(a)Γ(b)

ωa−1

(1 + ω)a+b.

Proof of Proposition 5 . The proposition follows from Lemma IV.3 of Zhou & Carin

(2015): Suppose y and (y1, · · · , yK) are independent with y ∼ Ga(φ, ξ), and (y1, · · · , yK) ∼

Dir(φp1, · · · , φpK), where∑K

k=1 pk = 1. Let xk = yyk, then xk ∼ Ga(φpk, ξ) independently

for k = 1, · · · , K.

Proof of Proposition 6 . The marginal density of β for the R2-D2 prior is

πR2-D2(β)

=Γ(aπ + b)

Γ(aπ)Γ(b)

∫ ∞0

1

2(λ/2)1/2exp{− |β|

(λ/2)1/2} λaπ−1

(1 + λ)aπ+bdλ

=2aπΓ(aπ + b)

Γ(aπ)Γ(b)

∫ ∞0

exp(−|β|x)x2b

(x2 + 2)aπ+bdx. (14)

39

Let µ = |β|, ν = b + 1/2, u2 = 2, and ρ = 1 − aπ − b, since |arg u| < π/2, Reµ > 0, and

Reν > 0, so we have

πR2-D2(β) =2aπΓ(aπ + b)

Γ(aπ)Γ(b)

∫ ∞0

exp(−µx)x2ν−1(x2 + u2)ρ−1 dx

=2aπΓ(aπ + b)

Γ(aπ)Γ(b)

u2ν+2ρ−2

2π1/2Γ(1− ρ)G3,1

1,3

(µ2u2

4

∣∣∣1−ν1−ρ−ν,0, 1

2

)=

2aπΓ(aπ + b)

Γ(aπ)Γ(b)

21/2−aπ

2π1/2Γ(aπ + b)G3,1

1,3

(β2

2

∣∣∣ 12−baπ− 1

2,0, 1

2

)=

1

(2π)1/2Γ(aπ)Γ(b)G3,1

1,3

(β2

2

∣∣∣ 12−baπ− 1

2,0, 1

2

)=

1

(2π)1/2Γ(aπ)Γ(b)G1,3

3,1

(2

β2

∣∣∣ 32−aπ ,1, 1212

+b

)whereG(.) denotes the Meijer G-Function, the second equality follows from equation 3.389.2

in Zwillinger (2014), and the last equality follows from 16.19.1 in DLMF (2015). Proposition

6 follows.

Proof of Theorem 1. For the proof of Theorem 1, we will use the following lemma found

in Miller (2006).

Lemma 1. (Watson’s Lemma) Suppose F (s) =∫∞

0e−stf(t) dt, f(t) = tαg(t) where g(t)

has an infinite number of derivatives in the neighborhood of t = 0, with g(0) 6= 0, and

α > −1. Suppose |f(t)| < Kect for any t ∈ (0,∞), where K and c are independent of t.

Then, for s > 0 and s→∞,

F (s) =n∑k=0

g(k)(0)

k!

Γ(α + k + 1)

sα+k+1+O(

1

sα+n+2).

According to equation (14) in the proof of Proposition 6, we denote F (|β|) ≡ πR2-D2(β),

as follows,

F (|β|) =2aπΓ(aπ + b)

Γ(aπ)Γ(b)

∫ ∞0

exp(−|β|x)x2b

(x2 + 2)aπ+bdx =

∫ ∞0

e−|β|xf(x) dx,

where f(t) = C∗t2b/(t2 + 2)aπ+b ≡ t2bg(t), C∗ = 2aπΓ(aπ + b)/{Γ(aπ)Γ(b)}, and g(t) =

C∗(t2 + 2)−aπ−b with g(t) has an infinite number of derivatives in the neighborhood of

t = 0, with g(0) 6= 0. So the marginal density of R2-D2 prior is the Laplace transforms

40

of f(·). By Watson’s Lemma, since |f(t)| < Kect for any t ∈ (0,∞), where K and c are

independent of t, then as |β| → ∞,

F (|β|) =n∑k=0

g(k)(0)

k!

Γ(2b+ k + 1)

|β|2b+k+1+O(

1

|β|2b+n+2),

and setting n = 2 gives

F (|β|) = C∗{

Γ(2b+ 1)

2aπ+b|β|2b+1+ 0

Γ(2b+ 2)

|β|2b+2− aπ + b)Γ(2b+ 3)

2aπ+b|β|2b+3

}+O(

1

|β|2b+4)

= C∗2−aπ−b{

Γ(2b+ 1)

|β|2b+1− (aπ + b)

Γ(2b+ 3)

|β|2b+3

}+O(

1

|β|2b+4) (15)

= O(1

|β|2b+1).

Hence, when b < 1/2, as |β| → ∞, we have

πR2-D2(β)1β2

=C∗

2aπ+b

{Γ(2b+ 1)

|β|2b−1− (aπ + b)

Γ(2b+ 3)

|β|2b+1+O(

1

|β|2b+2)

}→∞.

Proof of Theorem 2. It is obvious based on the marginal density of the generalized

double Pareto prior.

Proof of Theorem 3. According to 10.25.3 in DLMF (2015), when both ν and z are

real, if z → ∞, then Kν(z) ≈ π1/2(2z)−1/2e−z. Then as |β| → ∞, the marginal density of

the Dirichlet-Laplace prior given in Bhattacharya et al. (2015) satisfies

πDL(β) =1

2(1+a∗)/2Γ(a∗)|β|(a∗−1)/2K1−a∗((2|β|)1/2)

≈ 1

2(1+a∗)/2Γ(a∗)|β|(a∗−1)/2π1/22−3/4|β|−1/4 exp{−

√2|β|}

= C0|β|a∗/2−3/4 exp{−

√2|β|} = O(

|β|a∗/2−3/4

exp{√

2|β|}),

where C0 = π1/22−3/4/{2(1+a∗)/2Γ(a∗)} is a constant value. Furthermore, as |β| → ∞,

πDL(β)

1/β2≈ C0|β|a

∗/2+5/4 exp{−√

2|β|} → 0.

41

Proof of Theorem 4. For the proof of Theorem 4, we use the following lemma from

Fields (1972). Some useful notations used in the below proof: Denote aP = (a1, · · · , ap),

as a vector, similarly, bQ = (b1, · · · , bq), cM = (c1, · · · , cm), and so on. Let Γn(cP − t) =∏pk=n+1 Γ(ck−t), with Γn(cP −t) = 1 when n = p, Γ(cM−t) = Γ0(cM−t) =

∏mk=1 Γ(ck−t),

Γ∗(ai − aN) =∏n

k=1;k 6=i Γ(ai − ak), and

pFq

(aPbQ| w)

=∞∑k=0

Γ(aP + k)Γ(bQ)

Γ(bQ + k)Γ(aP )

wk

k!=∞∑k=0

p∏j=1

Γ(aj + k)q∏j=1

Γ(bj)

q∏j=1

Γ(bj + k)p∏j=1

Γ(aj)

wk

k!.

Lemma 2. (Theorem 1 in Fields (1972)) Given (i) 0 ≤ m ≤ q, 0 ≤ n ≤ p; (ii) ai − bkis not a positive integer for j = 1, · · · , p and k = 1, .s, q; (iii) ai − ak is not an integer for

i, k = 1, · · · , p, and i 6= k; and (iv) q < p or q = p and |z| > 1, we have

Gm,np,q

(z∣∣∣a1,··· ,apb1,··· ,bq

)=

n∑i=1

Γ∗(ai − aN)Γ(1 + bM − ai)z−1+ai

Γn(1 + aP − ai)Γm(ai − bQ)q+1Fp

(1,1+bQ−ai1+aP−ai

∣∣∣∣(−1)q−m−n

z

).

Now to prove Theorem 4, we have from Proposition 6 that, the marginal density of the

R2-D2 prior has πR2-D2(βj) = (2π)−1/2{Γ(aπ)Γ(b)}−1Gm,np,q (z|.) with m = 1, n = 3, p =

3, q = 1, a1 = 3/2− aπ, a2 = 1, a3 = 1/2, b1 = 1/2 + b, and z = 2/β2. Conditions (i)-(iv)

in Lemma 2 are satisfied for |β| near 0, since 0 < aπ < 1/2. Denote

C∗0 = (2π)−1/2(Γ(aπ)Γ(b))−1,

C∗1 = C∗0Γ(1

2− aπ)Γ(1− aπ)Γ(aπ)Γ(

1

2+ aπ) > 0,

C∗2 = C∗0Γ(aπ −1

2)Γ(

1

2)Γ(

1

2)Γ(

3

2− aπ) < 0,

C∗3 = C∗0Γ(aπ − 1)Γ(−1

2)Γ(

3

2)Γ(2− aπ) > 0,

U1(β2) =∞∑k=0

(−1)ku1(k, β2), u1(k, β2) =Γ(aπ + b+ k)

Γ(12

+ aπ + k)Γ(aπ + k)

(β2

2)k+aπ−1/2

k!,

U2(β2) =∞∑k=0

(−1)ku2(k, β2), u2(k, β2) =Γ(1

2+ b+ k)

Γ(32− aπ + k)Γ(1

2+ k)

(β2

2)k

k!,

and U3(β2) =∞∑k=0

(−1)ku3(k,β2), u3(k,β2) =Γ(1 + b+ k)

Γ(2− aπ + k)Γ(32

+ k)

(β2

2)k+1/2

k!.

42

Then

πR2-D2(β) = C∗0G1,33,1

(2

β2

∣∣∣ 32−aπ ,1, 1212

+b

)= C∗0

3∑i=1

Γ∗(ai − aN)Γ(1 + bM − ai)Γ3(1 + aP − ai)Γ1(ai − bQ)

(2

β2)−1+ai

2F3

(1,1+bQ−ai1+aP−ai

∣∣∣∣−β2

2

)

= C∗0

3∑i=1

3∏k=1;k 6=i

Γ(ai − ak)Γ(1 + b1 − ai)

3∏k=3+1

Γ(1 + ak − ai)1∏

k=1+1

Γ(ai − bk)(

2

β2)−1+ai

2F3

(1,1+b1−ai1+aP−ai

∣∣∣∣−β2

2

)

= C∗0

3∑i=1

3∏

k=1;k 6=iΓ(ai − ak)Γ(1 + b1 − ai)

1(

2

β2)−1+ai×

∞∑k=0

Γ(1 + k)Γ(1 + b1 − ai + k)3∏j=1

Γ(1 + aj − ai)

3∏j=1

Γ(1 + aj − ai + k)Γ(1)Γ(1 + b1 − ai)

(−β2

2)k

k!

= C∗0 ×{

Γ(1

2− aπ)Γ(1− aπ)

∞∑k=0

Γ(aπ + b+ k)Γ(12

+ aπ)Γ(aπ)

Γ(12

+ aπ + k)Γ(aπ + k)

(−1)k(β2

2)k+aπ−1/2

k!

+Γ(aπ −1

2)Γ(

1

2)Γ(

1

2)∞∑k=0

Γ(12

+ b+ k)Γ(32− aπ)

Γ(32− aπ + k)Γ(1

2+ k)

(−1)k(β2

2)k

k!

+Γ(aπ − 1)Γ(−1

2)Γ(

3

2)∞∑k=0

Γ(1 + b+ k)Γ(2− aπ)

Γ(2− aπ + k)Γ(32

+ k)

(−1)k(β2

2)k+1/2

k!

}

≡ C∗0 ×{

Γ(1

2− aπ)Γ(1− aπ)Γ(aπ)Γ(

1

2+ aπ)U1(β2)+

Γ(aπ −1

2)Γ(

1

2)Γ(

1

2)Γ(

3

2− aπ)U2(β2) + Γ(aπ − 1)Γ(−1

2)Γ(

3

2)Γ(2− aπ)U3(β2)

}≡ C∗1U1(β2) + C∗2U2(β2) + C∗3U3(β2).

For fixed β near the neighborhood of zero, u1(k, β2), u2(k, β2), and u3(k, β2) are all mono-

tone decreasing, and converge to zero as k →∞. Thus, by alternating series test, U1(β2),

U2(β2), and U3(β2) all converge. Also, we have

C0|β|2aπ−1 − C1|β|2aπ+1 = u1(0, β2)− u1(1, β2) ≤ U1(β2) ≤ u1(0, β2) = C0|β|2aπ−1

43

C2 − C3|β|2 = u2(0, β2)− u2(1, β2) ≤ U2(β2) ≤ u2(0, β2) = C2

C4|β| − C5|β|3 = u3(0, β2)− u3(1, β2) ≤ U3(β2) ≤ u3(0, β2) = C4|β|

where C0, C1, C2, C3, and C4 are all positive constants. So given that |β| in the neigh-

borhood of zero and aπ ∈ (0, 12), C∗1(C0|β|2aπ−1−C1|β|2aπ+1)+C∗2C2 +C∗3(C4|β|−C5|β|3) ≤

πR2-D2(β) ≤ C∗1C0|β|2aπ−1 + C∗2(C2 − C3|β|2) + C∗3C4|β|, then πR2-D2(β) = O(|β|2aπ−1).

Proof of Theorem 5. According to 10.30.2 in DLMF (2015), when ν > 0, z → 0 and z

is real, Kν(z) ≈ Γ(ν)(z/2)−ν/2. So given 0 < a∗ < 1 and |β| → 0,

πDL(β) =|β|(a∗−1)/2K1−a∗((2|β|)1/2)

2(1+a∗)/2Γ(a∗)

≈|β|(a∗−1)/2 1

2Γ(1− a∗)( (2|β|)1/2

2)a

∗−1

2(1+a∗)/2Γ(a∗)= C|β|a∗−1,

where C = Γ(1− a∗)/21+a∗Γ(a∗) is a constant value. Theorem 5 follows then.

Proof of Theorem 6. The proof of this theorem depends on Theorem 1 in Armagan,

Dunson, Lee, Bajwa & Strawn (2013). We will restate this theorem in the following Lemma.

Lemma 3. Under Assumptions (A1) and (A2), the posterior of βn under prior πn(βn) is

strongly consistent, that is, for any ε > 0, as n→∞,

Prβ0n

{πn(βn : ||βn − β0

n|| > ε | Yn)→ 0}

= 1

if

πn(βn : ||βn − β0n|| <

nr/2) > exp(−dn)

for all 0 < ∆ < ε2dmin/(48dmax) and 0 < d < ε2dmin/(32σ2) − 3∆dmax/(2σ2) and some

r > 0.

Denote the estimated set of non-zero coefficients as An = {j : βnj 6= 0, j = 1, · · · , pn}.

Given the R2-D2 prior (9), we need to calculate the probability assigned to the region

{β : ||β − β0|| < tn} where tn = ∆/nr/2 with 0 < ∆ < ε2dmin/(48dmax).

44

πn(βn : ||βn − β0n|| < tn) = πn

βn :∑j∈An

(βnj − β0nj)

2 +∑j 6∈An

β2nj < t2n

≥ πn

βj∈Annj :∑j∈An

(βnj − β0nj)

2 <qnt

2n

pn

× πnβj 6∈Annj :

∑j 6∈An

β2nj <

(pn − qn)t2npn

∏j∈An

{πn

(βnj : |βnj − β0

nj | <tn√pn

)}πn(βj 6∈Annj : β2nj <

t2npn

at least for one j

)

=

∏j∈An

{πn

(β0nj −

tn√pn

< βnj < β0nj +

tn√pn

)}[1−{πn

(βj 6∈Annj : β2

nj ≥t2npn

)}pn−qn]

∏j∈An

2tn√pnπR2-D2

(|β0nj |+

tn√pn

)[

1−

{πn

(βj 6∈Annj : |βnj |b ≥

tbn

pb/2n

)}pn−qn]

∏j∈An

2tn√pnπR2-D2

(supj∈An

|β0nj |+

tn√pn

)[

1−

{pb/2n E(|βnj |b)

tbn

}pn−qn]

{2tn√pnπR2-D2

(supj∈An

|β0nj |+

tn√pn

)}qn×

[1−

{pb/2n E(|βnj |b)

tbn

}pn−qn],

where πR2-D2(·) is the marginal density function of βnj, symmetric and decreasing when

the support is positive, and the last but one “≥” is directly got from Markov’s inequality.

Using the hierarchical form of the R2-D2 prior in (10), for any b > 0, conditional

expectations yield

E(|βnj|b) = E[E{E(|βnj|b | λj) | ξ}] = Eξ

[Eλj |ξ

{Γ(b+ 1)

(2/λj)b/2| ξ}]

=bΓ( b

2)Γ(aπ + b

2)

2b/2Γ(aπ).

For the R2-D2 prior, from equation (14), it follows that the marginal density is a

decreasing function on the positive support. Then assumptions (A1) – (A4), together with

the tail approximation of the marginal density as in the proof of Theorem 1, i.e., equation

(15), we have

πR2-D2( supj∈An|β0nj|+

tn√pn

) ≥ πR2-D2(En +tn√pn

) ≥ Γ(aπ + b)

Γ(aπ)Γ(b)2−b

Γ(2b+ 1)

(En + ∆nr/2√pn

)2b+1.

Considering the fact that Γ(a) = a−1 − γ0 + O(a) for a being near zero with γ0 the

45

Euler-Mascheroni constant, we have

πn(βn : ||βn − β0n|| <

nr/2)

2∆

nr/2√pn

Γ(aπ + b)

Γ(aπ)Γ(b)2−b

Γ(2b+ 1)

(En + ∆nr/2√pn

)2b+1

qn 1−

{pb/2n nrb/2bΓ( b2)Γ(aπ + b

2)

∆b2b/2Γ(aπ)

}pn−qn≥

2∆

nr/2√pn

Γ(aπ + b)aπΓ(b)

2−bΓ(2b+ 1)

(En + ∆nr/2√pn

)2b+1

qn 1−

{pb/2n nrb/2bΓ( b2)Γ(aπ + b

2)aπ

∆b2b/2

}pn−qn .Taking the negative logarithm of both sides of the above formula, and letting aπ =

C/(pb/2n nrb/2 log n), we have

− log πn(βn : ||βn − β0n|| <

nr/2)

≤ −qn log

{2∆CΓ(aπ + b)2−bΓ(2b+ 1)

nr/2√pnp

b/2n nrb/2 log nΓ(b)

}+ qn(2b+ 1) log(En +

nr/2√pn

)

−qn log

[1−

{pb/2n nrb/2bΓ( b

2)Γ(aπ + b

2)C

∆b2b/2pb/2n nrb/2 log n

}pn−qn]

= −qn log

{2∆CΓ(aπ + b)2−bΓ(2b+ 1)

Γ(b)

}+ qn(2b+ 1) log(En +

nr/2√pn

)

−qn log

[1−

{bΓ( b

2)Γ(aπ + b

2)C

∆b2b/2 log n

}pn−qn]+ qn log log n

+b+ 1

2qn log pn +

b+ 1

2qnr log n

Since logEn = O(log n), the dominating term is O(qn log n). Hence, if qn = o(n/ log n),

then we have that − log πn(βn : ||βn−β0n|| < ∆/nr/2) < dn for all 0 < d < ε2dmin/(32σ2)−

3∆dmax/(2σ2), so πn(βn : ||βn − β0

n|| < ∆/nr/2) > exp(−dn). The posterior consistency is

then completed by applying Lemma 3.

Proof of Theorem 7. Now that we have established the properties of marginal prior for

the R2-D2 hierarchical formulation, the proof will now be based on the results similar to

Theorems 2.1, 2.2, A.1 and A.2 in Song & Liang (2017). We will restate these theorems in

the following lemma.

46

Lemma 4. Consider the linear regression model (1) and suppose the regularity conditions

(B2)-(B5) hold. Suppose that the prior for π(β, σ2) is of the form

π(β | σ2) =

p∏i=1

[g(βi/σ)/σ], σ2 ∼ IG(a1, b1).

Denote εn = M√qn(log pn)/n where M > 0 is sufficiently large. If the density g(·) in the

above formula satisfies

1−∫ kn

−kng(β) dβ ≤ p−(1+u)

n , − log

(inf

β∈[−En,En]g(β)

)= O(log pn), (16)

where u > 0 is a constant and kn �√qn(log pn)/n/pn, then the following results hold:

Prβ0

{π(βn : ||βn − β0

n|| ≥ c1σ0εn | Yn) ≥ e−c2nε

2n

}≤ e−c3nε

2n ,

Prβ0

{π(βn : ||βn − β0

n||1 ≥ c1σ0√qnεn | Yn) ≥ e−c2nε

2n

}≤ e−c3nε

2n ,

Prβ0

{π(βn : ||Xnβn −Xnβ

0n|| ≥ c1σ

0√nεn | Yn) ≥ e−c2nε

2n

}≤ e−c3nε

2n ,

for some constants c1, c2, c3 > 0.

For the R2-D2 prior in (9), according to (14) the corresponding g(·) function is

g(β) =2aπΓ(aπ + b)

Γ(aπ)Γ(b)

∫ ∞0

exp(−|β|x)x2b

(x2 + 2)aπ+bdx.

By the symmetry of g(β) and Fubini’s Theorem, we have

1−∫ kn

−kng(β) dβ = 2

∫ ∞kn

g(β) dβ

= 22aπΓ(aπ + b)

Γ(aπ)Γ(b)

∫ ∞0

∫ ∞kn

exp(−|β|x) dβx2b

(x2 + 2)aπ+bdx

= 22aπΓ(aπ + b)

Γ(aπ)Γ(b)

∫ ∞0

exp(−knx)

x

x2b

(x2 + 2)aπ+bdx

=2Γ(aπ + b)

Γ(aπ)Γ(b)

∫ ∞0

exp(−kn√

2x)x2b−1

(x2 + 1)aπ+bdx

=2Γ(aπ + b)

Γ(aπ)Γ(b)

1

2√πΓ(aπ + b)

G3,11,3

(k2n

2

∣∣∣1−baπ ,0,

12

)=

1√πΓ(aπ)Γ(b)

G3,11,3

(k2n

2

∣∣∣1−baπ ,0,

12

)47

where the second to last “=” follows from equation 3.389.2 in Zwillinger (2014). The right

side of the above equation looks similar to the marginal density of R2-D2 prior, so we can

apply exactly the same technique used in proof of Theorem 4. So in the proof

1−∫ kn

−kng(β) dβ =

1√πΓ(aπ)Γ(b)

G3,11,3

(k2n

2

∣∣∣1−baπ ,0,

12

)= C∗1U1(k2

n) + C∗2U2(k2n) + C∗3U3(k2

n),

where

C∗1 =1√

πΓ(aπ)Γ(b)Γ(−aπ)Γ(

1

2− aπ)Γ(aπ +

1

2)Γ(1 + aπ) < 0,

C∗2 =1√

πΓ(aπ)Γ(b)Γ(aπ)Γ(

1

2)Γ(

1

2)Γ(1− aπ) > 0,

C∗3 =1√

πΓ(aπ)Γ(b)Γ(aπ −

1

2)Γ(−1

2)Γ(

3

2)Γ(

3

2− aπ) > 0,

U1(k2n) =

∞∑j=0

(−1)ju1(j, k2n), u1(j, k2

n) =Γ(aπ + b+ j)

Γ(1 + aπ + j)Γ(12

+ aπ + j)

(k2n

2)j+aπ

j!,

U2(k2n) =

∞∑j=0

(−1)ju2(j, k2n), u2(j, k2

n) =Γ(b+ j)

Γ(1− aπ + j)Γ(12

+ j)

(k2n

2)j

j!,

and U3(k2n) =

∞∑j=0

(−1)ju3(j, k2n), u3(j, k2

n) =Γ(1

2+ b+ j)

Γ(32− aπ + j)Γ(3

2+ j)

(k2n

2)j+1/2

j!.

Then we have

u1(0, k2n)− u1(1, k2

n) ≤ U1(k2n) ≤ u1(0, k2

n),

U2(k2n) ≤ u2(0, k2

n),

U3(k2n) ≤ u3(0, k2

n).

Hence, based on the fact Γ(1−z) = −zΓ(−z), Γ(z) ≈ 1/z as z → 0, kn �√

(qn log pn)/n/pn →

0, and aπ ≤ log(1−p−(1+u)n )

2 log kn→ 0, it follows

1−∫ kn

kn

g(β) dβ ≤ C∗1

{u1(0, k2

n)− u1(1, k2n)}

+ C∗2u2(0, k2n) + C∗3u3(0, k2

n)

= 1− k2aπn

{−

Γ(−aπ)Γ(12− aπ)Γ(aπ + b)

√πΓ(aπ)Γ(b)2aπ

− C∗4k2n − C∗5k1−2aπ

n

}= 1− k2aπ

n

{Γ(1− aπ)Γ(12− aπ)Γ(aπ + b)

√πΓ(b)2aπ

− C∗4k2n − C∗5k1−2aπ

n

}→ 1− k2aπ

n ≤ p−(1+u)n ,

48

where C∗4 , C∗5 ≥ 0. Hence for aπ ≤ log(1−p−(1+u)

n )2 log kn

→ 0, we proved the first condition in (16)

holds.

Now let’s prove that the second condition in (16) also holds. By Theorem 1, g(β) =

O(|β|−2b−1) as |β| → ∞. Since log(En) = O(log pn),

infβ∈[−En,En]

g(β) = g(En) ≈ O(E−2b−1n ).

So

− log(

infβ∈[−En,En]

g(β))

= (2b+ 1) log(En) = O(log pn),

i.e., the second condition in (16) holds. Hence, all conditions in Lemma 4 are satisfied.

Theorem 7 is proven.

B Appendix: Choosing Hyper-parameters ν and µ

Assume that q of the p variance components (λj) account for (1 − ε) proportion of the

variability. That is,∑qj=1 λ(j)∑pj=1 λ(j)

= 1 − ε, where λ(j) is the j-th largest value. Suppose that

λj ∼ Gamma (ν, µ). The degree of sparsity, i.e. the number of relevant components, q can

then be viewed as finding q such that the proportion achieves 1− ε. Since the proportion

is a random quantity, we seek q such that the median of the distribution of the proportion

reaches equality. Note that since, µ is a scale parameter, the distribution of the proportion

is unaffected by µ. Hence a grid search can be achieved to find ν such that the equation

is satisfied by the median of the distribution, this just requires sampling from a Gamma

(ν, 1) distribution to approximate the distribution for a given ν.

After choosing ν, we then choose the scale parameter, µ, we seek to ensure that the

prior distribution for β is near zero at p−q directions. To do so, recall that the distribution

must lie on the ellipsoid given by βTΣXβ = θσ2. Hence, we wish to push p− q directions

to be inside the ellipsoid, i.e. near zero. So, we set µ such that the probability of βj under

the unrestricted Normal distribution, being inside the ellipsoid is 1 − q/p. Note that this

49

probability is just P (β2j < θσ2). Now

P (β2j < θσ2) = P

(wj <

1

λj

)=

∫ ∞0

∫ 1/nλj

0

fw(wj)fλ(λj)dwjdλj

=

∫ ∞0

1

Γ(1/2)γ

(1

2,

1

2λj

)fλ(λj)dλj

=1

Γ(1/2)Eλ

(1

2,

1

2λj

)],

where wj is a χ21 random variable, and γ is the lower incomplete gamma function. This

can again be evaluated on a grid. This time a grid of µ.

50