9
Posterior consistency in linear models under shrinkage priors Artin Armagan Department of Statistical Science Duke University Durham, NC 27708 [email protected] David B. Dunson Department of Statistical Science Duke University Durham, NC 27708 [email protected] Jaeyong Lee Department of Statistics Seoul National University Seoul, 151-747, Korea [email protected] Waheed U. Bajwa Department of Electrical and Computer Engineering Duke University Durham, NC 27708 [email protected] April 22, 2011 Abstract We investigate posterior consistency in linear models with a diverging number of parameters. We first propose a parameter-free multivariate generalized double Pareto distribution as a default prior choice that preserves some of the desired characteristics of a joint double exponential distribution with multivariate Cauchy-like tails. We give sufficient conditions for consistency when p/n 0 and then investigate the behavior of the posterior under normal, double exponential and multivariate generalized double Pareto priors. Heavy tails; High-dimensional data; Bayesian Lasso; Posterior consistency; Robust prior; Shrinkage estimation. 1 Introduction Shrinkage estimation through continuos priors (Griffin & Brown, 2007; Park & Casella, 2008; Hans, 2009; Carvalho et al., 2010; Griffin & Brown, 2010) has found much attention in recent years along with their frequentist analogues (Knight & Fu, 2000; Fan & Li, 2001; Yuan & Lin, 2005; Zhao & Yu, 2006; Zou, 2006; Zou & Li, 2008) in the regularization framework. The Lasso of Tibshirani (1996) and its Bayesian analogues relying on double exponential priors (Park & Casella, 2008; Hans, 2009) have drawn particular attention, with many variations being proposed. These priors yield undeniable computational advantages in regression models over Bayesian variable selection approaches that require a search over a huge discrete model space (George & McCulloch, 1993; Raftery et al., 1997; Chipman et al., 2001; Liang et al., 2008; Clyde et al., 2010). These advantages are particularly apparent for scale mixture of Gaussian priors that allow conjugate block updating of the regression coefficients and hence lead to substantial improvements in Markov chain Monte Carlo efficiency through more rapid mixing and convergence rates. Many of these priors will also yield sparse estimates if desired via maximum a posteriori (MAP) estimation and approximate inferences via variational approaches (Tipping, 2001; Bishop & Tipping, 2000; Figueiredo, 2003; Armagan, 2009). In the Bayesian framework, to justify use in high-dimensional settings, it is important to establish posterior con- sistency in cases in which the number of parameters p increases with sample size n. This is of practical importance as often one tries to use as many variables in a regression model as the sample size allows for, suggesting the sample size can impact the number of candidate predictors. Some of the existing relevant asymptotic results include Ghosal (1999) and Jiang (2007). Ghosal (1999) considers the asymptotic normality of the posteriors in linear models while 1 arXiv:1104.4135v1 [stat.ME] 20 Apr 2011

Posterior consistency in linear models under shrinkage priors

  • Upload
    duke

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Posterior consistency in linear models under shrinkage priors

Artin ArmaganDepartment of Statistical Science

Duke UniversityDurham, NC 27708

[email protected]

David B. DunsonDepartment of Statistical Science

Duke UniversityDurham, NC 27708

[email protected]

Jaeyong LeeDepartment of Statistics

Seoul National UniversitySeoul, 151-747, [email protected]

Waheed U. BajwaDepartment of Electrical

and Computer EngineeringDuke University

Durham, NC [email protected]

April 22, 2011

Abstract

We investigate posterior consistency in linear models with a diverging number of parameters. We first proposea parameter-free multivariate generalized double Pareto distribution as a default prior choice that preserves some ofthe desired characteristics of a joint double exponential distribution with multivariate Cauchy-like tails. We givesufficient conditions for consistency when p/n → 0 and then investigate the behavior of the posterior under normal,double exponential and multivariate generalized double Pareto priors.

Heavy tails; High-dimensional data; Bayesian Lasso; Posterior consistency; Robust prior; Shrinkage estimation.

1 IntroductionShrinkage estimation through continuos priors (Griffin & Brown, 2007; Park & Casella, 2008; Hans, 2009; Carvalhoet al., 2010; Griffin & Brown, 2010) has found much attention in recent years along with their frequentist analogues(Knight & Fu, 2000; Fan & Li, 2001; Yuan & Lin, 2005; Zhao & Yu, 2006; Zou, 2006; Zou & Li, 2008) in theregularization framework. The Lasso of Tibshirani (1996) and its Bayesian analogues relying on double exponentialpriors (Park & Casella, 2008; Hans, 2009) have drawn particular attention, with many variations being proposed. Thesepriors yield undeniable computational advantages in regression models over Bayesian variable selection approachesthat require a search over a huge discrete model space (George & McCulloch, 1993; Raftery et al., 1997; Chipmanet al., 2001; Liang et al., 2008; Clyde et al., 2010). These advantages are particularly apparent for scale mixtureof Gaussian priors that allow conjugate block updating of the regression coefficients and hence lead to substantialimprovements in Markov chain Monte Carlo efficiency through more rapid mixing and convergence rates. Many ofthese priors will also yield sparse estimates if desired via maximum a posteriori (MAP) estimation and approximateinferences via variational approaches (Tipping, 2001; Bishop & Tipping, 2000; Figueiredo, 2003; Armagan, 2009).

In the Bayesian framework, to justify use in high-dimensional settings, it is important to establish posterior con-sistency in cases in which the number of parameters p increases with sample size n. This is of practical importanceas often one tries to use as many variables in a regression model as the sample size allows for, suggesting the samplesize can impact the number of candidate predictors. Some of the existing relevant asymptotic results include Ghosal(1999) and Jiang (2007). Ghosal (1999) considers the asymptotic normality of the posteriors in linear models while

1

arX

iv:1

104.

4135

v1 [

stat

.ME

] 2

0 A

pr 2

011

Jiang (2007) investigates the behavior of Bayesian variable selection in high dimensional settings for generalized lin-ear models. To our knowledge, results along these lines have not been established for some of the very commonlyused shrinkage priors such as the double exponential and its variants. We first present our main result where we givesufficient conditions for posterior consistency under linear models with diverging p. We then provide conditions onthe true parameter values under which consistency results for p/n→ 0 for the normal, double exponential and multi-variate generalized double Pareto prior. The later prior is a novel specification we propose which leads to consistencyfor a substantially larger class of true parameter values than the normal or double exponential prior.

Henceforth, N(0, s2) denotes a normal distribution with variance s2 and density,

π(βj |s) =1

(2π)1/2sexp{−β2

j /(2s2)}

and DE(0, s) denotes a double exponential distribution with scale parameter s and density

π(βj |s) =1

2sexp(−|βj |/s).

We consider the multivariate generalized double Pareto distribution as a useful alternative to these priors whichalso appears in Cevher (2009).

Definition 1. Let βj ∼ DE(0, s) for j = 1, ..., p independently conditionally on s. If we mix the joint density ofβ1, ..., βp by letting s−1 ∼ G(α, η), where G(α, η) denotes a gamma distribution with shape parameter α and rateparameter η, the resulting marginal joint distribution over β is a p dimensional multivariate generalized double Paretowith mean zero, shape parameter α and scale parameter η, MGDP(0, α, η), with density

π(β|α, η) =Γ(α+ p)

(2η)pΓ(α)

(1 +

∑j |βj |η

)−(α+p).

Remark 1. Park & Casella (2008) instead place an inverse-gamma prior on s2 in the double exponential density,leading to a joint density of β that lacks a simple analytic form.

Remark 2. The multivariate generalized double Pareto density is heavier-tailed than a corresponding joint doubleexponential density while preserving its linear contours and singularity at β = 0 (see Figure 1).

We set α = η = 1 as a default choice.

Remark 3. The MGDP(0, 1, 1) density becomes increasingly concentrated near zero as p increases, providing anautomatic penalty for increasing dimension.

2 Posterior ConsistencyConsider the linear regression model, yn = Xnβn + εn, where yn is an n-dimensional vector of responses, Xn is then× p design matrix and εn ∼ N

(0, σ2In

)with known σ2 > 0 .

Throughout the paper we assume that limn→∞ log p/ log n = δ for some 0 ≤ δ < 1 (A1). We also let Xn =UnΣnV

Tn be a singular value decomposition where Un is a n × p matrix with orthonormal columns, Vn is a p × p

orthonormal matrix and Σn = diag(σn1, ..., σnp) with 0 < Λ1 ≤ σ2np/n ≤ ... ≤ σ2

n1/n ≤ Λ2 (A2).

Theorem 1. Under assumptions A1 and A2 and letting β0n = (β0

n1, . . . , βnp)T denote the true values of the regression

coefficients, the posterior of βn under prior Πn(βn) is strongly consistent if

Πn

(βn : ‖βn − β0

n‖ <∆

nρ/2

)> D exp(−dn)

for some ∆ > 0, d > 0, D > 0 and ρ > 0, that is, as n → ∞, Πn(βn : ||βn − β0n|| > ε|yn) → 0 almost surely for

any ε > 0, where β0n = (β0

n1, . . . , β0np)

T is the true vector of the regression coefficients.

2

Figure 1: Double exponential with s = 1 (red) and multivariate generalized double Pareto with α = η = 1 (blue).

Proof. Let An := {βn : ‖βn − β0n‖ > ε} for ε > 0. We introduce a test function Φn for testing H0 : βn = β0

n vs.H1 : βn ∈ An such that

1. Eβ0n(Φn) ≤ exp(−bn),

2. supβn∈An Eβn(1− Φn) ≤ exp(−bn),

for some constant b > 0 that depends only on ε.Define Φn(yn) = I(yn ∈ Cn) where the critical region is Cn := {yn : ‖β̂n − β0

n‖ > cε} for c ∈ (0, 1) andβ̂n = (XT

nXn)−1XTnyn.

Computing the type I error rate for Φn we obtain

Eβ0n(Φn) = prβ0

n(‖β̂n − β0

n‖ > cε)

= prβ0n(‖(XT

nXn)−1XT

nεn‖ > cε)

= prβ0n(εTnUnΣ−2n UT

n εn > c2ε2)

≤ prβ0n(χ2p > c2ε2nΛ1/σ

2)

≤ exp

{−c

2ε2nΛ1

2σ2+

(c2ε2nΛ1p

2σ2− p2

4

)1/2}

(1)

by using the fact that U ′εn ∼ N(0, σ2Ip) and the inequality pr{χ2p − p ≥ 2(px)1/2 + 2x} ≤ exp(−x) by Laurent &

Massart (2000), where χ2p is a chi-squared distributed random variable with p degrees of freedom. The bound in (1) is

3

not an asymptotic one. It is obvious that the quantity in the square root in (1) is larger than zero under assumption A1as n→∞. Also it can be shown that the exponent in (1)

− c2ε2nΛ1

2σ2+

(c2ε2nΛ1p

2σ2− p2

4

)1/2

≤ −c2ε2nΛ1

4σ2

if p ≤ (2 −√

3)c2ε2nΛ1/(2σ2) which is satisfied again under assumption A1 as n → ∞. Then we can state that as

n→∞,

Eβ0n(Φn) ≤ exp

(−c

2ε2nΛ1

4σ2

). (2)

In a similar way,

supβn∈An

Eβn(1− Φn) = supβn∈An

prβn(‖β̂n − β0n‖ ≤ cε)

≤ supβn∈An

prβn(∣∣∣‖β̂n − βn‖ − ‖β0

n − βn‖∣∣∣ ≤ cε)

= supβn∈An

prβn(−cε+ ‖β0n − βn‖ ≤ ‖β̂n − βn‖ ≤ cε+ ‖β0

n − βn‖)

≤ supβn∈An

prβn(‖β̂n − βn‖ ≥ −cε+ ‖β0n − βn‖)

= prβn{‖β̂n − βn‖ ≥ (1− c)ε}= prβn{ε

T

nUnΣ−2n UT

n εn ≥ (1− c)2ε2}≤ prβn{χ

2p ≥ (1− c)2ε2nΛ1/σ

2}

≤ exp

[− (1− c)2ε2nΛ1

2σ2+

{(1− c)2ε2nΛ1p

2σ2− p2

4

}1/2].

Similarly to (2), as n→∞,

supβn∈An

Eβn(1− Φn) ≤ exp

{− (1− c)2ε2nΛ1

4σ2

}. (3)

Given (2) and (3), we set c = 1/2. Hence b = ε2nΛ1/(16σ2) and we have the desired consistent tests.The posterior probability of An is given by

Πn(An|yn) =

∫An{f(yn|βn)/f(yn|β0

n)}Π(dβn)∫{f(yn|βn)/f(yn|β0

n)}Π(dβn)

=(Φn + 1− Φn)JAn

Jn

≤ Φn +(1− Φn)JAn

Jn= I1 + I2/Jn,

where JA =∫A{f(yn|βn)/f(yn|β0

n)}Π(dβn) and Jn = JRp .We need to show that I1 + I2/Jn → 0.

prβ0n{I1 ≥ exp(−bn/2)} ≤ exp(bn/2)Eβ0

n(I1)

= exp(−bn/2).

By the Borel–Cantelli lemma∑∞n=1 prβ0

{I1 ≥ exp(−bn/2)} <∞ and hence prβ0{I1 ≥ exp(−bn/2) infinitely often} =

0.

4

We next look at the behavior of I2:

Eβ0n(I2) = Eβ0

n{(1− Φn)JAn}

= Eβ0n

{(1− Φn)

∫An

f(yn|βn)

f(yn|β0n)

Πn(dβn)

}=

∫An

∫(1− Φn)f(yn|βn)dynΠn(dβn)

≤ Πn(An) supβn∈An

Eβn(1− Φn)

≤ exp(−bn)

Then

prβ0n{I2 ≥ exp(−bn/2)} ≤ exp(bn/2)Eβ0

n(I2)

= exp(−bn/2).

and again by the Borel–Cantelli lemma∑∞n=1 prβ0

{I2 ≥ exp(−bn/2)} <∞ and hence prβ0{I2 ≥ exp(−bn/2) infinitely often} =

0.We have shown that both I1 and I2 tend towards zero. Now we analyze the behavior of Jn. For Πn(An|yn)→ 0,

we need to show that exp(γn)Jn →∞ for any γ > 0.

exp(γn)Jn = exp(γn)

∫exp

{−n 1

nlog

f(yn|β0n)

f(yn|βn)

}Πn(dβn)

=

∫exp

[n

{γ − 1

nlog

f(yn|β0n)

f(yn|βn)

}]Πn(dβn)

≥∫Bn,ν

exp{n(γ − ν)}Πn(dβn)

= exp{n(γ − ν)}Πn(Bn,ν) (4)

where Bn,ν = {βn : n−1 log{f(yn|β0n)/f(yn|βn)} < ν} = {βn : n−1(‖yn −Xnβn‖2 − ‖yn −Xnβ

0n‖2) < 2σ2ν}

for 0 < ν < γ. Then

Πn(Bn,ν) ≥ Πn

{βn : n−1

∣∣‖yn −Xnβn‖2 − ‖yn −Xnβ0n‖2∣∣ < 2σ2ν

}. (5)

Using the identity f(x) = f(x0) + f ′(x0)(x− x0) + f ′′(x0)(x− x0)2/2 for a quadratic function in (5),

Πn(Bn,ν) ≥ Πn

{βn : n−1

∣∣2‖yn −Xnβ0n‖(‖yn −Xnβn‖ − ‖yn −Xnβ

0n‖)

+ (‖yn −Xnβn‖ − ‖yn −Xnβ0n‖)2

∣∣ < 2σ2ν}

≥ Πn

{βn : n−1(2‖yn −Xnβ

0n‖‖Xnβn −Xnβ

0n‖+ ‖Xnβn −Xnβ

0n‖2) < 2σ2ν

}≥ Πn

(βn : 3n−1κn‖Xnβn −Xnβ

0n‖ < 2σ2ν, ‖Xnβn −Xnβ

0n‖ < κn

)≥ Πn

(βn : n−1‖Xnβn −Xnβ

0n‖ <

2σ2ν

3κn, n−1‖Xnβn −Xnβ

0n‖ < n−1κn

)(6)

given that ‖yn −Xnβ0n‖ < κn. Now we show that prβ0

n(yn : ‖yn −Xnβ

0n‖ ≥ κn infinitely often) = 0.

prβ0n(yn : ‖yn −Xnβ

0n‖2 ≥ κ2n) = prβ0

n(yn : εTnεn ≥ κ2n)

= prβ0n(yn : χ2

n ≥ κ2n/σ2)

≤ exp

{− κ2n

2σ2−(κ2nn

2σ2+n2

4

)1/2}

≤ exp{−κ2n/(4σ2)}

5

given that κ2n = n1+ρ for any ρ > 0 and sufficiently large n. Since∑∞n=1 prβ0

n(yn : ‖yn −Xnβ

0n‖ ≥ κn) < ∞, by

the Borel–Cantelli lemma prβ0n(yn : ‖yn −Xnβ

0n‖ ≥ κn infinitely often) = 0. Then following from (6) and the fact

that κn →∞, as n→∞, for sufficiently large n

Πn(Bn,ν) ≥ Πn

(βn : n−1‖Xnβn −Xnβ

0n‖ <

2σ2ν

3κn

)≥ Πn

(βn : n−1/2‖Xnβn −Xnβ

0n‖ <

2σ2ν

3nρ/2

)= Πn

(βn : n−1‖Xnβn −Xnβ

0n‖2 <

4σ4ν2

9nρ

)≥ Πn

(βn : ‖βn − β0

n‖2 <4σ4ν2

9Λ2nρ

)= Πn

(βn : ‖βn − β0

n‖ <∆

nρ/2

)where ∆ = (2σ2ν)/(3Λ2). Hence following (4), Πn(An|yn)→ 0 exponentially fast if

Πn

(βn : ‖βn − β0

n‖ <∆

nρ/2

)> exp(−dn) (7)

for 0 < d < γ − ν.

3 The PriorsTheorem 1 provides a sufficient condition (7) on the prior under which strong posterior consistency results underassumptions A1 and A2. (7) is a type of prior support condition, and in this section we provide conditions on the trueparameter vector β0

n under which (7) is satisfied for normal, double exponential and multivariate generalized doublePareto priors. In formalizing the conditions on β0

n, we avoid making sparsity assumptions and instead give constraintson norms. We also use f(n) = ω{g(n)} to denote that g(n)/f(n)→ 0 as n→∞.

3.1 The Normal PriorTheorem 2. Let βnj ∼ N(0, s2n) for i = 1, . . . , p independently and identically. Following Theorem 1, the resultingposterior is consistent if log(snn

ρ/2) = o(n1−δ), ‖β0n‖2 = o(s2nn) and sn = ω(n−ρ/2) for any δ < 1 and ρ > 0.

Proof. We start by taking the negative logarithm of both sides of the inequality given in (7). The left hand size is givenby

− log Πn

(βn : ‖βn − β0

n‖ <∆

nρ/2

)≤ p log(2πsnn

ρ/2/∆) +‖β0

n‖2 + ∆2/nρ

2s2n, (8)

where if log(snnρ/2) = o(n1−δ), ‖β0

n‖2 = o(s2nn) and sn = ω(n−ρ/2) for any δ < 1 and ρ > 0, then (7) holds. Heresn = ω(n−ρ/2) guarantees the right–hand side of (8) remains positive for all ‖β0

n‖2.

3.2 The Double Exponential PriorTheorem 3. Let βnj ∼ DE(0, sn) for i = 1, . . . , p independently and identically. Following Theorem 1, the resultingposterior is consistent if log(snn

ρ/2) = o(n1−δ),∑j |β0

nj | = o(snn) and sn = ω(n−ρ/2) for any δ < 1 and ρ > 0.

Proof.

− log Πn

(βn : ‖βn − β0

n‖ <∆

nρ/2

)≤ p log(2snn

ρ/2/∆) +

∑j |β0

nj |+ ∆/nρ/2

sn, (9)

6

where if log(snnρ/2) = o(n1−δ),

∑j |β0

nj | = o(snn) and sn = ω(n−ρ/2) for any δ < 1 and ρ > 0, then (7) holds.Here sn = ω(n−ρ/2) guarantees the right–hand side of (9) remains positive for all

∑j |β0

nj |.

3.3 The Multivariate Generalized Double Pareto PriorTheorem 4. Let βn ∼ MGDP(0, 1, 1). Following Theorem 1, the resulting posterior is consistent if log{nρ/2−δ(1 +∑j |β0

nj |)} = o(n1−δ) and∑j |β0

nj | = ω(nδ−ρ/2) for any δ < 1 and ρ > 0.

Proof.

− log Πn

(βn : ‖βn − β0

n‖ <∆

nρ/2

)≤ p log

(2nρ/2

)− log Γ(p+ 1)

+(p+ 1) log

1 +∑j

|β0nj |+ ∆/nρ/2

< p log

{2nρ/2(1 +

∑j |β0

nj |+ ∆/nρ/2)

p+ 1

}

+ log

1 +∑j

|β0nj |+ ∆/nρ/2

+ p+ 1, (10)

where we use the Stirling’s formula for Gamma function, i.e. Γ(z) = (2π/z)1/2(z/e)z{1 + O(1/z)} for z > 0. Iflog{nρ/2−δ(1 +

∑j |β0

n|)} = o(n1−δ) and∑j |β0

nj | = ω(nδ−ρ/2) for any δ < 1 and ρ > 0, then (7) holds. Here∑j |β0

nj | = ω(nδ−ρ/2) guarantees the right–hand side of (10) remains positive.

4 Concluding RemarksTheorems 2-4 show that the conditions on the norms of the true parameter vector for posterior consistency withdiverging p are significantly more stringent for normal or double exponential priors than for multivariate generalizeddouble Pareto prior. Suppose that β0

n is a very sparse vector. To be able to strongly shrink the small coefficientstowards zero, we need to make sn small under the normal and double exponential priors. As stated in Theorems 2and 3, sn ≥ kn−ρ/2 for all k > 0 both in the normal and double exponential cases. Letting sn = n−ρ/2+ψ forsome arbitrarily small ψ > 0, ‖β0

n‖2 = o(n1−ρ+2ψ) under the normal prior and∑j |β0

nj | = o(n1−ρ/2+ψ) under thedouble exponential prior. For increasing values of ρ, making sn smaller, the portion of the parameter space whereposterior consistency holds becomes smaller and smaller. This makes sense as the large coefficients are over-shrunk toappropriately shrink the small coefficients. For the multivariate generalized double Pareto prior, the conditions on thenorm of the underlying truth is much more relaxed, i.e. knδ−ρ/2 ≤

∑j |β0

nj | ≤ k exp(n1−δ) for all k > 0 as n→∞.Theorem 1 may also be used to investigate the behavior of many other priors in the linear regression setting.

AcknowledgementsThis work was supported by the National Institute of Environmental Health Sciences. The content is solely the respon-sibility of the authors and does not necessarily represent the official views of the National Institute of EnvironmentalHealth Sciences or the National Institutes of Health. Jaeyong Lee was supported by Basic Science Research Pro-gram through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science andTechnology.

7

ReferencesARMAGAN, A. (2009). Variational bridge regression. JMLR: W&CP 5, 17–24.

BISHOP, C. M. & TIPPING, M. E. (2000). Variational relevance vector machines. In UAI ’00: Proceedings of the16th Conference on Uncertainty in Artificial Intelligence. San Francisco, CA, USA: Morgan Kaufmann PublishersInc.

CARVALHO, C. M., POLSON, N. G. & SCOTT, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika97, 465–480.

CEVHER, V. (2009). Learning with compressible priors. In Advances in Neural Information Processing Systems,Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams & A. Culotta, eds., vol. 22.

CHIPMAN, H., GEORGE, E. I. & MCCULLOCH, R. E. (2001). The practical implementation of Bayesian modelselection. IMS Lecture Notes - Monograph Series 38.

CLYDE, M., GHOSH, J. & LITTMAN, M. L. (2010). Bayesian adaptive sampling for variable selection and modelaveraging. Journal of Computational and Graphical Statistics .

FAN, J. & LI, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal ofthe American Statistical Association 96, 1348–1360.

FIGUEIREDO, M. A. T. (2003). Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysisand Machine Intelligence 25, 1150–1159.

GEORGE, E. I. & MCCULLOCH, R. E. (1993). Variable selection via Gibbs sampling. Journal of the AmericanStatistical Association 88.

GHOSAL, S. (1999). Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli 5,pp. 315–331.

GRIFFIN, J. E. & BROWN, P. J. (2007). Bayesian adaptive lassos with non-convex penalization. Technical Report .

GRIFFIN, J. E. & BROWN, P. J. (2010). Inference with normal-gamma prior distributions in regression problems.Bayesian Analysis 5, 171–188.

HANS, C. (2009). Bayesian lasso regression. Biometrika 96, 835–845.

JIANG, W. (2007). Bayesian variable selection for high dimensional generalized linear models: Convergence rates ofthe fitted densities. The Annals of Statistics 35, 1487–1511.

KNIGHT, K. & FU, W. (2000). Asymptotics for lasso-type estimators. The Annals of Statistics 28, 1356–1378.

LAURENT, B. & MASSART, P. (2000). Adaptive estimation of a quadratic functional by model selection. The Annalsof Statistics 28, 1302–1338.

LIANG, F., PAULO, R., MOLINA, G., CLYDE, M. & BERGER, J. (2008). Mixtures of g priors for Bayesian variableselection. Journal of the American Statistical Association 103, 410–423.

PARK, T. & CASELLA, G. (2008). The Bayesian lasso. Journal of the American Statistical Association 103, 681–686(6).

RAFTERY, A. E., MADIGAN, D. & HOETING, J. A. (1997). Bayesian model averaging for linear regression models.Journal of the American Statistical Association 92, 179–191.

TIBSHIRANI, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.Series B (Methodological) 58, 267–288.

8

TIPPING, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine LearningResearch 1.

YUAN, M. & LIN, Y. (2005). Efficient empirical Bayes variable selection and estimation in linear models. Journal ofthe American Statistical Association 100, 1215–1225.

ZHAO, P. & YU, B. (2006). On model selection consistency of lasso. J. Mach. Learn. Res. 7.

ZOU, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101,1418–1429.

ZOU, H. & LI, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals ofStatistics 36, 1509–1533.

9