22
On a Class of Shrinkage Priors for Covariance Matrix Estimation Hao Wang Department of Statistics, University of South Carolina, Columbia, SC 29208, U.S.A. [email protected] Natesh S. Pillai Department of Statistics, Harvard University, Cambridge, MA 02138, U.S.A. [email protected] This version: October 5, 2011 Abstract We propose a flexible class of models based on scale mixture of uniform distri- butions to construct shrinkage priors for covariance matrix estimation. This new class of priors enjoys a number of advantages over the traditional scale mixture of normal priors, including its simplicity and flexibility in characterizing the prior density. We also exhibit a simple, easy to implement Gibbs sampler for posterior simulation which leads to efficient estimation in high dimensional problems. We first discuss the theory and computational details of this new approach and then extend the basic model to a new class of multivariate conditional autoregressive models for analyzing multivariate areal data. The proposed spatial model flexi- bly characterizes both the spatial and the outcome correlation structures at an appealing computational cost. Examples consisting of both synthetic and real- world data show the utility of this new framework in terms of robust estimation as well as improved predictive performance. Key words: Areal data; Covariance matrix; Data augmentation Gibbs sampler; Multivari- ate conditional autoregressive model; Scale mixture of uniform; Shrinkage; Sparsity. 1 Introduction Estimation of the covariance matrix Σ of a multivariate random vector y is ubiquitous in modern statistics and is particularly challenging when the dimension of the covariance matrix, p, is comparable or even larger than the sample size n. For efficient inference, it is thus paramount to take advantage of parsimonious structure often inherent in these high dimensional problems. Many Bayesian approaches have been proposed for covariance matrix estimation by placing shrinkage priors on various parameterizations of the covariance matrix Σ. Yang & Berger (1994) proposed reference priors for Σ based on the spectral decomposition of Σ. Barnard et al. (2000) and Liechty et al. (2004) 1

On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

Embed Size (px)

Citation preview

Page 1: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

On a Class of Shrinkage Priors for Covariance MatrixEstimation

Hao Wang

Department of Statistics, University of South Carolina,Columbia, SC 29208, U.S.A.

[email protected]

Natesh S. Pillai

Department of Statistics, Harvard University,Cambridge, MA 02138, U.S.A.

[email protected]

This version: October 5, 2011

Abstract

We propose a flexible class of models based on scale mixture of uniform distri-butions to construct shrinkage priors for covariance matrix estimation. This newclass of priors enjoys a number of advantages over the traditional scale mixtureof normal priors, including its simplicity and flexibility in characterizing the priordensity. We also exhibit a simple, easy to implement Gibbs sampler for posteriorsimulation which leads to efficient estimation in high dimensional problems. Wefirst discuss the theory and computational details of this new approach and thenextend the basic model to a new class of multivariate conditional autoregressivemodels for analyzing multivariate areal data. The proposed spatial model flexi-bly characterizes both the spatial and the outcome correlation structures at anappealing computational cost. Examples consisting of both synthetic and real-world data show the utility of this new framework in terms of robust estimationas well as improved predictive performance.

Key words: Areal data; Covariance matrix; Data augmentation Gibbs sampler; Multivari-

ate conditional autoregressive model; Scale mixture of uniform; Shrinkage; Sparsity.

1 Introduction

Estimation of the covariance matrix Σ of a multivariate random vector y is ubiquitous inmodern statistics and is particularly challenging when the dimension of the covariancematrix, p, is comparable or even larger than the sample size n. For efficient inference,it is thus paramount to take advantage of parsimonious structure often inherent inthese high dimensional problems. Many Bayesian approaches have been proposed forcovariance matrix estimation by placing shrinkage priors on various parameterizationsof the covariance matrix Σ. Yang & Berger (1994) proposed reference priors for Σ basedon the spectral decomposition of Σ. Barnard et al. (2000) and Liechty et al. (2004)

1

Page 2: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

considered shrinkage priors in terms of the correlation matrix and standard deviations.Daniels & Kass (1999, 2001) proposed flexible hierarchical priors based on a number ofparameterizations of Σ. All of these methods use non-conjugate priors and typically relyon Markov chain algorithms which explore the state space locally such as Metropolis-Hastings methods or asymptotic approximations for posterior simulation and modelingfitting and are restricted to low-dimensional problems.

A large class of sparsity modeling of the covariance matrix involves the identificationof zeros of the inverse Ω = Σ−1. This corresponds to the Gaussian graphical models inwhich zeros in the inverse covariance matrix uniquely determine an undirected graphthat represents the strict conditional independencies. The Gaussian graphical modelapproach for covariance matrix estimation is attractive and has gained substantiveattention owing to the fact that its implied conditional dependence structure provides anatural platform for modeling dependence of random quantities in areas such as biology,finance, environmental health and social sciences. The standard Bayesian approach toinference in Gaussian graphical models is the conjugate G-Wishart prior (Roverato,2002; Atay-Kayis & Massam, 2005), which places positive probability mass at zero onzero elements of Ω. A zero constrained random matrix Ω has the G-Wishart distributionWG(b,D) if its density is

p(Ω | G) = CG(b,D)−1|Ω|(b−2)/2 exp−1

2tr(DΩ) 1Ω∈M+(G), (1)

where b > 2 is the degree of freedom parameter, D is a symmetric positive definitematrix, CG(b,D) is the normalizing constant, M+(G) is the cone of symmetric positivedefinite matrices with entries corresponding to the missing edges of G constrained tobe equal to zero, and 1· is the indicator function. Although G-Wishart prior hasbeen quite successfully used in many applications, it has a few important limitations.First, the G-Wishart prior is sometimes not very flexible because of its restrictiveform. For example, the parameters for the degrees of freedom are the same for allthe elements of Ω. Second, unrestricted graphical model determination and covariancematrix estimation is computationally challenging. Recent advances for unrestrictedgraphical models (Jones et al., 2005; Wang & Carvalho, 2010; Mitsakakis et al., 2010;Dobra et al., 2011) all rely on the theoretical framework of Atay-Kayis & Massam(2005) for sparse matrix completion which is very computationally intensive. Indeed,for non-decomposable graphical models, we do not have a closed form expression for thenormalizing constant CG(b,D) and thus have to resort to tedious and often unstableMonte Carlo integration to estimate it for both graphical model determination andcovariance matrix estimation.

An alternative method for Bayesian graphical model determination and estimation isproposed by Wong et al. (2003). They placed point mass priors at zero on zero elementsof the partial correlation matrix and constant priors for the non-zero elements. Theirmethodology applies to both decomposable and non-decomposable models and is fittedby a reversible jump Metropolis-Hastings algorithm. However, it is unclear how toincorporate prior information about individual entries of Σ in their framework as the

2

Page 3: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

mathematical convenience of constant priors is essential for their algorithm.Absolutely continuous priors, or equivalently, penalty functions, can also induce

shrinkage to zero of subsets of elements of Ω and represent an important and flexiblealternative to the point mass priors. In the classical formulation, there is a rich liter-ature on methods for developing shrinkage estimators via different penalty functionsincluding the graphical lasso models (Yuan & Lin, 2007; Friedman et al., 2008; Roth-man et al., 2008) and the graphical adaptive lasso models (Fan et al., 2009) amongmany others. The recent literature on Bayesian methods has focused on the posteriormode estimation, with little attention on the key problem of efficient inference on co-variance matrix based on full posterior computation, with the only exception of Wang(2011) which gave a fully Bayesian treatment of the graphical lasso models. One likelyreason is the difficulty in efficiently generating posterior samples of covariance matricesunder shrinkage priors. A fully Bayesian inference is quite desirable because it notonly produces valid standard errors and Bayes estimators based on decision-theoreticframework but, perhaps more importantly, can be applied in multiple classes of mul-tivariate models that involve key components of unknown covariance matrices such asthe multivariate conditional autoregressive models developed in Section 5.

This paper proposes a class of priors and the implied Bayesian hierarchical modelingand computation for shrinkage estimation of covariance matrices. A key but well knownobservation is that any symmetric, unimodal density may be written as a scale mixtureof uniform distributions. Our main strategy is to use this mixture representationsto construct shrinkage priors compared to the traditional methods for constructingshrinkage priors using the scale mixture of normal distributions. As mentioned above,the scale mixture of uniform distribution is not new to Bayesian inference. Early usageof this representation includes Bayesian robust and sensitive analysis (Berger, 1985;Berger & Berliner, 1986) and robust regressions with heavy-tailed errors Walker et al.(1997).

However, our motivations are different; we seek an approach for constructing tractableshrinkage priors that are both flexible and computationally efficient. We argue that theclass of scale mixture of uniform priors provide an appealing framework for modeling awide class of shrinkage estimation problems and also has the potential to be extendedto a large class of high dimensional problems involving multivariate dependencies. Wealso highlight that a salient feature of our approach is its computational simplicity.We construct a simple, easy to implement Gibbs sampler based on data augmentationfor obtaining posterior draws for a large class of shrinkage priors. To the best of ourknowledge, none of the existing Bayesian algorithms for sparse permutation invariantcovariance estimation can be carried out solely based on a Gibbs sampler and they haveto rely on Metropolis-Hastings methods. Since Gibbs samplers involve global propos-al moves as compared to the local proposals of Metropolis-Hastings methods, in highdimensions this makes a difference in both the efficiency of the sampler and the run-ning time of the algorithm. Through simulation experiments, we illustrate the robustperformance of the scale mixture of uniform priors for covariance matrix, as well ashighlighting the strength and weakness of this approach compared to those based on

3

Page 4: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

point mass priors. Through an extension to a class of multivariate conditional autore-gressive models, we further illustrate that the framework of scale mixture of uniformsnaturally allows and encourages the integration of data and expert knowledge in modelfitting and assessment, and consequently improves the prediction.

The rest of the paper is organized as follows. In Section 2 we outline our frameworkfor constructing shrinkage priors for covariance matrices using the scale mixture of uni-forms. In Section 3 we construct a Gibbs sampler based on a data augmentation schemefor sampling from the posterior distribution. In Section 4 we conduct a simulation s-tudy and compare and contrast our models with existing methods. In Section 5 weextend our model to build shrinkage priors on multivariate conditional autoregressivemodels. In Section 6 we briefly discuss the application of our methods for shrinkageestimation for regression models.

2 Shrinkage priors for precision matrices

2.1 Precision matrix modeling

Let y = (y(1), y(2), . . . , y(p))T be a p-dimensional random vector having a multivari-ate normal distribution N(0,Σ) with mean zero and covariance matrix Σ. Let Ω =(ωij)p×p = Σ−1 denote the precision matrix, i.e., the inverse of the covariance matrixΣ. Given a set of independent random samples Y = (y1, . . . , yn)p×n of y, we wish toestimate the matrix Ω.

We consider the following prior distribution for the precision matrix:

p(Ω | τ) ∝∏i≤j

gij(ωij −mij

τij) 1Ω∈M+ , (2)

where gij(·) is a continuous, unimodal and symmetric probability density function withmode zero on R, M+ is the space of real valued symmetric, positive definite p × pmatrices, τij > 0 is a scale parameter controlling the strength of the shrinkage and 1Adenotes the indicator function of the set A. Our primary motivation for constructingprior distributions of the form (2) is that, often in real applications the amount of priorinformation the modeler can vary across individual elements of Ω. For instance, onemight incorporate the information that the variance of certain entries of Ω are close to0, or constrain some entries to be exactly 0. In this setting, shrinking different elementsof Ω at a different rate clearly provides a flexible framework for conducting Bayesianinference. In addition to obtaining a flexible class of prior distributions, by using amixture representation for the density gij, we can construct a simple, efficient and easyto implement Gibbs sampler to draw from the posterior distribution of the precisionmatrix.

4

Page 5: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

2.2 Scale mixture of uniform distributions

Our main tool is the following theorem which says that all unimodal, symmetric den-sities may be expressed as scale mixture of uniform distributions.

Theorem 1. Walker et al. (1997); Feller (1971) Suppose that θ is a real-valued randomquantity with a continuous, unimodal and symmetric distribution with mode zero havingdensity π(θ) (−∞ < θ <∞). Suppose π′(θ) exists for all θ. Then π(θ) has the form:

π(θ) =

∫ ∞0

1

2t1|θ|<t h(t) dt, (3)

where h(t) ∝ −2t× π′(t) is some density function on [0,∞). Therefore we may write

π(θ | t) ∼ U(−t, t), h(t) ∝ −2t× π′(t).

The generality and simplicity of Theorem 1 allow us to characterize various shrinkagepriors by using the special structure of (3). Indeed, as noted in Walker et al. (1997), aGaussian random variable x ∼ N(µ, σ2) can be expressed as x | v ∼ U(µ − σ

√v, µ +

σ√v), v ∼ Ga(3/2, 1/2), which shows that, all of the distributions which may be written

as a scale mixture of Gaussian distributions can indeed be expressed as a scale mixture ofuniform distributions as well. Let us discuss a few more examples of popular shrinkagepriors where Theorem 1 is applicable.

A popular class of distributions for constructing shrinkage priors is the exponentialpower family given by π(θ) ∝ exp(−|θ|q/τ q), where the exponent q > 0 controls thedecay at the tails. The mixing density function h(t) given in (3) can be thought of asthe “scale” parameter. In this case we have h(t) ∝ tq exp(−tq/τ q), which correspondsto the generalized gamma distribution. Two important special cases are the Gaussiandistribution (q = 2), and the double-exponential distribution (q = 1), which have beenstudied extensively in the context of the Bayesian lasso regression (Park & Casella, 2008;Hans, 2009) and the Bayesian graphical lasso (Wang, 2011). For general q > 0, one maywrite the exponential power distribution as a scale mixture of Gaussian distributions(Andrews & Mallows, 1974; West, 1987). However, a fully Bayesian, computationallyefficient analysis is not available based on Gaussian mixtures, especially in the contextof covariance estimation and graphical models. A few approximate methods exist fordoing inference using the exponential power prior distribution such as the variationalmethod proposed by Armagan (2009). Our use of uniform mixture representation hasthe advantage of posterior simulation via an efficient Gibbs sampler for any q > 0 as isshown in Section 2.3 and further exemplified in Sections 4 and 6.

Another natural candidate for shrinkage priors is the Student-t distribution given byπ(θ) ∝ (1 + θ2/τ 2)−(ν+1)/2, for which it is easy to show that h(t) ∝ t2(1 + t2/τ 2)−(ν+3)/2.Hence, t2/τ 2 is an inverted beta distribution IB(3/2, ν/2). Recall that the inverted betadistribution IB(a, b) has the density given by p(x) ∝ xa−1(1 + x)−a−b1x>0.

The generalized double Pareto distribution is given by π(θ) ∝ (1 + |θ|/τ)−(1+α),which corresponds to h(t) ∝ t(1 + t/τ)−(2+α); i.e., the scale t/τ follows an inverted beta

5

Page 6: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

distribution IB(2, α). Armagan et al. (2011) investigated the properties of this class ofshrinkage priors.

The above discussed class of shrinkage priors are well known and documented. Inthe following we give a new distribution which we call the “logarithmic” shrinkage priorwhich seems to be new in the context of shrinkage priors. Consider the density givenby

π(θ) ∝ log(1 + τ 2/θ2) . (4)

It is easy to show that the corresponding mixing distribution has the half-Cauchydensity,

h(t) ∝ (1 + t2/τ 2)−11t>0 .

This prior has two desirable properties for shrinkage estimation: an infinite spike atzero and heavy tails. These are precisely the desirable characteristics of a shrinkageprior distribution as argued convincingly for the “horseshoe” prior in Carvalho et al.(2010). The horseshoe prior is constructed by scale mixture of normals, namely, θ ∼N(0, σ2), σ ∼ C+(0, 1), where C+(0, 1) is a standard half-Cauchy distribution on thepositive reals with scale 1. The horseshoe prior does not have a closed form density butsatisfies the following:

K

2log(1 + 4/θ2) < π(θ) < K log(1 + 2/θ2),

for a constant K > 0. Clearly, our new prior (4) has identical behavior at the originaland the tails as that of the horseshoe prior distribution with the added advantage ofhaving an explicit density function unlike the horseshoe prior.

2.3 Posterior sampling

Let y denote the observed data. The scale mixture of uniform representation providesa simple way of sampling from the posterior distribution p(θ | y) ∝ f(y | θ)π(θ), wheref(y | θ) is the likelihood function and π(θ) is the shrinkage prior density. The repre-sentation (3) leads to the following full conditional distributions of θ and t (conditionalon y) given by

p(θ | y, t) ∝ f(y | θ) 1|θ|<t, p(t | y, θ) ∝ −π′(t) 1|θ|<t . (5)

Thus the data augmented Gibbs sampler for obtaining posterior draws from p(θ, t | y)involves iteratively simulating from the above two conditional distributions. Simulationof the former involves sampling from a truncated distribution, which is often achieved bybreaking it down further into several Gibbs steps, while sampling the latter is achievedby the following theorem.

6

Page 7: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

Table 1: Density of θ and t for some common shrinkage prior distributions, alongwith the conditional posterior inverse cumulative probability function for sampling t.Densities are given up to normalizing constants.

Density name Density for θ Density for t Inverse CDF: F−1(u | θ)Exponential power exp(−|θ|q/τ q) tq exp(−tq/τ q) −τ q(log u) + |θ|q1/qStudent-t (1 + θ2/τ2)−(ν+1)/2 t2(ν + t2/τ2)−(ν+3)/2 u−2/(ν+1)(τ2 + θ2)− τ21/2Generalized double Pareto (1 + |θ|/τ)−(1+α) t(1 + t/τ)−(2+α) u−1/(1+α)(|θ|+ τ)− τLogarithmic log(1 + τ2/θ2) (1 + t2/τ2)−1 τ(1 + τ2/θ2)u − 1−1/2

CDF, cumulative distribution function.

Theorem 2. Suppose the shrinkage prior density π(θ) can be represented by a scalemixture of uniform as in equation (3). Then the (posterior) conditional probabilitydensity function of the latent scale parameter t is given by

p(t | y, θ) ∝ −π′(t) 1t>|θ|,

and the corresponding (conditional) cumulative distribution function is

u = F (t | y, θ) = pr(T < t | y, θ) =π(|θ|)− π(t)

π(|θ|)|θ| < t . (6)

The advantage of the above theorem is that it gives an explicit expression of theconditional cumulative distribution function in terms of the prior density π(·). Thisprovides a simple way to sample from p(t | y, θ) using the inverse cumulative distributionfunction method whenever π(·) can be easily inverted. Table 1 summarizes the densityfunctions of π(θ) and h(t), and the inverse conditional cumulative distribution functionF−1(u | y, θ) for several shrinkage priors introduced in Section 2.2. We note that thescale mixture of uniform distributions are already used for doing inference for regressionmodels using the Gibbs sampler outlined above, for instance see Qin et al. (1998).

3 Posterior computation for precision matrices

3.1 Gibbs sampling on given global shrinkage parameter τ

Recall that given a set of independent random samples Y = (y1, . . . , yn)p×n from amultivariate normal distribution N(0,Ω−1), we wish to estimate the matrix Ω using theprior distribution given by (2). Let T = tiji≥j be the vector of latent scale parameters.For simplicity we first consider a simple case where gij(·) = g(·),mij = 0 and τij = τ inthis section, and then discuss the strategies for choosing τ in Section 3.2. However ouralgorithms can be easily extended to the general case of unequal shrinkage parametersτij. Theorem 1 suggests that the prior (2) can be represented as follows:

p(Ω | τ) =

∫T

p(Ω, T | τ)dT ∝∫T

∏i≥j

[ 1

2tij1|ωij |<τtijh(tij)

]dT,

7

Page 8: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

where p(Ω, T | τ) ∝∏

i≥j[1/(2tij) 1|ωij |<τtijh(tij)

]is the joint prior and h(tij) ∝

−tijg′(tij). The joint posterior distribution of (Ω, T ) is then:

p(Ω, T | Y, τ) ∝ |Ω|n/2 exp−1

2tr(SΩ)

∏i≥j

[− 1|ωij |<τtij g

′(tij)], (7)

where S = Y Y T.The most direct approach for sampling from (7) is to update each ωij one at a time

given the data, T , and all of the entries in Ω except for ωij in a way similar to thoseproposed in Wong et al. (2003). However, this direct approach requires a separateCholesky factorization for updating each ωij to find its allowable range and conditionaldistribution. It also relies on the Metropolis-Hastings step to correct the sample. Wedescribe an efficient Gibbs sampler for sampling (Ω, T ) from (7) that involves one stepfor sampling Ω and the other step for sampling T .

Given T , the first step of our Gibbs sampler systematically scans the set of 2 × 2sub-matrices Ωe,e : e = (i, j), 1 ≤ j < i ≤ p to generate Ω. For any e = (i, j), letV = 1, . . . , p be the set of vertices and note that

|Ω| = |A||ΩV \e,V \e|,

where A, the Schur component of ΩV \e,V \e, is defined by A = Ωe,e − B with B =Ωe,V \e(ΩV \e,V \e)

−1ΩV \e,e. The full conditional density of Ωe,e from (7) is given by

p(Ωe,e | −) ∝ |A|n/2 exp−1

2Se,eA 1Ωe,e∈T ,

where T = |ωij| < τtij ∩ |ωii| < τtii ∩ |ωjj| < τtjj. Thus, A is a truncatedWishart variate. To sample A, we write

A =

(1 0l21 1

)(d1 00 d2

)(1 l21

0 1

), Se,e =

(s11 s12

s21 s22

),

with d1 > 0 and d2 > 0. The joint distribution of (l12, d1, d2) is then given by:

p(d1, d2, l21 | −) ∝ dn/2+11 d

n/22 exp[−1

2trs11d1 + s22(l221d1 + d2) + 2s21d1l21] 1Ωe,e∈T ,

which implies that the univariate conditional distribution for the parameters d1 and d2

is a truncated gamma distribution, and a truncated normal distribution for l21. Detailsof the parameters of the truncated region and strategies for sampling are given in theAppendix. Given Ω, the second step of our Gibbs sampler generates T in block usingthe inverse cumulative distribution function methods described in equation (6). Thesetwo steps complete a Gibbs sampler for model fitting under a broad class of shrinkagepriors for Ω.

One attractive feature of the above sampler is that it is also suitable for samplingΩ ∈ M+(G), that is, Ω is constrained by an undirected graph G = (V,E) where V is

8

Page 9: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

the set of vertices and E is a set of edges and ωij = 0 if and only if (i, j) /∈ E. Theability to sample Ω ∈ M+(G) is useful when substantive prior information indicates acertain subset of elements in Ω are indeed zero. Section 5 provides such an examplethat involves a class of multivariate spatial models. To sample Ω ∈ M+(G), the onlymodification that is required is to replace the set of all 2 × 2 sub-matrices Ωe,e : e =(i, j), 1 ≤ j < i ≤ p with the set Ωe,e : e ∈ E ∪ Ωv : v ∈ VI where VI is the set ofisolated nodes in G.

3.2 Choosing the shrinkage parameters

We start with the scenario when τij = τ and mij = 0 for all i ≥ j. In this case we have

p(Ω | τ) = C−1τ

∏i≥j

g(ωijτ

),

where Cτ is a normalizing term involving τ . This normalizing constant is a necessaryquantity for choosing hyper parameters for τ . Since p(Ω | τ) is a scale family, applyingthe substitution Ω = Ω/τ yields,

Cτ =

∫Ω∈M+

∏i≥j

g(ωijτ

)dΩ = τp(p+1)

2

∫Ω∈M+

g(ωij)dΩ, (8)

where the integral on the right hand side of the above equation does not involve τbecause Ω : Ω ∈ M+ = Ω : Ω ∈ M+. Hence, under a hyperprior p(τ), theconditional posterior distribution of τ is

p(τ | Y,Ω) ∝ τ−p(p+1)/2∏i≥j

g(ωijτ

)p(τ) . (9)

Now the sampling scheme in Section 3.1 can be extended to include a component tosample τ at each iteration.

Now suppose mij = 0 and instead of having a single global shrinkage parameter, wewish to control the rate at which the individuals elements of Ω are shrunk towards 0separately. A natural shrinkage prior for this problem is

p(Ω | τ) = C−1τ

∏i≥j

gij(ωijτ

)

where gij may all be different. The idea is that by choosing a different density gijfor each edge, we can incorporate the prior information for the rate at which differententries of Ω are shrunk towards 0. For a hyper prior p(τ), using an identical calculationas in (8) and (9) we deduce that the conditional posterior of τ is then given by

p(τ | Y,Ω) ∝ τ−p(p+1)/2∏i≥j

gij(ωijτ

)p(τ) . (10)

9

Page 10: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

Notice that the Gibbs sampler presented in Section 3.1 applies to this case as well; wejust need to use the cumulative distribution function for the density gij for samplingfrom the conditional distribution of tij. Alternatively, one can also fix a density gand write p(Ω | τ) = C−1

τ

∏i≥j g(

ωij

vijτ) for fixed positive constants vij and then make

inference about the common τ .We conclude this section with the remark that our approach can be adapted for

hierarchical models. For example, in Section 5 we consider a shrinkage prior thatshrinks Ω towards a given matrix M = (mij) under the constraint that Ω ∈M+(G) fora given graph G:

p(Ω) = C−1τ,M

∏(i,j)∈E

g(ωij −mij

τ)1Ω∈M+(G),

where E denotes the set of edges of the graph G and normalizing constant Cτ,M =∫Ω∈M+(G)

∏(i,j)∈E g(

ωij−mij

τ)dΩ is the normalizing constant. In this case Cτ,M is analyt-

ically intractable as a function of τ. In the example of Section 5, we fixed τ at a valuethat represents prior knowledge of the distribution of Ω to avoid modeling τ . In someapplications, it may be desirable to add another level of hierarchy for modeling τ sothat we can estimate it from data. Several approaches have been proposed for dealingwith the intractable normalizing constant, see Liechty et al. (2004), Liechty et al. (2009)and the references therein for one such approach.

4 Simulation experiments

To assess the utility of the scale mixture of uniform priors, we compared a range ofpriors in this family against three alternatives: the frequentist graphical lasso methodof Friedman et al. (2008), the Bayesian G-Wishart prior and the method of Wong et al.(2003). The latter two place positive prior mass on zeros. We considered four covariancematrices from Rothman et al. (2008):

Model 1. An AR(1) model with σij =0·7|i−j|.

Model 2. An AR(4) model with ωii =1, ωi,i−1 = ωi−1,i =0·2, ωi,i−2 = ωi−2,i = ωi,i−3 =ωi−3,i =0·2, ωi,i−4 = ωi−4,i =0·1.

Model 3. A sparse model with Ω = B+ δIp where each off-diagonal entry in B is gen-erated independently and assigned the value 0·5 with probability α =0·1 and 0 otherwise.The diagonal elements Bii are set to be 0, and δ is chosen so that the condition number ofΩ is p. Here the condition number is defined as max(λ)/min(λ) where max(λ),min(λ)respectively denote the maximum and minimum eigenvalues of the matrix Ω.

Model 4. A dense model with the same Ω as in model 3 except for α =0·5.

For each of the above four models, we generated samples of size n = 30, 100 and di-mension p = 30, yielding the proportion of non-zero elements to be 6%, 25%, 10%, 50%,respectively. We compute the risk under two standard loss functions, Stein’s loss

10

Page 11: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

function, L1(Σ,Σ) = tr(ΣΣ−1) − log(ΣΣ−1) − p, and the squared-error loss functionL2(Σ,Σ) = tr(Σ − Σ)2. The corresponding Bayes estimators are ΣL1 = E(Ω | Y)−1

and ΣL2 = E(Σ | Y), respectively. We used the posterior sample mean using the Gibbssampler for estimating the risk for the Bayesian methods and the maximum likelihoodestimate for the graphical lasso method.

When fitting graphical lasso models, we used the 10-fold cross-validation to choosethe shrinkage parameter. When fitting the G-Wishart priors, we followed the con-ventional prior specification Ω ∼ WG(3, Ip) and used the reversible jump algorithm ofDobra et al. (2011) for model fitting. For both the G-Wishart priors and the methodsof Wong et al. (2003), we used the default graphical model space prior (Carvalho &Scott, 2009)

p(G) = (1 +m)

(m

|G|

)−1,

where m = p(p − 1)/2 and |G| is the total number of edges in graph G. For thescale mixtures of uniforms, we considered the exponential power prior p(Ω | τ) ∝exp−

∑i≤j |ωij|q/τ q with q ∈ 0·2, 1, the generalized double-Pareto prior p(Ω | τ) ∝∏

i≤j(1+|ωij|/τ)−1−α and the new logarithmic prior p(Ω | τ) ∝∏

i≤j log(1+τ 2/ω2ij). For

the choice of the global shrinkage parameters, we assumed the conjugate distributionτ−q ∼ Ga(1, 0·1) for the exponential power prior; α = 1, 1/(1 + τ) ∼ U(0, 1) for thegeneralized double Pareto prior as suggested by Armagan et al. (2011); and τ ∼ C+(0, 1)for the logarithmic prior as was done for the horseshoe prior in Carvalho et al. (2010).

Twenty datasets were generated for each case. The Bayesian procedures used 15000iterations with the first 5000 as burn-ins. In all cases, the convergence was rapid andthe mixing was good; the autocorrelation of each elements in Ω died out typicallyafter 10 lags. As for the computational cost, the scale mixture of uniforms and themethod of Wong et al. (2003) were significantly faster than the G-Wishart method.For example, for model 4, the G-Wishart took about 11 hours for one dataset underMatlab implementation on a six core 3·3 Ghz computer running CentOS 5·0 unix ; whilethe scale mixture of uniforms and the method of Wong et al. (2003) took only about20 and 6 minutes respectively. The graphical lasso method is just used as a benchmarkfor calibrating the Bayesian procedures. For each dataset, all Bayesian methods werecompared to the graphical lasso method by computing the relative loss; for example, forthe L1 loss, we computed the relative loss as L1(Σ,Σ)−L1(Σglasso,Σ), where Σ is anyBayes estimator of Σ and Σglasso is the graphical lasso estimator. Thus, a negative valueindicates that the method performs better relative to the graphical lasso procedure anda smaller relative loss indicates a better relative performance of the method.

Table 2 reports the simulation results. The two approaches based on point masspriors outperform the continuous shrinkage methods in sparser models such as model1, however, they are outperformed in less sparse configurations such as model 2 and4. One possible explanation is that the point mass priors tend to favor sparse modelsbecause it encourages sparsity through a positive prior mass at zero. Finally, theexponential power with q =0·2, the generalized double Pareto and the logarithmic

11

Page 12: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

Table 2: Summary of the relative L1 and L2 losses for different models and differentmethods. Medians are reported while standard errors are in parentheses.

Model 1 Model 2 Model 3 Model 4L1 L2 L1 L2 L1 L2 L1 L2

n=30

WG -4·4 (1·3) -5·9 (1·4) -0·3 (0·7) -12·7 (4·6) -0·9 (0·7) 1·4 (2·5) -2·3 (1·9) -0·0 (0·9)WCK -4·4 (1·0) -5·1 (2·3) -0·7 (0·6) -11·3 (3·8) -1·2 (0·6) 1·6 (1·5) -2·2 (1·0) 0·3 (0·5)EPq=1 -2·1 (1·1) 2·1 (1·0) -1·0 (0·8) -14·0 (4·7) -1·6 (0·7) -1·0 (2·2) -4·2 (1·2) -1·1 (0·5)EPq=0·2 -3·8 (1·1) -2·9 (2·1) -0·9 (0·8) -13·7 (4·9) -1·4 (0·7) -0·5 (2·5) -3·1 (1·1) -0·5 (1·3)GDP -3·8 (1·1) -3·2 (2·2) -1·3 (0·7) -13·2 (4·3) -1·4 (0·7) -0·4 (2·3) -2·5 (1·7) -0·4 (0·9)Log -3·7 (1·1) -2·3 (1·4) -0·6 (0·8) -13·3 (4·9) -1·3 (0·6) -0·2 (2·5) -3·2 (1·1) -0·8 (0·9)

n=100

WG -1·7 (0·2) -3·9 (0·7) -0·3 (0·4) -0·4 (1·5) -0·8 (0·3) -1·5 (1·5) 0·4 (0·3) 0·7 (0·3)WCK -1·3 (0·2) -2·7 (0·6) -0·7 (0·3) -0·8 (1·1) -0·5 (0·2) 0·3 (1·4) 0·2 (0·3) 0·5 (0·3)EPq=1 -0·2 (0·2) 0·6 (0·3) -0·6 (0·3) 0·0 (0·8) -0·2 (0·3) 0·5 (0·5) -1·1 (0·3) -0·2 (0·1)EPq=0·2 -1·3 (0·2) -1·8 (0·3) -0·8 (0·3) -1·4 (1·2) -0·6 (0·2) -0·8 (0·7) -0·3 (0·3) 0·2 (0·2)GDP -1·4 (0·2) -2·1 (0·4) -0·8 (0·4) -1·3 (1·1) -0·6 (0·2) -0·6 (0·6) -1·0 (0·3) -0·1 (0·1)Log -1·4 (0·2) -1·9 (0·4) -0·8 (0·3) -1·3 (1·1) -0·6 (0·2) -0·6 (0·6) -0·6 (0·3) 0·0 (0·2)

WG, G-Wishart; WCK, Wong et al. (2003); GDP, generalized double Pareto; EP, exponential power; Log: logarithmic.

priors have very similar performances – ranking among top models in all cases. Insummary, the experiment illustrates that these three heavy-tailed priors in the scalemixture of uniform family are generally indeed good for high dimensional covariancematrix estimation.

5 Application to multivariate conditional autoregressive models

5.1 Multivariate conditional autoregressive models based on scale mixture of uniformpriors

Multivariate conditional autoregressive models (Banerjee et al., 2004) constitute a di-verse set of powerful tools for modeling multivariate spatial random variables at arealunit level. Let W = (wij)pr×pr be the symmetric proximity matrix of pr areal units,wij ∈ 0, 1, and wii are customarily set to 0. Then W defines an undirect graphGr = (Vr, Er) where an edge (i, j) ∈ Er if and only if wij = 1. Let wi+ =

∑j wij,

EW = diag(w1+, . . . , wpr+) and M = (mij) = EW − ρW . Let X = (x1, . . . , xpr)T denote

a pr × pc random matrix where each xi is a pc-dimensional vector corresponding to re-gion i. Following Gelfand & Vounatsou (2003), one popular version of the multivariateconditional autoregressive models sets the joint distribution of X as

vec(X) ∼ N0, (Ωc ⊗ Ωr)−1, Ωr | ρ = EW − ρW, Ωc ∼W(bc, Dc), (11)

where Ωr is the pr×pr column covariance matrix, Ωc is the pc×pc row covariance matrix,ρ is the coefficient measuring spatial association and is constrained to be betweenthe reciprocals of the minimum and maximum eigenvalues of W to ensure that Ωr

is nonsingular, and bc and Dc respectively denote the degree of freedom and locationparameters of a Wishart prior distribution for Ωc. The joint distribution in (11) impliesthe following conditional distribution:

xi | x−i, ρ,Ωc ∼ N(∑j∈ne(i)

ρw−1i+ xj, w

−1i+ Ωc),

12

Page 13: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

where ne(i) denotes the neighbor of region i, that is, the set of points satisfying wij = 1.Evidently, the two covariance structures (Ωr,Ωc) are crucial in determining the effectsof spatial smoothing. For the matrix Ωc, direct application of shrinkage priors canreduce estimation uncertainties as compared to the conjugate Wishart prior in (11).For Ωr, one common value of ρ for all xi may limit the flexibility of the model becauseit assumes the same spatial association for all regions. The recent work of Dobra et al.(2011) uses the G-Wishart framework to provide alternative models. Specifically, theauthors recommend the following extensions for modeling (Ωr,Ωc):

Ωr |M ∼WGr(br,M), M | ρ = EW − ρW, Ωc ∼WGc(bc, Dc), (12)

where the row graph Gr is fixed and obtained from the proximity matrix W , andthe column graph Gc is unknown. For both models in (11) and (12), a prior for ρwas chosen to give higher probability mass to values close to 1 to encourage sufficientspatial dependence. In particular, Dobra et al. (2011) put equal mass on the following31 values: 0, 0·05, 0·1, . . . , 0·8, 0·82, . . . , 0·90, 0·91, . . ., 0·99,. Notice that Ωr andΩc are not uniquely identified since, for any c > 0, Ωc ⊗Ωr = (cΩc)⊗ (Ωr/c) (Wang &West, 2009). We address this by fixing Ωr,11 = 1.

Using the theory and methods for covariance matrix developed in Section 3, wenow extend the multivariate conditional autoregressive models (11) using the scalemixture of uniform distributions. We consider the following two extensions for modelingΩr ∈M+(Gr) and Ωc ∈M+

Ωr | ρ = EW − ρW, p(Ωc | τ) ∝∏i≥j

gc(ωc,ij/τc), (13)

and

p(Ωr) ∝∏

(i,j)∈Er∪i=j∈Vr

gr(|ωr,ij −mij|/τr) 1ωr,ij<0, p(Ωc | τc) ∝∏i≥j

gc(ωc,ij/τc).(14)

The first extension (13) places shrinkage priors on Ωc while leaving the model for Ωr

unchanged. The second extension (14) further shrinks Ωr towards the matrix M =EW − ρW while allowing adaptive spatial smoothing by not constraining Ωc to becontrolled by a common parameter ρ.

One practical advantage of the our model (14) over the model (12) of Dobra et al.(2011) is its flexibility in incorporating prior knowledge. For example, the similarity ofspatial neighbors implies that the off-diagonal elements of Ωr should be constrained tobe negative (Banerjee et al., 2004). This point is not addressed by Dobra et al. (2011)as their method is only applicable when the free elements of Ωr are not truncated. Inthe scale mixture of uniform framework, this important constraint is easily achieved bytruncating each free off-diagonal element in Ωr to be negative when sampling Ωr. Thefunctional form of gr(·) and the shrinkage parameter τr can be pre-specified throughcareful prior elicitation as follows. Using the Gibbs sampler in Section 3.1, we are ableto simulate from the prior distribution of Ωr for fixed gr(·) and τr. These prior draws

13

Page 14: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

allow us to choose gr(·) and τr to represent plausible ranges of spatial associations. Tospecify these ranges, one guideline can be based on the model (11) for which Gelfand &Vounatsou (2003) recommended a prior for ρ that favors the upper range of ρ ∈ (0, 1).In light of this recommendation, we prefer those gr and τr that increasingly favor valuesof ωc,ij close to 1 for any (i, j) ∈ Er and ωc,ii close to wi+ for i ∈ Vr. Such choicesof priors integrate prior information about spatial associations and allow for varyingspatial smoothing parameters across different regions.

5.2 US cancer data

Using our model, we analyze the same real data example studied by Dobra et al. (2011)concerning the application of multivariate spatial models for studying the US cancermortality rates. The data we analyzed consists of mortality counts for 10 types oftumors recorded for the 48 mainland states plus the District of Columbia for the year2000. The data were collected by the National Center for Health Statistics. Moral-ity counts below 25 were treated as missing because they are regarded as unreliablerecords in cancer surveillance community. Let Yij be the number of deaths in statei = 1, . . . , pr = 49 for tumor type j = 1, . . . , pc = 10. Following Dobra et al. (2011), weconsidered Poisson multivariate loglinear models with spatial random effects:

Yij | ηij ∼ Poi(ηij), log(ηij) = log(qi) + µj +Xij,

where qi is the population of state i, µj is the intercept of tumor type j and Xij is thezero-mean spatial random effect associated with state i and tumor j and has the jointdistribution vec(X) ∼ N0, (Ωc ⊗ Ωr)

−1.We compared the out-of-sample predictive performance of model (13) and (14) a-

gainst the model (11) of Gelfand & Vounatsou (2003) and model (12) of Dobra et al.(2011). For (11) and (12), we used the same hyper-parameter settings as in Dobraet al. (2011). For (13), we set gc(·) to be the logarithmic density in (4) and placedstandard half-cauchy prior on τc in order to expect robust performance for shrinkageestimation of Ωc as was suggested by the simulation study in Section 4. For (14), we letgr(ωr,ij) ∝ exp−|ωr,ij −mij|/τr1ωr,ij<0 for i = j or (i, j) ∈ Er so that Ωr is centeredaround M = W − EW and the similarity of spatial neighbors is ensured. We did notchoose heavy-tailed distributions for gr(·) because the sample size pc = 10 is relative-ly small for the dimension pr = 49 and a heavy-tailed prior can lead to a posteriordistribution of ωr,ij to be unrealistically small and ωr,ii to be unrealistically large. Weconsidered τr ∈ 0·1, 1, 10 to assess the prior sensitivity. Finally, we modeled gc(·) asin model (13).

In order to assess the out-of-sample predictive performance, we replicated the 10-fold cross-validation experiment of Dobra et al. (2011). Specifically, we divided thenonmissing counts of Y into 10 bins. For each bin i, we used the samples from theother 9 bins as observed data and imputed the samples from bin i as missing. Tocompare different models, we then computed the predictive mean squared error and

14

Page 15: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

mean variance as follows

MSE =1

|(i, j) : Yij ≥ 25|∑

(i,j):Yij≥25

(E(Yij)− Yij)2,

and

VAR =1

|(i, j) : Yij ≥ 25|∑

(i,j):Yij≥25

V ar(Yij),

where E(Yij) and V ar(Yij) are estimated using the posterior sample mean and variancebased on the output of the analysis of one of the 10 cross-validation datasets in whichYij are treated as missing. All results were obtained using a Monte Carlo sample of size80000 after an initial, discarded burn-in of 80000 iterations.

Figure 1 shows the raw and predicted morality rate of colon cancer. Table 3 reportsthe predictive performance as measured by the mean squared error and mean variance.All methods with shrinkage priors on Ωc improve the prediction over the standardmethod using the Wishart prior. Among the shrinkage methods, the logarithmic prioroutperforms the G-Wishart prior. Allowing Ωr to be adaptive by setting τr = 1 and10 can further reduce the mean squared error while maintaining the same predictivevariance with the common ρ model. Overall, our results suggest that the models (13)and (14) provide more accurate prediction and narrower credible intervals than thecompeting methods for this dataset.

To further study the prior sensitivity to the choice of τr, we plotted the marginalprior and posterior densities for the free off-diagonal element in Ωr using samples fromthe analysis of the first cross-validation dataset. Figure 2 displays the inference for oneelement under τr ∈ 0·1, 1, 10. Clearly, the marginal posterior distribution dependson the choice of τr. This is not surprising because the sample size is small comparedto the dimension of Ωr. The case τr = 1 and 10 seems to perform well in this examplebecause the marginal posterior distribution is influenced by the data. The case τr =0·1appears to be too tight and thus is not largely influenced by the data.

On the computing time, the Matlab implementation of model (14) took about 4hours to complete the analysis of one of the ten cross-validation datasets, while mod-el (12) of Dobra et al. (2011) took about 4 days. Additionally, Dobra et al. (2011)reported a runtime of about 22 hours on a dual-core 2·8 Ghz computer under C++implementation for a similar dataset of size pr = 49 and pc = 11. As mentioned above,our models based on the scale mixture of uniforms are not only more flexible but alsomore computationally efficient.

6 Shrinkage prior for linear regression models

In this section we briefly investigate the properties of the shrinkage prior constructedfrom scale mixture of uniforms for the linear regression models. Recently, shrinkageestimation for linear models have received a lot of attention (Park & Casella, 2008;Griffin & Brown, 2010; Armagan et al., 2011) all of which proceed via the scale mixture

15

Page 16: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

(a) Raw mortality rate (b) Predicted mortality rate

0.5 1 1.5 2 2.5

0.5 1 1.5 2 2.5

Figure 1: US cancer mortality map of colon cancer (per 10000 habitants). (a) The rawmortality rate, (b) The predicted mortality rate under model TDE+Log with τr = 1.

Table 3: Predictive mean squared error and variance in 10-fold cross-validation predic-tive performance in the cancer mortality example.

GV DLR Common ρ+Log TDE+Logτr=10 τr=1 τr=0·1

MSE 3126 2728 2340 2238 2187 2359VAR 9177 6493 3814 3850 3810 3694

GV: the non-shrinkage model (11) of Gelfand & Vounatsou (2003); DLR: model (12) of Dobraet al. (2011); Common ρ+Log: model (13) under common ρ for Ωr and logarithmic prior for Ωc;TDE+Log: model (14) under truncated double-exponential prior for Ωr with fixed but different τrand logarithmic prior for Ωc.

(a) τr =0·1 (b) τr =1 (c) τr =10

−1.8 −1.6 −1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.20

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

ωr,ij

−4.5 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ωr,ij

−25 −20 −15 −10 −5 00

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

ωr,ij

Figure 2: Marginal prior (dashed lines) and posterior (solid lines) densities of one freeoff-diagonal element in Ωr from the analysis under model (14) with three different valuesof τr: (a) τr =0·1, (b) τr =1, (c) τr =10.

16

Page 17: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

of normals. Walker et al. (1997) and Qin et al. (1998) were among the first to usethe scale mixture of uniform priors for regression models. However, they used thisfamily only for modeling the measurement errors and deriving the corresponding Gibbssampler. To the best of our knowledge, we are the first to investigate the scale mixtureof uniforms as a class of shrinkage priors for regression coefficients. When this paper wasnearing completion we were notified of a similar approach in the very recent work Polson& Scott (2011) in which the authors independently propose a similar construction basedon mixtures of Bartlett-Fejer kernels for the bridge regression and proceed via a resultsimilar to Theorem 1.

Consider the following version of a regularized Bayesian linear model where the goalis to sample from the posterior distribution

p(β | σ, τ, Y ) ∝ exp− 1

2σ2(Y −Xβ)T(Y −Xβ)

p∏j=1

g(βjστ

)

where g(·) is the shrinkage prior and τ is the global shrinkage parameter. Theorem 1suggests we can introduce latent variable t = t1, . . . , tp such that the joint posteriorof (β, t) is given by:

p(β, t | σ, τ, Y ) ∝ exp− 1

2σ2(Y −Xβ)T(Y −Xβ)

p∏j=1

−g′(tj) 1στt>|βj |

The Gibbs samplers are then implemented by (a) simulating βj from a truncated normalfor each j, and (b) block simulating t1, . . . , tp from using the conditional cumulativedistribution function in Theorem 2.

We compare the posterior mean estimators under the exponential power prior withq =0·2 and the logarithmic prior to the posterior means corresponding to several otherexisting priors. These two shrinkage priors are interesting because the exponentialpower prior is the Bayesian analog of the bridge regression (Park & Casella, 2008)and is challenging for fully posterior analysis using the scale mixture of normals andrelatively unexplored before, and the logarithmic prior is a new prior that resembles theclass of horseshoe priors that are shown to have some advantages over many existingapproaches (Carvalho et al., 2010).

We use the setting of simulation experiments considered in Armagan et al. (2011).Specifically, we generate n = 50 observations from y = xTβ+ε, ε ∼ N(0, 32), where β hasone of the following five configurations: (i) β =(1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)T,(ii) β(3,3,3,3,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)T, (iii) β =(1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0)T,(iv) β =(3,3,3,3,3,0,0,0,0,0,3,3,3,3,3,0,0,0,0,0)T, (v) β =(0·85, . . ., 0·85)T, and x =(x1, . . . , xp)

T has one of the following two scenarios: (a) xj are independently and iden-tically distributed standard normals, (b) x is a multivariate normal with E(x) = 0 andcov(xj, xj′) = 0·5|j−j′|. The variance is assumed to have the Jeffrey’s prior p(σ2) ∝ 1/σ2.The global shrinkage parameter is assumed to have the conjugate τ−q ∼ Ga(1, 1) forthe exponential power prior with q =0·2, and τ ∼ C+(0, 1) for the logarithmic prior.

17

Page 18: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

Table 4: Summary of model errors for the simulation study in the regression analy-sis of Section 6. Median model errors are reported; bootstrap standard errors are inparentheses.

(a) xj independent (b) xj correlated(i) (ii) (iii) (iv) (v) (i) (ii) (iii) (iv) (v)

GDPa 2·7 (0·1) 2·2 (0·2) 4·0 (0·2) 3·8 (0·2) 5·7 (0·3) 2·1 (0·1) 2·1 (0·1) 3·2 (0·1) 4·2 (0·3) 4·4 (0·1)GDPb 2·8 (0·2) 2·1 (0·2) 4·6 (0·2) 3·8 (0·2) 7·0 (0·2) 1·9 (0·1) 2·0 (0·1) 3·3 (0·2) 4·2 (0·2) 4·7 (0·1)GDPc 2·6 (0·1) 2·4 (0·2) 4·4 (0·2) 4·0 (0·2) 6·5 (0·2) 1·9 (0·1) 2·2 (0·1) 3·1 (0·2) 4·3 (0·2) 4·3 (0·1)HS 2·7 (0·1) 2·1 (0·2) 4·8 (0·2) 3·8 (0·2) 7·3 (0·2) 2·0 (0·1) 2·0 (0·1) 3·3 (0·2) 4·3 (0·2) 4·6 (0·1)EPq=1 3·2 (0·1) 4·0 (0·3) 5·1 (0·3) 4·9 (0·3) 7·3 (0·5) 2·1 (0·1) 2·8 (0·2) 2·8 (0·1) 4·2 (0·3) 3·5 (0·2)EPq=0·2 2·5 (0·1) 2·0 (0·1) 4·7 (0·1) 3·9 (0·3) 7·3 (0·3) 2·0 (0·1) 2·1 (0·1) 3·2 (0·1) 3·9 (0·1) 5·4 (0·2)Log 2·5 (0·1) 2·5 (0·2) 4·5 (0·2) 4·5 (0·2) 6·4 (0·4) 2·0 (0·1) 2·4 (0·1) 3·0 (0·1) 4·3 (0·1) 4·6 (0·2)

GDPa,b,c, three recommended Generalized double Pareto priors in Armagan et al. (2011); HS, horseshoe; EP, exponentialpower; Log, logarithmic.

Model error is calculated using the Mahalanobis distance (β−β)TΣX(β−β) where ΣX

is the covariance matrix used to generate X.Table 4 reports the median model errors and the bootstrap standard errors based

on 100 datasets for each case. Results for cases other than the exponential power priorwith q =0·2 and the logarithmic prior are based on the reported values of Armaganet al. (2011). Except for model (iii) and (v) in the correlated predictor scenario, the ex-ponential power prior with q = 1 is outperformed by other methods. The performancesof the exponential power prior with q =0·2 and the logarithmic prior are comparablewith those of the generalized Pareto and the horseshoe priors.

7 Conclusion

The scale mixture of uniform prior provides a unified framework for shrinkage estima-tion of covariance matrices for a wide class of prior distributions. Further research onthe scale mixture of uniform distributions is of interest in developing theoretical insightsas well as computational advances in shrinkage prior estimation for Bayesian analysis ofcovariance matrices and other related models. One obvious next step is to investigatethe covariance selection models that encourage exact zeros on a subset of elements ofΩ under the scale mixture uniform priors. Such extensions can potentially combine theflexibility of the scale mixture of uniform priors and the interpretation of the graphsimplied by exact zero elements. Another interesting research direction is the generaliza-tion of the basic random sampling models to dynamic settings that allow the covariancestructure to be time-varying. Such models are useful for analyzing high-dimensionaltime series data encountered in areas such as finance and environmental sciences. Weare current investigating these extensions and we expect the Gibbs sampler developedin Section 3.1 to play a key role in model fitting in these settings.

18

Page 19: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

Acknowledgements

The authors thank Abel Rodriguez and James G. Scott for very useful conversationsand references. NSP gratefully acknowledges the NSF grant DMS 1107070.

Appendix

Details of sampling algorithm in Section 3.1

The joint distribution of (l12, d1, d2) is:

p(d1, d2, l21 | −) ∝ dn/2+11 d

n/22 exp[−1

2trs11d1 + s22(l221d1 + d2) + 2s21d1l21] 1Ωe,e∈T .

Clearly, the full conditional distribution for d1, d2 and l21 are given by

d1 ∼ Gan/2 + 2, (s11 + s22l221 + 2s21l21)/2 1Ωe,e∈T ,

d2 ∼ Ga(n/2 + 1, s22/2) 1Ωe,e∈T and l21 ∼ Ns21/s22, 1/(s22d1) 1Ωe,e∈T , respectively.To identify the truncated region T , recall

Ωe,e = A+B, A =

(d1 d1l21

d1l21 d1l221 + d2

), B =

(b11 b12

b21 b22

).

The set T = |ωij| < tij ∩ |ωii| < tii ∩ |ωjj| < tjj can be written as

|d1 + b11| < tii ∩ |d1l21 + b21| < tij ∩ |d1l221 + d2 + b22| < tjj. (15)

Given B, tii, tij, tjj, (15) gives straightforward expressions for the truncated region ofeach variable in (d1, d2, l21) conditional on the other two.

Sampling a univariate truncated normal can be carried out efficiently using themethod of Robert (1995), while sampling a truncated gamma is based on the inversecumulative distribution function method.

References

Andrews, D. F. & Mallows, C. L. (1974). Scale mixtures of normal distributions.Journal of the Royal Statistical Society. Series B (Methodological) 36, pp. 99–102.

Armagan, A. (2009). Variational bridge regression. Proceedings of the 12th Interna-tional Confe- rence on Artificial Intelligence and Statistics (AISTATS) 5.

Armagan, A., Dunson, D. & Lee, J. (2011). Generalized double Pareto shrinkage.ArXiv e-prints .

Atay-Kayis, A. & Massam, H. (2005). The marginal likelihood for decomposableand non-decomposable graphical Gaussian models. Biometrika 92, 317–35.

19

Page 20: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

Banerjee, S., Carlin, B. P. & Gelfand, A. E. (2004). Hierarchical Modeling andanalysis of Spatial data. Boca Raton: Chapman & Hall.

Barnard, J., McCulloch, R. & Meng, X.-L. (2000). Modeling covariance matri-ces in terms of standard deviations and correlations, with application to shrinkage.Statistica Sinica 10, 1281–1311.

Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. New York:Springer Series in Statistics, New York: Springer, 2nd ed.

Berger, J. O. & Berliner, L. M. (1986). Robust bayes and empirical bayes analysiswith ε-contaminated priors. The Annals of Statistics 14, 461–486.

Carvalho, C. M., Polson, N. G. & Scott, J. G. (2010). The horseshoe estimatorfor sparse signals. Biometrika 97, 465–480.

Carvalho, C. M. & Scott, J. G. (2009). Objective bayesian model selection ingaussian graphical models. Biometrika 96, 497–512.

Daniels, M. J. & Kass, R. E. (1999). Nonconjugate bayesian estimation of covari-ance matrices and its use in hierarchical models. Journal of the American StatisticalAssociation 94, pp. 1254–1263.

Daniels, M. J. & Kass, R. E. (2001). Shrinkage estimators for covariance matrices.Biometrics 57, 1173–1184.

Dobra, A., Lenkoski, A. & Rodriguez, A. (2011). Bayesian inference for generalgaussian graphical models with application to multivariate lattice data. Journal ofthe American Statistical Association (to appear) .

Fan, J., Feng, Y. & Wu, Y. (2009). Network exploration via the adaptive lasso andscad penalties. Annals of Applied Statistics 3, 521–541.

Feller, W. (1971). An Introduction to Probability Theory and its Applications, vol. II.New York: John Wiley & Sons, 2nd ed.

Friedman, J., Hastie, T. & Tibshirani, R. (2008). Sparse inverse covarianceestimation with the graphical lasso. Biostatistics 9, 432–441.

Gelfand, A. E. & Vounatsou, P. (2003). Proper multivariate conditional autore-gressive models for spatial data analysis. Biostatistics 4, 11–15.

Griffin, J. E. & Brown, P. J. (2010). Inference with normal-gamma prior distri-butions in regression problems. Bayesian Analysis 5, 171–188.

Hans, C. (2009). Bayesian lasso regression. Biometrika 96, 835–845.

20

Page 21: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C. & West, M. (2005).Experiments in stochastic computation for high-dimensional graphical models. Sta-tistical Science 20, 388–400.

Liechty, J. C., Liechty, M. W. & Muler, P. (2004). Bayesian correlation esti-mation. Biometrika 91, 1–14.

Liechty, M. W., Liechty, J. C. & Muller, P. (2009). The Shadow Prior. Journalof Computational and Graphical Statistics 18, 368–383.

Mitsakakis, N., Massam, H. & Escobar, M. (2010). A Metropolis-Hastings basedmethod for sampling from G-Wishart distribution in Gaussian graphical models.Tech. rep., University of Toronto.

Park, T. & Casella, G. (2008). The Bayesian Lasso. Journal of the AmericanStatistical Association 103, 681–686.

Polson, N. G. & Scott, J. G. (2011). The Bayesian Bridge. ArXiv e-prints .

Qin, Z., Walker, S. & Damien, P. (1998). Uniform scale mixture models withapplication to Bayesian inference. Working papers series, University of MichiganRoss School of Business.

Robert, C. P. (1995). Simulation of truncated normal variables. Statistics andComputing 5, 121–125. 10.1007/BF00143942.

Rothman, A. J., Bickel, P. J., Levina, E. & Zhu, J. (2008). Sparse permutationinvariant covariance estimation. Electronic Journal of Statistics 2, 494–515.

Roverato, A. (2002). Hyper-inverse Wishart distribution for non-decomposablegraphs and its application to Bayesian inference for Gaussian graphical models. S-candinavian Journal of Statistics 29, 391–411.

Walker, S., Damien, P. & Meyer, M. (1997). On scale mixtures of uniformdistributions and the latent weighted least squares method. Working papers series,University of Michigan Ross School of Business.

Wang, H. (2011). The bayesian graphical lasso and efficient posterior computation.Working papers series, University of South Carolina.

Wang, H. & Carvalho, C. M. (2010). Simulation of hyper-inverse wishart distribu-tions for non-decomposable graphs. Electronic Journal of Statistics 4, 1470–1475.

Wang, H. & West, M. (2009). Bayesian analysis of matrix normal graphical models.Biometrika 96, 821–834.

West, M. (1987). On scale mixtures of normal distributions. Biometrika 74, pp.646–648.

21

Page 22: On a Class of Shrinkage Priors for Covariance Matrix ...faculty.chicagobooth.edu/workshops/econometrics/past/pdf/Wang.pdfnatural platform for modeling dependence of random quantities

Wong, F., Carter, C. & Kohn, R. (2003). Efficient estimation of covariance selec-tion models. Biometrika 90, 809–30.

Yang, R. & Berger, J. O. (1994). Estimation of a covariance matrix using thereference prior. The Annals of Statistics 22, pp. 1195–1211.

Yuan, M. & Lin, Y. (2007). Model selection and estimation in the Gaussian graphicalmodel. Biometrika 94, 19–35.

22