6
Hierarchical Models: A Current Computational Perspective Author(s): James P. Hobert Source: Journal of the American Statistical Association, Vol. 95, No. 452 (Dec., 2000), pp. 1312- 1316 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2669778 . Accessed: 15/06/2014 23:46 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association. http://www.jstor.org This content downloaded from 194.29.185.37 on Sun, 15 Jun 2014 23:46:23 PM All use subject to JSTOR Terms and Conditions

Hierarchical Models: A Current Computational Perspective

Embed Size (px)

Citation preview

Hierarchical Models: A Current Computational PerspectiveAuthor(s): James P. HobertSource: Journal of the American Statistical Association, Vol. 95, No. 452 (Dec., 2000), pp. 1312-1316Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/2669778 .

Accessed: 15/06/2014 23:46

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

This content downloaded from 194.29.185.37 on Sun, 15 Jun 2014 23:46:23 PMAll use subject to JSTOR Terms and Conditions

1312 Journal of the American Statistical Association, December 2000

cal report, University of Antwerp. Rousseeuw, P. J., and Yohai, V. J. (1984), "Robust Regression by Means

of S-Estimators," in Robust and Nonlinear Time Series, eds. J. Franke, W. Hardle, and R. D. Martin, New York: Springer-Verlag, pp. 256-272.

Rubun, D. B., and Thomas, N. (1996), "Matching Using Estimated Propen- sity Scores: Relating Theory to Practice," Biometrics, 52, 254-268.

Sheather, S. J., McKean, J. W., and Hettmansperger, T. P. (1997), "Finite Sample Stability Properties of the Least Median of Squares Estimator," Journal of Statistical Computation and Simulation, 58, 371-383.

Simpson, D. G., and Chang, Y. I. (1997), "Reweighting Approximate GM Estimators: Asymptotics and Residual-Based Graphics," Journal of Sta- tistical Planning and Inference, 57, 273-293.

Simpson, D. G., Ruppert, D., and Carroll, R. J. (1992), "On One-Step GM Estimates and Stability of Inferences in Linear Regression," Journal of the American Statistical Association, 87, 439-450.

Simpson, D. G., and Yohai, V. J. (1998), "Functional Stability of One-Step GM Estimators in Approximately Linear Regression," The Annals of Statistics, 26, 1147-1169.

Hierarchical Models: A Current Computational Perspective

James P. HOBERT

1. INTRODUCTION

Hierarchical models (HMs) have many applications in statistics, ranging from accounting for overdispersion (Cox 1983) to constructing minimax estimators (Strawderman 1971). Perhaps the most common use is to induce corre- lations among a group of random variables that model ob- servations with something in common; for example, the set of measurements made on a particular individual in a lon- gitudinal study. My starting point is the following general, two-stage HM.

Let (YT, u1)T,... , (Yk', uk)" be k independent ran- dom vectors, where T denotes matrix transpose, and let A =(AT, A2)T be a vector of unknown parameters. The data vectors, Yl, . ., Yk, are modeled conditionally on the unobservable random-effects vectors, Ul, . . ., Uk. Specifi- cally, fi (yi ui; Al), the conditional density function of Yi given ui, is assumed to be some parametric density in yi whose parameters are written as functions of ui and A1. At the second level of the hierarchy, it is assumed that gi (Ui; A2), the marginal density function of ui, is some parametric density in ui whose parameters are written as functions of A2. (Of course, any of these random variables could be discrete, in which case we would simply replace the phrase "density function" with "mass function.") The Bayesian version of this model, in which A is considered a random variable with prior wF(A), is essentially the condi- tionally independent hierarchical model of Kass and Steffey (1989).

The fully specified generalized linear mixed model (Bres- low and Clayton 1993; McCulloch 1997), which is currently receiving a great deal of attention, is a special case. Indeed, many (if not most) of the parametric HMs found in the sta- tistical literature are special cases of this general HM. Sec- tion 2 presents three specific examples, including a model for McCullagh and Nelder's (1989) salamander data.

The likelihood function for this model is the marginal density of the observed data viewed as a function of the

James P. Hobert is Associate Professor, Department of Statistics, Uni- versity of Florida, Gainesville, FL 32611 (E-mail: [email protected]). The author is grateful to George Casella for constructive comments and suggestions and to Ranjini Natarajan for a couple of helpful conversations.

parameters; that is,

L(y;Y) = FJ fi(YiIUi; A1)gi(ui; A2) du, (1) i=l

where y (y T... y)T denotes the observed data. As usual, Bayesian inference about A is made through the pos- terior density ir(A y) Oc L(A; y)ir(A). In some applications, inferences about the ui's may be of interest, but in this article I restrict attention to inference about A.

Unfortunately, a HM is simply not useful from a statis- tical standpoint unless there exists a computational tech- nique that allows for reliable approximation of the quan- tities necessary for inference. Indeed, it has been the on- going development of techniques like the EM algorithm (Dempster, Laird, and Rubin 1977; McLachlan and Krish- nan 1997) and Markov chain Monte Carlo (Besag, Green, Higdon, and Mengersen, 1995; Gelfand and Smith 1990; Tierney 1994) that has enabled statisticians to make use of increasingly more complex HMs over the last few decades. Section 3 contains a discussion of some key aspects of the computational techniques currently used in conjunction with Bayesian and frequentist HMs. The context of this dis- cussion is the model for the salamander data developed in Section 2. Several areas that are ripe for new research are identified.

2. THREE EXAMPLES

An exhaustive list of all the types of data that can be modeled with the general HM of Section 1 obviously is beyond the scope of this article. However, I hope that the three examples outlined here provide the reader with a sense of how extensive such a list might be. Please note that the notation used in these specific examples is standard, but not necessarily consistent with that of Section 1.

2.1 Rat Growth Data

Gelfand, Hills, Racine-Poon, and Smith (1990) analyzed

? 2000 American Statistical Association Journal of the American Statistical Association

December 2000, Vol. 95, No. 452, Vignettes

This content downloaded from 194.29.185.37 on Sun, 15 Jun 2014 23:46:23 PMAll use subject to JSTOR Terms and Conditions

Theory and Methods of Statistics 1313

data from an experiment in which the weights of 60 young rats (30 in a control group and 30 in a treatment group) were measured weekly for 5 weeks starting on day 8. Let Yij denote the weight of the ith rat on day xj = 7j + 1, where i=1,...,60 and j =1,.. ,5. Suppose that 1 < i < 30 corresponds to the control group.

The following random coefficient regression model was used by Gelfand et al. (1990). Conditional on a pair of rat- specific regression parameters, ai and /i, it is assumed that Yil. , yis are independent with

Yij cxa, Oi

N(a6i + Oi3x, a2I(1 < i < 30) + c42I(31 < i < 60)), where I(.) is an indicator function. Thus the model allows for a difference between the variances of the error terms for the control rats and the treatment rats.

Presumably, the experimenters were not interested in the effect of treatment and control on this particular set of 60 rats, but rather on some population of rats, of which these 60 can be considered a random sample. Thus instead of taking the (ai, 3i) pairs as fixed unknown parameters, it is assumed that (o6i, 13)T, i - 1, ... , 30, are iid bivariate normal with mean (ac, 13)T and covariance matrix Ec, and that (oai, 1,)T, i = 31,... , 60, are iid bivariate normal with mean (act, 3t)T and covariance matrix Et.

A comparison of the "c" and "t" parameters allows one to make inferences regarding treatment versus control across the relevant population of rats. Gelfand et al. (1990) per- formed a Bayesian analysis with conjugate priors on all of the parameters. The Gibbs sampler (see the Gibbs sampler vignette by Gelfand) was used to make inferences.

2.2 Seizure Data Thall and Vail (1990) described an experiment in which

59 epileptics were randomly assigned to one of two treat- ment groups (active drug and placebo). The number of seizures experienced by each patient during four consec- utive 2-week periods following treatment were recorded. Let Yij denote the number of seizures experienced by the ith patient over the jth 2-week period where i = 1,.. . , 59 and j = 1, 2, 3, 4. Other available covariates are age and a pretreatment baseline seizure count.

Consider a naive generalized linear model (McCullagh and Nelder 1989) for these data in which the yij s are as- sumed to be independent Poisson random variables whose means are linked to linear predictors involving the covari- ates. I refer to this model as naive for two reasons. First, it is probably reasonable to assume that observations on two different patients are independent, but it is surely un- reasonable to assume that observations made on the same patient are independent. Second, it appears that these data are overdispersed with respect to the Poisson distribution; that is, there is more variability than would be expected under the Poisson assumption.

Booth, Casella and Hobert (2000) suggested the follow- ing HM, which allows for both dependence among an in- dividual's observations and overdispersion. Let u, ... ., Ut59

be iid N(O,CJ2) and think of these as random patient ef-

fects. Conditional on Ui, Yil, . I ., yI4 are assumed to be iid negative binomial random variables with index a and mean yi = exp{x7TQ + ui }, where xT is an appropriately chosen vector of covariates. More specifically,

P (yij = y lui; ,a)

F(y + al) ( cP ___i (2) F (al)y! p8i+aJ p8i+a

2

where a > 0 and y c {0,1, 2,. ..}. The random patient effects induce a positive correlation

between counts from the same patient. Regarding potential overdispersion, note that E(yij|ui) = pi and var(y.jlui) bLi + ,u/i/a, and that (2) becomes a Poisson probability as Ol -X 00. Thus the parameter Ol allows for overdispersion.

2.3 The Salamander Data

The salamander data consist of three separate experi- ments, each performed according to the design given in table 14.3 of McCullagh and Nelder (1989) and each in- volving matings among salamanders in two closed groups. Thus there are a total of six closed groups of salamanders. Each group contained 20 salamanders: 5 species R females, 5 species W females, 5 species R males, and 5 species W males. (Actually, the same 40 salamanders were used in two of the experiments, but we ignore this and act as if three different sets of 40 salamanders were used because of the long time delay between these two experiments.) Within each group, only 60 of the possible 100 heterosexual crosses were observed. Let Yhij be the indicator of a successful mat- ing between the ith female and jth male in group h, where i,j = 1,2,..10 and h = 1,...,6. Note that for fixed h, only 60 of the possible 100 (i, j) pairs are relevant. There are four possible sex-species combinations, and the exper- imenters collected the data hoping to answer the question: Are there differences in the four mating success probabili- ties?

Any two observations involving a common salamander (male or female) should obviously be modeled as correlated random variables. One way to induce such correlations is through the following HM, which is one of several models introduced by Karim and Zeger (1992). Let Vhi denote the random effect that the ith female in group h has across matings in which she is involved, and define Whj similarly for the jth male in group h. Let Vh = (vhl, . ,vhlo)T

and v = (v, .. ., T)T, and define w;, and w similarly. Likewise, let Yl, ,Y6 be 60 x 1 vectors containing the binary outcomes from the six closed groups, and put y (yT yTT

Conditional on Vh and Wh, the elements of y;, are assumed independent and such that yhij Vhi Whj

Bernouhli(7rhij) with

logit(7rhij) = X Tj + Vhi + Whj,

where xhj~ is ai1x 4 vector inldicating the type of mating (i.e., it contains three 0's and a single 1) and p3 is an un- known regression parameter. Finally, it is assumed that the elements of v are iid N(0, o-2) and independenlt of w whose

This content downloaded from 194.29.185.37 on Sun, 15 Jun 2014 23:46:23 PMAll use subject to JSTOR Terms and Conditions

- 1314 Journal of the American Statistical Association, December 2000

elements are assumed to be iid N(O, a' ). In terms of the general HM of Section 1, A = (3T, o,' 5U)T.

In the next section I look at Bayesian and frequentist ver- sions of this model, and consider the computational tech- niques required for making inferences. There are two main reasons for using this particular dataset: (a) the models give a reasonable indication of how complex a HM statisticians are currently able to handle, and (b) I wish to take advan- tage of the fact that many readers will be familiar with this dataset, as it has been discussed by so many authors (e.g., Booth and Hobert 1999; Chan and Kuk 1997; Karim and Zeger 1992; Lee and Nelder 1996; McCullagh and Nelder 1989; McCulloch 1994; Vaida and Meng 1998).

3. COMPUTATION FOR THE SALAMANDER MODEL

The fact that only 60 of the possible 100 heterosexual crosses were observed in each of the six groups does not complicate the analysis of the data, but it does complicate notation. Specifically, for h = 1, . . . 6, we require

Sh ={(i, j): ith female and jth male in group h were coupled}.

The likelihood function associated with Karim and Zeger's (1992) model for the salamander data is

6

t(p V7 CW; Y) = fl Lh(7 07V7 Cw;Yh)7 3

h=1

where, up to a multiplicative constant,

Lh(i 07V 7 W; Yh)

exp Yhij + Vhi + Wh1)

( ) L(i) ESk + expIXhij' + Vhi + Whj}

f 110 110

x exp ZVhi 22 Whj } dvhdwh.

Thus L contains six intractable 20-dimensional integrals, which are due to the crossed nature of the design. Bayesian and frequentist computations are considered in the follow- ing two sections.

3.1 Bayesian Analysis

A consequence of the complexity of the salamander like- lihood function is that, no matter what (nontrivial) prior is chosen for A =(1l3T, J2, J2)T, the integrals defining any posterior quantities of interest will not be analytically tractable. Furthermore, the high dimension of these inte- grals probably rules out numerical integration. Thus, to make inferences, one is forced to use either analytical ap- proximations (Kass and Steffey 1989; Tierney, Kass, and Kadane 1989) or Markov chain Monte Carlo (MCMC) tech- niques.

Suppose that the Gibbs sampler is to be used to sample from ir(A y). In an attempt to be "objective" and simul- taneously make the full conditionals needed for Gibbs as

simple as possible, one might chose an improper conjugate prior such as 7r(A) oc (a 2)a(.2 )b for some a and b (Karim and Zeger 1992). Such priors can, however, lead to serious problems. For example, depending on the value of (a, b), it may be the case that all of the full conditionals needed to apply the Gibbs sampler are proper, whereas the poste- rior distribution itself is improper (Natarajan and McCul- loch 1995). Compounding this potential problem is the fact that an improper posterior may not be apparent from the Gibbs output (Hobert and Casella 1996). Moreover, even if 7r(A) 0c (a2)a (.2 )b does result in a proper posterior, it is not a "noninformative" prior in any formal sense for this HM. Thus it may be the case that the prior is actually driv- ing the inferences, which is opposite of what was intended. Unfortunately, even if one is willing to suffer the (sampling) consequences of using a nonconjugate prior, standard "de- fault" priors such as Jeffreys's prior are not available in closed form.

To guard against improper posteriors, some authors have suggested using "diffuse" proper priors, which typically means proper conjugate priors that are nearly improper. Indeed, the author is guilty of making such suggestions: as a prior for a variance component in a normal mixed model, Hobert and Casella (1996, p. 1469) suggested using 7r(u2) 0c (a 2) -r- exp{-1/sU2} with small positive values for r and s. The problem is that these diffuse priors are still not noninformative in any formal sense and, further- more, there is empirical evidence that such priors may lead to very slowly converging Gibbs samplers (Natarajan and McCulloch 1998).

We conclude that choosing a prior for A is currently a real dilemma for a Bayesian with no prior information. Conse- quently, development of reasonable default priors for A (of the HM in Sec. 1) seems to be a potentially rich area for future research. Natarajan and Kass (2000), discuss some recent results along these lines.

Suppose now that 7r(A) is a satisfactory prior for the salamander model, and that the Gibbs sampler is run to produce (v(t),w(t) ,3(t) a2(t)2 5(t))It 0,1,2,... where t = 0 corresponds to some intelligently cho- sen starting value. (See C. J. Geyer's web page at http://www.stat.umn.edu/-charlie/ for an interesting dis- cussion of burn-in and the use of parallel chains in MCMC.) An obvious estimate of E(o21 y) is

I A/1

M1 I 2(t) (4)

Most statisticians would agree that without an associated standard error, this Monte Carlo estimate is not very use- ful. Despite this, estimates based on MCMC are routinely presented without reliable standard errors. The reason for this is presumably that calculating the standard error of an estimate based on MCMC is often not trivial. Indeed, estab- lishing the existence of a central limit theorem (CLT) for a Monte Carlo estimate based on a Markov chain can be difficult (Chan and Geyer 1994; Meyn and Tweedie 1993; Tierney 1994). (Readers who are skeptical about the im-

This content downloaded from 194.29.185.37 on Sun, 15 Jun 2014 23:46:23 PMAll use subject to JSTOR Terms and Conditions

Theory and Methods of Statistics 1315

portance of a CLT should look at the example in sec. 4 of Roberts and Rosenthal 1998.)

The most common method of establishing the existence of CLTs is to show that the Markov chain itself is geomet- rically ergodic. Although many Markov chains that are the basis of MCMC algorithms have been shown to be geomet- rically ergodic (e.g., Chan 1993; Hobert and Geyer 1998; Mengersen and Tweedie 1996; Roberts and Rosenthal 1999; Roberts and Tweedie 1996), myriad complex MCMC algo- rithms are currently in use to which these results do not apply. An example is the Gibbs sampler for the salamander model (with a proper conjugate prior on A). Thus another area of research that is full of possibilities is establishing geometric ergodicity (or lack thereof) of the Markov chains used in popular MCMC algorithms.

Unfortunately, even if it is known that a CLT exists, es- timating the asymptotic variance may not be easy (Geyer 1992; Mykland, Tierney, and Yu 1995). For example, to ap- ply the regenerative method of Mykland et al. (1995), a mi- norization condition (Rosenthal 1995) must be established. Depending on the complexity of the Markov chain, this can be a challenging exercise. Thus important work remains to be done on this topic as well.

3.2 Frequentist Analysis

Now the problem shifts from sampling from the pos- terior distribution to maximizing the likelihood function. As mentioned earlier, the likelihood function for the sala- mander model involves six intractable 20-dimensional inte- grals. Evidently, computational techniques that can be used to maximize such complex likelihood functions must make use of the Monte Carlo method. McCulloch (1997) de- scribed and compared three methods for maximizing like- lihood functions based on generalized linear mixed mod- els: Monte Carlo EM (MCEM), Monte Carlo Newton- Raphson (MCNR), and simulated maximum likelihood. He concluded that MCEM and MCNR are both generally ef- fective (and better than simulated maximum likelihood), but noted a drawback concerning "the complications of decid- ing whether the stochastic versions of EM or NR have con- verged." Since the appearance of McCulloch's 1997 article, MCEM seems to have received more attention than MCNR. Consequently, here I focus on MCEM.

Most of the problems associated with MCEM, including the one pointed out by McCulloch, stem from an inability to quantify the Monte Carlo error introduced at each step of the algorithm. An important area for future research is the development of methods that will enable one to get a handle on this Monte Carlo error. To explain this problem in more detail, I consider using MCEM to maximize (3).

Viewing the random effects, v and w, as missing data leads to the following implementation of the EM algorithm (Dempster et al. 1977). The (r + 1)th E step entails calcu- lation of

Q(A A(r)) 6

Z Eh[log{fh(yh vI),Wh; A1)gh(vh, Wh; A2)} Yh; A ] h=l

where A(r) is the result of the rth iteration of EM, and El, is with respect to the conditional distribution of (Vh, Wh)

given Yh with parameter value A(r), whose density is de- noted by gh(Vh, Wh Yh; A(r)). Once Q has been calculated, one simply maximizes it over A to get A(`+1); that is, per- form the (r + l)th M step.

There is simply no hope of computing Q in closed form, because gh(vh,WhlYh; A(r)) involves the same intractable integrals as those found in the likelihood function. The MCEM algorithm (Wei and Tanner 1990) avoids an in- tractable E step by replacing Q with a Monte Carlo ap- proximation. For example, instead of maximizing Q, one maximizes

Q(A A(r )) 6 1M

E M Elog{fh(YhVh v w( ); A1)9h(V (1) W( ); A2)ji h=1 1=1

where (v ())w(1))11 = 1,2,... ,M, could be an iid sam- ple from gh(Vh,WhjYh;A(r)) or from an ergodic Markov chain with stationary density gh(Vh, Wh Yh; A(r))- (Impor- tance sampling is another possibility.)

Of course, there is no free lunch. Although using MCEM circumvents a complicated expectation at each E step, it necessitates a method for choosing M at each MCE step. If M is too small, then the EM step will be swamped by Monte Carlo error, whereas too large an M is wasteful. Booth and Hobert (1999) argued that the ability to choose an appropriate value for M hinges on the existence of a CLT for the maximizer of Q and one's ability to estimate the corresponding asymptotic variance. These authors provided methods for choosing M and diagnosing convergence when Q is based on independent samples, and showed that their MCEM algorithm can be used to maximize complex likeli- hoods, such as the salamander likelihood described herein. However, it is clear that when the dimension of the in- tractable integrals in the likelihood is very large, one will be forced to construct Q using MCMC techniques such as the Metropolis algorithm developed by McCulloch (1997). The discussion in the previous section suggests that when Q is based on a Markov chain, verifying the existence of a CLT for the maximizer of Q and estimating the correspond- ing asymptotic variance can be difficult problems (Levine and Casella 1998). This is another important and potentially rich area for new research.

REFERENCES Besag, J., Green, P., Higdon, D., and Mengersen, K. (1995), "Bayesian

Computation and Stochastic Simulation" (with discussion), Statistical Science, 10, 3-66.

Booth, J. G., and Hobert, J. P. (1999), "Maximizing Generalized Linear Mixed Model Likelihoods With an Automated Monte Carlo EM Algo- rithm," Journal of the Royal Statistical Society, Ser. B, 61, 265-285.

Booth, J. G., Casella, G., and Hobert, J. P. (2000), "Negative Binomial Log- Linear Mixed Models," technical report, University of Florida, Dept. of Statistics.

Breslow, N. E., and Clayton, D. G. (1993), "Approximate Inference in Generalized Linear Mixed Models," Journal of the American Statistical Association, 88, 9-25.

Chan, J. S. K., and Kuk, A. Y. C. (1997), "Maximum Likelihood Estima- tion for Probit-Linear Mixed Models With Correlated Random Effects,"

This content downloaded from 194.29.185.37 on Sun, 15 Jun 2014 23:46:23 PMAll use subject to JSTOR Terms and Conditions

1316 Journal of the American Statistical Association, December 2000

Biometrics, 53, 86-97. Chan, K. S. (1993), "Asymptotic Behavior of the Gibbs Sampler," Journal

of the American Statistical Association, 88, 320-326. Chan, K. S., and Geyer, C. J. (1994), Comment on "Markov Chains for Ex-

ploring Posterior Distributions," by L. Tierney The Annals of Statistics, 22, 1747-1757.

Cox, D. R. (1983), "Some Remarks on Overdispersion," Biometrika, 70, 269-274.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), "Maximum Like- lihood From Incomplete Data via the EM Algorithm" (with discussion), Journal of the Royal Statistical Society, Ser. B, 39, 1-38.

Gelfand, A. E., Hills, S. E., Racine-Poon, A., and Smith, A. F. M. (1990), "Illustration of Bayesian Inference in Normal Data Models Using Gibbs Sampling," Journal of the American Statistical Association, 85, 972-985.

Gelfand, A. E., and Smith, A. F. M. (1990), "Sampling-Based Approaches to Calculating Marginal Densities," Journal of the American Statistical Association, 85, 398-409.

Geyer, C. J. (1992), "Practical Markov Chain Monte Carlo" (with discus- sion), Statistical Science, 7, 473-511.

Hobert, J. P., and Casella, G. (1996), "The Effect of Improper Priors on Gibbs Sampling in Hierarchical Linear Mixed Models," Journal of the American Statistical Association, 91, 1461-1473.

Hobert, J. P., and Geyer, C. J. (1998), "Geometric Ergodicity of Gibbs and Block Gibbs Samplers for a Hierarchical Random-Effects Model," Journal of Multivariate Analysis, 67, 414-430.

Karim, M. R., and Zeger, S. L. (1992), "Generalized Linear Models With Random Effects; Salamander Mating Revisited," Bionmetrics, 48, 631- 644.

Kass, R. E., and Steffey, D. (1989), "Approximate Bayesian Inference in Conditionally Independent Hierarchical Models (Parametric Empirical Bayes Models)," Journal of the American Statistical Association, 84, 717-726.

Lee, Y., and Nelder, J. A. (1996), "Hierarchical Generalized Linear Mod- els" (with discussion), Journal of the Royal Statistical Society, Ser. B, 58, 619-678.

Levine, R. A., and Casella, G. (1998), "Implementations of the Monte Carlo EM Algorithm," technical report, University of California, Davis, Intercollege Division of Statistics.

McCullagh, P., and Nelder, J. A. (1989), Generalized Linear Models, Lon- don: Chapman and Hall, 2nd ed.

McCulloch, C. E. (1994), "Maximum Likelihood Variance Components Estimation for Binary Data," Journal of the American Statistical Asso- ciation, 89, 330-335.

(1997), "Maximum Likelihood Algorithms for Generalized Linear Mixed Models," Journal of the American Statistical Association, 92, 162-170.

McLachlan, G. J., and Krishnan, T. (1997), The EM Algorithm and Exten-

sions, New York: Wiley. Mengersen, K. L., and Tweedie, R. L. (1996), "Rates of Convergence of

the Hastings and Metropolis Algorithms," The Annals of Statistics, 24, 101-121.

Meyn, S. P., and Tweedie, R. L. (1993), Markov Chainls and Stochastic Stability, New York: Springer-Verlag.

Mykland, P., Tierney, L., and Yu, B. (1995), "Regeneration in Markov Chain Samplers," Journal of the American Statistical Association, 90, 233-241.

Natarajan, R., and Kass, R. E. (2000), "Reference Bayesian Methods for Generalized Linear Mixed Models," Journal of the American Statistical Association, 95, 227-237.

Natarajan, R., and McCulloch, C. E. (1995), "A Note on the Existence of the Posterior Distribution for a Class of Mixed Models for Binomial Responses," Biometrika, 82, 639-643.

(1998), "Gibbs Sampling with Diffuse Proper Priors: A Valid Approach to Data-Driven Inference?," Journal of Computational and Graphical Statistics, 7, 267-277.

Roberts, G. O., and Rosenthal, J. S. (1998), "Markov Chain Monte Carlo: Some Practical Implications of Theoretical Results" (with discussion), Canadian Journal of Statistics, 26, 5-31.

(1999), "Convergence of Slice Sampler Markov Chains," Journal of the Royal Statistical Society, Ser. B, 61, 643-660.

Roberts, G. O., and Tweedie, R. L. (1996), "Geometric Convergence and Central Limit Theorems for Multidimensional Hastings and Metropolis Algorithms," Biometrika, 83, 95-110.

Rosenthal, J. S. (1995), "Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo," Journal of the American Statistical Association, 90, 558-566.

Strawderman, W. E. (1971), "Proper Bayes Minimax Estimators of the Multivariate Normal Mean," The Annals of Mathematical Statistics, 42, 385-388.

Thall, P. F., and Vail, S. C. (1990), "Some Covariance Models for Longi- tudinal Count Data With Overdispersion," Biomnetrics, 46, 657-671.

Tierney, L. (1995), "Markov Chains for Exploring Posterior Distributions" (with discussion), The Annals of Statistics, 22, 1701-1762.

Tierney, L., Kass, R. E., and Kadane, J. B. (1989), "Fully Exponential Laplace Approximations to Expectations and Variances of Nonpositive Functions," Journal of the Americanz Stcatistical Association, 84, 710- 716.

Vaida, F., and Meng, X-L. (1998), "A Flexible Gibbs-EM Algorithm for Generalized Linear Mixed Models With Binary Response," in Proceed- ings of the 13th International Workshop on Statistical Modeling, eds. B. Marx and H. Friedl, pp. 374-379.

Wei, G. C. G., and Tanner, M. A. (1990), "A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algo- rithms," Journal of the American Statistical Association, 85, 699-704.

Hypothesis Testing: From p Values to Bayes Factors John 1. MARDEN

1. WHAT IS HYPOTHESIS TESTING?

Testing hypotheses involves deciding on the plausibil- ity of two or more hypothetical statistical models based on some data. It may be that there are two hypothesis on equal footing; for example, the goal of Mosteller and Wal- lace (1984) was to decide which of Alexander Hamilton or James Madison wrote a number of the Federalist Papers. It is more common that there is a particular null hypothesis that is a simplification of a larger model, such as when test-

John I. Marden is Professor, Department of Statistics, University of Illi- nois at Urbana-Champaign, Champaign, IL 61820 (E-mail: marden@stat. uiuc.edu).

ing whether a population mean is 0 versus the alternative that it is not 0. This null hypothesis may be something that one actually believes could be (approximately) true, such as the specification of an astronomical constant; or some- thing that one believes is false but is using as a straw man, such as the efficacy of a new drug is null; or something that one hopes is true for convenience's sake, such as equal variances in the analysis of variance.

Formally, we assume that the data are represented by x, which has density fo(x) for parameter 0. We test the null

? 2000 American Statistical Association Journal of the American Statistical Association

December 2000, Vol. 95, No. 452, Vignettes

This content downloaded from 194.29.185.37 on Sun, 15 Jun 2014 23:46:23 PMAll use subject to JSTOR Terms and Conditions