MathematicalStatistics Old Schoolistics.net/pdfs/mathstat.pdf · Preface My idea of mathematical statistics encompasses three main areas: The mathemat-ics needed as a basis for work

Mathematical Statistics

Old School

John I. MardenDepartment of StatisticsUniversity of Illinois at Urbana-Champaign

© 2017 by John I. MardenEmail: [email protected]: http://stat.istics.net/MathStat

Typeset using the memoir package (Madsen and Wilson, 2015) with LATEX (Lam-port, 1994).

Preface

My idea of mathematical statistics encompasses three main areas: The mathemat-ics needed as a basis for work in statistics; the mathematical methods for carryingout statistical inference; and the theoretical approaches for analyzing the efficacy ofvarious procedures. This book is conveniently divided into three parts roughly cor-responding to those areas.

Part I introduces distribution theory, covering the basic probability distributionsand their properties. Here we see distribution functions, densities, moment gener-ating functions, transformations, the multivariate normal distribution, joint marginaland conditional distributions, Bayes theorem, and convergence in probability anddistribution.

Part II is the core of the book, focussing on inference, mostly estimation and hy-pothesis testing, but also confidence intervals and model selection. The emphasis ison frequentist procedures, partly because they take more explanation, but Bayesianinference is fairly well represented as well. Topics include exponential family and lin-ear regression models; likelihood methods in estimation, testing, and model selection;and bootstrap and randomization techniques.

Part III considers statistical decision theory, which evaluates the efficacy of pro-cedures. In earlier years this material would have been considered the essence ofmathematical statistics — UMVUEs, the CRLB, UMP tests, invariance, admissibility,minimaxity. Since much of this material deals with small sample sizes, few parame-ters, very specific models, and super-precise comparisons of procedures, it may nowseem somewhat quaint. Certainly, as statistical investigations become increasinglycomplex, there being a single optimal procedure, or a simply described set of ad-missible procedures, is highly unlikely. But the discipline of stating clearly what thestatistical goals are, what types of procedures are under consideration, and how oneevaluates the procedures, is key to preserving statistics as a coherent intellectual area,rather than just a handy collection of computational techniques.

This material was developed over the last thirty years teaching various configura-tions of mathematical statistics and decision theory courses. It is currently used asa main text in a one-semester course aimed at master’s students in statistics and ina two-semester course aimed at Ph. D. students in statistics. Both courses assume aprerequisite of a rigorous mathematical statistics course at the level of Hogg, McKean,and Craig (2013), though the Ph. D. students are generally expected to have learnedthe material at a higher level of mathematical sophistication.

Much of Part I constitutes a review of the material in Hogg, et. al., hence does

iii

iv Preface

not need to be covered in detail, though the material on conditional distributions,the multivariate normal, and mapping and the ∆-method in asymptotics (Chapters6, 7, and 9) may need extra emphasis. The masters-level course covers a good chunkof Part II, particularly Chapters 10 through 16. It would leave out the more techni-cal sections on likelihood asymptotics (Sections 14.4 through 14.7), and possibly thematerial on regularization and least absolute deviations in linear regression (Sections12.5 and 12.6). It would also not touch Part III. The Ph. D.-level course can proceedmore quickly through Part I, then cover Part II reasonably comprehensively. Themost typical topics in Part III to cover are the optimality results in testing and esti-mation (Chapters 19 and 21), and general statistical decision theory up through theJames-Stein estimator and randomized procedures (Sections 20.1 through 20.7). Thelast section of Chapter 20 and the whole of Chapter 22 deal with necessary conditionsfor admissibility, which would be covered only if wishing to go deeper into statisticaldecision theory.

The mathematical level of the course is a bit higher than that of Hogg et al. (2013)and in the same ballpark as texts like the mathematical statistics books Bickel andDoksum (2007), Casella and Berger (2002), and Knight (1999), the testing/estimationduo Lehmann and Romano (2005) and Lehmann and Casella (2003), and the moredecision-theoretic treatments Ferguson (1967) and Berger (1993). A solid backgroundin calculus and linear algebra is necessary, and real analysis is a plus. The laterdecision-theoretic material needs some set theory and topology. By restricting pri-marily to densities with respect to either Lebesgue measure or counting measure,I have managed to avoid too much explicit measure theory, though there are placeswhere “with probability one” statements are unavoidable. Billingsley (1995) is a goodresource for further study in measure theoretic probability.

Notation for variables and parameters mostly follows the conventions that capitalletters represent random quantities, and lowercase represent specific values and con-stants; bold letters indicate vectors or matrices, while non-bolded ones are scalars;and Latin letters represent observed variables and constants, with Greek letters rep-resenting parameters. There are exceptions, such as using the Latin “p” is as a pa-rameter, and functions will usually be non-bold, even when the output is multidi-mensional.

This book would not exist if I didn’t think I understood the material well enoughto teach it. To the extent I do, thanks go to my professors at the University of Chicago,especially Raj Bahadur, Michael Perlman, and Michael Wichura.

Contents

Preface iii

Contents v

I Distribution Theory 1

1 Distributions and Densities 31.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Distribution functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 PDFs: Probability density functions . . . . . . . . . . . . . . . . . . . . . 6

1.4.1 A bivariate pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 PMFs: Probability mass functions . . . . . . . . . . . . . . . . . . . . . . 81.6 Distributions without pdfs or pmfs . . . . . . . . . . . . . . . . . . . . . 9

1.6.1 Late start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.6.2 Spinner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.6.3 Mixed-type densities . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Expected Values, Moments, and Quantiles 172.1 Definition of expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.1 Indicator functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Means, variances, and covariances . . . . . . . . . . . . . . . . . . . . . . 19

2.2.1 Uniform on a triangle . . . . . . . . . . . . . . . . . . . . . . . . . 212.2.2 Variance of linear combinations & affine transformations . . . . 22

2.3 Vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.5 Moment and cumulant generating functions . . . . . . . . . . . . . . . . 26

2.5.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 282.5.2 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 282.5.3 Binomial and multinomial distributions . . . . . . . . . . . . . . 292.5.4 Proof of the moment generating lemma . . . . . . . . . . . . . . 32

2.6 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

v

vi Contents

2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Marginal Distributions and Independence 393.1 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1.1 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . 403.2 Marginal densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.1 Ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2.2 PDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3.1 Independent exponentials . . . . . . . . . . . . . . . . . . . . . . 433.3.2 Spaces and densities . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3.3 IID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Transformations: DFs and MGFs 494.1 Adding up the possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1.1 Sum of discrete uniforms . . . . . . . . . . . . . . . . . . . . . . . 504.1.2 Convolutions for discrete variables . . . . . . . . . . . . . . . . . 504.1.3 Sum of two Poissons . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Distribution functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.2.1 Convolutions for continuous random variables . . . . . . . . . . 524.2.2 Uniform→ Cauchy . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.3 Probability transform . . . . . . . . . . . . . . . . . . . . . . . . . 544.2.4 Location-scale families . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 Moment generating functions . . . . . . . . . . . . . . . . . . . . . . . . . 564.3.1 Uniform→ Exponential . . . . . . . . . . . . . . . . . . . . . . . . 564.3.2 Sum of independent gammas . . . . . . . . . . . . . . . . . . . . 574.3.3 Linear combinations of independent normals . . . . . . . . . . . 574.3.4 Normalized means . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.3.5 Bernoulli and binomial . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Transformations: Jacobians 655.1 One dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.2 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.3 Gamma, beta, and Dirichlet distributions . . . . . . . . . . . . . . . . . . 67

5.3.1 Dirichlet distribution . . . . . . . . . . . . . . . . . . . . . . . . . 685.4 Affine transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.4.1 Bivariate normal distribution . . . . . . . . . . . . . . . . . . . . . 705.4.2 Orthogonal transformations and polar coordinates . . . . . . . . 715.4.3 Spherically symmetric pdfs . . . . . . . . . . . . . . . . . . . . . . 735.4.4 Box-Muller transformation . . . . . . . . . . . . . . . . . . . . . . 74

5.5 Order statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6 Conditional Distributions 816.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.2 Examples of conditional distributions . . . . . . . . . . . . . . . . . . . . 82

6.2.1 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . 826.2.2 Mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Contents vii

6.2.3 Hierarchical models . . . . . . . . . . . . . . . . . . . . . . . . . . 836.2.4 Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.3 Conditional & marginal→ Joint . . . . . . . . . . . . . . . . . . . . . . . 846.3.1 Joint densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.4 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.4.1 Coins and the beta-binomial distribution . . . . . . . . . . . . . . 866.4.2 Simple normal linear model . . . . . . . . . . . . . . . . . . . . . 866.4.3 Marginal mean and variance . . . . . . . . . . . . . . . . . . . . . 876.4.4 Fruit flies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.5 Conditional from the joint . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.5.1 Coins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.5.2 Bivariate normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.6 Bayes theorem: Reversing the conditionals . . . . . . . . . . . . . . . . . 936.6.1 AIDS virus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946.6.2 Beta posterior for the binomial . . . . . . . . . . . . . . . . . . . . 95

6.7 Conditionals and independence . . . . . . . . . . . . . . . . . . . . . . . 956.7.1 Independence of residuals and X . . . . . . . . . . . . . . . . . . 95

6.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7 The Multivariate Normal Distribution 1037.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.1.1 Spectral decomposition . . . . . . . . . . . . . . . . . . . . . . . . 1057.2 Some properties of the multivariate normal . . . . . . . . . . . . . . . . . 106

7.2.1 Affine transformations . . . . . . . . . . . . . . . . . . . . . . . . 1067.2.2 Marginals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067.2.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.3 PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.4 Sample mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.5 Chi-square distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.5.1 Noninvertible covariance matrix . . . . . . . . . . . . . . . . . . . 1117.5.2 Idempotent covariance matrix . . . . . . . . . . . . . . . . . . . . 1127.5.3 Noncentral chi-square distribution . . . . . . . . . . . . . . . . . 113

7.6 Student’s t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.7 Linear models and the conditional distribution . . . . . . . . . . . . . . 1157.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

8 Asymptotics: Convergence in Probability and Distribution 1258.1 Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1258.2 Convergence in probability to a constant . . . . . . . . . . . . . . . . . . 1258.3 Chebyshev’s inequality and the law of large numbers . . . . . . . . . . . 126

8.3.1 Regression through the origin . . . . . . . . . . . . . . . . . . . . 1288.4 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 129

8.4.1 Points of discontinuity of F . . . . . . . . . . . . . . . . . . . . . . 1318.4.2 Converging to a constant random variable . . . . . . . . . . . . . 132

8.5 Moment generating functions . . . . . . . . . . . . . . . . . . . . . . . . . 1328.6 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

8.6.1 Supersizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1348.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

9 Asymptotics: Mapping and the ∆-Method 139

viii Contents

9.1 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1399.1.1 Regression through the origin . . . . . . . . . . . . . . . . . . . . 141

9.2 ∆-method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1419.2.1 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

9.3 Variance stabilizing transformations . . . . . . . . . . . . . . . . . . . . . 1439.4 Multivariate ∆-method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

9.4.1 Mean, variance, and coefficient of variation . . . . . . . . . . . . 1459.4.2 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . 1479.4.3 Affine transformations . . . . . . . . . . . . . . . . . . . . . . . . 148

9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

II Statistical Inference 153

10 Statistical Models and Inference 15510.1 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15510.2 Interpreting probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15610.3 Approaches to inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15810.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

11 Estimation 16111.1 Definition of estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16111.2 Bias, standard errors, and confidence intervals . . . . . . . . . . . . . . . 16211.3 Plug-in methods: Parametric . . . . . . . . . . . . . . . . . . . . . . . . . 163

11.3.1 Coefficient of variation . . . . . . . . . . . . . . . . . . . . . . . . 16411.4 Plug-in methods: Nonparametric . . . . . . . . . . . . . . . . . . . . . . . 16411.5 Plug-in methods: Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 165

11.5.1 Sample mean and median . . . . . . . . . . . . . . . . . . . . . . 16611.5.2 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

11.6 Posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16911.6.1 Normal mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16911.6.2 Improper priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

11.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

12 Linear Regression 17912.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17912.2 Matrix notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18012.3 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

12.3.1 Standard errors and confidence intervals . . . . . . . . . . . . . . 18312.4 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18412.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

12.5.1 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18512.5.2 Hurricanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18712.5.3 Subset selection: Mallows’ Cp . . . . . . . . . . . . . . . . . . . . 18912.5.4 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

12.6 Least absolute deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . 19112.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

13 Likelihood, Sufficiency, and MLEs 19913.1 Likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

Contents ix

13.2 Likelihood principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20013.2.1 Binomial and negative binomial . . . . . . . . . . . . . . . . . . . 201

13.3 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20213.3.1 IID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20313.3.2 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 20313.3.3 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . 20413.3.4 Laplace distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 20413.3.5 Exponential families . . . . . . . . . . . . . . . . . . . . . . . . . . 204

13.4 Conditioning on a sufficient statistic . . . . . . . . . . . . . . . . . . . . . 20513.4.1 IID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20813.4.2 Normal mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20813.4.3 Sufficiency in Bayesian analysis . . . . . . . . . . . . . . . . . . . 209

13.5 Rao-Blackwell: Improving an estimator . . . . . . . . . . . . . . . . . . . 20913.5.1 Normal probability . . . . . . . . . . . . . . . . . . . . . . . . . . 21013.5.2 IID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

13.6 Maximum likelihood estimates . . . . . . . . . . . . . . . . . . . . . . . . 21213.7 Functions of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

13.7.1 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 21413.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

14 More on Maximum Likelihood Estimation 22114.1 Score function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

14.1.1 Fruit flies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22214.2 Fisher information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22214.3 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

14.3.1 Sketch of the proof . . . . . . . . . . . . . . . . . . . . . . . . . . . 22414.4 Cramér’s conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22514.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

14.5.1 Convexity and Jensen’s inequality . . . . . . . . . . . . . . . . . . 22614.5.2 A consistent sequence of roots . . . . . . . . . . . . . . . . . . . . 228

14.6 Proof of asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . 22914.7 Asymptotic efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

14.7.1 Mean and median . . . . . . . . . . . . . . . . . . . . . . . . . . . 23214.8 Multivariate parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

14.8.1 Non-IID models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23514.8.2 Common mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23514.8.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

14.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

15 Hypothesis Testing 24515.1 Accept/Reject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

15.1.1 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24815.2 Tests based on estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

15.2.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25015.3 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25015.4 Bayesian testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25115.5 P-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25615.6 Confidence intervals from tests . . . . . . . . . . . . . . . . . . . . . . . . 25715.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

x Contents

16 Likelihood Testing and Model Selection 26316.1 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

16.1.1 Normal mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26316.1.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26516.1.3 Independence in a 2× 2 table . . . . . . . . . . . . . . . . . . . . 26616.1.4 Checking the dimension . . . . . . . . . . . . . . . . . . . . . . . 267

16.2 Asymptotic null distribution of the LRT statistic . . . . . . . . . . . . . . 26716.2.1 Composite null . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

16.3 Score tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26916.3.1 Many-sided . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

16.4 Model selection: AIC and BIC . . . . . . . . . . . . . . . . . . . . . . . . 27216.5 BIC: Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27316.6 AIC: Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

16.6.1 Multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . 27616.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

17 Randomization Testing 28517.1 Randomization model: Two treatments . . . . . . . . . . . . . . . . . . . 28617.2 Fisher’s exact test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

17.2.1 Tasting tea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29017.3 Testing randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29117.4 Randomization tests for sampling models . . . . . . . . . . . . . . . . . 293

17.4.1 Paired comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 29417.4.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

17.5 Large sample approximations . . . . . . . . . . . . . . . . . . . . . . . . . 29617.5.1 Technical conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 29817.5.2 Sign changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

17.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

18 Nonparametric Tests Based on Signs and Ranks 30318.1 Sign test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30318.2 Rank transform tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

18.2.1 Signed-rank test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30418.2.2 Mann-Whitney/Wilcoxon two-sample test . . . . . . . . . . . . . 30518.2.3 Spearman’s ρ independence test . . . . . . . . . . . . . . . . . . . 307

18.3 Kendall’s τ independence test . . . . . . . . . . . . . . . . . . . . . . . . 30818.3.1 Ties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30918.3.2 Jonckheere-Terpstra test for trend among groups . . . . . . . . . 311

18.4 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31318.4.1 Kendall’s τ and the slope . . . . . . . . . . . . . . . . . . . . . . . 314

18.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

III Optimality 323

19 Optimal Estimators 32519.1 Unbiased estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32619.2 Completeness and sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . 327

19.2.1 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 32819.2.2 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . 329

Contents xi

19.3 Uniformly minimum variance estimators . . . . . . . . . . . . . . . . . . 32919.3.1 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 330

19.4 Completeness for exponential families . . . . . . . . . . . . . . . . . . . . 33019.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

19.5 Cramér-Rao lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 33219.5.1 Laplace distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 33319.5.2 Normal µ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334

19.6 Shift-equivariant estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 33519.7 The Pitman estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336

19.7.1 Shifted exponential distribution . . . . . . . . . . . . . . . . . . . 33919.7.2 Laplace distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 339

19.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

20 The Decision-Theoretic Approach 34520.1 Binomial estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34520.2 Basic setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34620.3 Bayes procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34720.4 Admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34820.5 Estimating a normal mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

20.5.1 Stein’s surprising result . . . . . . . . . . . . . . . . . . . . . . . . 35220.6 Minimax procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35520.7 Game theory and randomized procedures . . . . . . . . . . . . . . . . . 35620.8 Minimaxity and admissibility when T is finite . . . . . . . . . . . . . . . 35820.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361

21 Optimal Hypothesis Tests 36921.1 Randomized tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36921.2 Simple versus simple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37021.3 Neyman-Pearson lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372

21.3.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37321.4 Uniformly most powerful tests . . . . . . . . . . . . . . . . . . . . . . . . 376

21.4.1 One-sided exponential family testing problems . . . . . . . . . . 37921.4.2 Monotone likelihood ratio . . . . . . . . . . . . . . . . . . . . . . 379

21.5 Locally most powerful tests . . . . . . . . . . . . . . . . . . . . . . . . . . 38121.6 Unbiased tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383

21.6.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38421.7 Nuisance parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38621.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388

22 Decision Theory in Hypothesis Testing 39522.1 A decision-theoretic framework . . . . . . . . . . . . . . . . . . . . . . . . 39522.2 Bayes tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397

22.2.1 Admissibility of Bayes tests . . . . . . . . . . . . . . . . . . . . . . 39722.2.2 Level α Bayes tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

22.3 Necessary conditions for admissibility . . . . . . . . . . . . . . . . . . . . 40022.4 Compact parameter spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 40222.5 Convex acceptance regions . . . . . . . . . . . . . . . . . . . . . . . . . . 405

22.5.1 Admissible tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40722.5.2 Monotone acceptance regions . . . . . . . . . . . . . . . . . . . . 408

22.6 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409

xii Contents

22.6.1 Formal definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41022.6.2 Reducing by invariance . . . . . . . . . . . . . . . . . . . . . . . . 411

22.7 UMP invariant tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41222.7.1 Multivariate normal mean . . . . . . . . . . . . . . . . . . . . . . 41222.7.2 Two-sided t test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41322.7.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413

22.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414

Bibliography 421

Author Index 427

Subject Index 431

Part I

Distribution Theory

1

Chapter 1

Distributions and Densities

1.1 Introduction

This chapter kicks off Part I, in which we present the basic probability conceptsneeded for studying and developing statistical procedures. We introduce probabil-ity distributions, transformations, and asymptotics. Part II covers the core ideas andmethods of statistical inference, including frequentist and Bayesian approaches to es-timation, testing, and model selection. It is the main focus of the book. Part III tacklesthe more esoteric part of mathematical statistics: decision theory. The main goal isto evaluate inference procedures, to determine which do a good job. Optimality,admissibility, and minimaxity are the main topics.

1.2 Probability

We quickly review the basic definition of a probability distribution. Starting withthe very general, suppose X is a random object. It could be a single variable, avector, a matrix, or something more complicated, e.g., a function, infinite sequence,or image. The space of X is X , the set of possible values X can take on. A probabilitydistribution on X, or on X , is a function P that assigns a value in [0, 1] to subsets ofX . For “any” subset A ⊂ X , P[A] is the probability X ∈ A. It can also be writtenP[X ∈ A]. (The quotes on “any” are to point out that technically, only subsets ina “sigma field” of subsets of X are allowed. We will gloss over that restriction, notbecause it is unimportant, but because for our purposes we do not get into too muchtrouble doing so.)

In order for P to be a probability distribution, it has to satisfy two axioms:

1. P[X ] = 1;

2. If A1, A2, . . . are disjoint (Aj ∩ Aj = ∅ for i 6= j), then

P[∪∞i=1 Ai] =

∞

∑i=1

P[Ai]. (1.1)

The second axiom means to refer to finite unions as well as infinite ones. Usingthese axioms, along with the restriction that 0 ≤ P[A] ≤ 1, all the usual properties ofprobabilities can be derived. Some such follow.

3

4 Chapter 1. Distributions and Densities

Complement. The complement of a set A is AC = X − A, that is, everything that isnot in A (but in X ). Clearly, A and AC are disjoint, and their union is everything:

A ∩ AC = ∅, A ∪ AC = X , (1.2)

so,1 = P[X ] = P[A ∪ AC] = P[A] + P[AC], (1.3)

which meansP[AC] = 1− P[A]. (1.4)

That is, the probability the object does not land in A is 1 minus the probability that itdoes land in A.

Empty set. P[∅] = 0, because the empty set is the complement of X , which hasprobability 1.

Union of two (nondisjoint) sets. If A and B are not disjoint, then it is not necessarilytrue that P[A ∪ B] = P[A] + P[B]. But A ∪ B can be separated into two disjoint sets:the set A and the part of B not in A, which is [B ∩ Ac]. Then

P[A ∪ B] = P[A] + P[B ∩ Ac]. (1.5)

Now B = (B ∩ A) ∪ (B ∩ Ac), and (B ∩ A) and (B ∩ Ac) are disjoint, so

P[B] = P[B ∩ A] + P[B ∩ Ac]⇒ P[B ∩ Ac] = P[B]− P[A ∩ B]. (1.6)

Then stick that formula into (1.5), so that

P[A ∪ B] = P[A] + P[B]− P[A ∩ B]. (1.7)

The above definition doesn’t help much in specifying a probability distribution. Inprinciple, one would have to give the probability of every possible subset, but luckilythere are simplifications.

We will deal primarily with random variables and finite collections of randomvariables. A random variable has space X ⊂ R, the real line. A collection of p randomvariables has space X ⊂ Rp, the p-dimensional Euclidean space. The elements areusually arranged in some convenient way, such as in a vector (row or column), matrix,multidimensional array, or triangular array. Mostly, we will have them arranged as arow vector X = (X1, . . . , Xn) or a column vector.

Some common ways to specify the probabilities of a collection of p random vari-ables include

1. Distribution functions (Section 1.3);

2. Densities (Sections 1.4 and 1.5);

3. Moment generating functions (or characteristic functions) (Section 2.5);

4. Representations.

Distribution functions and characteristic functions always exist, moment generat-ing functions do not. The densities we will deal with are only those with respectto Lebesgue measure or counting measure, or combinations of the two — whichmeans for us, densities do not always exist. By “representation” we mean the ran-dom variables are expressed as a function of some other random variables. Section1.6.2 contains a simple example.

1.3. Distribution functions 5

1.3 Distribution functions

The distribution function for X is the function F : Rp → [0, 1] given by

F(x) = F(x1, . . . , xp) = P[X1 ≤ x1, . . . , Xp ≤ xp]. (1.8)

Note that F is defined on all of Rp, not just the space X . In principal, given F, one canfigure out the probability of all subsets A ⊂ X (although no one would try), whichmeans F uniquely identifies P, and vice versa. If F is a continuous function, then X istermed continuous. Generally, we will indicate random variable with capital letters,and the values they can take on with lowercase letters.

For a single random variable X, the distribution function is F(x) = P[X ≤ x] forx ∈ R. This function satisfies the following properties:

1. F(x) is nondecreasing in x;

2. limx→−∞ F(x) = 0;

3. limx→∞ F(x) = 1;

4. For any x, limy↓x F(y) = F(x).

The fourth property is that F is continuous from the right. It need not be continu-ous. For example, suppose X is the number of heads (i.e., 0 or 1) in one flip of a faircoin, so that X = 0, 1, and P[X = 0] = P[X = 1] = 1/2. Then

F(x) =

0 if x < 01/2 if 0 ≤ x < 1

1 if x ≥ 1. (1.9)

Now F(1) = 1, and if y ↑ 1, say, y = 1− 1/m, then for m = 1, 2, 3, . . ., F(1− 1/m) =1/2, which approaches 1/2, not 1. On the other hand, if y ↓ 1, say y = 1 + 1/m, thenF(1 + 1/m) = 1, which does approach 1.

A jump in F at a point x means that P[X = x] > 0; in fact, the probability is theheight of the jump. If F is continuous at x, then P[X = x] = 0. Figure 1.1 showsa distribution function with jumps at 1 and 6, which means that the probability Xequals either of those points is positive, the probability being the height of the gaps(which are 1/4 in this plot). Otherwise, the function is continuous, hence no othersingle value has positive probability. Note also the flat part between 1 and 4, whichmeans that P[1 < X ≤ 4] = 0.

Not only do all distribution functions for random variables satisfy those four prop-erties, but any function F that satisfies those four is a legitimate distribution function.

Similar results hold for finite collections of random variables:

1. F(x1, . . . , xp) is nondecreasing in each xi, holding the others fixed;

2. limxi→−∞ F(x1, x2, . . . , xp) = 0 for any of the xi’s;

3. limx→∞ F(x, x, . . . , x) = 1;

4. For any (x1, . . . , xp), limy1↓x1,...,yp↓xp F(y1, . . . , yp) = F(x1, . . . , xp).


0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

1.0

y

F

(x)

x

Figure 1.1: A distribution function.

1.4 PDFs: Probability density functions

A density with respect to Lebesgue measure on Rp, which we simplify to “pdf” for“probability density function,” is a function f : X → [0, ∞) such that for any subsetA ⊂ X ,

P[A] =∫ ∫

· · ·∫

Af (x1, x2, . . . , xp)dx1dx2 . . . dxp. (1.10)

If X has a pdf, then it is continuous. In fact, its distribution function is differentiable,f being the derivative of F:

f (x1, . . . , xp) =∂

∂x1· · · ∂

∂xpF(x1, . . . , xp). (1.11)

There are continuous distributions that do not have pdfs, as in Section 1.6.2. Any pdfhas to satisfy the following two properties:

1. f (x1, . . . , xp) ≥ 0 for all (x1, . . . , xp) ∈ X ;

2.∫ ∫· · ·∫X f (x1, x2, . . . , xp)dx1dx2 . . . dxp = 1.

It is also true that any function f satisfying those two conditions is a pdf of alegitimate probability distribution. Table 1.1 contains some famous univariate (sothat p = 1 and X ⊂ R) distributions with their pdfs. For later convenience, themeans and variances (see Section 2.2) are included. The Γ in the table is the gammafunction, defined by

Γ(α) =∫ ∞

0xα−1e−xdx for α > 0. (1.12)

There are many more important univariate densities, such as the F and noncentralversions of the t, χ2, and F. The most famous multivariate distribution is the mul-tivariate normal. We will look at that one in Chapter 7. The next section presents asimple bivariate distribution.

1.4. PDFs: Probability density functions 7

Name Space X pdf f (x) Mean Variance

Normal : N(µ, σ2) R 1√2π σ

e−(x−µ)2/(2σ2) µ σ2

µ ∈ R, σ2 > 0

Uniform(a, b) (a, b) 1b−a

a+b2

(b−a)2

12a < b

Exponential(λ) (0, ∞) λe−λx 1λ

1λ2

λ > 0

Gamma(α, λ) (0, ∞) λα

Γ(α) e−λxxα−1 αλ

αλ2

α > 0, λ > 0

Beta(α, β) (0, 1) Γ(α+β)Γ(α)Γ(β)

xα−1(1− x)β−1 αα+β

αβ(α+β)2(α+β+1)

α > 0, β > 0

Cauchy R 1π

11+x2 * *

Laplace R 12 e−|x| 0 2

Logistic R ex

(1+ex)2 0 π2

3

Chi-square : χ2ν (0, ∞) 1

Γ(ν/2)2ν/2 xν/2−1e−x/2 ν 2ν

ν = 1, 2, . . .

Student’s tν R Γ((ν+1)/2)Γ(ν/2)

√νπ

(1 + t2

ν )− ν+1

2 0 νν−2

ν = 1, 2, . . . if ν ≥ 2 if ν ≥ 3

∗ = Doesn’t exist

Table 1.1: Some common probability density functions.

1.4.1 A bivariate pdf

Suppose (X, Y) has space

W = (x, y) | 0 < x < 1, 0 < y < 1 (1.13)

and pdff (x, y) = c(x + y). (1.14)

The constant c is whatever it needs to be so that the pdf integrates to 1, i.e.,

1 = c∫ 1

0

∫ 1

0(x + y)dydx = c

∫ 1

0(x + 1

2 )dx = c( 12 + 1

2 ) = c. (1.15)


So the pdf is simply f (x, y) = x + y. Some values of the distribution function are

F(0, 0) = 0;

F( 12 , 1

4 ) =∫ 1

2

0

∫ 14

0(x + y)dydx =

∫ 12

0( 1

4 x + 132 )dx =

132

+1

32=

116

;

F( 12 , 2) =

∫ 12

0

∫ 1

0(x + y)dydx =

∫ 12

0(x + 1

2 )dx =18+

14=

38

;

F(2, 1) = 1. (1.16)

Other probabilities:

P[X + Y ≤ 12 ] =

∫ 12

0

∫ 12−x

0(x + y)dydx

=∫ 1

2

0(x ( 1

2 − x) + 12 ( 1

2 − x)2)dx

=∫ 1

2

0( 1

8 −12 x2)dx

=124

, (1.17)

and for 0 < y < 1,

P[Y ≤ y] =∫ 1

0

∫ y

0(x + w)dwdx =

∫ 1

0(xy + 1

2 y2)dx =y2+

y2

2=

12

y(1 + y), (1.18)

which is the distribution function of Y, at least for 0 < y < 1. The pdf for Y is thenfound by differentiating:

fY(y) = F′Y(y) = y +12

for 0 < y < 1. (1.19)

1.5 PMFs: Probability mass functions

A discrete random variable is one for which X is a countable (which includes finite)set. Its probability can be given by its probability mass function, which we will call“pmf,” f : X → [0, 1] given by

P[(x1, . . . , xp)] = P[X = x] = f (x) = f (x1, . . . , xp), (1.20)

where x = (x1, . . . , xp). The pmf gives the probabilities of the individual points.(Measure-theoretically, the pmf is the density with respect to counting measure on X .)The probability of any subset A is the sum of the probabilities of the individual pointsin A. Table 1.2 contains some popular univariate discrete distributions.

The distribution function of a discrete random variable is a pure jump function,that is, it is flat except for jumps of height f (x) at x for each x ∈ X . See Figure 2.4on page 33 for an example. The most famous multivariate discrete distribution is themultinomial, which we look at in Section 2.5.3.

1.6. Distributions without pdfs or pmfs 9

Name Space X pmf f (x) Mean VarianceBernoulli(p) 0, 1 px(1− p)1−x p p(1− p)0 < p < 1

Binomial(n, p) 0, 1, . . . , n (nx) px(1− p)n−x np np(1− p)

n = 1, 2, . . . ; 0 < p < 1

Poisson(λ) 0, 1, 2, . . . e−λ λx

x! λ λλ > 0

Discrete Uniform(a, b) a, a + 1, . . . , b 1b−a+1

a+b2

(b−a+1)2−112

a, b integers, a < b

Geometric(p) 0, 1, 2, . . . p(1− p)x 1−pp

1−pp2

0 < p < 1

Negative Binomial(K, p) 0, 1, 2, . . . (x+K−1K−1 )pK(1− p)x K 1−p

p K 1−pp2

K = 1, 2, . . . ; 0 < p < 1

Table 1.2: Some common probability mass functions.

1.6 Distributions without pdfs or pmfs

Distributions need not be either discrete or have a density with respect to Lebesguemeasure. We present here some simple examples.

1.6.1 Late start

Consider waiting for a train to leave. It will not leave early, but very well may leavelate. There is a positive probability, say 10%, it will leave exactly on time. If it doesnot leave on time, there is a continuous distribution for how late it leaves. Thus it isnot totally discrete, but not continuous, either, so it has neither a pdf nor pmf. It doeshave a distribution function, because everything does. A possible one is

F(x) =

0 if x < 00.1 if x = 01− 0.9 exp(−x/100) if x > 0

, (1.21)

where x is the number of minutes late. Figure 1.2 sketches this F.Is this a legitimate distribution function? It is easy to see it is nondecreasing, once

one notes that 1− 0.9 exp(−x/100) > 0.1 if x > 0. The limits are ok as x → ±∞. It isalso continuous from the right, where the only tricky spot is limx↓0 F(x), which goesto 1− 0.9 exp(0) = 0.1 = F(0), so it checks.

One can then find the probabilities of various late times, e.g., it has no chance ofleaving early, 10% chance of leaving exactly on time, F(60) = 1− 0.9 exp(−60/100) ≈0.506 chance of being at most one hour late, F(300) = 1− 0.9 exp(−300/100) ≈ 0.955chance of being at most five hours late, etc. (Sort of like Amtrak.)


−100 0 100 200 300 400

0.0

0.2

0.4

0.6

0.8

1.0

y

F(x

)

x

Figure 1.2: The distribution function for a late start.

θ

x=cos(θ)

y=sin(θ)

Figure 1.3: Illustration of the spinner.

1.6. Distributions without pdfs or pmfs 11

x

y

F(x,y)Figure 1.4: The sketch of the distribution function in (1.22) for the spinner example.

1.6.2 Spinner

Imagine a spinner whose pointer is one unit in length. It is spun so that it is equallylikely to be pointing in any direction. The random quantity is the (x, y) location ofthe end of the pointer, so that X = (x, y) ∈ R2 | x2 + y2 = 1, the circle with radius1. The distribution of (X, Y) is not discrete because the point can land anywhere onthe circle. On the other hand, it does not have a density with respect to Lebesguemeasure on R2 because the integral over the circle is the volume above the circleunder the pdf, that volume being 0.

But there is a distribution function. The F(x, y) is the arc length of the part(s) ofthe circle that has x-coordinate less than or equal to x and y-coordinate less than orequal to y, divided by total arc length (which is 2π):

F(x, y) =arc length((u, v) | u2 + v2 = 1, u ≤ x, v ≤ y)

2π. (1.22)

Figure 1.4 has a sketch of F.Fortunately, there is an easier way to describe the distribution. For any point (x, y)

on the circle, one can find the angle with the x-axis of the line connecting (0, 0) and(x, y), θ = Angle(x, y), so that x = cos(θ) and y = sin(θ). See Figure 1.3. Foruniqueness’ sake, take θ ∈ [0, 2π). Then (x, y) being uniform on the circle impliesthat θ is uniform from 0 to 2π. Then the distribution of (X, Y) can be described via

(X, Y) = (cos(Θ), sin(Θ)), where Θ ∼ Uniform[0, 2π). (1.23)


0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

c(0:

10, 0

:10)

y:

0,1,

...,1

0

x: (0,1)

Figure 1.5: The sketch of the spaceW in (1.24).

Such a description is called a representation, in that we are representing one set ofrandom variables as a function of another set (which in this case is just the one Θ).

1.6.3 Mixed-type densities

Imagine now a two-stage process, where one first chooses a coin out of an infinitecollection of coins, then flips the coin n = 10 times. The coins have different prob-abilities of heads x, so that over the population of coins, X ∼ Uniform(0, 1). Let Ybe the number of heads among the 10 flips. Then what is random is the pair (X, Y).This vector is neither discrete nor continuous: X is continuous and Y is discrete. Thespace is a union of 11 (= n + 1) line segments,

W = (x, 0) | 0 < x < 1 ∪ (x, 1) | 0 < x < 1 ∪ · · · ∪ (x, 10) | 0 < x < 1. (1.24)

See Figure 1.5.The density can still be given as f (x, y), but now the x part is discrete, and the y

part is continuous. Then the probability of any set involves summing over the x andintegrating over the y. E.g.,

P[X < 1/2 & Y ≥ 5] =∫ 1/2

0

10

∑y=5

f (x, y), (1.25)

where

f (x, y) =(

10y

)xy(1− x)10−y. (1.26)

This idea can be extended to any number of random variables, some discrete andsome continuous.

1.7. Exercises 13

1.7 Exercises

Exercise 1.7.1. Suppose the random variable X is always equal to the constant c. Thatis, the space is c, and P[X = c] = 1. (a) Let F(x) be the distribution function of X.What are the values of F(x) for (i) x < c, (ii) x = c, and (iii) x > c? (b) Now let f bethe pmf of X. What are the values of f (x) for (i) x < c, (ii) x = c, and (iii) x > c?

Exercise 1.7.2. Suppose (X, Y) is a random vector with space (1, 2), (2, 1), whereP[(X, Y) = (1, 2)] = P[(X, Y) = (2, 1)] = 1/2. Fill in the table with the values ofF(x, y):

y ↓; x → 0 1 2 33210

Exercise 1.7.3. Suppose (X, Y) is a continuous two-dimensional random vector withspace (x, y) | 0 < x < 1, 0 < y < 1, x + y < 1. (a) Which of the following is the bestsketch of the space?

A B C D

(b) The density is f (x, y) = c for (x, y) in the space. What is c? (c) Find the followingvalues: (i) F(0.1, 0.2), (ii) F(0.8, 1), (iii) F(0.8, 1.5), (iv) F(0.7, 0.8).

Exercise 1.7.4. Continue with the distribution in Exercise 1.7.3, but focus on just X.(a) What is the space of X? (b) For x in that space, what is the distribution functionFX(x)? (c) For x in that space, FX(x) = F(x, y) for what values of y in the range [0, 1]?(d) For x in that space, what is the pdf fX(x)?

Exercise 1.7.5. Now take (X, Y) with space (x, y) | 0 < x < y < 1. (a) Of thespaces depicted in Exercise 1.7.3, which is the best sketch of the space in this case?(b) Suppose (X, Y) has pdf f (x, y) = 2, for (x, y) in the space. Let W = Y/X. Whatis the space of W? (c) Find the distribution function of W, FW(w), for w in the space.[Hint: Note that FW(w) = P[W ≤ w] = P[Y ≤ wX]. The set in the probability is thena triangle, for which the area can be found.] (d) Find the pdf of W, fW(w), for w inthe space.

Exercise 1.7.6. Suppose (X1, X2) is uniformly distributed over the unit square, thatis, the space is (x1, x2) | 0 < x1 < 1, 0 < x2 < 1, and the pdf is f (x1, x2) = 1 for(x1, x2) in the space. Let Y = X1 + X2. (a) What is the space of Y? (b) Find thedistribution function FY(y) of Y. [Hint: Draw the picture of the space of (X1, X2),and sketch the region for which x1 + x2 ≤ y, as in the figures:


x 2

x1

y=0.8

x 2

x1

y=1.3

Then find the area of that region. Do it separately for y < 1 and y ≥ 1.] (c) Show thatthe pdf of Y is fY(y) = y if y ∈ (0, 1) and fY(y) = 2− y if y ∈ [1, 2). Sketch the pdf.It has a tent distribution.

Exercise 1.7.7. Suppose X ∼ Uniform(0, 1), and let Y = |X − 1/4|. (a) What is thespace of Y? (b) Find the distribution function of Y. [Specify it in pieces: y < 0,0 < y < a, a < y < b, y > b. What are a and b?] (c) Find the pdf of Y.

Exercise 1.7.8. Set X = cos(Θ) and Y = sin(Θ), where Θ ∼ Uniform(0, 2π). (a) Whatis the space X of X? (b) For x ∈ X , find F(x) = P[X ≤ x]. [Hint: Figure out whichθ’s correspond to X ≤ x. The answer should have a cos−1 in it.] (c) Find the pdf ofX. (d) Is the pdf of Y the same as that of X?

Exercise 1.7.9. Suppose U ∼ Uniform(0, 1), and (X, Y) = (U, 1−U). Let F(x, y) bethe distribution function of (X, Y). (a) Find and sketch the space of (X, Y). (b) Forwhich values of (x, y) is F(x, y) = 1? (c) For which values of (x, y) is F(x, y) = 0? (d)Find F(3/4, 3/4), F(3/2, 3/4), and F(3/4, 7/8).

Exercise 1.7.10. (a) Use the definition of the gamma function in (1.12) to help showthat ∫ ∞

0xα−1e−λx = λ−αΓ(α) (1.27)

for α > 0 and λ > 0, thus justifying the constant in the gamma pdf in Table 1.1. (b)Use integration by parts to show that Γ(α + 1) = αΓ(α) for α > 0. (c) Show thatΓ(1) = 1, hence with part (b), Γ(n) = (n− 1)! for positive integer n.

Exercise 1.7.11. The gamma distribution given in Table 1.1 has two parameters: α isthe shape and λ is the rate. (Alternatively, the second parameter may be given by β =1/λ, which is called the scale parameter.) (a) Sketch the pdfs for shape parametersα = .5, .8, 1, 2, and 5, with λ = 1. What do you notice? What is qualitatively differentabout the behavior of the pdfs near x = 0 depending on whether α < 1, α = 1, orα > 1? (b) Now fix α = 1 and sketch the pdfs for λ = .5, 1, and 5. What do you noticeabout the shapes? (c) Fix α = 5, and explore the pdfs for different rates.

Exercise 1.7.12. (a) The Exponential(λ) distribution is a special case of the gamma.What are the corresponding parameters? (b) The χ2

ν is a special case of the gamma.What are corresponding parameters. (c) The Uniform(0,1) is a special case of the

1.7. Exercises 15

beta. What are the corresponding parameters? (d) The Cauchy is a special case ofStudent’s tν. For which ν?

Exercise 1.7.13. Let Z ∼ N(0, 1), and let W = Z2. What is the space of W? (a) Writedown the distribution function of W as an integral over the pdf of Z. (b) Show thatthe pdf of W is

g(w) =1√

2πwe−w/2. (1.28)

[Hint: Differentiate the distribution function from part (a). Recall that

ddw

∫ b(w)

a(w)f (z)dz = f (b(w))b′(w)− f (a(w))a′(w). ] (1.29)

(c) The distribution of W is χ2ν (see Table 1.1) for which ν? (d) The distribution of W

is a special case of a gamma. What are the parameters? (e) Show that by matching(1.28) with the gamma or chi-square density, we have that Γ(1/2) =

√π.

Exercise 1.7.14. Now suppose Z ∼ N(µ, 1), and let W = X2, which is called noncen-tral chi-square on one degree of freedom. (Section 7.5.3 treats noncentral chi-squaresmore generally.) (a) What is the space of W? (b) Show that the pdf of W is

gµ(w) = g(w) e−12 µ2 eµ

√w + e−µ

√w

2, (1.30)

where g is the pdf in (1.28). Note that the last fraction is cosh(µ√

w).

Exercise 1.7.15. The logistic distribution has space R and pdf f (x) = ex(1 + ex)−2

as in Table 1.1. (a) Show that the pdf is symmetric about 0, i.e., f (x) = f (−x)for all x. (b) Show that the distribution function is F(x) = ex/(1 + ex). (c) LetU ∼ Uniform(0, 1). Thinking of u as a probability of some event. The odds of thatevent are u/(1− u), and the log odds or logit is logit(u) = log(u/(1− u)). Showthat X = logit(U) ∼ Logistic, which may explain where the name came from. [Hint:Find FX(x) = P[log(U/(1−U)) ≤ x], and show that equals the distribution functionin part (b).]

Chapter 2

Expected Values, Moments, and Quantiles

2.1 Definition of expected value

The distribution function F contains all there is to know about the distribution of arandom vector, but it is often difficult to take in all at once. Quantities that sum-marize aspects of the distribution are often helpful, including moments (means andvariances, e.g.) and quantiles, which are discussed in this chapter. Moments arespecial cases of expected values.

We start by defining expected value in the pdf and pmf cases. There are manyX’s that have neither a pmf nor pdf, but even in those cases we can often find theexpected value.

Definition 2.1. Expected value. Suppose X has pdf f , and g : X → R. If∫· · ·

∫X|g(x1, . . . , xp)| f (x1, . . . , xp)dx1 . . . dxp < ∞, (2.1)

then the expected value of g(X), E[g(X)], exists and

E[g(X)] =∫· · ·

∫X

g(x1, . . . , xp) f (x1, . . . , xp)dx1 . . . dxp. (2.2)

If X has pmf f , and

∑ · · ·∑(x1,...,xp)∈X |g(x1, . . . , xp)| f (x1, . . . , xp) < ∞, (2.3)

then the expected value of g(X), E[g(X)], exists and

E[g(X)] = ∑ · · ·∑(x1,...,xp)∈X g(x1, . . . , xp) f (x1, . . . , xp). (2.4)

The requirement (2.1) or (2.3) that the absolute value of the function must have afinite integral/sum is there to eliminate ambiguous situations. For example, considerthe Cauchy distribution with pdf f (x) = 1/(π(1+ x2)) and space R, and take g(x) =x, so we wish to find E[X]. Consider∫ ∞

−∞|x| f (x)dx =

∫ ∞

−∞

1π

|x|1 + x2 dx = 2

∫ ∞

0

1π

x1 + x2 dx. (2.5)

17

18 Chapter 2. Expected Values, Moments, and Quantiles

For large |x|, the integrand is on the order of 1/|x|, which does not have a finiteintegral. More precisely, it is not hard to show that

x1 + x2 >

12x

for x > 1. (2.6)

Thus ∫ ∞

−∞

1π

|x|1 + x2 dx >

∫ ∞

1

1π

1x

dx =1π

log(x) |∞1 =1π

log(∞) = ∞. (2.7)

In this case we say that “the expected value of the Cauchy does not exist.” By thesymmetry of the density, it would be natural to expect the expected value to be 0. Butwhat we have is

E[X] =∫ 0

−∞x f (x)dx +

∫ ∞

0x f (x)dx = −∞ + ∞ = Undefined. (2.8)

That is, we cannot do the integral, so the expected value is not defined.One could allow +∞ and −∞ to be legitimate values of the expected value, e.g.,

say that E[X2] = +∞ for the Cauchy, as long as the value is unambiguous. We arenot allowing that possibility formally, but informally will on occasion act as thoughwe do.

Expected values cohere in the proper way, that is, if Y is a random vector that is afunction of X, say Y = h(X), then for a function g of Y,

E[g(Y)] = E[g(h(X))], (2.9)

if the latter exists. This property helps in finding the expected values when represen-tations are used. For example, in the spinner case (1.23),

E[X] = E[cos(Θ)] =1

2π

∫ 2π

0cos(θ)dθ = 0, (2.10)

where the first expected value has X as the random variable, for which we do nothave a pdf, and the second expected value has Θ as the random variable, for whichwe do have a pdf (the Uniform[0, 2π)).

One important feature of expected values is their linearity, which follows by thelinearity of integrals and sums:

Lemma 2.2. For any random variables X, Y, and constant c,

E[cX] = cE[X] and E[X + Y] = E[X] + E[Y], (2.11)

if the expected values exist.

The lemma can be used to show more involved linearities, e.g.,

E[aX + bY + cZ + d] = aE[X] + bE[Y] + cE[Z] + d (2.12)

(since E[d] = d for a constant d), and

E[g(X) + h(X)] = E[g(X)] + E[h(X)]. (2.13)

2.2. Means, variances, and covariances 19

Warning. Be aware that for non-linear functions, the expected value of a function isNOT the function of the expected value, i.e.,

E[g(X)] 6= g(E[X]) (2.14)

unless g(x) is linear, or you are lucky. For example,

E[X2] 6= E[X]2, (2.15)

unless X is a constant. (Which is fortunate, because otherwise all variances would be0. See (2.20) below.)

2.1.1 Indicator functions

An indicator function is one that takes on only the values 0 and 1. It is usually givenas IA(x) or I[x ∈ A], or simply I[A], for a subset A ⊂ X , where A contains the valuesfor which the function is 1:

IA(x) = I[x ∈ A] = I[A] =

1 if x ∈ A0 if x 6∈ A . (2.16)

These functions give alternative expressions for probabilities in terms of expectedvalues as in

E[IA[X]] = 1× P[X ∈ A] + 0× P[X 6∈ A] = P[A]. (2.17)

2.2 Means, variances, and covariances

Means, variances, and covariances are particular expected values. For a randomvariable, the mean is just its expected value:

The mean of X = E[X] (often denoted µ). (2.18)

(From now on, we will usually suppress the phrase “if it exists” when writing ex-pected values, but think of it to yourself when reading “E.”) The variance is theexpected value of the deviation from the mean, squared:

The variance of X = Var[X] = E[(X− E[X])2] (often denoted σ2). (2.19)

The standard deviation is the square root of the variance. It is often a nicer quantitybecause it is in the same units as X, and measures the “typical” size of the deviationof X from its mean.

A very useful formula for finding variances is

Var[X] = E[X2]− E[X]2, (2.20)

which can be seen, letting µ = E[X], as follows:

E[(X− µ)2] = E[X2 − 2Xµ + µ2] = E[X2]− 2E[X]µ + µ2 = E[X2]− µ2. (2.21)

With two random variables, (X, Y), say, there is in addition the covariance:

The covariance of X and Y = Cov[X, Y] = E[(X− E[X])(Y− E[Y])]. (2.22)


The covariance measures a type of relationship between X and Y. Notice that theexpectand is positive when X and Y are both greater than or both less than theirrespective means, and negative when one is greater and one less. Thus if X and Ytend to go up or down together, the covariance will be positive, while if when onegoes up the other goes down, the covariance will be negative. Note also that it issymmetric, Cov[X, Y] = Cov[Y, X], and Cov[X, X] = Var[X].

As for the variance in (2.20), we have the formula

Cov[X, Y] = E[XY]− E[X]E[Y]. (2.23)

The correlation coefficient is a normalization of the covariance, which is generallyeasier to interpret:

The correlation coefficient of X and Y = Corr[X, Y] =Cov[X, Y]√

Var[X]Var[Y](2.24)

if Var[X] > 0 and Var[Y] > 0. This is a unitless quantity that measures the linearrelationship of X and Y. It is bounded by −1 and +1. To verify this fact, we firstneed the following.

Lemma 2.3. Cauchy-Schwarz. For random variables (U, V),

E[UV]2 ≤ E[U2]E[V2], (2.25)

with equality if and only if

U = 0 or V = βU with probability 1, (2.26)

for β = E[UV]/E[U2].

Here, the phrase “with probability 1” means P[U = 0] = 1 or P[V = βU] = 1.

Proof. The lemma is easy to see if U is always 0, because then E[UV] = E[U2] = 0.Suppose it is not, so that E[U2] > 0. Consider

E[(V − bU)2] = E[V2 − 2bUV + b2U2] = E[V2]− 2bE[UV] + b2E[U2]. (2.27)

Because the expectand on the left is nonnegative for any b, so is its expected value.In particular, it is nonnegative for the b that minimizes the expected value, which iseasy to find:

∂

∂bE[(V − bU)2] = −2E[UV] + 2bE[U2], (2.28)

and setting that to 0 yields b = β where β = E[UV]/E[U2]. Then

E[V2]− 2βE[UV] + β2E[U2] = E[V2]− 2E[UV]

E[U2]E[UV] +

(E[UV]

E[U2]

)2E[U2]

= E[V2]− E[UV]2

E[U2]≥ 0, (2.29)

from which (2.25) follows.

2.2. Means, variances, and covariances 21

There is equality in (2.25) if and only if there is equality in (2.29), which meansthat E[(V − βU)2] = 0. Because the expectand is nonnegative, its expected value canbe 0 if and only if it is 0, i.e.,

(V − βU)2 = 0 with probability 1. (2.30)

But that equation implies the second part of (2.26), proving the lemma.

For variables (X, Y), apply the lemma with U = X− E[X] and V = Y− E[Y]:

E[(X− E[X])(Y− E[Y])]2 ≤ E[(X− E[X])2]E[(Y− E[Y])2]

⇐⇒ Cov[X, Y]2 ≤ Var[X]Var[Y]. (2.31)

Thus from (2.24), if the variances are positive and finite,

−1 ≤ Corr[X, Y] ≤ 1. (2.32)

Furthermore, if there is an equality in (2.31), then either X is a constant, or

Y− E[Y] = b(X− E[X]) ⇔ Y = α + βX, (2.33)

where

β =Cov[X, Y]

Var[X]and α = E[Y]− βE[X]. (2.34)

In this case,

Corr[X, Y] =

1 if β > 0−1 if β < 0 . (2.35)

Thus the correlation coefficient measures the linearity of the relationship betweenX and Y, +1 meaning perfectly positively linearly related, −1 meaning perfectlynegatively linearly related.

2.2.1 Uniform on a triangle

Suppose (X, Y) has pdf f (x, y) = 2 for (x, y) ∈ W = (x, y) | 0 < x < y < 1, whichis the upper-left triangle of the unit square, as in Figure 2.1.

One would expect the correlation to be positive, since the large y’s tend to go withlarger x’s, but the correlation would not be +1, because the space is not contained ina straight line. To find the correlation, we need to perform some integrals:

E[X] =∫ 1

0

∫ y

0x2dxdy =

∫ 1

0y2dy =

13

, E[Y] =∫ 1

0

∫ y

0y2dxdy = 2

∫ 1

0y2dy =

23

,

E[X2] =∫ 1

0

∫ y

0x22dxdy =

∫ 1

0

2y3

3dy =

16

, E[Y2] =∫ 1

0

∫ y

0y22dxdy = 2

∫ 1

0y3dy =

12

,

and E[XY] =∫ 1

0

∫ y

0xy2dxdy =

∫ 1

0y3dy =

14

. (2.36)

Then

Var[X] =16− 1

32 =1

18, Var[Y] =

12−(

23

)2=

118

, Cov[X, Y] =14− 1

3× 2

3=

136

,

(2.37)


0.0 0.4 0.8

0.0

0.2

0.4

0.6

0.8

1.0

c(0,

1, 1

, 0)

y

x

Figure 2.1: The spaceW = (x, y) | 0 < x < y < 1.

and, finally,

Corr[X, Y] =1/36√

(1/18)(1/18)=

12

. (2.38)

This value does seem plausible: positive but not too close to 1.

2.2.2 Variance of linear combinations & affine transformations

A linear combination of the variables, X1, . . . , Xp, is a function of the form

b1X1 + · · ·+ bpXp, (2.39)

for constants b1, . . . , bp. An affine transformation just adds a constant:

a + b1X1 + · · ·+ bpXp. (2.40)

Thus they are almost the same, and if you want to add the (constant) variable X0 ≡ 1,you can think of an affine transformation as a linear combination, as one does whensetting up a linear regression model with intercept. Here we find formulas for thevariance of an affine transformation.

Start with a + bX:

Var[a + bX] = E[(a + bX− E[a + bX])2]

= E[(a + bX− a− bE[X]])2]

= E[b2(X− E[X])2]

= b2E[(X− E[X])2]

= b2Var[X]. (2.41)

The constant a goes away (it does not contribute to the variability), and the constantb is squared. For a linear combination of two variables, the variance involves the two

2.3. Vectors and matrices 23

variances, as well as the covariance:

Var[a + b1X1 + b2X2] = E[(a + b1X1 + b2X2 − E[a + b1X1 + b2X2])2]

= E[(b1(X1 − E[X1]) + b2(X2 − E[X2]))2]

= b21 E[(X1 − E[X1])

2] + 2b1b2 E[(X1 − E[X1])(X2 − E[X2])]

+ b22 E[(X2 − E[X2])

2]

= b21 Var[X1] + b2

2 Var[X2] + 2b1b2 Cov[X1, X2]. (2.42)

With p variables, we have

Var[a +p

∑i=1

biXi] =p

∑i=1

b2i Var[Xi] + 2∑ ∑1≤i<j≤pbibjCov[Xi, Xj]. (2.43)

Covariances between two linear combinations work similarly. That is,

Cov[a +p

∑i=1

biXi, c +q

∑i=1

diYi] =p

∑i=1

q

∑j=1

bidj Cov[Xi, Yj]. (2.44)

These formulas can be made simpler using matrix and vector notation, which wedo in the next section.

2.3 Vectors and matrices

The mean of a vector or matrix of random variables is the corresponding vector ormatrix of means. That is, if X is an n× 1 column vector, X = (X1, . . . , Xn)′ (the primemeans transpose), then

E[X] =

E[X1]...

E[Xn]

. (2.45)

If X is a row vector, 1× p, then E[X] = (E[X1], . . . , E[Xp]). More generally, if X is ann× p matrix, then so is its mean:

E[X] = E

X11 X12 · · · X1pX21 X22 · · · X2p

......

. . ....

Xn1 Xn2 · · · Xnp

=

E[X11] E[X12] · · · E[X1p]E[X21] E[X22] · · · E[X2p]

......

. . ....

E[Xn1] E[Xn2] · · · E[Xnp]

.

(2.46)The linearity in Lemma 2.2 holds for linear/affine transformations of vectors and

matrices as well. If X is n× 1, then for fixed m× n matrix B and m× 1 vector a,

E[a + BX] = a + BE[X], (2.47)

and if X is n× p, for matrices A (m× q), B (m× n) and C (p× q),

A + E[BXC] = A + BE[X]C. (2.48)


These formulas can be proved by writing out the individual elements, and notingthat each is a linear combination of the random variables.

A 1× p vector X yields p variances, the Var[Xi]’s, but also the (p2) covariances,

the Cov[Xi, Xj]’s. These are usually conveniently arranged in a p × p matrix, thecovariance matrix:

Σ = Cov[X] =

Var[X1] Cov[X1, X2] · · · Cov[X1, Xp]

Cov[X2, X1] Var[X2] · · · Cov[X2, Xp]...

.... . .

...Cov[Xp, X1] Cov[Xp, X2] · · · Var[Xp]

. (2.49)

The same matrix will work for X and X′, that is, a column vector or a row vector. Thismatrix is symmetric, i.e., Σ′ = Σ. (The covariance matrix of a matrix X of randomvariables is typically defined by first changing the matrix X into a long vector, thendefining the Cov[X] to be the covariance of that vector.) A compact way to define thecovariance is

Cov[X] =

E[(X− E[X])(X− E[X])′] if X is a column vectorE[(X− E[X])′(X− E[X])] if X is a row vector . (2.50)

A convenient, and important to remember, formula for the covariance of an affinetransformation follows.

Lemma 2.4. For fixed a and B, where X is a column vector,

Cov[a + BX] = BCov[X]B′. (2.51)

This equation is an example of a “sandwich” formula, with the B’s as the bread.It is not hard to show that similarly, for X being a row vector,

Cov[a + XB′] = BCov[X]B′. (2.52)

Note that this lemma is a matrix version of (2.41).

Proof.

Cov[a + BX] = Cov[(a + BX− E[a + BX])(a + BX− E[a + BX])′] by (2.50)

= Cov[B(X− E[X])(B(X− E[X]))′]

= Cov[B(X− E[X])(X− E[X])′B′]

= BCov[(X− E[X])(X− E[X])′]B′ by (2.48)

= BCov[X]B′ again by (2.50). (2.53)

This lemma leads to a simple formula for the variance of a + b1X1 + · · ·+ bpXp:

Var[a + b1X1 + · · ·+ bpXp] = bCov[X]b′, (2.54)

because we can write a + b1X1 + · · · + bpXp = a + Xb′ for b = (b1, . . . , bp). Thusa = a and B = b in (2.52). Compare this formula to (2.43).

2.4. Moments 25

Beta(5,1.5): κ3 = −0.8235 Normal: κ3 = 0 Gamma(3,1): κ3 = 3.4641

Figure 2.2: Some pdfs illustrating skewness.

2.4 Moments

The mean, variance, and covariance are special cases of what are called moments.Moments of a random variable provide summaries of its distribution. The kth rawmoment of a random variable is the expected value of its kth power, where k = 1, 2, . . ..The kth central moment is the expected value of the kth power of its deviation fromthe mean µ, at least for k > 1:

kth raw moment = µ′k = E[Xk], k = 1, 2, . . . ;

kth central moment = µk = E[(X− µ)k], k = 2, 3, . . . . (2.55)

Thus µ′1 = µ = E[X], µ′2 = E[X2], and µ2 = σ2 = Var[X] = µ′2 − µ21. It is not hard,

but a bit tedious, to figure out the kth central moment from the first k raw moments,and vice versa. It is not uncommon for given moments not to exist. In particular, ifthe kth moment does not exist, then neither does any higher moment.

The first two moments measure the center and spread of the distribution. The thirdcentral moment is generally a measure of skewness, where symmetric distributionshave 0 skewness, a heavier tail to the right than to the left would have a positiveskewness, and a heavier tail to the left would have a negative skewness. Usually it isnormalized so that it is not dependent on the variance:

Skewness = κ3 =µ3

σ3 . (2.56)

See Figure 2.2, where the plots show negative, zero, and positive skewness, respec-tively.

The fourth central moment is a measure of kurtosis. It, too, is normalized:

Kurtosis = κ4 =µ4

σ4 − 3. (2.57)

The normal distribution has µ4/σ4 = 3, so that subtracted “3” in (2.57) means thekurtosis of a normal is 0. It is not particularly easy to figure out what kurtosis meansin general, but for nice unimodal densities, it measures “boxiness.” A negative kur-tosis indicates a density more boxy than the normal, such as the uniform. A positive


Beta(1.2,1.2): κ4 = −1.1111 Normal: κ4 = 0 Laplace: κ4 = 3

Figure 2.3: Some symmetric pdfs illustrating kurtosis.

kurtosis indicates a pointy middle and heavy tails, such as the Laplace. Figure 2.3compares some symmetric pdfs, going from boxy to normal to pointy.

The first several moments of a random variable do not characterize it. That is, twodifferent distributions could have the same first, second, and third moments. Even ifthey agree on all moments, and all moments are finite, the two distributions mightnot be the same, though that’s rare. See Exercise 2.7.20. The next section (Section2.5) presents the moment generating function, which does determine the distributionunder conditions.

Multivariate distributions have the regular moments for the individual componentrandom variables, but also have mixed moments. For a p-variate random variable(X1, . . . , Xp), mixed moments are expected values of products of powers of the Xi’s.So for k = (k1, . . . , kp), the kth raw mixed moment is E[∏ Xki

i ], and the kth cen-tral moment is E[∏(Xi − µi)

ki ], assuming these expected values exist. Thus for twovariables, the (1, 1)th central moment is the covariance.

2.5 Moment and cumulant generating functions

The moment generating function (mgf for short) is a meta-moment in a way, since itcan be used to find all the moments of X. If X is p× 1, it is a function from Rp → [0, ∞]given by

MX(t) = E[et1X1+···+tp Xp

]= E[et·X] (2.58)

for t = (t1, . . . , tp). (For p-dimensional vectors a and b, a · b = a1b1 + · · · + apbpis called their dot product. Its definition does not depend on the type of vectors,row or column, just that they have the same number of elements.) The mgf does notalways exist, that is, often the integral or sum defining the expected value diverges.An infinite mgf for some values of t is ok, as long as it is finite for t in a neighborhoodof 0p, in which case the mgf uniquely determines the distribution of X.

Theorem 2.5. Uniqueness of mgf. If for some ε > 0,

MX(t) < ∞ and MX(t) = MY(t) for all t such that ‖t‖ ≤ ε, (2.59)

then X and Y have the same distribution.

2.5. Moment and cumulant generating functions 27

If one knows complex variables, the characteristic function is superior because italways exists. It is defined as φX(t) = E[exp(i t · X)], and also uniquely defines thedistribution. In fact, most proofs of Theorem 2.5 first show the uniqueness of char-acteristic functions, then argue that the conditions of the theorem guarantee that themgf M(t) can be extended to an analytic function of complex t, which for imaginary tyields the characteristic function. Billingsley (1995) is a good reference for the proofsof the uniquenesses of mgfs (his Section 30) and characteristic functions (his Theorem26.2).

The uniqueness in Theorem 2.5 is the most useful property of mgfs, but they canalso be handy for generating (mixed) moments.

Lemma 2.6. Suppose X has mgf such that for some ε > 0,

MX(t) < ∞ for all t such that ‖t‖ ≤ ε. (2.60)

Then for any nonnegative integers k1, . . . , kp,

E[Xk11 Xk2

2 · · ·Xkpp ] =

∂k1+···kp

∂tk11 · · · ∂tkp

p

MX(t)|t=0p, (2.61)

which is finite.

Notice that this lemma implies that all mixed moments are finite under the condi-tion (2.60). The basic idea is straightforward. Assuming the derivatives and expecta-tion can be interchanged,

∂k1+···kp

∂tk11 · · · ∂tkp

p

E[et·X]|t=0p= E[

∂k1+···kp

∂tk11 · · · ∂tkp

p

et·X|t=0p]

= E[Xk11 Xk2

2 · · ·Xkpp ]. (2.62)

But justifying that interchange requires some careful analysis. If interested, Section2.5.4 provides the details when p = 1.

Specializing to a random variable X, the mgf is

MX(t) = E[etX ]. (2.63)

If it exists for t in a neighborhood of 0, then all moments of X exist, and

∂k

∂tk MX(t)|t=0 = E[Xk]. (2.64)

The cumulant generating function is the log of the moment generating function,

cX(t) = log(MX(t)). (2.65)

It generates the cumulants, which are defined by what the cumulant generating func-tion generates, i.e., for a random variable, the kth cumulant is

γk =∂k

∂tk cX(t)|t=0. (2.66)


Mixed cumulants for multivariate X are found by taking mixed partial derivatives,analogous to (2.61).

Cumulants are often easier to work with than moments. The first four are

γ1 = E[X] = µ1 = µ,

γ2 = Var[X] = µ2 = σ2,

γ3 = E[(X− E[X])3] = µ3, and

γ4 = E[(X− E[X])4]− 3 Var[X]2 = µ4 − 3µ22 = µ4 − 3σ4. (2.67)

The skewness (2.56) and kurtosis (2.57) are then simple functions of the cumulants:

Skewness[X] = κ3 =γ3

σ3 and Kurtosis[X] = κ4 =γ4

σ4 . (2.68)

2.5.1 Normal distribution

A Z ∼ N(0, 1) is called a standard normal. Its mgf is

MZ(t) = E[etZ] =1√2π

∫ ∞

−∞etze−

12 z2

dz =1√2π

∫ ∞

−∞e−

12 (z

2−2tz)dz. (2.69)

In the exponent, complete the square with respect to the z: z2 − 2tz = (z− t)2 − t2.Then

MZ(t) = e12 t2

∫ ∞

−∞

1√2π

e−12 (z−t)2

dz = e−12 t2

. (2.70)

The second equality holds because the integrand in the middle expression is the pdfof a N(t, 1), which means the integral is 1.

The cumulant generating function is then a simple quadratic:

cZ(t) =t2

2, (2.71)

and it is easy to see that

c′Z(0) = 0, c′′(0) = 1, c′′′(t) = 0. (2.72)

Thus the mean is 0 and variance is 1 (not surprisingly), and all other cumulants are0. In particular, the skewness and kurtosis are both 0.

It is a little messier, but the same technique shows that if X ∼ N(µ, σ2),

MX(t) = eµt+σ2t2/2. (2.73)

2.5.2 Gamma distribution

The gamma distribution has two parameters: α > 0 is the shape parameter, and λ > 0is the rate parameter. Its space is X = (0, ∞), and as in Table 1.1 on page 7 its pdf is

f (x | α, λ) =λα

Γ(α)xα−1e−λx, x ∈ (0, ∞). (2.74)


If α = 1, then this distribution is the Exponential(λ) in Table 1.1.The mgf is

MX(t) = E[etX ] =λα

Γ(α)

∫ ∞

0etxxα−1e−λx

=λα

Γ(α)

∫ ∞

0xα−1e−(λ−t)x. (2.75)

That integral needs (λ− t) > 0 to be finite, so we need t < λ, which means the mgfis finite for a neighborhood of zero, since λ > 0. Now the integral at the end of (2.75)looks like the gamma density but with λ− t in place of λ. Thus that integral equalsthe inverse of the constant in the Gamma(α, λ− t), so that

E[etX ] =λα

Γ(α)Γ(α)

(λ− t)α

=

(λ

λ− t

)α

, t < λ. (2.76)

We will use the cumulant generating function cX(t) = log(MX(t)) to obtain themean and variance, because it is slightly easier. Thus

c′X(t) =∂

∂tα(log(λ)− log(λ− t)) =

α

λ− t=⇒ E[X] = c′X(0) =

α

λ, (2.77)

and

c′′X(t) =∂2

∂t2 α(log(λ)− log(λ− t)) =α

(λ− t)2 =⇒ Var[X] = c′′X(0) =α

λ2 . (2.78)

In general, the kth cumulant (2.66) is

γk = (k− 1)!α

λk , (2.79)

and in particular

Skewness[X] =2α/λ3

α3/2/λ3 =2√α

and Kurtosis[X] =6α/λ4

α2/λ4 =6α

. (2.80)

Thus the skewness and kurtosis depends on just the shape parameter α. Also, theyare positive, but tend to 0 as α increases.

2.5.3 Binomial and multinomial distributions

A Bernoulli trial is an event that has just two possible outcomes, often called “success”and “failure.” For example, flipping a coin once is a trial, and one might declare thatheads is a success. In many medical studies, a single person’s outcome is oftena success or failure. Such a random variable Z has space 0, 1, where 1 denotessuccess and 0 failure. The distribution is completely specified by the probability of asuccess, denoted p : p = P[Z = 1].

The binomial is a model for counting the number of successes in n trials, e.g., thenumber of heads in ten flips of a coin, where the trials are independent (formally


defined in Section 3.3) and have the same probability p of success. As in Table 1.2 onpage 9,

X ∼ Binomial(n, p) =⇒ fX(x) =(

nx

)px(1− p)n−x, x ∈ X = 0, 1, . . . , n. (2.81)

The fact that this pmf sums to 1 relies on the binomial theorem:

(a + b)n =n

∑x=0

(nx

)axbn−x, (2.82)

with a = p and b = 1− p. This theorem also helps in finding the mgf:

MX(t) = E[etX ] =n

∑x=0

etx fX(x)

=n

∑x=0

etx(

nx

)px(1− p)n−x

=n

∑x=0

(nx

)(pet)x(1− p)n−x

= (pet + 1− p)n. (2.83)

It is finite for all t ∈ R, as is the case for any bounded random variable.Now cX(t) = log(MX(t)) = n log(pet + 1− p) is the cumulant generating function.

The first two cumulants are

E[X] = c′X(0) = npet

pet + 1− p|t=0 = np, (2.84)

and

Var[X] = c′′X(0)

= n(

pet

pet + 1− p− (pet)2

(pet + 1− p)2

)|t=0

= n(p− p2) = np(1− p). (2.85)

(In Section 4.3.5 we will exhibit an easier approach.)The multinomial distribution also models the results of n trials, but here there are

K possible categories for each trial. E.g., one may roll a die n times, and see whetherit is a one, two, . . ., or six (so K = 6); or one may randomly choose n people, each ofwhom is then classified as short, medium, or tall (so K = 3). As for the binomial, thetrials are assumed independent, and the probability of an individual trial coming upin category k is pk, so that p1 + · · ·+ pK = 1. The random vector is X = (X1, . . . , XK),where Xk is the number of observations from category k. Letting p = (p1, . . . , pK),we have

X ∼ Multinomial(n, p) =⇒ fX(x) =(

nx

)px1

1 · · · pxKK , x ∈ X , (2.86)


where the space consists of all possible ways K nonnegative integers can sum to n:

X = X ∈ RK | xk ∈ 0, . . . , n for each k, and x1 + · · ·+ xK = n, (2.87)

and for x ∈ X , (nx

)=

(n!

x1! · · · xK !

). (2.88)

This pmf is related to the multinomial theorem:

(a1 + · · ·+ aK)n = ∑

x∈ X

(nx

)ax1

1 · · · axKK . (2.89)

Note that the binomial is a special case of the multinomial with K = 2:

X ∼ Binomial(n, p) =⇒ (X, n− X) ∼ Multinomial(n, (p, 1− p)). (2.90)

Now for the mgf. It is a function of t = (t1, . . . , tK):

MX(t) = E[et·X] = ∑x∈ X

et·X fX(x)

= ∑x∈ X

et1x1 · · · etK xK

(nx

)px1

1 · · · pxKK

= ∑x∈ X

(nx

)(p1et1 )x1 · · · (pKetK )xK

= (p1et1 + · · ·+ pKetK )n < ∞ for all t ∈ RK . (2.91)

The mean and variance of each Xk can be found much as for the binomial. We findthat

E[Xk] = npK and Var[Xk] = npk(1− pk). (2.92)

(In fact, these results are not surprising since the individual Xk are binomial.) For thecovariance between X1 and X2, we first find

E[X1X2] =∂2

∂t1∂t2MX(t)|t=0K

= n(n− 1)(p1et1 + · · ·+ pKetK )n−2 p1et1 p2et2 |t=0K

= n(n− 1)p1 p2. (2.93)

(The cumulant generating function works as well.) Thus

Cov[X1, X2] = n(n− 1)p1 p2 − (np1)(np2) = −np1 p2. (2.94)

Similarly, Cov[Xk, Xl ] = −npk pl if k 6= l. It does make sense for the covariance to benegative, since the more there are in category 1, the fewer are available for category2.


2.5.4 Proof of the moment generating lemma

Here we prove Lemma 2.6 when p = 1. The main mathematical challenge is provingthat we can interchange derivatives and expected values. We will use the dominatedconvergence theorem from real analysis and measure theory. See, e.g., Theorem 16.4in Billingsley (1995). Suppose gn(x), n = 0, 1, 2, · · · , and g(x) are functions such thatlimn→∞ gn(x) = g(x) for each x. The theorem states that if there is a function h(x)such that |gn(x)| ≤ h(x) for all n, and E[h(X)] < ∞, then

limn→∞

E[gn(X)] = E[g(X)]. (2.95)

The assumption in Lemma 2.6 is that for some ε > 0, the random variable X hasM(t) < ∞ for |t| ≤ ε. We show that for |t| < ε,

M(k)(t) ≡ ∂k

∂tk M(t) = E[XketX ], and E[|X|ketX ] < ∞, k = 0, 1, 2, . . . . (2.96)

The lemma follows by setting t = 0. Exercise 2.7.22(a) in fact proves a somewhatstronger result than the above inequality:

E[|X|ke|sX|] < ∞, |s| < ε. (2.97)

The k = 0th derivative is just the function itself, so that (2.96) for k = 0 is M(t) =E[exp(tX)] < ∞, which is what we have assumed. Now assume (2.96) holds fork = 0, . . . , m, and consider k = m + 1. Since |t| < ε, we can take ε′ = (ε− |t|)/2 > 0,so that |t|+ ε′ < ε. Then by (2.96),

M(m)(t + δ)−M(m)(t)δ

= E

[Xme(t+δ)X − XmetX

δ

]

= E

[XmetX eδX − 1

δ

]for 0 < |δ| ≤ ε′. (2.98)

Here we apply the dominated convergence theorem to the term in the last expecta-tion, where gn(x) = xm exp(tx)(exp(δnx) − 1)/δn, with δn = ε′/n → 0. Exercise2.7.22(b) helps to show that

|gn(x)| ≤ |x|me|tx|eε′ |x| − 1

ε′≡ h(x). (2.99)

Now (2.97) applied with k = m, and s = |t| and s = |t|+ ε′, shows that E[h(X)] < ∞.Hence the dominated convergence theorem implies that (2.95) holds, meaning we cantake δ→ 0 on both sides of (2.98). The left-hand side is the (m + 1)st derivative of M,and in the expected value (exp(δx)− 1)/δ→ x. That is,

M(m+1)(t) = E[Xm+1etX ]. (2.100)

Then induction, along with (2.97), proves (2.96).The proof for general p runs along the same lines. The induction step is performed

on multiple indices, one for each ki in the mixed-moment in (2.61).

2.6. Quantiles 33

−1 0 1 2 3

y

0.00

0.25

0.50

0.75

1.00

F(x

)

x

Figure 2.4: The distribution function for a Binomial(2,1/2). The dotted line is whereF(x) = 0.15.

2.6 Quantiles

A positional measure for a random variable is one that gives the value that is in acertain relation to the rest of the values. For example, the 0.25th quantile is the valuesuch that the random variable is below the value 25% of the time, and above it 75%of the time. The median is the (1/2)th quantile. Ideally, for q ∈ [0, 1], the qth quantileis the value ηq such that F(ηq) = q, where F is the distribution function. That is,ηq = F−1(q). Unfortunately, F does not have an inverse for all q unless it is strictlyincreasing, which leaves out all discrete random variables. Even in the continuouscase, the inverse might not be unique, e.g., there may be a flat spot in F. For example,consider the pdf f (x) = 1/2 for x ∈ (0, 1) ∪ (2, 3). Then any number x between 1and 2 has F(x) = 1/2, so that there is no unique median. Thus the definition is a bitmore involved.

Definition 2.7. For q ∈ (0, 1), a qth quantile of the random variable X is any value ηq suchthat

P[X ≤ ηq] ≥ q and P[X ≥ ηq] ≥ 1− q. (2.101)

With this definition, there is at least one quantile for each q for any distribution,but there is no guarantee of uniqueness without some additional assumptions.

As mentioned above, if the distribution function is strictly increasing in x forall x ∈ X , where the space X is a (possibly infinite) interval, then ηq = F−1(q)uniquely. For example, if X is Exponential(1), then F(x) = 1− e−x for x > 0, so thatηq = − log(1− q) for q ∈ (0, 1).

By contrast, consider X ∼ Binomial(2, 1/2), whose distribution function is givenin Figure 2.4. At x = 0, P[X ≤ 0] = 0.25 and P[X ≥ 0] = 1. Thus 0 is a quantilefor any q ∈ (0, 0.25]. The horizontal dotted line in the graph is where F(x) = 0.15. It


never hits the distribution function, but it passes through the gap at x = 0, hence itsquantile is 0. But q = 0.25 = F(x) hits an entire interval of points between 0 and 1.Thus any of those values is its quantile, i.e., η0.25. The complete set of quantiles forq ∈ (0, 1) is

ηq =

0 if q ∈ (0, 0.25)

[0,1] if q = 0.251 if q ∈ (0.25, 0.75)

[1,2] if q = 0.752 if q ∈ (0.75, 1)

. (2.102)

2.7 Exercises

Exercise 2.7.1. (a) Let X ∼ Beta(α, β). Find E[X(1− X)]. (Give the answer in termsof a rational polynomial in α, β.) (b) Find E[Xa(1− X)b] for nonnegative integers aand b.

Exercise 2.7.2. The Geometric(p) distribution is a discrete distribution with spacebeing the nonnegative integers. It has pmf f (x) = p(1− p)x, for parameter p ∈ (0, 1).If one is flipping a coin with p = P[Heads], then X is the number of tails beforethe first head, assuming independent flips. (a) Find the moment generating function,M(t), of X. For what t is it finite? (b) Find E[X] and Var[X].

Exercise 2.7.3. Prove (2.23), i.e., that Cov[X, Y] = E[XY]− E[X]E[Y] if the expectedvalues exist.

Exercise 2.7.4. Suppose Y1, . . . , Yn are uncorrelated random variables with the samemean µ and same variance σ2. Let Y = (Y1, . . . , Yn)′. (a) Write down E[Y] and Cov[Y].(b) For an n× 1 vector a, show that a′Y has mean µ ∑ ai and variance σ2‖a‖2, where“‖a‖” is the norm of the vector a:

‖a‖ =√

a21 + · · ·+ a2

n. (2.103)

Exercise 2.7.5. Suppose X is a 3× 1 vector with covariance matrix σ2I3. (a) Find thematrix A so that AX is the vector of deviations, i.e.,

D = AX =

X1 − XX2 − XX3 − X

. (2.104)

(b) Find the B for which Cov[D] = σ2B. How does it compare to A? (c) What isthe correlation between two elements of D? (d) Let c be the 1× 3 vector such thatc X = X. What is c? Find cc′ = ‖c‖2, Var[c X], and cA. (e) Find

Cov[(

cA

)X]

. (2.105)

From that matrix (it should be 4× 4), read off the covariance of X with the deviations.

2.7. Exercises 35

Exercise 2.7.6. Here, Y is an n × 1 vector with Cov[Y] = σ2In. Also, E[Yi] = βxi,i = 1, . . . , n, where x = (x1, . . . , xn)′ is a fixed set of constants (not all zero), and βis a parameter. This model is simple linear regression with an intercept of zero. LetU = x′Y. (a) Find E[U] and Var[U]. (b) Find the constant c so that E[U/c] = β. (ThenU/c is an unbiased estimator of β.) What is Var[U/c]?

Exercise 2.7.7. Suppose X ∼ Multinomial(n, p), where X and p are 1× K. Show that

Cov[X] = n(diag(p)− p′p), (2.106)

where diag(p) is the K× K diagonal matrix

diag(p) =

p1 0 0 · · · 00 p2 0 · · · 00 0 p3 · · · 0...

......

. . ....

0 0 0 · · · pK

. (2.107)

[Hint: You can use the results in (2.92) and (2.94).]

Exercise 2.7.8. Suppose X and Y are random variables with Var[X] = σ2X , Var[Y] = σ2

Yand Cov[X, Y] = η. (a) Find Cov[aX + bY, cX + dY] directly (i.e., using (2.44)). (b)Now using the matrix manipulations, find the covariance matrix for(

a bc d

)(XY

). (2.108)

Does the covariance term in the resulting matrix equal the answer in part (a)?

Exercise 2.7.9. Suppose

Cov[(X, Y)] =(

1 22 5

). (2.109)

(a) Find the constant b so that X and Y− bX are uncorrelated. (b) For that b, what isVar[Y− bX]?

Exercise 2.7.10. Suppose X is p× 1 and Y is q× 1. Then Cov[X, Y] is defined to bethe p× q matrix with elements Cov[Xi, Yj]:

Cov[X, Y] =

Cov[X1, Y1] Cov[X1, Y2] · · · Cov[X1, Yq]Cov[X2, Y1] Cov[X2, Y2] · · · Cov[X2, Yq]

......

. . ....

Cov[Xp, Y1] Cov[Xp, Y2] · · · Cov[Xp, Yq]

. (2.110)

(a) Show that Cov[X, Y] = E[(X− E[X])(Y− E[Y])′] = E[XY′]− E[X]E[Y]′. (b) Sup-pose A is r× p and B is s× q. Show that Cov[AX, BY] = ACov[X, Y]B′.

Exercise 2.7.11. Let Z ∼ N(0, 1), and W = Z2, so W ∼ χ21 as in Exercise 1.7.13. (a)

Find the moment generating function MW(t) of W by integrating over the pdf of Z,i.e., find

∫etz2

fZ(z)dz. For which values of t is MW(t) finite? (b) From (2.76), themoment generating function of a Gamma(α, λ) random variable is λα/(λ− t)α whent < λ. For what values of α and λ does this mgf equal that of W? Is that result as itshould be, i.e, the mgf of χ2

1?


Exercise 2.7.12. As in Table 1.1 (page 7), the Laplace distribution (also known as thedouble exponential) has space (−∞, ∞) and pdf f (x) = (1/2)e−|x|. (a) Show that theLaplace has mgf M(t) = 1/(1− t2). [Break the integral into two parts, according tothe sign of x.] (b) For which t is the mgf finite?

Exercise 2.7.13. Continue with X ∼ Laplace as in Exercise 2.7.12. (a) Show that fork even, E[Xk] = Γ(k + 1) = k! [Hint: It is easiest to do the integral directly, notingthat by symmetry it is twice the integral over (0, ∞).] (b) Use part (a) to show thatVar[X] = 2 and Kurtosis[X] = 3.

Exercise 2.7.14. Suppose (X, Y) = (cos(Θ), sin(Θ)), where Θ ∼ Uniform(0, 2π). (a)Find E[X], E[X2], E[X3], E[X4], and Var[X], Skewness[X], and Kurtosis[X]. (b) FindCov[X, Y] and Corr[X, Y]. (c) Find E[X2 + Y2] and Var[X2 + Y2].

Exercise 2.7.15. (a) Show that the mgf of the Poisson(λ) is eλ(et−1). (b) Find the kth

cumulant of the Poisson(λ) as a function of λ and k.

Exercise 2.7.16. (a) Fill in the skewness and kurtosis for the indicated distributions(if they exist). The "cos(Θ)" is the X from Exercise 2.7.14.

Distribution Skewness KurtosisNormal(0,1)Uniform(0,1)

Exponential(1)LaplaceCauchycos(Θ)

Poisson(1/2)Poisson(20)

(b) Which of the given distributions with zero skewness is most “boxy,” accordingto the above table? (c) Which of the given distributions with zero skewness has themost “pointy-middled/fat-tailed,” according to the above table? (d) Which of thegiven distributions is most like the normal (other than the normal), according to theabove table? Which is least like the normal? (Ignore the distributions whose skewnessand/or kurtosis does not exist.)

Exercise 2.7.17. The logistic distribution has space R and pdf f (x) = ex(1 + ex)−2 asin Table 1.1. Show that the qth quantile is ηq = log(q/(1− q)), which is logit(q).

Exercise 2.7.18. This exercises uses X ∼ Logistic. (a) Exercise 1.7.15 shows that X canbe represented as X = log(U/(1−U)) where U ∼ Uniform(0, 1). Show that the mgfof X is

MX [etX ] = E[et log(U/(1−U))] = Γ(1 + t)Γ(1− t). (2.111)

For which values of t is that equation valid? [Hint: Write the integrand as a product ofpowers of u and 1− u, and notice that it looks like the beta pdf without the constant.](b) The digamma function is defined to be ψ(α) = d log(Γ(α))/dα. The trigammafunction is its derivative, ψ′(α). Show that the variance of the logistic is π2/3. Youcan use the fact that ψ′(1) = π2/6. (c) Show that

Var[X] = 2∫ ∞

0x2 e−x

(1 + e−x)2 = 4η(2), where η(s) =∞

∑k=1

(−1)k−1

ks . (2.112)

2.7. Exercises 37

The function η is the Dirichlet eta function, and η(2) = π2/12. [Hint: For the firstequality in (2.112), use the fact that the pdf of the logistic is symmetric about 0, i.e.,f (x) = f (−x). For the second equality, use the expansion (1− z)−2 = ∑∞

k=1 kzk−1 for|z| < 1, then integrate each term over x, noting that each term has something like agamma pdf.]

Exercise 2.7.19. If X ∼ N(µ, σ2), then Y = exp(X) has a lognormal distribution.(a) Show that the kth raw moment of Y is exp(kµ + k2σ2/2). [Hint: Note thatE[Yk] = MX(k), where MX is the mgf of X.] (b) Show that for t > 0, the mgf of Y is in-finite. Thus the conditions for Lemma 2.6 do not hold, but the moments are finite any-way. [Hint: Write MY(t) = E[exp(t exp(X))] = c

∫(exp(t exp(x)− (x−µ)2/(2σ2))dx.

Then show that for t > 0, t exp(x)/((x − µ)2/(2σ2)) → ∞ as x → ∞, which meansthere is some x0 such that the exponent in the integral is greater than 0 for x > x0.Thus MY(t) > c

∫ ∞x0

1dx = ∞.]

Exercise 2.7.20. Suppose Z has pmf pZ(z) = c exp(−z2/2) for z = 0,±1,±2, . . .. Thatis, the space of Z is Z, the set of all integers. Here, c = 1/ ∑z∈Z exp(−z2/2). LetW = exp(Z). (a) Show that E[Wk] = exp(k2/2). [Hint: Write E[Wk] = E[exp(kZ)] =c ∑z∈Z exp(kz− k2/2). Then complete the square in the exponent wrt k, and changethe summation to that over z− k.] (b) Show that W has the same raw moments as thelognormal Y in Exercise 2.7.19 when µ = 0 and σ2 = 1. Do W and Y have the samedistribution? (See Durrett (2010) for this W and an extension.)

Exercise 2.7.21. Suppose the random variable X has mgf M(t) that is finite for |t| ≤ εfor some ε > 0. This exercise shows that all moments of X are finite. (a) Showthat E[exp(t|X|)] < ∞ for |t| ≤ ε. [Hint: Note that for such t, M(t) and M(−t)are both finite, and exp(t|X|) < exp(tX) + exp(−tX) for any t and X. Then takeexpected values of both sides of that inequality.] (b) Write exp(t|X|) in its seriesexpansion (exp(a) = ∑∞

k=0 ak/k!), and show that if t > 0, for any integer k, |X|k ≤exp(t|X|)k!/tk. Argue that then E[|X|k] < ∞.

Exercise 2.7.22. Continue with the setup in Exercise 2.7.21. Here we prove somefacts needed for the proof of Lemma 2.6 in Section 2.5.4. (a) Fix |t| < ε, and showthat there exists a δ ∈ (0, ε) such that |t + δ| < ε. Thus M(t + δ) < ∞. WriteM(t + δ) = E[exp(t|X|) exp(δ|X|)]. Expand exp(δ|x|) as in Exercise 2.7.21(b) to showthat for any integer k, |X|k exp(t|X|) ≤ exp((t + δ)|X|)k!/δk. Argue that thereforeE[|X|k exp(t|X|)] < ∞. (b) Suppose δ ∈ (0, ε′). Show that∣∣∣∣∣ eδx − 1

δ

∣∣∣∣∣ ≤ eε′ |x| − 1ε′

. (2.113)

[Hint: Expand the exponential again to obtain (exp(δx) − 1)/δ = ∑∞k=1 δk−1xk/k!.

Then take absolute values, noting that in the sum, all the terms satisfy δk−1|x|k ≤ε′k−1|x|k. Finally, reverse the expansion step.]

Exercise 2.7.23. Verify the quantiles of the Binomial(2, 1/2) given in (2.102) for q ∈(0.25, 1).

Exercise 2.7.24. Suppose X has the “late start” distribution function as in (1.21) andFigure 1.2 on page 10, where F(x) = 0 if x < 0, F(0) = 1/10, and F(x) = 1 −(9/10)e−x/100 if x > 0. Find the quantiles for all q ∈ (0, 1).


Exercise 2.7.25. Imagine wishing to guess the value of a random variable X beforeyou see it. (a) If you guess m and the value of X turns out to be x, you lose (x−m)2

dollars. What value of m will minimize your expected loss? Show that m = E[X]minimizes E[(X−m)2] over m, assuming that Var[X] < ∞. [Hint: Write the expectedloss as E[X2]− 2mE[X] + m2, then differentiate wrt m and set to 0.] (b) What is theminimum value? (c) Suppose instead you lose |x −m|, which has relatively smallerpenalties for large errors than does squared error loss. Assume that X has a contin-uous distribution with pdf f and finite mean. Show that E[|X −m|] is minimized bym being any median of X. [Hint: Write the expected value as

E[|X−m|] =∫ m

−∞|x−m| f (x)dx +

∫ ∞

m|x−m| f (x)dx

= −∫ m

−∞(x−m) f (x)dx +

∫ ∞

m(x−m) f (x)dx, (2.114)

then differentiate and set to 0. Use the fact that P[X = m] = 0.] The minimum valuehere is called the mean absolute deviation from the median. (d) Now suppose thepenalty is different depending on whether your guess is too small or too large. Thatis, for some q ∈ (0, 1), you lose q|x−m| if x > m, and (1− q)|x−m| if x < m. Showthat the expected value of this loss is minimized by m being any qth quantile of X.

Exercise 2.7.26. The interquartile range of a distribution is defined to be the differ-ence between the two quartiles, that is, it is IQR = η0.75− η0.25 (at least if the quartilesare unique). Find the interquartile range for a N(µ, σ2) random variable.

Chapter 3

Marginal Distributions and Independence

3.1 Marginal distributions

Given the distribution of a vector of random variables, it is possible in principleto find the distribution of any individual component of the vector, or any subsetof components. To illustrate, consider the distribution of the scores (Assignment,Exams) for a statistics class, where each variable has values “Lo” and “Hi”:

ExamsAssignments Lo Hi Marginal of Assignments

Lo 0.3178 0.2336 0.5514Hi 0.1028 0.3458 0.4486

Marginal of Exams 0.4206 0.5794 1

(3.1)

Thus about 32% of the students did low on both assignments and exams, and about35% did high on both. But notice it is also easy to figure out the percentages of peoplewho did low or high on the individual scores, e.g.,

P[Assigment = Lo] = 0.5514 and (hence) P[Assigment = Hi] = 0.4486. (3.2)

These numbers are in the margins of the table (3.1), hence the distribution of assign-ments alone, and of exams alone, are called marginal distributions. The distributionof (Assignments, Exams) together is called the joint distribution.

More generally, given the joint distribution of (the big vector) (X, Y), one canfind the marginal distribution of the vector X, and the marginal distribution of thevector Y. (We don’t have to take consecutive components of the vector, e.g., given(X1, X2, . . . , X5), we could be interested in the marginal distribution of (X1, X3, X4),say.)

Actually, the words joint and marginal can be dropped. The joint distribution of(X, Y) is just the distribution of (X, Y); the marginal distribution of X is just thedistribution of X, and the same for Y. The extra verbiage can be helpful, though,when dealing with different types of distributions in the same breath.

Before showing how to find the marginal distributions from the joint, we shoulddeal with the spaces. LetW be the joint space of (X, Y), and X and Y be the marginalspaces of X and Y, respectively. Then

X = x | (x, y) ∈ W for some y and Y = y | (x, y) ∈ W for some x. (3.3)

39

40 Chapter 3. Marginal Distributions and Independence

For example, consider the joint space W = (x, y) | 0 < x < y < 1, sketched inFigure 2.1 on page 22. The marginal spaces X and Y are then both (0, 1).

There are various approaches to finding the marginal distributions from the joint.First, suppose F(x, y) is the distribution function for (X, Y) jointly, and FX(x) is thatfor x marginally. Then (assuming x is p× 1 and y is q× 1),

FX(x) = P[X1 ≤ x1, . . . , Xp ≤ xp]

= P[X1 ≤ x1, . . . , Xp ≤ xp, Y1 ≤ ∞, . . . , Yq ≤ ∞]

= F(x1, . . . , xp, ∞, . . . , ∞). (3.4)

That is, you put ∞ in for the variables you are not interested in, because they arecertainly less than infinity.

The mgf is equally easy. Suppose M(t, s) is the mgf for (X, Y) jointly, so that

M(t, s) = E[et·X+s·Y]. (3.5)

To eliminate the dependence on Y, we now set s to zero, that is, the mgf of X alone is

MX(t) = E[et·X] = E[et·X+0q ·Y] = M(t, 0q). (3.6)

3.1.1 Multinomial distribution

Given X ∼ Multinomial(n, p) as in (2.86), one may wish to find the marginal distri-bution of a single component, e.g., X1. It should be binomial, because now for eachtrial a success is that the observation is in the first category. To show this fact, we findthe mgf of X1 by setting t2 = · · · = tK = 0 in (2.91):

MX1 (t) = Mx((t, 0, . . . , 0)) = (p1et + p2 + · · ·+ pK)n = (p1et + 1− p1)

n, (3.7)

which is indeed the mgf of a binomial as in (2.83). Specifically, X1 ∼ Binomial(n, p1).

3.2 Marginal densities

More challenging, but also more useful, is to find the marginal density from thejoint density, assuming it exists. Suppose the joint distribution of the two randomvariables, (X, Y), has pmf f (x, y), and spaceW . Then X has a pmf, fX(x), as well. Tofind it in terms of f , write

fX(x) = P[X = x (and Y can be anything)]

= ∑y | (x,y)∈W

P[X = x, Y = y]

= ∑y | (x,y)∈W

f (x, y). (3.8)

That is, you add up all the f (x, y) for that value of x, as in the table (3.1). The sameprocedure works if X and Y are vectors. The set of y’s we are summing over we willcall the conditional space of Y given X = x, and denote it by Yx:

Yx = y ∈ Y | (x, y) ∈ W. (3.9)

3.2. Marginal densities 41

WithW = (x, y) | 0 < x < y < 1, for any x ∈ (0, 1), y ranges from x to 1, hence

Yx = (x, 1). (3.10)

In the coin example in Section 1.6.3, for any probability of heads x, the range of Y isthe same (see Figure 1.5 on page 12), so that Yx = 0, 1, . . . , n for any x ∈ (0, 1).

To summarize, in the general discrete case, we have

fx(x) = ∑y∈Yx

f (x, y), x ∈ X . (3.11)

3.2.1 Ranks

The National Opinion Research Center Amalgam Survey of 1972 asked people torank three types of areas in which to live: City over 50,000, Suburb (within 30 milesof a City), and Country (everywhere else). The table (3.12) shows the results (Duncanand Brody, 1982), with respondents categorized by their current residence.

Ranking Residence(City, Suburb, Country) City Suburb Country Total

(1, 2, 3) 210 22 10 242(1, 3, 2) 23 4 1 28(2, 1, 3) 111 45 14 170(2, 3, 1) 8 4 0 12(3, 1, 2) 204 299 125 628(3, 2, 1) 81 126 152 359Total 637 500 302 1439

(3.12)

That is, a ranking of (1, 2, 3) means that person ranks living in the city best, sub-urbs next, and country last. There were 242 people in the sample with that ranking,210 of whom live in the city (so they should be happy), 22 of whom live in the sub-urbs, and just 10 of whom live in the country.

The random vector here is (X, Y, Z), say, where X represents the rank of city, Ythat of suburb, and Z that of country. The space consists of the six permutations of1, 2, and 3:

W = (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1), (3.13)

as in the first column of the table. Suppose the total column is our population, sothat there are 1439 people all together, and we randomly choose a person from thispopulation. Then the (joint) distribution of the person’s ranking (X, Y, Z) is given by

f (x, y, z) = P[(X, Y, Z) = (x, y, z)] =

242/1439 if (x, y, z) = (1, 2, 3)28/1439 if (x, y, z) = (1, 3, 2)

170/1439 if (x, y, z) = (2, 1, 3)12/1439 if (x, y, z) = (2, 3, 1)

628/1439 if (x, y, z) = (3, 1, 2)359/1439 if (x, y, z) = (3, 2, 1)

. (3.14)

This distribution could use some summarizing, e.g., what are the marginal distri-butions of X, Y, and Z? For each ranking x = 1, 2, 3, we have to add over the possible


rankings of Y and Z, so that

fX(1) = f (1, 2, 3) + f (1, 3, 2) =28 + 242

1439= 0.1876;

fX(2) = f (2, 1, 3) + f (2, 3, 1) =170 + 12

1439= 0.1265;

fX(3) = f (3, 1, 2) + f (3, 2, 1) =628 + 359

1439= 0.6859. (3.15)

Thus city is ranked third over 2/3 of the time. The marginal rankings of suburb andcountry can be obtained similarly.

3.2.2 PDFs

Again for two variables, suppose now that the pdf is f (x, y). We know that thedistribution function is related to the pdf via

F(x, y) =∫(−∞,x]∩X

∫(−∞,y]∩Yu

f (u, v)dvdu. (3.16)

From (3.4), to obtain the distribution function, we set y = ∞, which means in theinside integral, we can remove the “(−∞, y]” part:

FX(x) =∫(−∞,x]∩X

∫Yu

f (u, v)dvdu. (3.17)

Then the pdf of X is found by taking the derivative with respect to x for x ∈ X , whichhere just means stripping away the outer integral and setting u = x (and v = y, if wewish):

fX(x) =∂

∂xFX(x) =

∫Yx

f (x, y)dy. (3.18)

Thus instead of summing over the y as in (3.11), we integrate. This procedure is oftencalled “integrating out y.”

Consider the example in Section 2.2.1, where f (x, y) = 2 for 0 < x < y < 1. From(3.10), for x ∈ (0, 1), we have Yx = (x, 1), hence

fX(x) =∫Yx

f (x, y)dy =∫ 1

x2dy = 2(1− x). (3.19)

With vectors, the process is the same, just embolden the variables:

fx(x) =∫Yx

f (x, y)dy. (3.20)

3.3 Independence

Much of statistics is geared towards evaluation of relationships between variables:Does smoking cause cancer? Do cell phones? What factors explain the rise in asthma?The absence of a relationship, independence, is also important. Two sets A and Bare independent if P[A ∩ B] = P[A] × P[B]. The definition for random variables issimilar:

3.3. Independence 43

Definition 3.1. Suppose (X, Y) has joint distribution P, and marginal spaces X and Y ,respectively. Then X and Y are independent if

P[X ∈ A and Y ∈ B] = P[X ∈ A]× P[Y ∈ B] for all A ⊂ X and B ⊂ Y . (3.21)

Also, if (X(1), . . . , X(K)) has distribution P, and the vector X(k) has space Xk, then X(1), . . .,X(K) are (mutually) independent if

P[X(1) ∈ A1, . . . , X(K) ∈ AK ] = P[X(1) ∈ A1]× · · · × P[X(K) ∈ AK ]

for all A1 ⊂ X1, . . . , AK ⊂ XK . (3.22)

The basic idea in independence is that what happens with one variable does notaffect what happens with another. There are a number of useful equivalences for in-dependence of X and Y. (Those for mutual independence of K vectors hold similarly.)

• Distribution functions: X and Y are independent if and only if

F(x, y) = FX(x)× FY(y) for all x ∈ Rp, y ∈ Rq. (3.23)

• Expected values of products of functions: X and Y are independent if and onlyif

E[g(X)h(Y)] = E[g(X)]× E[h(Y)] (3.24)

for all functions g : X → R and h : Y → R whose expected values exist.

• MGFs: Suppose the marginal mgfs of X and Y are finite for t and s in neigh-borhoods of zero (respectively in Rp and Rq). Then X and Y are independent ifand only if

M(t, s) = MX(t)MY(s) (3.25)

for all (t, s) in a neighborhood of zero in Rp+q.

The second item can be used to show that independent random variables are un-correlated, because as in (2.23), Cov[X, Y] = E[XY] − E[X]E[Y], and (3.24) showsthat E[XY] = E[X]E[Y] if X and Y are independent. Be aware that the implica-tion does not go the other way, that is, X and Y can have correlation 0 and stillnot be independent. For example, suppose W = (0, 1), (0,−1), (1, 0), (−1, 0), andP[(X, Y) = (x, y)] = 1/4 for each (x, y) ∈ W . Then it is not hard to show thatE[X] = E[Y] = 0, and that E[XY] = 0 (in fact, XY = 0 always), hence Cov[X, Y] = 0.But X and Y are not independent, e.g., take A = 0 and B = 0. Then

P[X = 0 and Y = 0] = 0 6= P[X = 0]P[Y = 0] =12× 1

2. (3.26)

3.3.1 Independent exponentials

Suppose U and V are independent Exponential(1)’s. The mgf of an Exponential(1) is1/(1− t) for t < 1. See (2.76), which gives the mgf of Gamma(α, λ) as (λ/(λ− t))α

for t < λ. Thus the mgf of (U, V) is

M(U,V)(t1, t2) = MU(t1)MV(t2) =1

1− t1

11− t2

, t1 < 1, t2 < 1. (3.27)


Now letX = U + V and Y = U −V. (3.28)

Are X and Y independent? What are their marginal distributions? We can start bylooking at the mgf:

M(X,Y)(s1, s2) = E[es1X+s2Y ]

= E[es1(U+V)+s2(U−V)]

= E[e(s1+s2)U+(s1−s2)V ]

= M(U,V)(s1 + s2, s1 − s2)

=1

1− s1 − s2

11− s1 + s2

. (3.29)

This mgf is finite if s1 + s2 < 1 and s1 − s2 < 1, which is a neighborhood of (0, 0). IfX and Y are independent, then this mgf must factor into the two individual mgfs. Itdoes not appear to factor. More formally, the marginal mgfs are

MX(s1) = M(X,Y)(s1, 0) =1

(1− s1)2 , (3.30)

andMY(s2) = M(X,Y)(0, s2) =

1(1− s2)(1 + s2)

=1

1− s22

. (3.31)

Note that the first one is that of a Gamma(2, 1), so that X ∼ Gamma(2, 1). The one forY may not be recognizable, but it turns out to be the mgf of a Laplace, as in Exercise2.7.12. Notice that

M(X,Y)(s1, s2) 6= MX(s1)MY(s2), (3.32)

hence X and Y are not independent. They are uncorrelated, however:

Cov[X, Y] = Cov[U + V, U −V] = Var[U]−Var[V]− Cov[U, V] + Cov[V, U]

= 1− 1 = 0. (3.33)

3.3.2 Spaces and densities

Suppose (X, Y) is discrete, with pmf f (x, y) > 0 for (x, y) ∈ W , and fX and fY arethe marginal pmfs of X and Y, respectively. Then applying (3.21) to the singleton setsx and y for x ∈ X and y ∈ Y shows that

P[X = x, Y = y] = P[X = x]× P[Y = y], (3.34)

which translates tof (x, y) = fX(x) fY(y). (3.35)

In particular, this equation shows that if x ∈ X and y ∈ Y , then f (x, y) > 0, hence(x, y) ∈ W . That is, if X and Y are independent, then

W = X ×Y , (3.36)

the “rectangle” created from the marginal spaces.

3.3. Independence 45

0.0 1.0 2.0

0.0

1.0

2.0

0.0 1.0 2.0 3.0

0.0

1.0

2.0

3.0

(0, 1)× (0, 2) [(0, 1) ∪ (2, 3)]× [(0, 1) ∪ (2, 3)]

1.0 2.0 3.0

1.0

2.0

3.0

4.0

0.0 1.0 2.0

2.0

3.0

4.0

0, 1, 2 × 1, 2, 3, 4 0, 1, 2 × [2, 4]

Figure 3.1: Some Cartesian products, i.e., rectangles.

Formally, given sets A ⊂ Rp and B ⊂ Rq, the Cartesian product or rectangleA× B is defined to be

A× B = (x, y) | x ∈ A and y ∈ B ⊂ Rp+q. (3.37)

The set may not be a rectangle in the usual sense, although it will be if p = q = 1and A and B are both intervals. Figure 3.1 has some examples. Of course, Rp+q is arectangle itself, being Rp ×Rq. The result (3.36) holds in general.

Lemma 3.2. If X and Y are independent, then the spaces can be taken so that (3.36) holds.

This lemma implies that if the joint space can not be a rectangle, then X and Y arenot independent. Consider the example in Section 2.2.1, where W = (x, y) | 0 < x <y < 1, a triangle. If we take a square below that triangle, such as (0.8, 1)× (0.8, 1),then

P[(X, Y) ∈ (0.8, 1)× (0.8, 1)] = 0 but P[X ∈ (0.8, 1)]P[Y ∈ (0.8, 1)] > 0, (3.38)

so that X and Y are not independent. The result extends to more variables. In Section3.2.1 on ranks, the marginal spaces are

X = Y = Z = 1, 2, 3, (3.39)

butW 6= X ×Y ×Z = (1, 1, 1), (1, 1, 2), (1, 1, 3), . . . , (3, 3, 3), (3.40)


in particular becauseW has only 6 elements, while the product space has 33 = 27.The factorization in (3.34) is necessary and sufficient for independence in the dis-

crete and continuous cases, or mixed-type densities as in Section 1.6.3.

Lemma 3.3. Suppose X has marginal density fX(x), and Y has marginal density fY(y).Then X and Y are independent if and only if the distribution of (X, Y) can be given by density

f (x, y) = fX(x) fY(y) (3.41)

and spaceW = X ×Y . (3.42)

In the lemma, we say “can be given” since in the continuous case, we can changethe densities or spaces on sets of probability zero without changing the distribution.We can simplify a little, that is, as long as the space and joint pdf factor, we haveindependence.

Lemma 3.4. Suppose (X, Y) has joint density f (x, y). Then X and Y are independent if andonly if the density can be written as

f (x, y) = g(x)h(y) (3.43)

for some functions g and h, andW = X ×Y . (3.44)

This lemma is not presuming the g and h are actual densities, although they cer-tainly could be.

3.3.3 IID

A special case of independence has the vectors with the exact same distribution, aswell as being independent. That is, X(1), . . . , X(K) are independent, and all have thesame marginal distribution. We say the vectors are iid, meaning “independent andidentically distributed.” This type of distribution often is used to model randomsamples, where n individuals are chosen from a (virtually) infinite population, and pvariables are recorded on each. Then K = n and the X(i)’s are p× 1 vectors. If themarginal density is fX for each X(i), with marginal space X , then the joint density ofthe entire sample is

f (x(1), . . . , x(n)) = fX(x(1)) · · · fX(x(n)) (3.45)

with spaceW = X × · · · × X = X n. (3.46)

3.4 Exercises

Exercise 3.4.1. Let (X, Y, Z) be the ranking variables as in (3.14). (a) Find the marginaldistributions of Y and Z. What is the most popular rank of suburb? Of country? (b)Find the marginal space and marginal pmf of (X, Y). How does the space differ fromthat of (X, Y, Z)?

3.4. Exercises 47

Exercise 3.4.2. Suppose U and V are iid with finite variance, and let X = U + V andY = U −V, as in (3.28). (a) Show that X and Y are uncorrelated. (b) Suppose U andV both have space (0, ∞). Without knowing the pdfs, what can you say about theindependence of X and Y? [Hint: What is the space of (X, Y)?]

Exercise 3.4.3. Suppose U and V are iid N(0, 1), and let X = U + V and Y = U −Vagain. (a) Find the mgf of the joint (U, V), M(U,V)(t1, t2). (b) Find the mgf of (X, Y),M(X,Y)(s1, s2). Show that it factors into the mgfs of X and Y. (c) What are themarginal distributions of X and Y? [Hint: See (2.73)].

Exercise 3.4.4. Suppose (X, Y) is uniform over the unit disk, so that the space isW = (x, y) | x2 + y2 < 1 and f (x, y) = 1/π for (x, y) ∈ W . (a) What are the(marginal) spaces of X and Y? Are X and Y independent? (b) For x ∈ X (themarginal space of X), what is Yx (the conditional space of Y given X = x)? (c) Findthe (marginal) pdf of X.

Exercise 3.4.5. Let Z1, Z2, and Z3 be independent, each with space −1,+1, andP[Zi = −1] = P[Zi = +1] = 1/2. Set

X1 = Z1Z3 and X2 = Z2Z3. (3.47)

(a) What is the space of (X1, X2)? (b) Are X1 and X2 independent? (c) Now letX3 = Z1Z2. Are X1 and X3 independent? (d) Are X2 and X3 independent? (e) Whatis the space of (X1, X2, X3)? (f) Are X1, X2 and X3 mututally independent? (g) Nowlet U = X1X2X3? What is the space of U?

Exercise 3.4.6. For each given pdf f (x, y) and space W for (X, Y), answer true orfalse to the three statements: (i) X and Y are independent; (ii) The space of (X, Y) isa rectangle; (iii) Cov[X, Y] = 0.

f (x, y) W(a) c1xy (x, y) | 0 < x < 1, 0 < y < 1(b) c2xy (x, y) | 0 < x < 1, 0 < y < 1, x + y < 1(c) c3(x + y) (x, y) | 0 < x < 1, 0 < y < 1

(3.48)

The ci’s are constants.

Exercise 3.4.7. Suppose Θ ∼ Uniform(0, 2π), and define X = cos(Θ), Y = sin(Θ).Also, set R =

√X2 + Y2. True or false? (a) X and Y are independent. (b) The space of

(X, Y) is a rectangle. (c) Cov[X, Y] = 0. (d) R and Θ are independent. (e) The spaceof (R, Θ) is a rectangle. (f) Cov[R, Θ] = 0.

Chapter 4

Transformations: DFs and MGFs

A major task of mathematical statistics is finding, or approximating, the distributionsof random variables that are functions of other random variables. Important exam-ples are estimators, hypothesis tests, and predictors. This chapter and the next willaddress finding exact distributions. There are many approaches, and which one touse may not always be obvious. Chapters 8 and 9 consider large-sample approxi-mations. The following sections run through a number of possibilities, though thegranddaddy of them all, using Jacobians, has its own Chapter 5.

4.1 Adding up the possibilities

If X is discrete, then any function Y = g(X) will also be discrete, hence its pmf can befound by adding up all the probabilities that correspond a given y:

fY(y) = P[Y = y] = P[g(X) = y] = ∑x | g(x)=y

fX(x), y ∈ Y . (4.1)

Of course, that final summation may or not be easy to find.One situation in which it is easy is when g is a one-to-one and onto function from

X to Y , so that there exists an inverse function,

g−1 : Y → X ; g(g−1(y)) = y and g−1(g(x)) = x. (4.2)

Then, with fX being the pmf of X,

fY(y) = P[g(X) = y] = P[X = g−1(y)] = fX(g−1(y)). (4.3)

For example, if X ∼ Poisson(λ), and Y = X2, then g(x) = x2, hence g−1(y) =√

y fory ∈ Y = 0, 1, 4, 9, . . .. The pmf of Y is then

fY(y) = fX(√

y) = e−λ λ√

y√

y!, y ∈ Y . (4.4)

Notice that it is important to have the spaces correct. E.g., this g is not one-to-oneif the space is R, and the “

√y” makes sense only if y is the square of a nonnegative

integer.We consider some more examples.

49

50 Chapter 4. Transformations: DFs and MGFs

4.1.1 Sum of discrete uniforms

Suppose X = (X1, X2), where X1 and X2 are independent, and

X1 ∼ Discrete Uniform(0, 1) and X2 ∼ Discrete Uniform(0, 2). (4.5)

Note that X1 ∼ Bernoulli(1/2). We are after the distribution of Y = X1 + X2. Thespace of Y can be seen to be Y = 0, 1, 2, 3. This function is not one-to-one, e.g.,there are two x’s that sum to 1: (0, 1) and (1, 0). This is a small enough example thatwe can just write out all the possibilities:

fY(0) = P[X1 + X2 = 0] = P[X = (0, 0)] =12× 1

3=

16

;

fY(1) = P[X1 + X2 = 1] = P[X = (0, 1) or X = (1, 0)] =12× 1

3+

12× 1

3=

26

;

fY(2) = P[X1 + X2 = 2] = P[X = (0, 2) or X = (1, 1)] =12× 1

3+

12× 1

3=

26

;

fY(3) = P[X1 + X2 = 3] = P[X = (1, 2)] =12× 1

3=

16

. (4.6)

4.1.2 Convolutions for discrete variables

Here we generalize the previous example a bit by assuming X1 has pmf f1 and spaceX1 = 0, 1, . . . , a, and X2 has pmf f2 and space X2 = 0, 1, . . . , b. Both a and b arepositive integers, or either could be +∞. Then Y has space

Y = 0, 1, . . . , a + b. (4.7)

To find fY(y), we need to sum up the probabilities of all (x1, x2)’s for which x1 + x2 =y. These pairs can be written (x1, y− x1), and require x1 ∈ X1 as well as y− x1 ∈ X2.That is, for fixed y ∈ Y ,

0 ≤ x1 ≤ a and 0 ≤ y− x1 ≤ b =⇒ max0, y− b ≤ x1 ≤ mina, y. (4.8)

For example, with a = 3 and b = 3, the following table shows which x1’s correspondto each y:

x1 ↓; x2 → 0 1 2 30 0 1 2 31 1 2 3 42 2 3 4 53 3 4 5 6

(4.9)

Each value of y appears along a diagonal, so that

y = 0 ⇒ x1 = 0;y = 1 ⇒ x1 = 0, 1;y = 2 ⇒ x1 = 0, 1, 2;y = 3 ⇒ x1 = 0, 1, 2, 3;y = 4 ⇒ x1 = 1, 2, 3;y = 5 ⇒ x1 = 2, 3;y = 6 ⇒ x1 = 3. (4.10)

4.1. Adding up the possibilities 51

Thus in general, for y ∈ Y ,

fY(y) = P[X1 + X2 = y]

=mina,y

∑x1=max0,y−b

P[X1 = x1, X2 = y− x1]

=mina,y

∑x1=max0,y−b

f1(x1) f2(y− x1). (4.11)

This formula is called the convolution of f1 and f2. In general, the convolution oftwo random variables is the distribution of the sum.

To illustrate, suppose X1 has the pmf of the Y in (4.6), and X2 is Discrete Uni-form(0,3), so that a = b = 3 and

f1(0) = f1(3) =16

, f1(1) = f1(2) =13

and f2(x2) =14

, x2 = 0, 1, 2, 3. (4.12)

Then Y = 0, 1, . . . , 6, and

fY(0) =0

∑x1=0

f1(x1) f2(0− x1) =124

;

fY(1) =1

∑x1=0

f1(x1) f2(1− x1) =124

+112

=3

24;

fY(2) =2

∑x1=0

f1(x1) f2(2− x1) =124

+112

+1

12=

524

;

fY(3) =3

∑x1=0

f1(x1) f2(3− x1) =124

+112

+1

12+

124

=624

;

fY(4) =3

∑x1=1

f1(x1) f2(4− x1) =112

+112

+1

24=

524

;

fY(5) =3

∑x1=2

f1(x1) f2(5− x1) =112

+124

=3

24;

fY(6) =3

∑x1=3

f1(x1) f2(6− x1) =124

. (4.13)

Check that the fY(y)’s do sum to 1.

4.1.3 Sum of two Poissons

An example for which a = b = ∞ has X1 and X2 independent Poissons, with param-eters λ1 and λ2, respectively. Then Y = X1 + X2 has space Y = 0, 1, · · · , the same


as the spaces of the Xi’s. In this case, for fixed y, x1 = 0, . . . , y, hence

fY(y) =y

∑x1=0

f1(x1) f2(y− x1)

=y

∑x1=0

e−λ1λx1

1x1!

e−λ2λ

y−x12

(y− x1)!

= e−λ1−λ2

y

∑x1=0

1x1!(y− x1)!

λx11 λ

y−x12

= e−λ1−λ21y!

y

∑x1=0

y!x1!(y− x1)!

λx11 λ

y−x12

= e−λ1−λ21y!

(λ1 + λ2)y, (4.14)

the last step using the binomial theorem in (2.82). But that last expression is thePoisson pmf, i.e.,

Y ∼ Poisson(λ1 + λ2). (4.15)

This fact can be proven also using mgfs. See Section 4.3.

4.2 Distribution functions

Suppose the p × 1 vector X has distribution function FX, and Y = g(X) for somefunction

g : X −→ Y ⊂ R. (4.16)

Then the distribution function FY of Y is

FY(y) = P[Y ≤ y] = P[g(X) ≤ y], y ∈ R. (4.17)

The final probability is in principle obtainable from the distribution of X, which solvesthe problem. If Y has a pdf, we can then find that by differentiating, as we did in(1.17). Exercises 1.7.13 and 1.7.14 already gave previews of this approach for the χ2

1distribution, as did Exercise 1.7.15 for the logistic.

4.2.1 Convolutions for continuous random variables

Suppose (X1, X2) has pdf f (x1, x2). For initial simplicity, we will take the space to bethe entire plane, X = R2, noting that f could be 0 over wide swaths of the space. LetY = X1 + X2, so it has space Y = R. Its distribution function is

FY(y) = P[X1 + X2 ≤ y] = P[X2 ≤ y− X1] =∫ ∞

−∞

∫ y−x1

−∞f (x1, x2)dx2dx1. (4.18)

The pdf is found by differentiating with respect to y, which replaces the inner integralwith the integrand evaluated at x2 = y− x1:

fY(y) = F′Y(y) =∫ ∞

−∞f (x1, y− x1)dx1. (4.19)


θ

x

Figure 4.1: An arrow is shot at an angle of θ, which is chosen randomly between 0and π. Where it hits the line one unit high is the value x.

This convolution formula is the analog of (4.11) in the discrete case.When evaluating that integral, we must be careful of when f is 0. For example,

if X1 and X2 are iid Uniform(0,1)’s, we would integrate f (x1, w− x1) = 1 over justx1 ∈ (0, w) if w ∈ (0, 1), or over just x1 ∈ (w− 1, 1) if w ∈ [1, 2). In fact, Exercise 1.7.6did basically this procedure to find the tent distribution.

As another illustration, suppose X1 and X2 are iid Exponential(λ), so that Y =(0, ∞). Then for fixed y > 0, the x1 runs from 0 to y. Thus

fY(y) =∫ y

0λ2e−λx1 e−λ(y−x1)dx1 = λ2ye−λy, (4.20)

which is a Gamma(2, λ). This is a special case of the sum of independent gammas,as we will see in Section 5.3.

4.2.2 Uniform → Cauchy

Imagine an arrow shot into the air from the origin (the point (0, 0)) at an angle θ tothe ground (x-axis). It goes straight (no gravity or friction) to hit the ceiling one unithigh (y = 1). The horizontal distance between the origin and where it hits the ceilingis x. See Figure 4.1. If θ is chosen uniformly from 0 to π, what is the density of X?

In the figure, look at the right triangle formed by the origin, the point where thearrow hits the ceiling, and the drop from that point to the x-axis. That is, the trianglethat connects the points (0, 0), (x, 1), and (x, 0). The cotangent of θ is the base overthe height of that triangle, which is x over one (the length of the dotted line segment):x = cot(θ). Note that smaller values of θ correspond to larger values of x. Thus thedistribution function of X is

FX(x) = P[X ≤ x] = P[cot(Θ) ≤ x] = P[Θ ≥ arccot(x)] = 1− 1π

arccot(x), (4.21)

the final equality following since Θ has pdf 1/π for θ ∈ (0, π). The pdf of X is foundby differentiating, so all we need is to remember or derive or Google the derivativeof the inverse cotangent, which is −1/(1 + x2). Thus

fX(x) =1π

11 + x2 , (4.22)


the Cauchy pdf from Table 1.1 on page 7.

4.2.3 Probability transform

Generating random numbers on a computer is a common activity in statistics, e.g.,for inference using techniques such as the bootstrap and Markov chain Monte Carlo,for randomizing subjects to treatments, and for assessing the performance of vari-ous procedures. It it easy to generate independent Uniform(0, 1) random variables— Easy in the sense that many people have worked very hard for many years todevelop good methods that are now available in most statistical software. Actually,the numbers are not truly random, but rather pseudo-random, because there is adeterministic algorithm producing them. But you could also question whether ran-domness exists in the real world anyway. Even flipping a coin is deterministic, ifyou know all the physics in the flip. (At least classical physics. Quantum physics isbeyond me.)

One usually is not satisfied with uniforms, but rather has normals or gammas, orsomething more complex, in mind. There are many clever ways to create the desiredrandom variables from uniforms (see for example the Box-Muller transformation inSection 5.4.4), but the most basic uses the inverse distribution function. We supposeU ∼ Uniform(0, 1), and wish to generate an X that has given distribution function F.Assume that F is continuous, and strictly increasing for x ∈ X , so that the quantilesF−1(u) are well-defined for every u ∈ (0, 1). Consider the random variable

W = F−1(U). (4.23)

Then its distribution function is

FW(w) = P[W ≤ w] = P[F−1(U) ≤ w] = P[U ≤ F(w)] = F(w), (4.24)

where the last step follows because U ∼ Uniform(0, 1) and 0 ≤ F(w) ≤ 1. But thatequation means that W has the desired distribution function, hence to generate an X,we generate a U and take X = F−1(U).

To illustrate, suppose we wish X ∼ Exponential(λ). The distribution function Fof X is zero for x ≤ 0, and

F(x) =∫ x

0λe−λwdw = 1− e−λx for x > 0. (4.25)

Thus for u ∈ (0, 1),

u = F(x) =⇒ x = F−1(u) = − log(1− u)/λ. (4.26)

One limitation of this method is that F−1 is not always computationally simple. Evenin the normal case there is no closed form expression. In addition, this method doesnot work directly for generating multivariate X, because then F is not invertible.

The approach works for non-continuous distributions as well, but care must betaken because u may fall in a gap. For example, suppose X is Bernoulli(1/2). Thenfor u = 1/2, we have x = 0, since F(0) = 1/2, but no other value of u ∈ (0, 1) has anx with F(x) = u. The fix is to define a substitute for the inverse, F−, where F−(u) = 0for 0 < u < 1/2, and F−(u) = 1 for 1/2 < u < 1. Thus half the time F−(U) is 0, and


half the time it is 1. More generally, if u is in a gap, set F−(u) to the value of x forthat gap. Mathematically, we define for 0 < u < 1,

F−(u) = minx | F(x) ≥ u, (4.27)

which will yield X =D F−(U) for continuous and noncontinuous F.The process can be reversed, which is useful in hypothesis testing, to obtain p-

values. That is, suppose X has continuous strictly increasing (on X ) distributionfunction F, and let

U = F(X). (4.28)

Then the distribution function of U is

FU(u) = P[F(X) ≤ u] = P[X ≤ F−1(u)] = F(F−1(u)) = u, (4.29)

which is the distribution function of a Uniform(0, 1), i.e.,

F(X) ∼ Uniform(0, 1). (4.30)

When F is not continuous, F(X) is stochastically larger than a uniform:

P[F(X) ≤ u] ≤ u, u ∈ (0, 1). (4.31)

That is, it is at least as likely as a uniform to be larger than u. See Exercise 4.4.16. (Welook at stochastic ordering again in Definition 18.1 on page 306.)

4.2.4 Location-scale families

One approach to modeling univariate data is to assume a particular shape of thedistribution, then let the mean and variance vary, as in the normal. More generally,since the mean or variance may not exist, we use the terms “location” and “scale.”We define such families next.

Definition 4.1. Let Z have distribution function F on R. Then for µ ∈ R and σ > 0,X = µ + σZ has the location-scale family distribution based on F with location parameterµ and scale parameter σ.

The distribution function for X in the definition is then

F(x | µ, σ) = F(

x− µ

σ

). (4.32)

If Z has pdf f , then by differentiation we have that the pdf of X is

f (x | µ, σ) =1σ

f(

x− µ

σ

). (4.33)

If Z has the moment generating function M(t), then that for X is

M(t | µ, σ) = etµ M(tσ). (4.34)


The normal distribution is the most famous location-scale family. Let Z ∼ N(0, 1),the standard normal distribution, and set X = µ + σZ. It is immediate that E[X] = µ

and Var[X] = σ2. The pdf of X is then

φ(x | µ, σ2) =1σ

φ

(x− µ

σ

), where φ(z) =

1√2π

e−12 z2

,

=1√2πσ

e−1

2σ (x−µ)2, (4.35)

which is the N(µ, σ2) density given in Table 1.1. The mgf of Z is M(t) = exp(t2/2)from (2.70), hence X has mgf

M(t | µ, σ2) = etµ+ 12 t2σ2

, (4.36)

as we saw in (2.73).Other popular location-scale families are based on the uniform, Laplace, Cauchy,

and logistic. You primarily see location-scale families for continuous distributions,but they can be defined for discrete distribution. Also, if σ is fixed (at σ = 1, say),then the family is a location family, and if µ is fixed (at µ = 0, say), it is a scalefamily. There are multivariate versions of location-scale families as well, such as themultivariate normal in Chapter 7.

4.3 Moment generating functions

Instead of finding the distribution function of Y, one could try to find the mgf. Byuniqueness, Theorem 2.5 on page 26, if you recognize the mgf of Y as being for aparticular distribution, then you know Y has that distribution.

4.3.1 Uniform → Exponential

Suppose X ∼ Uniform(0, 1), and Y = − log(X). Then the mgf of Y is

MY(t) = E[etY ]

= E[e−t log(X)]

=∫ 1

0e−t log(x)dx

=∫ 1

0x−tdx

=1

−t + 1x−t+1|1x=0

=1

1− tif t < 1 (and +∞ if not). (4.37)

This mgf is that of the gamma in (2.76), where α = λ = 1, which means it is alsoExponential(1). Thus the pdf of Y is

fY(y) = e−y for y ∈ Y = (0, ∞). (4.38)

4.3. Moment generating functions 57

4.3.2 Sum of independent gammas

The mgf approach is especially useful for the sums of independent random vari-ables (convolutions). For example, suppose X1, . . . , XK are independent, with Xk ∼Gamma(αk, λ). (They all have the same rate, but are allowed different shapes.) LetY = X1 + · · ·+ XK . Then its mgf is

MY(t) = E[etY ]

= E[et(X1+···+XK)]

= E[etX1 · · · etXK ]

= E[etX1 ]× · · · × E[etXK ] by independence of the Xk’s= M1(t)× · · · ×MK(t) where Mk(t) is the mgf of Xk

=

(λ

λ− t

)α1

· · ·(

λ

λ− t

)αK

for t < λ, by (2.76)

=

(λ

λ− t

)α1+···+αK

. (4.39)

But then this mgf is that of Gamma(α1 + · · ·+ αK , λ). Thus

X1 + · · ·+ XK ∼ Gamma(α1 + · · ·+ αK , λ). (4.40)

What if the λ’s are not equal? Then it is still easy to find the mgf, but it would notbe the Gamma mgf, or anything else we have seen so far.

4.3.3 Linear combinations of independent normals

Suppose X1, . . . , XK are independent, Xk ∼ N(µk, σ2k ). Consider the affine transfor-

mationY = a + b1X1 + · · ·+ bKXK . (4.41)

It is straightforward, from Section 2.2.2, to see that

E[Y] = a + b1µ1 + · · ·+ bKµK and Var[Y] = b21σ2

1 + · · ·+ b2Kσ2

K , (4.42)

since independence implies that all the covariances are 0. But those equations do notgive the entire distribution of Y. We need the mgf:

MY(t) = E[etY ]

= E[et(a+b1X1+···+bK XK)]

= eatE[e(tb1)X1 ]× · · · × E[e(tbK)XK ] by independence

= eat M(tb1 | µ1, σ21 )× · · · ×M(tbK | µK , σ2

K)

= eatetb1µ1+12 t2b2

1σ21 · · · etbKµK+

12 t2b2

Kσ2K

= et(a+b1µ1+···+bKµK)+12 t2(b2

1σ21+···+b2

Kσ2K)

= M(t | a + b1µ1 + · · ·+ bKµK , b21σ2

1 + · · ·+ b2Kσ2

K), (4.43)


by (4.36). Notice that in going from the third to fourth step, we have changed thevariable for the mgfs from t to the bkt’s, which is legitimate. The final mgf in (4.43) isindeed the mgf of a normal, with the appropriate mean and variance, i.e.,

Y ∼ N(a + b1µ1 + · · ·+ bKµK , b21σ2

1 + · · ·+ b2Kσ2

K). (4.44)

If the Xk’s are iid N(µ, σ2), then (4.44) can be used to show that

∑ Xk ∼ N(Kµ, Kσ2) and X ∼ N(µ, σ2/K). (4.45)

4.3.4 Normalized means

The central limit theorem, which we present in Section 8.6, is central to statistics asit justifies using the normal distribution in certain non-normal situations. Here weassume we have X1, . . . , Xn iid with any distribution, as long as the mgf M(t) of Xiexists for t in a neighborhood of 0. The normalized mean is

Wn =√

nX− µ

σ, (4.46)

where µ = E[Xi] and σ2 = Var[Xi], so that E[Wn] = 0 and Var[Wn] = 1. If the Xi arenormal, then Wn ∼ N(0, 1). The central limit theorem implies that even if the Xi’sare not normal, Wn is “approximately” normal if n is “large.” We will look at thecumulant generating function of Wn, and compare it to the normal’s.

First, the mgf of Wn:

Mn(t) = E[etWn ]

= E[et√

n(X−µ)/σ]

= e−t√

nµ/σ E[e(t/(√

nσ))∑ Xi ]

= e−t√

nµ/σ ∏ E[e(t/(√

nσ))Xi ]

= e−t√

nµ/σ M(t/(√

nσ))n. (4.47)

Letting c(t) be the cumulant generating function of Xi, and cn(t) be that of Wn, wehave that

cn(t) = −t√

nµ

σ+ nc

(t√nσ

). (4.48)

We know that c′n(0) is the mean, which is 0 in this case, and c′′n(0) is the variance,which here is 1. For higher derivatives, the first term in (4.48) vanishes, and eachderivative brings out another 1/(

√nσ) from c(t), hence the kth cumulant of Wn is

c(k)n (0) = nc(k)(0)1

(√

nσ)k . (4.49)

Letting γk be the kth cumulant of Xi, and γn,k be that of Wn,

γn,k =1

nk/2−11σk γk, k = 3, 4, . . . . (4.50)

4.3. Moment generating functions 59

For the normal, all cumulants are 0 after the first two. Thus the closer the γn,k’sare to 0, the closer the distribution of Wn is to N(0, 1) in some sense. There are thentwo factors: The larger n, the closer these cumulants are to 0 for k > 2. Also, thesmaller γk/σk in absolute value, the closer to 0.

For example, if the Xi’s are Exponential(1), or Gamma(1,1), then from (2.79) thecumulants are γk = (k− 1)!, and γn,k = (k− 1)!/nk/2−1. The Poisson(λ) has cumu-lant generating function c(t) = λ(et − 1), so all its derivatives are λet, hence γk = λ,and

γn,k =1

nk/2−11

σk/2 γk =1

nk/2−11

λk/2 λ =1

(nλ)k/2−1 . (4.51)

The Laplace has pdf (1/2)e−|x| for x ∈ R. Its cumulants are

γk =

0 if k is odd

2(k− 1)! if k is even . (4.52)

Thus the variance is 2, and

γn,k = γk =

0 if k is odd

(k− 1)!/(2n)k/2−1 if k is even. (4.53)

Here is a small table with some values of γk,n:

γn,k Normal Exponential Poisson Laplacek n λ = 1/10 λ = 1 λ = 103 1 0 2.000 3.162 1.000 0.316 0.000

10 0 0.632 1.000 0.316 0.100 0.000100 0 0.200 0.316 0.100 0.032 0.000

4 1 0 6.000 10.000 1.000 0.100 3.00010 0 0.600 1.000 0.100 0.010 0.300100 0 0.060 0.100 0.010 0.001 0.030

5 1 0 24.000 31.623 1.000 0.032 0.00010 0 0.759 1.000 0.032 0.001 0.000100 0 0.024 0.032 0.001 0.000 0.000

(4.54)

For each distribution, as n increases, the cumulants do decrease. Also, the expo-nential is closer to normal that the Poisson(1/10), but the Poisson(1) is closer than theexponential, and the Poisson(10) is even closer. The Laplace is symmetric, so its oddcumulants are 0, automatically making it relatively close to normal. Its kurtosis is abit worse than the Poisson(1), however.

4.3.5 Bernoulli and binomial

The distribution of a random variable that takes on just the values 0 and 1 is com-pletely specified by giving the probability it is 1. Such a variable is called Bernoulli.

Definition 4.2. If Z has space 0, 1, then it is Bernoulli with parameter p = P[Z = 1],written

Z ∼ Bernoulli(p). (4.55)


The pmf can then be written

f (z) = pz(1− p)1−z. (4.56)

Note that Bernoulli(p) = Binomial(1, p). Rather than defining the binomial throughits pmf, as in Table 1.2 on page 9, a better alternative is to base the definition onBernoullis.

Definition 4.3. If Z1, . . . , Zn are iid Bernoulli(p), then X = Z1 + · · ·+ Zn is binomial withparameters n and p, written

X ∼ Binomial(n, p). (4.57)

The binomial counts the number of successes in n independent trials, wherethe Bernoullis represent the individual trials, with a “1” indicating success. Themoments, and mgf, of a Bernoulli are easy to find. In fact, because Zk = Z fork = 1, 2, . . .,

E[Z] = E[Z2] = 0× (1− p) + 1× p = p⇒ Var[Z] = p− p2 = p(1− p), (4.58)

andMZ(t) = E[etZ] = e0(1− p) + et p = pet + 1− p. (4.59)

Thus for X ∼ Binomial(n, p),

E[X] = nE[Zi] = np, Var[X] = nVar[Zi] = np(1− p), (4.60)

and

MX(t) = E[et(Z1+···+Zn)] = E[etZ1 ] · · · E[etZn ] = MZ(t)n = (pet + 1− p)n. (4.61)

This mgf is the same as we found in (2.83), meaning we really are defining the samebinomial.

4.4 Exercises

Exercise 4.4.1. Suppose (X, Y) has space 1, 2, 3 × 1, 2, 3 and pmf f (x, y) = (x +y)/c for some constant c. (a) Are X and Y independent? (b) What is the constant c?(c) Let W = X + Y. Find fW(w), the pmf of W.

Exercise 4.4.2. If x and y are vectors of length n, then Kendall’s distance between xand y measures the extent to which they have a positive relationship in the sense thatlarger values of x go with larger values of y. It is defined by

d(x, y) = ∑ ∑1≤i<j≤n

I[(xi − xj)(yi − yj) < 0]. (4.62)

The idea is to plot the points, and draw a line segment between each pair of points(xi, yi) and (xj, yj). If any segment has a negative slope, then the x’s and y’s forthat pair go in the wrong direction. (Look ahead to Figure 18.1 on page 308 for anillustration.) Kendall’s distance then counts the number of such pairs. If d(x, y) = 0,then the plot shows a nondecreasing pattern. Worst is if d(x, y) = n(n− 1)/2, sincethen all pairs go in the wrong direction.

4.4. Exercises 61

Suppose Y is uniformly distributed over the permutations of the integers 1, 2, 3,that is, Y has space

Y = (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1), (4.63)

and P(Y = y) = 1/6 for each y ∈ Y . (a) Find the pmf of U = d((1, 2, 3), Y).[Write out the d((1, 2, 3), y)’s for the y’s in Y .] (b) Kendall’s τ is defined to beT = 1− 4U/n(n− 1). It normalizes Kendall’s distance so that it acts like a correlationcoefficient, i.e., T = −1 if x and y have an exact negative relationship, and T = 1 ifthey have an exact positive relationship. In this problem, n = 3. Find the space of T,and its pmf.

Exercise 4.4.3. Continue with the Y and U from Exercise 4.4.2, where n = 3. We canwrite

U = d((1, 2, 3), y) =2

∑i=1

3

∑j=i+1

I[−(yi − yj) < 0] =2

∑i=1

3

∑j=i+1

I[yi > yj]. (4.64)

ThenU = U1 + U2, (4.65)

where Ui = ∑3j=i+1 I[yi > yj]. (a) Find the space and pmf of (U1, U2). Are U1 and U2

independent? (b) What are the marginal distributions of U1 and U2?

Exercise 4.4.4. Suppose U and V are independent with the same Geometric(p) dis-tribution. (So they have space the nonnegative integers and pmf f (u) = p(1− p)u.)Let T = U + V. (a) What is the space of T? (b) Find the pmf of T using convolutionsas in (4.11). [The answer should be (1 + t)p2(1− p)t. This is a negative binomialdistribution, as we will see in Exercise 4.4.15.]

Exercise 4.4.5. Suppose U1 is Discrete Uniform(0, 1) and U2 is Discrete Uniform(0, 2),where U1 and U2 are independent. (a) Find the moment generating function, MU1 (t).(b) Find the moment generating function, MU2 (t). (c) Now let U = U1 +U2. Find themoment generating function of U, MU(t), by multiplying the two individual mgfs.(d) Use the mgf to find the pmf of U. (It is given in (4.6).) Does this U have the samedistribution as that in Exercises 4.4.2 and 4.4.3?

Exercise 4.4.6. Define the 3× 1 random vector Z to be Trinoulli(p1, p2, p3) if it hasspace Z = (1, 0, 0), (0, 1, 0), (0, 0, 1) and pmf

fZ(1, 0, 0) = p1, fZ(0, 1, 0) = p2, fZ(0, 0, 1) = p3, (4.66)

where the pi > 0 and p1 + p2 + p3 = 1. (a) Find the mgf of Z, MZ(t1, t2, t3). (b)If Z1, . . . , Zn are iid Trinoulli(p1, p2, p3), then X = Z1 + · · · + Zn is Trinomial(n,(p1, p2, p3)). What is the space X of X? (c) What is the mgf of X? This mgf isthat of which distribution that we have seen before?

Exercise 4.4.7. Let X = (X1, X2, X3, X4) ∼ Multinomial(n, (p1, p2, p3, p4)). Find themgf of Y = (X1, X2, X3 + X4). What is the distribution of Y?

Exercise 4.4.8. Suppose (X1, X2) has space (a, b)× (c, d) and pdf f (x1, x2). Show thatthe limits of integration for the convolution (4.19)

∫f (x1, y− x1)dx1 are maxa, y−

d < x1 < minb, y− c.


Exercise 4.4.9. Let X1 and X2 be iid N(0,1)’s, and set Y = X1 + X2. Using the convo-lution formula (4.19), show that Y ∼ N(0, 2). [Hint: In the exponent of the integrand,complete the square with respect to x1, then note you are integrating over what lookslike a N(0, 1/2) pdf.]

Exercise 4.4.10. Suppose Z has the distribution function F(z), pdf f (z), and mgfM(t). Let X = µ + σZ, where µ ∈ R and σ > 0. Thus we have a location-scalefamily, as in Definition 4.1 on page 55. (a) Show that the distribution function of Xis F((x− µ)/σ), as in (4.32). (b) Show that the pdf of X is (1/σ) f ((x− µ)/σ), as in(4.33). (c) Show that the mgf of X is exp(tµ)M(tσ), as in (4.34).

Exercise 4.4.11. Show that the mgf of a N(µ, σ2) is exp(tµ + (1/2)t2σ2), as in (4.36).For what values of t is the mgf finite?

Exercise 4.4.12. Consider the location-scale family based on Z. Suppose Z has finiteskewness κ3 in (2.56) and kurtosis κ4 in (2.57). Show that X = µ + σZ has the sameskewness and kurtosis as Z as long as σ > 0. [Hint: Show that E[(X − µ)k] =

σkE[Zk].]

Exercise 4.4.13. In Exercise 2.7.12, we found that the Laplace has mfg 1/(1− t2) for|t| > 1. Suppose U and V are independent Exponential(1), and let Y = U −V. Findthe mgf of Y. What is the distribution of Y? [Hint: Look back at Section 3.3.1.]

Exercise 4.4.14. Suppose X1, . . . , XK are iid N(µ, σ2). Use (4.44) to show that (4.45)holds, i.e., that ∑ Xk ∼ N(Kµ, Kσ2) and X ∼ N(µ, σ2/K).

Exercise 4.4.15. Consider a coin with probability of heads being p, and flip it a num-ber of times independently, until you see a heads. Let Z be the number of tails flippedbefore the first head. Then Z ∼ Geometric(p). Exercise 2.7.2 showed that the mgf ofZ is MZ(t) = p(1− et(1− p))−1 for t’s such that it is finite. Suppose Z1, . . . , ZK areiid Geometric(p), and Y = Z1 + · · ·+ ZK . Then Y is the number of tails before theKth head. It is called Negative Binomial(K, p). (a) What is the mgf of Y, MY(t)? (b)For α > 0 and 0 < x < 1, define

1F0(α ; − ; x) =∞

∑y=0

Γ(y + α)

Γ(α)xy

y!. (4.67)

It can be shown using a Taylor series that 1F0(α ; − ; x) = (1− x)−α. (Such functionsarise again in Exercise 7.8.21. Exercise 2.7.18(c) used 1F0(2 ; − ; x).) We can then writeMY(t) = pc

1F0(α ; − ; x) for what c, α and x? (c) Table 1.2 gives the pmf of Y to be

fY(y) =(

y + K− 1K− 1

)pK(1− p)y. (4.68)

Find the mgf of this fY to verify that it is the pmf for the negative binomial as definedin this exercise. [Hint: Write the binomial coefficient out, and replace two of thefactorials with Γ functions.] (d) Show that for K > 1,

δ(y) ≡ K− 1y + K− 1

(4.69)

has E[δ(Y)] = p, so is an unbiased estimator of p. [Hint: Write down the ∑ δ(y) fY(y),then factor out a p and note that you have the pmf of another negative binomial.]

4.4. Exercises 63

Exercise 4.4.16. Let F be the distribution function of the random variable X, whichmay not be continuous. Take u ∈ (0, 1). The goal is to show (4.31), that P[F(X) ≤ u] ≤u. Let x∗ = F−(u) = minx | F(x) ≥ u as in (4.27). (a) Suppose F(x) is continuousat x = x∗. Show that F(x∗) = u, and F(x) ≤ u if and only if x ≤ x∗. [It helps to drawa picture of the distribution function.] Thus P[F(X) ≤ u] = P[X ≤ x∗] = F(x∗) = u.(b) Suppose F(x) is not continuous at x = x∗. Show that P[X ≥ x∗] ≥ 1− u, andF(x) ≤ u if and only if x < x∗. Thus P[F(X) ≤ u] = P[X < x∗] = 1− P[X ≥ x∗] ≤1− (1− u) = u.

Exercise 4.4.17. Let (X, Y) be uniform over the unit disk: The space is W = (x, y) |x2 + y2 < 1, and f (x, y) = 1/π for (x, y) ∈ W . Let U = X2. (a) Show that thedistribution function of U can be written as

FU(u) =4π

∫ √u

0

√1− x2 dx (4.70)

for u ∈ (0, 1). (b) Find the pdf of U. It should be a Beta(α, β). What are α and β?

Exercise 4.4.18. Suppose X1, X2, . . . , Xn are independent, all with the same continu-ous distribution function F(x) and space X . Let Y be their maximum:

Y = maxX1, X2, . . . , Xn, (4.71)

and FY(y) be the distribution function of Y. (a) What is the space of Y? (b) Explainwhy Y ≤ y if and only if X1 ≤ y & X2 ≤ y & · · · & Xn ≤ y. (c) Explain whyP[Y ≤ y] = P[X1 ≤ y] × P[X2 ≤ y] × · · · × P[Xn ≤ y]. (d) Thus we can writeFY(y) = ua for some u and a. What are u and a?

For the rest of this exercise, suppose the Xi’s above are Uniform(0, 1) (and still inde-pendent). (e) For 0 < x < 1, what is F(x)? (f) What is FY(y), in terms of y and n? (g)What is fY(y), the pdf of Y? This is the pdf of what distribution? (Give the name andthe parameters.)

Exercise 4.4.19. In the following questions, X and Y are independent. Yes or no? (a)If X ∼ Gamma(α, λ) and Y ∼ Gamma(β, λ), where α 6= β, is X + Y gamma? (b) IfX ∼ Gamma(α, λ) and Y ∼ Gamma(α, δ), where λ 6= δ, is X + Y gamma? (c) If X ∼Poisson(λ) and Y ∼ Poisson(δ), is X +Y Poisson? (d) If X and Y are Exponential(λ),is X +Y exponential? (e) If X ∼ Binomial(n, p) and Y ∼ Binomial(n, q), where p 6= q,is X + Y binomial? (f) If X ∼ Binomial(n, p) and Y ∼ Binomial(m, p), where n 6= m,is X + Y binomial? (g) If X and Y are Laplace, is X + Y Laplace?

Chapter 5

Transformations: Jacobians

In Section 4.1, we saw that finding the pmfs of transformed variables when we startwith discrete variables is possible by summing the appropriate probabilities. In theone-to-one case, it is even easier. That is, if g(x) is one-to-one, then the pmf ofY = g(X) is fY(y) = fX(g−1(y)). In the continuous case, it is not as straightforward.The problem is that for a pdf, fX(x) is not P[X = x], which is 0. Rather, for a smallarea A ⊂ X that contains x,

P[X ∈ A] ≈ fX(x)×Area(A). (5.1)

Now suppose g : X → Y is one-to-one and onto. Then for y ∈ B ⊂ Y ,

P[Y ∈ B] ≈ fY(y)×Area(B). (5.2)

Since

P[Y ∈ B] = P[X ∈ g−1(B)] ≈ fX(g−1(y))×Area(g−1(B)), (5.3)

where

g−1(B) = x ∈ X | g(x) ∈ B, (5.4)

we find that

fY(y) ≈ fX(g−1(y))× Area(g−1(B))Area(B) . (5.5)

Compare this equation to (4.3). For continuous distributions, we need to take care ofthe transformation of areas as well as the transformation of Y itself. The actual pdfof Y is found by shrinking the B in (5.5) down to the point y.

5.1 One dimension

When X and Y are random variables, the ratio of areas in (5.5) is easy to find. Becauseg and g−1 are one-to-one, they must be either strictly increasing or strictly decreasing.For y ∈ Y , take ε small enough that Bε ≡ [y, y + ε) ∈ Y . Then, since the area of an

65

66 Chapter 5. Transformations: Jacobians

interval is just its length,

Area(g−1(Bε))

Area(Bε)=|g−1(y + ε)− g−1(y)|

ε

→∣∣∣∣ ∂

∂yg−1(y)

∣∣∣∣ as ε→ 0. (5.6)

(The absolute value is there in case g is decreasing.) That derivative is called theJacobian of the transformation g−1. That is,

fY(y) = fX(g−1(y)) |Jg−1 (y)|, where Jg−1 (y) =∂

∂yg−1(y). (5.7)

This approach needs a couple of assumptions: y is in the interior of Y , and thederivative of g−1(y) exists.

Reprising Example 4.3.1, where X ∼ Uniform(0, 1) and Y = − log(X), we haveY = (0, ∞), and g−1(y) = e−y. Then

fX(e−y) = 1 and Jg−1 (y) =∂

∂ye−y = −e−y =⇒ fY(y) = 1× | − e−y| = e−y, (5.8)

which is indeed the answer from (4.38).

5.2 General case

For the general case (for vectors), we need to figure out the ratio of the volumes,which is again given by the Jacobian, but for vectors.

Definition 5.1. Suppose g : X → Y is one-to-one and onto, where both X and Y are opensubsets of Rp, and all the first partial derivative of g−1(y) exist. Then the Jacobian of thetransformation g−1 is defined to be

Jg−1 : Y → R,

Jg−1 (y) =

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

∂∂y1

g−11 (y) ∂

∂y2g−1

1 (y) · · · ∂∂yp

g−11 (y)

∂∂y1

g−12 (y) ∂

∂y2g−1

2 (y) · · · ∂∂yp

g−12 (y)

......

. . ....

∂∂y1

g−1p (y) ∂

∂y2g−1

p (y) · · · ∂∂yp

g−1p (y)

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

, (5.9)

where g−1(y) = (g−11 (y), . . . , g−1

p (y)), and here the “| · |” represents the determinant.

The next theorem is from advanced calculus.

Theorem 5.2. Suppose the conditions in Definition 5.1 hold, and Jg−1 (y) is continuous andnon-zero for y ∈ Y . Then

fY(y) = fX(x)× |Jg−1 (y)|. (5.10)

Here the “| · |” represents the absolute value.

5.3. Gamma, beta, and Dirichlet distributions 67

If you think of g−1 as x, you can remember the formula as

fY(y) = fX(x)×∣∣∣∣ dxdy

∣∣∣∣ . (5.11)

5.3 Gamma, beta, and Dirichlet distributions

Suppose X1 and X2 are independent, with

X1 ∼ Gamma(α, λ) and X2 ∼ Gamma(β, λ), (5.12)

so that they have the same rate but possibly different shapes. We are interested in

Y1 =X1

X1 + X2. (5.13)

This variable arises, e.g., in linear regression, where R2 has that distribution undercertain conditions. The function taking (x1, x2) to y1 is not one-to-one. To fix that up,we introduce another variable, Y2, so that the function to the pair (y1, y2) is one-to-one. Then to find the pdf of Y1, we integrate out Y2. We will take

Y2 = X1 + X2. (5.14)

ThenX = (0, ∞)× (0, ∞) and Y = (0, 1)× (0, ∞). (5.15)

To find g−1, solve the equation y = g(x) for x:

y1 =x1

x1 + x2, y2 = x1 + x2 =⇒ x1 = y1y2, x2 = y2 − x1

=⇒ x1 = y1y2, x2 = y2(1− y1), (5.16)

henceg−1(y1, y2) = (y1y2, y2(1− y1)). (5.17)

Using this inverse, we can see that indeed the function is onto the Y in (5.15), sinceany y1 ∈ (0, 1) and y2 ∈ (0, ∞) will yield, via (5.17), x1 and x2 in (0, ∞). For theJacobian:

Jg−1 (y) =

∣∣∣∣∣∣∣∣∣∂

∂y1y1y2

∂∂y2

y1y2

∂∂y1

y2(1− y1)∂

∂y2y2(1− y1)

∣∣∣∣∣∣∣∣∣=

∣∣∣∣ y2 y1−y2 1− y1

∣∣∣∣= y2(1− y1) + y1y2

= y2. (5.18)

Now because the Xi’s are independent gammas, their pdf is

fX(x1, x2) =λα

Γ(α)xα−1

1 e−λx1λβ

Γ(β)xβ−1

2 e−λx2 . (5.19)


Then the pdf of Y is

fY(y) = fX(g−1(y))× |Jg−1 (y)|= fX(y1y2, y2(1− y1)))× |y2|

=λα

Γ(α)(y1y2)

α−1e−λy1y2λβ

Γ(β)(y2(1− y1))

β−1e−λy2(1−y1) × |y2|

=λα+β

Γ(α)Γ(β)yα−1

1 (1− y1)β−1yα+β−1

2 e−λy2 . (5.20)

To find the pdf of Y1, we can integrate out y2:

fY1 (y1) =∫ ∞

0fY(y1, y2)dy2. (5.21)

That certainly is a fine approach. But in this case, note that the joint pdf in (5.20) canbe factored into a function of just y1, and a function of just y2. That fact, coupledwith the fact that the space is a rectangle, means that by Lemma 3.4 on page 46 weautomatically have that Y1 and Y2 are independent.

If we look at the y2 part of fY(y) in (5.20), we see that it looks like a Gamma(α +β, λ) pdf. We just need to multiply & divide by the appropriate constant:

fY(y) =

[λα+β

Γ(α)Γ(β)yα−1

1 (1− y1)β−1 Γ(α + β)

λα+β

] [λα+β

Γ(α + β)yα+β−1

2 e−λy2

]

=

[Γ(α + β)

Γ(α)Γ(β)yα−1

1 (1− y1)β−1] [

λα+β

Γ(α + β)yα+β−1

2 e−λy2

]. (5.22)

So in the first line, we have separated the gamma pdf from the rest, and in the secondline simplified the y1 part of the density. Thus that part is the pdf of Y1, whichfrom Table 1.1 on page 7 is the Beta(α, β) pdf. Note that we have surreptitiously alsoproven that ∫ 1

0yα−1

1 (1− y1)β−1dy1 =

Γ(α)Γ(β)

Γ(α + β)≡ β(α, β), (5.23)

which is the beta function.To summarize, we have shown that X1/(X1 + X2) and X1 + X2 are independent

(even though they do not really look independent), and that

X1X1 + X2

∼ Beta(α, β) and X1 + X2 ∼ Gamma(α + β, λ). (5.24)

That last fact we already knew, from Example 4.3.2. Also, notice that the beta variabledoes not depend on the rate λ.

5.3.1 Dirichlet distribution

The Dirichlet distribution is a multivariate version of the beta. We start with theindependent random variables X1, . . . , XK , K > 1, where

Xk ∼ Gamma(αk, 1). (5.25)

5.3. Gamma, beta, and Dirichlet distributions 69

Then the K− 1 vector Y defined via

Yk =Xk

X1 + · · ·+ XK, k = 1, . . . , K− 1, (5.26)

has a Dirichlet distribution, written

Y ∼ Dirichlet(α1, . . . , αK). (5.27)

There is also a YK , which may come in handy, but because Y1 + · · ·+ YK = 1, it is re-dundant. Also, as in the beta, the definition is the same if the Xk’s are Gamma(αk, λ).

The representation (5.26) makes it easy to find the marginals, i.e.,

Yk ∼ Beta(αk, α1 + · · ·+ αk−1 + αk+1 + · · ·+ αK), (5.28)

hence the marginal means and variances. Sums of the Yk’s are also beta, e.g., if K = 4,then

Y1 + Y3 =X1 + X3

X1 + X2 + X3 + X4∼ Beta(α1 + α3, α2 + α4). (5.29)

The space of Y is

Y = y ∈ RK−1 | 0 < yk < 1, k = 1, . . . , K− 1, and y1 + · · ·+ yK−1 < 1. (5.30)

To find the pdf of Y, we need a one-to-one transformation from X, so we need toappend another function of X to the Y. The easiest choice is

W = X1 + . . . + XK , (5.31)

so that g(x) = (y, w). Then g−1 is given by

x1 = wy1;x2 = wy2;

...xK−1 = wyK−1;

xK = w(1− y1 − . . .− yK−1). (5.32)

It can be shown that the determinant of the Jacobian is wK−1. Exercise 5.6.3 illustratesthe calculations for K = 4. The joint pdf of (Y, W) is

f(Y,W)(y, w) = [K

∏k=1

1Γ(αK)

] [K−1

∏k=1

yαk−1k ] (1− y1 − · · · − yK−1)

αK−1 wα1+···+αK−1 e−w.

(5.33)The joint space of (Y, W) is Y × (0, ∞), together with the factorization of the densityin (5.33), means that Lemma 3.4 can be used to show that Y and W are independent.In addition, W ∼ Gamma(α1 + · · ·+ αK , 1), which can be seen either by looking atthe w-part in (5.33), or by noting that it is the sum of independent gammas with thesame rate parameter. Either way, we then have that the constant that goes with thew-part is 1/Γ(α1 + · · ·+ αK), hence the pdf of Y must be

fY(y) =Γ(α1 + · · ·+ αK)

Γ(α1) · · · Γ(αK)yα1−1

1 · · · yαK−1−1K−1 (1− y1 − · · · − yK−1)

αK−1. (5.34)

If K = 2, then this is a beta pdf. Thus the Dirichlet is indeed an extension of the beta,and

Beta(α, β) = Dirichlet(α, β). (5.35)


5.4 Affine transformations

Suppose X is 1× p, and for given 1× p vector a and p× p matrix B, let

Y = g(X) = a + XB′. (5.36)

In order for this transformation to be one-to-one, we need that B is invertible, whichwe will assume, so that

x = g−1(y) = (y− a)(B′)−1. (5.37)

For a matrix C and vector z, with w = zC′, it is not hard to see that

∂wi∂zj

= cij, (5.38)

the ijth element of C. Thus from (5.37), since the “a” part is a constant, the Jacobianis

Jg−1 (y) = |B−1| = |B|−1. (5.39)

The invertibility of B ensures that the absolute value of this Jacobian is between 0 and∞.

5.4.1 Bivariate normal distribution

We apply this last result to the normal, where X1 and X2 are iid N(0, 1), a is a 1× 2vector, and B is a 2× 2 invertible matrix. Then Y is given as in (5.36), with p = 2. Themean and covariance matrix for X are

E[X] = 02 and Cov[X] = I2, (5.40)

so that from (2.47) and (2.53),

µ ≡ E[Y] = a and Σ ≡ Cov[Y] = BCov[X]B′ = BB′. (5.41)

The space of X, hence of Y, is R2. To find the pdf of Y, we start with that of X:

fX(x) =2

∏i=1

1√2π

e−12 x2

i =1

2πe−

12 xx′ . (5.42)

Then

fY(y) = fX((y− a)(B′)−1) abs(|B|−1)

=1

2πe−

12 (y−a)(B′)−1((y−a)(B′)−1)′ abs(|B|)−1. (5.43)

Using (5.41), we can write

(y− a)(B′)−1((y− a)(B′)−1)′ = (y− µ)(BB′)−1(y− µ)′ = (y− µ)Σ−1(y− µ)′,(5.44)

and using properties of determinants,

abs(|B|) =√|B||B′| =

√|BB′| =

√|Σ|, (5.45)

5.4. Affine transformations 71

hence the pdf of Y can be given as a function of the mean and covariance matrix:

fY(y) =1

2π

1√|Σ|

e−12 (y−µ)Σ−1

(y−µ)′ . (5.46)

This Y is bivariate normal, with mean µ and covariance matrix Σ, written very muchin the same way as the regular normal,

Y ∼ N(µ, Σ). (5.47)

In particular, the X here isX ∼ N(02, I2). (5.48)

The mgf of a bivariate normal is not hard to find given that of the X. Because X1and X2 are iid N(0, 1), their mgf is

MX(t) = e12 t2

1 e12 t2

2 = e12 tt′ , (5.49)

where here t is 1× 2. Then with Y = a + XB′,

MY(t) = E[eYt′ ]

= E[e(a+XB′)t′ ]

= eat′E[eX(tB)′ ]

= eat′MX(tB)

= eat′ e12 tB(tB)′

= eat′ e12 tBB′t′

= eµt′+ 12 tΣt′ , (5.50)

because a = µ and BB′ = Σ by (5.41). Compare this mgf to that of the regular normal,in (4.36).

In Chapter 7 we will deal with the general p-dimensional multivariate normal,proceeding exactly the same way as above, except the matrices and vectors havemore elements.

5.4.2 Orthogonal transformations and polar coordinates

A p× p orthogonal matrix is a matrix Γ such that

Γ′Γ = ΓΓ′ = Ip. (5.51)

An orthogonal matrix has orthonormal columns (and orthonormal rows), that is, withΓ = (γ1, . . . ,γp),

‖γi‖ = 1 and γ ′iγj = 0 if i 6= j. (5.52)

An orthogonal transformation of X is then

Y = ΓX (5.53)


x

Γx

θ

φ

Figure 5.1: The vector x = 1.5(cos(θ), sin(θ))′ is rotated φ radians by the orthogonalmatrix Γ.

for some orthogonal matrix Γ. This transformation rotates Y about zero, that is, thelength of Y is the same as that of X, because

‖Γz‖2 = z′Γ′Γz = z′z = ‖z‖2, (5.54)

but its orientation is different. The Jacobian is ±1:

|Γ′Γ| = |Ip| = 1 =⇒ |Γ|2 = 1. (5.55)

When p = 2, the orthogonal matrices can be parametrized by the angle. Morespecifically, the set of all 2× 2 orthogonal matrices equals(

cos(φ) sin(φ)sin(φ) − cos(φ)

)| φ ∈ [0, 2π)

∪(

cos(φ) − sin(φ)sin(φ) cos(φ)

)| φ ∈ [0, 2π)

.

(5.56)Note that the first set of matrices in (5.56) has determinant −1, and the second sethas determinant 1.

To see the effect on the 2× 1 vector x = (x1, x2)′, first find the polar coordinates

for x1 and x2:

x1 = r cos(θ) and x2 = r sin(θ), where r ≥ 0 and θ ∈ [0, 2π). (5.57)

Then taking an orthogonal matrix Γ from the second set in (5.56), which are calledrotations, we have

Γx =

(cos(φ) − sin(φ)sin(φ) cos(φ)

)r(

cos(θ)sin(θ)

)= r

(cos(θ + φ)sin(θ + φ)

). (5.58)

Thus if x is at an angle of θ to the x-axis, then Γ rotates the vector by the angle φ,keeping the length of the vector constant. Figure 5.1 illustrates the rotation of x withpolar coordinates r = 1.5 and θ = π/6. The orthogonal matrix in (5.58) uses φ = π/4.As φ goes from 0 to 2π, the vector Γx makes a complete rotation about the origin.

The first set of matrices in (5.56) are called reflections. They first flip the sign ofx2, then rotate by the angle φ.

5.4. Affine transformations 73

5.4.3 Spherically symmetric pdfs

A p× 1 random vector X has a spherically symmetric distribution if for any orthog-onal Γ,

X =D ΓX, (5.59)

meaning X and ΓX have the same distribution. Suppose X is spherically symmetricand has a pdf fX(x). This density can be taken to be spherically symmetric as well,that is,

fX(x) = fX(Γx) for all x ∈ X and orthogonal Γ. (5.60)

We will look more at the p = 2 case. Exercise 5.6.10 shows that (5.60) implies thatthere is a function h(r) for r ≥ 0 such that

fX(x) = h(‖x‖). (5.61)

For example, suppose X1 and X2 are independent N(0, 1)’s. Then from Table 1.1, thepdf of X = (X1, X2)

′ is

fX(x) =1√2π

e−12 x2

1 × 1√2π

e−12 x2

2 =1

2πe−

12 (x2

1+x22), (5.62)

hence we can take

h(r) =1

2πe−

12 r2

(5.63)

in (5.61).Consider the distribution of the polar coordinates (R, Θ) = g(X1, X2) where

g(x1, x2) = (‖x‖, Angle(x1, x2)). (5.64)

The Angle(x1, x2) is taken to be in [0, 2π), and is basically the arctan(x2/x1), exceptthat that is not uniquely defined, e.g., if x1 = x2 = 1, arctan(1/1) = π/4 or 5π/4.What we really mean is the unique value θ in [0, 2π) for which (5.57) holds. A glitchis that θ is not uniquely defined when (x1, x2) = (0, 0), since then r = 0 and any θwould work. So we assume that X does not contain (0, 0), hence r ∈ (0, ∞). Thisrequirement does not hurt anything because we are dealing with continuous randomvariables.

The g−1 is already given in (5.57), hence the Jacobian is

Jg−1 (r, θ) =

∣∣∣∣∣∣∣∣∂∂r r cos(θ) ∂

∂θ r cos(θ)

∂∂r r sin(θ) ∂

∂θ r sin(θ)

∣∣∣∣∣∣∣∣=

∣∣∣∣ cos(θ) −r sin(θ)sin(θ) r cos(θ)

∣∣∣∣= r(cos(θ)2 + sin(θ)2)

= r. (5.65)

You may recall this result from calculus, i.e.,

dx1dx2 = rdrdθ. (5.66)


Then by (5.61), the pdf of (R, Θ) is

f(R,Θ)(r, θ) = h(r) r. (5.67)

Exercise 5.6.11 shows that the space is a rectangle, (0, ∞)× [0, 2π). This pdf canbe written as a product of a function of just r, i.e., h(r)r, and a function of just θ, i.e.,the function “1”. Thus R and Θ must be independent, by Lemma 3.4. Since the pdfof Θ is constant, it must be uniform, i.e.,

θ ∼ Uniform[0, 2π), (5.68)

which has pdf 1/(2π). Then from (5.67),

fR(r) = h(r)r = [2πh(r)r][

12π

]. (5.69)

Applying this formula to the normal example in (5.63), we have that R has pdf

fR(r) = 2πh(r)r = r e−12 r2

. (5.70)

5.4.4 Box-Muller transformation

The Box-Muller transformation (Box and Muller, 1958) is an approach to generatingtwo random normals from two random uniforms that reverses the above polar coor-dinate procedure. That is, suppose U1 and U2 are independent Uniform(0,1). Thenwe can generate Θ by setting

Θ = 2πU1, (5.71)

and R by setting

R = F−1R (U2), (5.72)

where FR is the distribution function for the pdf in (5.70):

FR(r) =∫ r

0w e−

12 w2

dw = −e12 w2 |rw=0 = 1− e−

12 r2

. (5.73)

Inverting u2 = FR(r) yields

r = F−1R (u2) =

√−2 log(1− u2). (5.74)

Thus, as in (5.57), we set

X1 =√−2 log(1−U2) cos(2πU1) and X2 =

√−2 log(1−U2) sin(2πU1), (5.75)

which are then independent N(0, 1)’s. (Usually one sees U2 in place of the 1−U2 inthe logs, but either way is fine because both are Uniform(0,1).)

5.5. Order statistics 75

5.5 Order statistics

The order statistics for a sample x1, . . . , xn are the observations placed in orderfrom smallest to largest. They are usually designated with indices “(i),” so that theorder statistics are

x(1), x(2), . . . , x(n), (5.76)

where

x(1) = smallest of x1, . . . , xn = minx1, . . . , xn;x(2) = second smallest ofx1, . . . , xn;

...x(n) = largest of x1, . . . , xn = maxx1, . . . , xn. (5.77)

For example, if the sample is 3.4, 2.5, 1.7, 5.2 then the order statistics are 1.7, 2.5, 3.4,5.2. If two observations have the same value, then that value appears twice, i.e., theorder statistics for 3.4, 1.7, 1.7, 5.2 are 1.7, 1.7, 3.4, 5.2.

These statistics are useful as descriptive statistics, and in nonparametric infer-ence. For example, estimates of the median of a distribution are often based on orderstatistics, such as the median, or the trimean, which is a linear combination of thetwo quartiles and the median.

We will deal with X1, . . . , Xn being iid, each with distribution function F, pdf f ,and space X , so that the space of X = (X1, . . . , Xn) is X n. Then let

Y = (X(1), . . . , X(n)). (5.78)

We will assume that the Xi’s are distinct, that is, no two have the same value. Thisassumption is fine in the continuous case, because the probability that two observa-tions are equal is zero. In the discrete case, there may indeed be ties, and the analysisbecomes more difficult. The space of Y is

Y = (y1, . . . , yn) ∈ X n | y1 < y2 < · · · < yn. (5.79)

To find the pdf, start with y ∈ Y , and let δ > 0 be small enough so that theintervals (y1, y1 + δ), (y2, y2 + δ), . . . , (yn, yn + δ) are disjoint. (So take δ less than theall the gaps yi+1 − yi.) Then

P[y1 < Y1 < y1 + δ, . . . , yn < Yn < yn + δ]

= P[y1 < X(1) < y1 + δ, . . . , yn < X(n) < yn + δ]. (5.80)

Now the event in the latter probability occurs when any permutation of the Xi’s hasone component in the first interval, one in the second, etc. E.g., if n = 3,

P[y1 < X(1) < y1 + δ, y2 < X(2) < y2 + δ, y3 < X(3) < y3 + δ]

= P[y1 < X1 < y1 + δ, y2 < X2 < y2 + δ, y3 < X3 < y3 + δ]

+ P[y1 < X1 < y1 + δ, y2 < X3 < y2 + δ, y3 < X2 < y3 + δ]

+ P[y1 < X2 < y1 + δ, y2 < X1 < y2 + δ, y3 < X3 < y3 + δ]

+ P[y1 < X2 < y1 + δ, y2 < X3 < y2 + δ, y3 < X1 < y3 + δ]

+ P[y1 < X3 < y1 + δ, y2 < X1 < y2 + δ, y3 < X2 < y3 + δ]

+ P[y1 < X3 < y1 + δ, y2 < X2 < y2 + δ, y3 < X1 < y3 + δ]

= 6 P[y1 < X1 < y1 + δ, y2 < X2 < y2 + δ, y3 < X3 < y3 + δ]. (5.81)


The last equation follows because the Xi’s are iid, hence the six individual probabili-ties are equal. In general, the number of permutations is n!. Thus, we can write

P[y1 < Y1 < y1 + δ, . . . , yn < Yn < yn + δ] = n!P[y1 < X1 < y1 + δ,. . . , yn < Xn < yn + δ]

= n!n

∏i=1

[F(yi + δ)− F(yi)]. (5.82)

Dividing by δn then letting δ→ 0 yields the joint density, which is

fY(y) = n!n

∏i=1

f (yi), y ∈ Y . (5.83)

Marginal distributions of individual order statistics, or sets of them, can be ob-tained by integrating out the ones that are not desired. The process can be a bit tricky,and one must be careful with the spaces. Instead, we will present a representationthat leads to the marginals as well as other quantities.

We start with the U1, . . . , Un being iid Uniform(0, 1), so that the pdf (5.83) of theorder statistics Y = (U(1), . . . , U(n)) is simply n!. Consider the first order statistictogether with the gaps between consecutive order statistics:

G1 = U(1), G2 = U(2) −U(1), . . . , Gn = U(n) −U(n−1). (5.84)

These Gi’s are all positive, and they sum to U(n), which has range (0, 1). Thus thespace of G = (G1, . . . , Gn) is

G = g ∈ Rn | 0 < gi < 1, i = 1, . . . , n & g1 + · · ·+ gn < 1. (5.85)

The inverse function to (5.84) is

u(1) = g1, u(2) = g1 + g2, . . . , u(n) = g1 + · · ·+ gn. (5.86)

Note that this is a linear function of G:

Y = GA′, A′ =

1 1 1 · · · 10 1 1 · · · 10 0 1 · · · 1...

......

. . ....

0 0 0 · · · 1

, (5.87)

and |A| = 1. Thus the Jacobian is 1, and the pdf of G is also n!:

fG(g) = n!, g ∈ G. (5.88)

This pdf is quite simple on its own, but note that it is a special case of the Dirichletin (5.34) with K = n + 1 and all αk = 1. Thus any order statistic is a beta, because itis the sum of the first few gaps, i.e.,

U(k) = g1 + · · ·+ gk ∼ Beta(k, n− k + 1), (5.89)

5.6. Exercises 77

analogous to (5.29). In particular, if n is odd, then the median of the observations isU(n+1)/2, hence

MedianU1, . . . , Un ∼ Beta(

n + 12

,n + 1

2

), (5.90)

and using Table 1.1 to find the mean and variance of a beta,

E[MedianU1, . . . , Un] =12

(5.91)

and

Var[MedianU1, . . . , Un] =((n + 1)/2)(n + 1)/2

(n + 1)2(n + 2)=

14(n + 2)

. (5.92)

To obtain the pdf of the order statistics of non-uniforms, we can use the probabilitytransform approach as in Section 4.2.3. That is, suppose the Xi’s are iid with (strictlyincreasing) distribution function F and pdf f . We then have that X has the samedistribution as (F−1(U1), . . . , F−1(Un)), where the Ui’s are iid Uniform(0, 1). BecauseF, hence F−1, is increasing, the order statistics for the Xi’s match those of the Ui’s,that is,

(X(1), . . . , X(n)) =D (F−1(U(1)), . . . , F−1(U(n))). (5.93)

Thus for any particular k, X(k) =D F−1(U(k)). We know that U(k) ∼ Beta(k, n− k+ 1),

hence can find the distribution of X(k) using the transformation with h(u) = F−1.Thus h−1 = F, i.e.,

U(k) = h−1(X(k)) = F(X(k))⇒ Jh−1 (x) = F′(x) = f (x). (5.94)

The pdf of X(k) is then

fX(k)(x) = fU(k)

(F(x)) f (x) =Γ(n + 1)

Γ(k)Γ(n− k + 1)F(x)k−1(1− F(x))n−k f (x). (5.95)

For most F and k, the pdf in (5.95) is not particularly easy to deal with analytically.In Section 9.2, we introduce the ∆-method, which can be used to approximate thedistribution of order statistics for large n.

5.6 Exercises

Exercise 5.6.1. Show that if X ∼ Gamma(α, λ), then cX ∼ Gamma(α, λ/c).

Exercise 5.6.2. Suppose X ∼ Beta(α, β), and let Y = X/(1−X) (so g(x) = x/(1− x).)(a) What is the space of Y? (b) Find g−1(y). (c) Find the Jacobian of g−1(y). (d) Findthe pdf of Y.

Exercise 5.6.3. Suppose X1, X2, X3, X4 are random variables, and define the function(Y1, Y2, Y3, W) = g(X1, X2, X3, X4) by

Y1 =X1

X1 + X2 + X3 + X4, Y2 =

X2X1 + X2 + X3 + X4

, Y3 =X3

X1 + X2 + X3 + X4,

(5.96)


and W = X1 + X2 + X3 + X4, as in (5.26) and (5.31) with K = 4. (a) Find the inversefunction g−1(y1, y2, y3, w). (b) Find the Jacobian of g−1(y1, y2, y3, w). (c) Show thatthe determinant of the Jacobian of g−1(y1, y2, y3, w) is w3. [Hint: The determinant ofa matrix does not change if one of the rows is added to another row. In the matrixof derivatives, adding each of the first three rows to the last row can simplify thedeterminant calculation.]

Exercise 5.6.4. Suppose X = (X1, X2, . . . , XK), where the Xk’s are independent andXk ∼ Gamma(αk, 1), as in (5.25), for K > 1. Define Y = (Y1, . . . , YK−1) by Yk =Xk/(X1 + · · ·+ XK) for k = 1, . . . , K − 1, and W = X1 + · · ·+ XK , so that (Y, W) =g(X) as in (5.26) and (5.31). Thus Y ∼ Dirichlet(α1, . . . , αK). (a) Write down the pdfof X. (b) Show that the joint space of (Y, W) is Y × (0, ∞), where Y is given in (5.30).(You can take as given that the space of Y is Y and the space of W is (0, ∞).) (c) Showthat the pdf of (Y1, . . . , YK−1, W) is given in (5.33).

Exercise 5.6.5. Suppose U1 and U2 are independent Uniform(0,1)’s, and let (Y1, Y2) =g(U1, U2) be defined by Y1 = U1 + U2, Y2 = U1 −U2. (a) Find g−1(y1, y2) and theabsolute value of the Jacobian, |Jg−1 (y1, y2)|. (b) What is the pdf of (Y1, Y2)? (c) Sketchthe joint space of (Y1, Y2). Is it a rectangle? (d) Find the marginal spaces of Y1 andY2. (e) Find the conditional space of Y2 given Y1 = y1 for y1 in the marginal space.[Hint: Do it separately for y1 < 1 and y1 ≥ 1.] (f) What is the marginal pdf of Y1?(We found this tent distribution in Exercise 1.7.6 using distribution functions.)

Exercise 5.6.6. Suppose Z ∼ N(0, 1) and U ∼ Uniform(0, 1), and Z and U are inde-pendent. Let (X, Y) = g(Z, U), given by

X =ZU

and Y = U. (5.97)

The X has the slash distribution. (a) Is the space of (X, Y) a rectangle? (b) What isthe space of X? (c) What is the space of Y? (d) What is the expected value of X? (e)Find the inverse function g−1(x, y). (f) Find the Jacobian of g−1. (g) Find the pdf of(X, Y). (h) What is the conditional space of Y given X = x, Yx? (i) Find the pdf of X.

Exercise 5.6.7. Suppose X = (X1, X2) is bivariate normal with mean µ and covariancematrix Σ, where

µ = (µ1, µ2) and Σ =

(σ11 σ12σ21 σ22

). (5.98)

Assume that Σ is invertible. (a) Write down the pdf. Find the a, b, c for which

(x− µ)Σ−1(x− µ)′ = a(x1 − µ1)2 + b(x1 − µ1)(x2 − µ2) + c(x2 − µ2)

2. (5.99)

(b) If σ12 6= 0, are X1 and X2 independent? (c) Suppose σ12 = 0 (so σ21 = 0, too).Show explicitly that the pdf factors into a function of x1 times a function of x2. AreX1 and X2 independent? (d) Still with σ12 = 0, what are the marginal distributionsof X1 and X2?

Exercise 5.6.8. Suppose that (Y1, . . . , YK−1) ∼ Dirichlet(α1, . . . , αK), where K ≥ 5.What is the distribution of (W1, W2), where

W1 =Y1

Y1 + Y2 + Y3 + Y4, and W2 =

Y2 + Y3Y1 + Y2 + Y3 + Y4

? (5.100)

Justify your answer. [Hint: It is easier to work directly with the gammas defining theYi’s, rather than using pdfs.]

5.6. Exercises 79

Exercise 5.6.9. Let G ∼ Dirichlet(α1, . . . , αK) (so that G = (G1, . . . , GK−1)). Set α+ =α1 + · · ·+ αK . (a) Show that E[Gk] = αk/α+ and Var[Gk] = αk(α+ − αk)/(α2

+(α+ +1)). [Hint: Use the beta representation in (5.28), and get the mean and variancefrom Table 1.1 on page 7.] (b) Find E[GkGl ] for k 6= l. (c) Show that Cov[Gk, Gl ] =−αkαl/(α2

+(α+ + 1)) for k 6= l. (d) Show that

Cov[G] =1

α+(α+ + 1)

(D(α)− 1

α+α′α

), (5.101)

where α = (α1, . . . , αK−1) and D(α) is the (K− 1)× (K− 1) diagonal matrix with theαi’s on the diagonal:

D(α) =

α1 0 · · · 00 α2 · · · 0...

.... . .

...0 0 · · · αK−1

. (5.102)

Exercise 5.6.10. Suppose x is 2× 1 and fX(x) = fX(Γx) for any x ∈ R2 and orthogonal2× 2 matrix Γ. (a) Write x in terms of its polar coordinates as in (5.57), i.e., x1 =‖x‖ cos(θ) and x2 = ‖x‖ sin(θ). Define Γ using an angle φ as in (5.58). For whatφ is Γx = (‖x‖, 0)′? (b) Show that there exists a function h(r) for r ≥ 0 such thatfX(x) = h(‖x‖). [Hint: Let h(r) = fX((r, 0)′).]

Exercise 5.6.11. Suppose x ∈ R2 − 02 (the two-dimensional plane without the ori-gin), and let (r, θ) = g(x) be the polar coordinate transformation, as in (5.64). Showthat g defines a one-to-one correspondence between R2 − 02 and the rectangle(0, ∞)× [0, 2π).

Exercise 5.6.12. Suppose Y1, . . . , Yn are independent Exponential(1)’s, and Y(1), . . .,Y(n) are their order statistics. (a) Write down the space and pdf of the single orderstatistic Y(k). (b) What is the pdf of Y(1)? What is the name of the distribution?

Exercise 5.6.13. The Gumbel(µ) distribution has space (−∞, ∞) and distribution func-tion

Fµ(x) = e−e−(x−µ). (5.103)

(a) Is this a legitimate distribution function? Why or why not? (b) Suppose X1, . . . , Xnare independent and identically distributed Gumbel(µ) random variables, and Y =X(n), their maximum. What is the distribution function of Y? (Use the method inExercise 4.4.18.) (c) This Y is also distributed as a Gumbel. What is the value of theparameter?

The remaining exercises are based on U1, . . . , Un independent Uniform(0,1)’s, whereY = (U(1), . . . , U(n)) is the vector of their order statistics. Following (5.86), the orderstatistics are be represented by U(i) = G1 + · · · + Gi, where G = (G1, . . . , Gn) ∼Dirichlet(α1, α2, . . . , αn+1), and all the αi’s equal 1.

Exercise 5.6.14. This exercise finds the covariance matrix of Y. (a) Show that

Cov[G] =1

(n + 1)(n + 2)

(In −

1n + 1

1n1′n

), (5.104)


where In is the n× n identity matrix, and 1n is the n× 1 vector of 1’s. [Hint: Thiscovariance is a special case of that in Exercise 5.6.9(d).] (b) Show that

Cov[Y] =1

(n + 1)(n + 2)

(B− 1

n + 1cc′)

, (5.105)

where B is the n× n matrix with ijth element bij = mini, j, and c = (1, 2, . . . , n)′.[Hint: Use Y = GA′ for the A in (5.87), and show that B = AA′ and c = A1n.] (c)From (5.105), obtain

Var[U(i)] =i(n + 1− i)

(n + 1)2(n + 2)and Cov[U(i), U(j)] =

i(n + 1− j)(n + 1)2(n + 2)

if i < j. (5.106)

Exercise 5.6.15. Suppose n is odd, so that the sample median is well-defined, andconsider the three statistics: the sample mean U, the sample median Umed, and thesample midrange Umr, defined to be the midpoint of the minimum and maximum,Umr = (U(1) + U(n))/2. The variance of Umed is given in (5.92). (a) Find Var[Ui],and Var[U]. (b) Find Var[U(1)], Var[U(n)], and Cov[U(1), U(n)]. [Use (5.106).] What isVar[Umr]? (c) When n = 1, the three statistics above are all the same. For n > 1 (butstill odd), which has the lowest variance, and which has the highest variance? (d) Asn → ∞, what is the limit of Var[Umed]/Var[U]? (e) As n → ∞, what is the limit ofVar[Umr]/Var[U]?

Exercise 5.6.16. This exercise finds the joint pdf of (U(1), U(n)), the minimum andmaximum of the Ui’s. Let V1 = G1 and V2 = G2 + · · ·+ Gn, so that (U(1), U(n)) =

g(V1, V2) = (V1, V1 + V2). (a) What is the space of (U(1), U(n))? (b) The distributionof (V1, V2) is Dirichlet. What are the parameters? Write down the pdf of (V1, V2).(c) Find the inverse function g−1 and the absolute value of the determinant of theJacobian of g−1. (d) Find the pdf of (U(1), U(n)).

Exercise 5.6.17. Suppose X ∼ Binomial(n, p). This exercise relates the distribu-tion function of the binomial to the beta distribution. (a) Show that for p ∈ (0, 1),P[U(k+1) > p] = P[X ≤ k]. [Hint: The (k + 1)st order statistic is greater than p if andonly if how many of the Ui’s are less than p? What is that probability?] (b) Concludethat if F is the distribution function of X, then F(k) = P[Beta(k + 1, n− k) > p]. Thisformula is used in the R function pbinom.

Chapter 6

Conditional Distributions

6.1 Introduction

A two-stage process is described in Section 1.6.3, which will appear again in Section6.4.1, where one first randomly chooses a coin from a population of coins, then flips itindependently n = 10 times. There are two random variables in this experiment: X,the probability of heads for the chosen coin, and Y, the total number of heads amongthe n flips. It is given that X is equally likely to be any number between 0 and 1, i.e.,

X ∼ Uniform(0, 1). (6.1)

Also, once the coin is chosen, Y is binomial. If the chosen coin has X = x, then wesay that the conditional distribution of Y given X = x is Binomial(n, x), written

Y |X = x ∼ Binomial(n, x). (6.2)

Together the equations (6.1) and (6.2) describe the distribution of (X, Y).A couple of other distributions may be of interest. First, what is the marginal,

sometimes referred to in this context as unconditional, distribution of Y? It is notbinomial. It is the distribution arising from the entire two-stage procedure, not thatarising given a particular coin. The space is Y = 0, 1, . . . , n, as for the binomial, butthe pmf is different:

fY(y) = P[Y = y] 6= P[Y = y |X = x]. (6.3)

(That last expression is pronounced the probability that Y = y given X = x.)Also, one might wish to interchange the roles of X and Y, and ask for the con-

ditional distribution of X given Y = y for some y. This distribution is of particularinterest in Bayesian inference, as follows. One chooses a coin as before, and thenwishes to know its x. It is flipped ten times, and the number of heads observed,y, is used to guess what the x is. More precisely, one then finds the conditionaldistribution of X given Y = y:

X |Y = y ∼ ?? (6.4)

In Bayesian parlance, the marginal distribution of X in (6.1) is the prior distributionof X, because it is your best guess before seeing the data, and the conditional dis-tribution in (6.4) is the posterior distribution, determined after you have seen thedata.

81

82 Chapter 6. Conditional Distributions

These ideas extend to random vectors (X, Y). There are five distributions we con-sider, three of which we have seen before:

• Joint. The joint distribution of (X, Y) is the distribution of X and Y taken to-gether.

• Marginal. The two marginal distributions: that of X alone and that of Y alone.

• Conditional. The two conditional distributions: that of Y given X = x, and thatof X given Y = y.

The next section shows how to find the joint distribution from a conditional andmarginal. Further sections look at finding the marginals and reverse conditional, thelatter using Bayes theorem. We end with independence, Y being independent of X ifthe conditional distribution of Y given X = x does not depend on x.

6.2 Examples of conditional distributions

When considering the conditional distribution of Y given X = x, it may or may notbe that the randomness of X is of interest, depending on the situation. In addition,there is no need for Y and X to be of the same type, e.g., in the coin example, X iscontinuous and Y is discrete. Next we look at some additional examples.

6.2.1 Simple linear regression

The relationship of one variable to another is central to many statistical investigations.The simplest is a linear relationship,

Y = α + βX + E, (6.5)

Here, α and β are fixed, Y is the “dependent" variable, and X is the “explanatory"or “independent" variable. The E is error, needed because one does not expect thevariables to be exactly linearly related. Examples include X = Height and Y = Weight,or X = Dosage of a drug and Y some measure of health (cholesterol level, e.g.). TheX could be a continuous variable, or an indicator function, e.g., be 0 or 1 accordingto the sex of the subject.

The normal linear regression model specifies that

Y |X = x ∼ N(α + βx, σ2e ). (6.6)

In particular,E[Y |X = x] = α + βx and Var[Y |X = x] = σ2

e . (6.7)

(Other models take (6.7) but do not assume normality, or allow Var[Y |X = x] todepend on x.) It may be that X is fixed by the experimenter, for example, the dosagex might be preset; or it may be that the X is truly random, e.g., the height of arandomly chosen person would be random. Often, this randomness of X is ignored,and analysis proceeds conditional on X = x. Other times, the randomness of X isalso incorporated into the analysis, e.g., one might have the marginal

X ∼ N(µX , σ2X). (6.8)

Chapter 12 goes into much more detail on linear regression models.

6.2. Examples of conditional distributions 83

6.2.2 Mixture models

The population may consist of a finite or countable number of distinct subpopula-tions, e.g., in assessing consumer ratings of cookies, there may be a subpopulation ofpeople who like sweetness, and one with those who do not. With K subpopulations,X takes on the values 1, . . . , K. Note that we could have K = ∞. These values areindices, not necessarily meaning to convey any ordering. For a normal mixture, themodel is

Y |X = k ∼ N(µk, σ2k ). (6.9)

Generally there are no restrictions on the µk’s, but the σ2k ’s may be assumed equal.

Also, K may or may not be known. The marginal distribution for X may be unre-stricted, i.e.,

fX(k) = pk, k = 1, . . . , K, (6.10)where the pk’s are positive and sum to 1, or it may have a specific pmf.

6.2.3 Hierarchical models

Many experiments involve first randomly choosing a number of subjects from a pop-ulation, then measuring a number of random variables on the chosen subjects. Forexample, one might randomly choose n third-grade classes from a city, then withineach class administer a test to m randomly chosen students. Let Xi be the overallability of class i, and Yi the average performance on the test of the students chosenfrom class i. Then a possible hierarchical model is

X1, . . . , Xn are iid ∼ N(µ, σ2), and

Y1, . . . , Yn |X1 = x1, . . . , Xn = xn are independent ∼ (N(x1, τ2), . . . , N(xn, τ2)).(6.11)

Here, µ and σ2 are the mean and variance for the entire population of class means,while xi is the mean for class i. Interest may center on the overall mean, so the citycan obtain funding from the state, as well as for the individual classes chosen, sothese classes can get special treats from the local school board.

6.2.4 Bayesian models

A statistical model typically depends on an unknown parameter vector θ, and theobjective is to estimate the parameters, or some function of them, or test hypothesesabout them. The Bayesian approach treats the data X and the parameter Θ as bothbeing random, hence having a joint distribution. The frequentist approach considersthe parameters to be fixed but unknown. Both approaches use a model for X, whichis a set of distributions indexed by the parameter, e.g., the Xi’s are iid N(µ, σ2), whereθ = (µ, σ2). The Bayesian approach considers that model conditional on Θ = θ, andwould write

X1, . . . , Xn |Θ = θ (= (µ, σ2)) ∼ iid N(µ, σ2). (6.12)Here, the capital Θ is the random vector, and the lower case θ is the particular value.Then to fully specify the model, the distribution of Θ must be given,

Θ ∼ π, (6.13)

for some prior distribution π. Once the data is x obtained, inference is based on theposterior distribution of Θ |X = x.


6.3 Conditional & marginal → Joint

We start with the conditional distribution of Y given X = x, and the marginal distri-bution of X, and find the joint distribution of (X, Y). Let X be the (marginal) space ofX, and for each x ∈ X , let Yx be the conditional space of Y given X = x, that is, thespace for the distribution of Y |X = x. Then the (joint) space of (X, Y) is

W = (x, y) | x ∈ X & y ∈ Yx. (6.14)

(In the coin example, and Yx = 0, 1, . . . , n, so that in this case the conditional spaceof Y does not depend on x.)

Now for a function g(x, y), the conditional expectation of g(X, Y) given X = x isdenoted

eg(x) = E[g(X, Y) |X = x], (6.15)

and is defined to be the expected value of the function g(X, Y) where X is fixed at xand Y has the conditional distribution Y |X = x. If this conditional distribution has apdf, say fY|X(y | x), then

eg(x) =∫Yx

g(x, y) fY|X(y | x)dy. (6.16)

If fX|Y is a pmf, then we have the summation instead of the integral. It is important torealize that this conditional expectation is a function of x. In the coin example (6.2),with g(x, y) = y, the conditional expected number of heads given the chosen coin hasparameter x is

eg(x) = E[Y |X = x] = E[Binomial(n, x)] = nx. (6.17)

The key to describing the joint distribution is to define the unconditional expectedvalue of g.

Definition 6.1. Given the conditional distribution of Y given X, and the marginal distribu-tion of X, the joint distribution of (X, Y) is that distribution for which

E[g(X, Y)] = E[eg(X)] (6.18)

for any function g with finite expected value.

So, continuing the coin example, with g(x, y) = y, since marginally X ∼ Uni-form(0,1),

E[Y] = E[eg(X)] = E[nX] = n E[Uniform(0, 1)] =n2

. (6.19)

Notice that this unconditional expected value does not depend on x, while the con-ditional expected value does, which is as it should be.

This definition yields the joint distribution P on W by looking at indicator func-tions g, as in Section 2.1.1. That is, take A ⊂ W , so that with P being the jointprobability distribution for (X, Y),

P[A] = E[IA(X, Y)], (6.20)

where IA(x, y) is the indicator of the set A as in (2.16), so equals 1 if (x, y) ∈ A and 0if not. Then the definition says that

P[A] = E[eIA (X)] where eIA (x) = E[IA(X, Y) |X = x]. (6.21)

6.4. Marginal distributions 85

We should check that this P is in fact a legitimate probability distribution, and in turnyields the correct expected values. The latter result is proven in measure theory. Thatit is a probability measure as in (1.1) is not hard to show using that IW ≡ 1, and ifAi’s are disjoint,

I∪Ai (x, y) = ∑ IAi (x, y). (6.22)

The eIA (x) is the conditional probability of A given X = x, written

P[A |X = x] = P[(X, Y) ∈ A |X = x] = E[IA(X, Y) |X = x]. (6.23)

6.3.1 Joint densities

If the conditional and marginal distributions have densities, then it is easy to find thejoint density. Suppose X has marginal pdf fX(x) and the conditional pdf of Y |X = xis fY |X(y | x). Then for any function g with finite expectation, we can take eg(x) as in(6.16), and write

E[g(X, Y)] = E[g(X)]

=∫X

eg(x) fX(x)dx

=∫X

∫Yx

g(x, y) fY |X(y | x)dy fX(x)dx

=∫W

g(x, y) fY |X(y | x) fX(x)dxdy. (6.24)

The last step is to emphasize that the double integral is indeed integrating over thewhole space of (X, Y). Looking at that last expression, we see that by taking

f (x, y) = fY|X(y | x) fX(x), (6.25)

we have thatE[g(X, Y)] =

∫W

g(x, y) f (x, y)dxdy. (6.26)

Since that equation works for any g with finite expectation, Definition 6.1 implies thatf (x, y) is the joint pdf of (X, Y). If either or both of the original densities are pmfs,the analysis goes through the same way, with summations in place of integrals.

Equation (6.25) should not be especially surprising. It is analogous to the generaldefinition of conditional probability for sets A and B:

P[A ∩ B] = P[A | B]× P[B]. (6.27)

6.4 Marginal distributions

There is no special trick in obtaining the marginal of Y given the conditional Y |X = xand the marginal of X; just find the joint and integrate out x. Thus

fY(y) =∫Xy

f (x, y)dx =∫Xy

fY|X(y | x) fX(x)dx. (6.28)


6.4.1 Coins and the beta-binomial distribution

In the coin example of (6.1) and (6.2),

fY(y) =∫ 1

0

(ny

)xy(1− x)n−ydx

=

(ny

)Γ(y + 1)Γ(n− y + 1)

Γ(n + 2)

=n!

y!(n− y)!y!(n− y)!(n + 1)!

=1

n + 1, y = 0, 1, . . . , n, (6.29)

which is the Discrete Uniform(0, n). Note that this distribution is not a binomial, anddoes not depend on x (it better not!). In fact, this Y is a special case of the following.

Definition 6.2. Suppose

Y |X = x ∼ Binomial(n, x) and X ∼ Beta(α, β). (6.30)

Then the marginal distribution of Y is beta-binomial with parameters α, β and n, written

Y ∼ Beta-Binomial(α, β, n). (6.31)

When α = β = 1, the X is uniform. Otherwise, as above, we can find the marginalpmf to be

fY(y) =Γ(α + β)

Γ(α)Γ(β)

(ny

)Γ(y + α)Γ(n− y + β)

Γ(n + α + β), y = 0, 1, . . . , n. (6.32)

6.4.2 Simple normal linear model

For another example, take the linear model in (6.6) and (6.8),

Y |X = x ∼ N(α + βx, σ2e ) and X ∼ N(µX , σ2

X). (6.33)

We could write out the joint pdf and integrate, but instead we will find the mgf,which we can do in steps because it is an expected value. That is, with g(y) = ety,

MY(t) = E[etY ] = E[eg(X)], where eg(x) = E[etY |X = x]. (6.34)

We know the mgf of a normal from (4.36), which Y |X = x is, hence

eg(x) = MN(α+βx,σ2e )(t) = e(α+βx)t+σ2

et22 . (6.35)


The expected value of eg(X) can also be written as a normal mgf:

MY(t) = E[eg(X)] = E[e(α+βX)t+σ2e

t22 ]

= eαt+σ2e

t22 E[e(βt)X ]

= eαt+σ2e

t22 MN(µX ,σ2

X)(βt)

= eαt+σ2e

t22 eβtµX+σ2

X(βt)2

2

= e(α+βµX)t+(σ2e +σ2

X β2) t22

= mgf of N(α + βµX , σ2e + σ2

X β2). (6.36)

That is, marginally,Y ∼ N(α + βµX , σ2

e + σ2X β2). (6.37)

6.4.3 Marginal mean and variance

We know that the marginal expected value of g(X, Y) is just the expected value of theconditional expected value:

E[g(X, Y)] = E[eg(X)], where eg(x) = E[g(X, Y) |X = x]. (6.38)

The marginal variance is not quite the expected value of the conditional variance.First, we will write down the marginal and conditional variances, using the formulaVar[W] = E[W2]− E[W]2 on both:

σ2g ≡ Var[g(X, Y)] = E[g(X, Y)2]− E[g(X, Y)]2 and

vg(x) ≡ Var[g(X, Y) |X = x] = E[g(X, Y)2 |X = x]− E[g(X, Y) |X = x]2. (6.39)

Next, use the conditional expected value result on g2:

E[g(X, Y)2] = E[eg2 (X)], where eg2 (x) = E[g(X, Y)2 |X = x]. (6.40)

Now use (6.38) and (6.40) in (6.39):

σ2g = E[eg2 (X)]− E[eg(X)]2 and

vg(x) = eg2 (x)− eg(x)2. (6.41)

Taking expected value over both sides of the second equation and rearranging showsthat

E[eg2 (X)] = E[vg(X)] + E[eg(X)2], (6.42)

hence

σ2g = E[vg(X)] + E[eg(X)2]− E[eg(X)]2

= E[vg(X)] + Var[eg(X)]. (6.43)

To summarize:


1. The unconditional expected value is the expected value of the conditional ex-pected value.

2. The unconditional variance is the expected value of the conditional varianceplus the variance of the conditional expected value.

The second sentence is very analogous to what happens in regression, where thetotal sum-of-squares equals the regression sum-of-squares plus the residual sum ofsquares.

These are handy results. For example, in the beta-binomial (Definition 6.2), find-ing the mean and variance using the pdf (6.32) can be challenging. But using theconditional approach is much easier. Because conditionally Y is Binomial(n, x),

eY(x) = nx and vY(x) = nx(1− x). (6.44)

Then because X ∼ Beta(α, β),

E[Y] = nE[X] = nα

α + β, (6.45)

and

Var[Y] = E[vY(X)] + Var[eY(X)]

= E[nX(1− X)] + Var[nX]

= nαβ

(α + β)(α + β + 1)+ n2 αβ

(α + β)2(α + β + 1)

=nαβ(α + β + n)

(α + β)2(α + β + 1). (6.46)

These expressions give some insight into the beta-binomial. Like the binomial, thebeta-binomial counts the number of successes in n trials, and has expected value npfor p = α/(α + β). Consider the variances of a binomial and beta-binomial:

Var[Binomial] = np(1− p) and Var[Beta-Binomial] = np(1− p)α + β + nα + β + 1

. (6.47)

Thus the beta-binomial has a larger variance, so it can be used to model situationsin which the data are more disperse than the binomial, e.g., if the n trials are noffspring in the same litter, and success is survival. The larger α + β, the closer thebeta-binomial is to the binomial.

You might wish to check the mean and variance in the normal example 6.4.2.The same procedure works for vectors:

E[Y] = E[eY(X)] where eY(x) = E[Y |X = x], (6.48)

and

Cov[Y] = E[vY(X)] + Cov[eY(X)], where vY(x) = Cov[Y |X = x]. (6.49)

In particular, considering a single covariance,

Cov[Y1, Y2] = E[cY1,Y2 (X)] + Cov[eY1 (X), eY2 (X)], (6.50)

wherecY1,Y2 (x) = Cov[Y1, Y2 |X = x]. (6.51)


6.4.4 Fruit flies

Arnold (1981) presents an experiment concerning the genetics of Drosophila pseudoob-scura, a type of fruit fly. We are looking at a particular locus (place) on a pair ofchromosomes. The locus has two possible alleles (values): TL ≡ TreeLine and CU≡ Cuernavaca. Each individual has two of these, one on each chromosome. Theindividual’s genotype is the pair of alleles it has. Thus the genotype could be (TL,TL),(TL,CU), or (CU,CU). (There is no distinction made between (CU,TL) and (TL,CU).)

The objective is to estimate θ ∈ (0, 1), the proportion of CU in the population. Inthis experiment, the researchers randomly collected 10 adult males. Unfortunately,one cannot determine the genotype of the adult fly just by looking at him. Onecan determine the genotype of young flies, though. So the researchers bred each ofthese ten flies with a (different) female known to be (TL,TL), and analyzed two ofthe offspring from each mating. Each offspring receives one allele from each parent.Thus if the mother’s alleles are (A1, A2) and the father’s are (B1, B2), each offspringhas four (maybe not distinct) possibilities:

FatherMother B1 B2

A1 (A1, B1) (A1, B2)A2 (A2, B1) (A2, B2)

(6.52)

In this case, there are three relevant, fairly simple, tables:

FatherMother TL TL

TL (TL, TL) (TL, TL)TL (TL, TL) (TL, TL)

FatherMother CU TL

TL (TL, CU) (TL, TL)TL (TL, CU) (TL, TL)

FatherMother CU CU

TL (TL, CU) (TL, CU)TL (TL, CU) (TL, CU)

(6.53)

The actual genotypes of the sampled offspring are next:

Father Offsprings’ genotypes1 (TL, TL) & (TL, TL)2 (TL, TL) & (TL, CU)3 (TL, TL) & (TL, TL)4 (TL, TL) & (TL, TL)5 (TL, CU) & (TL, CU)6 (TL, TL) & (TL, CU)7 (TL, CU) & (TL, CU)8 (TL, TL) & (TL, TL)9 (TL, CU) & (TL, CU)

10 (TL, TL) & (TL, TL)

(6.54)

The probability distribution of these outcomes is governed by the population pro-portion θ of CUs under the following assumptions:

1. The ten chosen fathers are a simple random sample from the population.


2. The chance that a given father has 0, 1 or 2 CUs in his genotype follows theHardy-Weinberg laws, which means that the number of CUs for each father islike flipping a coin twice independently, with probability of heads being θ.

3. For a given mating, the two offspring are each equally likely to get either of thefathers two alleles (as well as a TL from the mother), and what the two offspringget are independent.

Since each genotype is uniquely determined by the number of CUs it has (0, 1,or 2), we can represent the ith father by Xi, the number of CUs in his genotype.The mothers are all (TL, TL), so each offspring receives a TL from the mother, andrandomly one of the alleles from the father. Let Yij be the indicator of whether the jthoffspring of father i receives a CU from the father. That is,

Yij =

0 if Offspring j from father i is (TL, TL)1 if Offspring j from father i is (TL, CU)

. (6.55)

Then each “family" has three random variables, (Xi, Yi1, Yi2). We will assume thatthese triples are independent, in fact,

(X1, Y11, Y12), . . . , (Xn, Yn1, Yn2) are iid. (6.56)

Assumption #2, the Hardy-Weinberg law, implies that

Xi ∼ Binomial(2, θ), (6.57)

because each father in effect randomly chooses two alleles from the population.Next, we specify the conditional distribution of the offspring given the father, i.e.,(Yi1, Yi2) |Xi = xi. If xi = 0, then the father is (TL, TL), so the offspring will allreceive a TL from the father, as in the first table in (6.53):

P[(Yi1, Yi2) = (0, 0) |Xi = 0] = 1. (6.58)

Similarly, if xi = 2, the father is (CU,CU), so the offspring will all receive a CU fromthe father, as in the third table in (6.53):

P[(Yi1, Yi2) = (1, 1) |Xi = 2] = 1. (6.59)

Finally, if xi = 1, the father is (TL,CU), which means each offspring has a 50-50 chanceof receiving a CU from the father, as in the second table in (6.53):

P[(Yi1, Yi2) = (y1, y2) |Xi = 1] =14

for (y1, y2) ∈ (0, 0), (0, 1), (1, 0), (1, 1). (6.60)

This conditional distribution can be written more compactly by noting that xi/2 isthe chance that an offspring receives a CU, so that

(Yi1, Yi2) |Xi = xi ∼ iid Bernoulli( xi

2

), (6.61)

which using (4.56) yields the conditional pmf

fY|X(yi1, yi2 | xi) =( xi

2

)yi1+yi2(

1− xi2

)2−yi1−yi2. (6.62)

6.5. Conditional from the joint 91

The goal of the experiment is to estimate θ, but without knowing the Xi’s. Thusthe estimation has to be based on just the Yij’s. The marginal means are easy to find:

E[Yij] = E[eYij (Xi)], where eYij (xi) = E[Yij |Xi = xi] =xi2

, (6.63)

because conditionally Yij is Bernoulli. Then E[Xi] = 2θ, hence

E[Yij] =2θ

2= θ. (6.64)

Nice! Then an obvious estimator of θ is the sample mean of all the Yij’s, of whichthere are 2n = 20:

θ =∑n

i=1 ∑2j=1 Yij

2n. (6.65)

This estimator is called the Dobzhansky estimator. To find the estimate, we justcount the number of CUs in (6.54), which is 8, hence the estimate of θ is 0.4.

What is the variance of the estimate? Are Yi1 and Yi2 unconditionally independent?What is Var[Yij]? What is the marginal pmf of (Yi1, Yi2)? See Exercises 6.8.15 and6.8.16.

6.5 Conditional from the joint

Often one has a joint distribution, but is primarily interested in the conditional, e.g.,many experiments involve collecting health data from a population, and interest cen-ters on the conditional distribution of certain outcomes, such as longevity, conditionalon other variables, such as sex, age, cholesterol level, activity level, etc. In Bayesianinference, one can find the joint distribution of the data and the parameter, and fromthat find the conditional distribution of the parameter given the data. Measure theoryguarantees that for any joint distribution of (X, Y), there exists a conditional distri-bution of Y |X = x for each x ∈ X . It may not be unique, but any conditionaldistribution that combines with the marginal of X to yield the original joint distribu-tion is a valid conditional distribution. If densities exists, then fY|X(y | x) is a validconditional density if

f (x, y) = fY|X(y | x) fX(x) (6.66)

for all (x, y) ∈ W . Thus, given a joint density f , one can integrate out y to obtain themarginal fX, then define the conditional by

fY|X(y | x) =f (x, y)fX(x)

=Joint

Marginalif fX(x) > 0. (6.67)

It does not matter how the conditional density is defined when fX(x) = 0, as long asit is a density on Yx, because in reconstructing the joint f in (6.66), it is multiplied by0. Also, the conditional of X given Y = y is the ratio of the joint to the marginal of Y.

This formula works for pdfs, pmfs, and the mixed kind.

6.5.1 Coins

In Example 6.4.1 on the coins, the joint density of (X, Y) is

f (x, y) =(

ny

)xy(1− x)n−y, (6.68)


and the marginal distribution of Y, the number of heads, is, as in (6.29),

fY(y) =1

n + 1, y = 0, . . . , n. (6.69)

Thus the conditional posterior distribution of X, the chance of heads, given Y, thenumber of heads, is

fX|Y(x | y) = f (x, y)fY(y)

= (n + 1)(

ny

)xy(1− x)n−y

=(n + 1)!

y!(n− y)!xy(1− x)n−y

=Γ(n + 2)

Γ(y + 1)Γ(n− y + 1)xy(1− x)n−y

= Beta(y + 1, n− y + 1) pdf. (6.70)

For example, if the experiment yields Y = 3 heads, then one’s guess of what the prob-ability of heads is for this particular coin is described by the Beta(4, 8) distribution.A reasonable guess could be the posterior mean,

E[X |Y = 3] = E[Beta(4, 8)] =4

4 + 8=

13

. (6.71)

Note that this is not the sample proportion of heads, 0.3, although it is close. The pos-terior mode (the x that maximizes the pdf) and posterior median are also reasonablepoint estimates. A more informative quantity might be a probability interval:

P[0.1093 < X < 0.6097 |Y = 3] = 95%. (6.72)

So there is a 95% chance that the chance of heads is somewhere between 0.11 and0.61. (These numbers were found using the qbeta function in R.) It is not a very tightinterval, but there is not much information in just ten flips.

6.5.2 Bivariate normal

If one can recognize the form of a particular density for Y within a joint density, thenit becomes unnecessary to explicitly find the marginal of X and divide the joint bythe marginal. For example, the N(µ, σ2) density can be written

φ(z | µ, σ2) =1√2πσ

e−12(z−µ)2

σ2

= c(µ, σ2)ez µ

σ2 −z2 12σ2 . (6.73)

That is, we factor the pdf into a constant we do not care about at the moment, thatdepends on the fixed parameters, and the important component containing all thez-action.

Now consider the bivariate normal,

(X, Y) ∼ N(µ, Σ), where µ = (µX , µY) and Σ =

(σ2

X σXYσXY σ2

Y

). (6.74)

6.6. Bayes theorem: Reversing the conditionals 93

Assuming Σ is invertible, the joint pdf is (as in (5.46))

f (x, y) =1

2π

1√|Σ|

e−12 ((x,y)−µ)Σ−1

((x,y)−µ)′ . (6.75)

We try to factor the conditional pdf of Y |X = x as

fY|X(y | x) =f (x, y)fX(x)

= c(x, µ, Σ)g(y, x, µ, Σ), (6.76)

where c has as many factors as possible that are free of y (including fX(x)), and g haseverything else. Exercise 6.8.5 shows that c can be chosen so that

g(y, x, µ, Σ) = eyγ1+y2γ2 , (6.77)

where

γ1 =σXY(x− µX) + µYσ2

Y|Σ| and γ2 = −1

2σ2

X|Σ| . (6.78)

Now compare the g in (6.77) to the exponential term in (6.73). Since the latter isa normal pdf, and the space of Z in (6.73) is the same as that of Y in (6.77), the pdfin (6.76) must be normal, where the parameters (µ, σ2) are found by matching γ1 toµ/σ2 and γ2 to −1/(2σ2). Doing so, we obtain

σ2 = σ2Y −

σ2XY

σ2X

and µ = µY +σXY

σ2X

(x− µX). (6.79)

That is,

Y |X = x ∼ N

(µY +

σXY

σ2X

(x− µX), σ2Y −

σ2XY

σ2X

). (6.80)

Because we know the normal pdf, we could work backwards to find the c in (6.76),but there is no need to do that.

This is a normal linear model, as in (6.6), where

Y |X = x ∼ N(α + βx, σ2e ) (6.81)

with

β =σXY

σ2X

, α = µY − βµX , and σ2e = σ2

Y −σ2

XYσ2

X. (6.82)

These equations should be familiar from linear regression. An alternative method forderiving conditional distribution in the multivariate normal is given in Section 7.7.

6.6 Bayes theorem: Reversing the conditionals

We have already essentially derived Bayes theorem, but it is important enough todeserve its own section. The theorem takes the conditional density of Y given X = xand the marginal distribution of X, and produces the conditional density of X givenY = y. It uses the formula conditional = joint/marginal, where the marginal is foundby integrating out the y from the joint.


Theorem 6.3. Bayes. Suppose Y |X = x has density fY|X(y | x) and X has marginal densityfX(x). Then,

fX|Y(x | y) =fY|X(y | x) fX(x)∫

XyfY|X(y | z) fX(z)dz

. (6.83)

The integral will be a summation if X is discrete.

Proof. With f (x, y) being the joint density,

fX|Y(x | y) =f (x, y)fY(y)

=fY|X(y | x) fX(x)∫Xy

f (z, y)dz

=fY|X(y | x) fX(x)∫

XyfY|X(y | z) fX(z)dz

. (6.84)

Bayes theorem is often used with sets. Let A ⊂ X and B1, . . . ,BK be a partition ofX , i.e.,

Bi ∩ Bj = ∅ for i 6= j, and ∪Kk=1 Bk = X . (6.85)

Then

P[Bk | A] =P[A | Bk]P[Bk]

∑Kl=1 P[A | Bl ]P[Bl ]

. (6.86)

6.6.1 AIDS virus

A common illustration of Bayes theorem involves testing for some medical condition,e.g., a blood test for the AIDS virus. Suppose the test is 99% accurate. If a randomperson’s test is positive, does that mean the person is 99% sure of having the virus?

Let A+ = “test is positive", A− = “test is negative", B+ =“person has the virus"and B− = “person does not have the virus." Then we know the conditionals

P[A+ | B+] = 0.99 and P[A− | B−] = 0.99, (6.87)

but they are not of interest. We want to know the reverse conditional, P[B+ | A+],the chance of having the virus given the test is positive. There is no way to figurethis probability out without the marginal of B, that is, the marginal chance a randomperson has the virus. Let us say that P[B+] = 1/10, 000. Now we can use Bayestheorem (6.86)

P[B+ | A+] =P[A+ | B+]P[B+]

P[A+ | B+]P[B+] + P[A+ | B−]P[B−]

=0.99× 1

10000

0.99× 110000 + 0.01× 9999

10000

≈ 0.0098. (6.88)

Thus the chance of having the virus, given the test is positive, is only about 1/100.That is lower than one might expect, but it is substantially higher than the overallchance of 1/10000. (This example is a bit simplistic in that random people do nottake the test, but more likely people who think they may be at risk.)

6.7. Conditionals and independence 95

6.6.2 Beta posterior for the binomial

The coin example in Section 6.5.1 can be generalized by using a beta in place of theuniform. In a Bayesian framework, we suppose the probability of success, θ, has aprior distribution Beta(α, β), and Y |Θ = θ is Binomial(n, θ). (So now x has becomeθ.) The prior is supposed to represent knowledge or belief about what the θ is beforeseeing the data Y. To find the posterior, or what we are to think after seeing the dataY = y, we need the conditional distribution of Θ given Y = y.

The joint density is

f (θ, y) = fY|Θ(y | θ) fΘ(θ)

=

(ny

)θy(1− θ)n−yβ(α, β)θα−1(1− θ)β−1

= c(y, α, β)θy+α−1(1− θ)n−y+β−1. (6.89)

Because we are interested in the pdf of Θ, we put everything not depending on θ inthe constant. But the part that does depend on θ is the meat of a Beta(α+ y, β+ n− y)density, hence that is the posterior:

Θ |Y = y ∼ Beta(α + y, β + n− y). (6.90)

That is, we do not explicitly have to find the marginal pmf of Θ, which we did do in(6.70).

6.7 Conditionals and independence

If X and Y are independent, then it makes sense that the distribution of one does notdepend on the value of the other, which is true.

Lemma 6.4. The random vectors X and Y are independent if and only if (a version of) theconditional distribution of Y given X = x does not depend on x.

The parenthetical “a version of” is there to make the statement precise, since it ispossible (e.g., if X is continuous) for the conditional distribution to depend on x for afew x without changing the joint distribution.

When there are densities, this result follows directly:

Independence =⇒ f (x, y) = fX(x) fY(y) and W = X ×Y

=⇒ fY |X(y | x) =f (x, y)fX(x)

= fY(y) and Yx = Y , (6.91)

which does not dependent on x. The other way, if the conditional distribution of Ygiven X = x does not depend on x, then fY |X(y | x) = fY(y) and Yx = Y , hence

f (x, y) = fY |X(y | x) fX(x) = fY(y) fX(x) and W = X ×Y . (6.92)

6.7.1 Independence of residuals and X

Suppose (X, Y) is bivariate normal as in (6.74),

(X, Y) ∼ N((µX , µY),

(σ2

X σXYσXY σ2

Y

)). (6.93)


We then have thatY |X = x ∼ N(α + βx, σ2

e ), (6.94)

where the parameters are given in (6.82). The residual is Y − α − βX. What is itsconditional distribution? First, for fixed x,

[Y− α− βX |X = x] =D [Y− α− βx |X = x]. (6.95)

This equation means that when we are conditioning on X = x, the conditional dis-tribution stays the same if we fix X = x, which follows from the original definitionof conditional expected value in (6.15). We know that subtracting the mean from anormal leaves a normal with mean 0 and the same variance, hence

Y− α− βx |X = x ∼ N(0, σ2e ). (6.96)

But the right-hand side has no x, hence Y − α − βX is independent of X, and hasmarginal distribution N(0, σ2

e ).

6.8 Exercises

Exercise 6.8.1. Suppose (X, Y) has pdf f (x, y), and that the conditional pdf of Y |X =x does not depend on x. That is, there is a function g(y) such that f (x, y)/ fX(x) =g(y) for all x ∈ X . Show that g(y) is the marginal pdf of Y. [Hint: Find the joint pdfin terms of the conditional and marginal of X, then integrate out x.]

Exercise 6.8.2. A study was conducted on people near Newcastle on Tyne in 1972-74(Appleton, French, and Vanderpump, 1996), and followed up twenty years later. Wewill focus on 1314 women in the study. The three variables we will consider are Z:age group (three values); X: whether they smoked or not (in 1974); and Y: whetherthey were still alive in 1994. Here are the frequencies:

Age group Young (18− 34) Middle (35− 64) Old (65+)Smoker? Yes No Yes No Yes No

Died 5 6 92 59 42 165Lived 174 213 262 261 7 28

(6.97)

(a) Treating proportions in the table as probabilities, find

P[Y = Lived |X = Smoker] and P[Y = Lived |X = Nonsmoker]. (6.98)

Who were more likely to live, smokers or nonsmokers? (b) Find P[X = Smoker | Z =z] for z = Young, Middle, and Old. What do you notice? (c) Find

P[Y = Lived |X = Smoker & Z = z] (6.99)

andP[Y = Lived |X = Nonsmoker & Z = z] (6.100)

for z = Young, Middle, and Old. Adjusting for age group, who were more likelyto live, smokers or nonsmokers? (d) Conditionally on age, the relationship betweensmoking and living is negative for each age group. Is it true that marginally (notconditioning on age), the relationship between smoking and living is negative? Whatis the explanation? (Simpson’s paradox.)

6.8. Exercises 97

Exercise 6.8.3. Suppose in a large population, the proportion of people who are in-fected with the HIV virus is ε = 1/100, 000. (In the example in Section 6.6.1, thisproportion was 1/10,000.) People can take a blood test to see whether they have thevirus. The test is 99% accurate: The chance the test is positive given the person hasthe virus is 99%, and the chance the test is negative given the person does not havethe virus is also 99%. Suppose a randomly chosen person takes the test. (a) What isthe chance that this person does have the virus given that the test is positive? Is thisclose to 99%? (b) What is the chance that this person does have the virus given thatthe test is negative? Is this close to 1%? (c) Do the probabilities in (a) and (b) sum to1?

Exercise 6.8.4. (a) Find the mode for the Beta(α, β) distribution. (b) What is the valuefor the Beta(4, 8)? How does it compare to the posterior mean in (6.71)?

Exercise 6.8.5. Consider (X, Y) being bivariate normal, N(µ, Σ), as in (6.74). (a) Showthat the exponent in (6.75) can be written

− 12|Σ| (σ

2Y(x− µX)− 2σXY(x− µX)(y− µY) + σ2

X(y− µY)2). (6.101)

(b) Consider (6.101) as a quadratic in y, so that x and the parameters are constants.Show that y has coefficient γ1 = (σXY(x − µX) + µYσ2

Y)/|Σ| and y2 has coefficientγ2 = −σ2

X/(2|Σ|), as in (6.78). (c) Argue that the conditional pdf of Y |X = x can bewritten as in (6.76) and (6.77), i.e.,

fY|X(y | x) = c(x, µ, Σ)eyγ1+y2γ2 . (6.102)

(You do not have to explicitly find the function c, though you are welcome to do so.)

Exercise 6.8.6. Set

µ

σ2 =σXY(x− µX) + µYσ2

Y|Σ| and − 1

2σ2 = −12

σ2X|Σ| . (6.103)

Solve for µ and σ2. The answers should be as in (6.79).

Exercise 6.8.7. Suppose X ∼ Gamma(α, λ). Then E[X] = α/λ and Var[X] = α/λ2.(a) Find E[X2]. (b) Is E[1/X] = 1/E[X]? (c) Find E[1/X]. It is finite for whichvalues of α? (d) Find E[1/X2]. It is finite for which values of α? (e) Now supposeY ∼ Gamma(β, δ), and it is independent of the X above. Let R = X/Y. Find E[R],E[R2], and Var[R].

Exercise 6.8.8. Suppose X |Θ = θ ∼ Poisson(c θ), where c is a fixed constant. Also,marginally, Θ ∼ Gamma(α, λ). (a) Find the joint density f (x, θ) of (X, Θ). It can bewritten as dθα∗−1e−λ∗θ for some α∗ and λ∗, where the d may depend on x, c, λ, α, butnot on θ. What are α∗ and λ∗? (b) Find the conditional distribution of Θ |X = x . (c)Find E[Θ |X = x].

Exercise 6.8.9. In 1954, a large experiment was conducted to test the effectivenessof the Salk vaccine for preventing polio. A number of children were randomly as-signed to two groups, one group receiving the vaccine, and a control group receivinga placebo. The number of children contracting polio (denoted x) in each group was


then recorded. For the vaccine group, nV = 200745 and xV = 57; for the controlgroup, nC = 201229 and xC = 142. That is, 57 of the 200745 children getting the vac-cine contracted polio, and 142 of the 201229 children getting the placebo contractedpolio. Let ΘV and ΘC be the polio rates per 100,000 children for the populationvaccine and control groups, and suppose that (XV , ΘV) is independent of (XC, ΘC).Furthermore, suppose

XV |ΘV = θV ∼ Poisson(cV θV) and XC |ΘC = θC ∼ Poisson(cC θC), (6.104)

where the c’s are the n’s divided by 100,000, and marginally, ΘV ∼ Gamma(α, λ)and ΘC ∼ Gamma(α, λ) (i.e., they have the same prior). It may be reasonable totake priors with mean about 25, and standard deviation also about 25. (a) What doα and λ need to be so that E[ΘV ] = E[ΘC] = 25 and Var[ΘV ] = Var[ΘC] = 252? (b)What are the posterior distributions for the Θ’s based on the above numbers? (Thatis, find ΘV |XV = 57 and ΘC |XC = 142.) (c) Find the posterior means and standarddeviations of the Θ’s. (d) We are hoping that the vaccine has a lower rate than thecontrol. What is P[ΘV < ΘC |XV = 57, XC = 142]? (Sketch the two posterior pdfs.)(e) Consider the ratio of rates, R = ΘV/ΘC. Find E[R |XV = 57, XC = 142] and√

Var[R |XV = 57, XC = 142]. (f) True or false: The vaccine probably cuts the rate ofpolio by at least half (i.e., P[R < 0.5 |XV = 57, XC = 142] > 0.5). (g) What do youconclude about the effectiveness of the vaccine?

Exercise 6.8.10. Suppose (Y1, Y2, . . . , YK−1) ∼ Dirichlet(α1, . . . , αK), where K > 4. (a)Argue that

Y1Y4Y2Y3

=DX1X4X2X3

, (6.105)

where the Xi’s are independent gammas. What are the parameters of the Xi’s? (b)Find the following expected values, if they exist:

E[

Y1Y4Y2Y3

]and E

[(Y1Y4Y2Y3

)2]

. (6.106)

For which values of the αi’s does the first expected value exist? For which does thesecond exist?


(Z1, Z2, Z3, Z4) | (P1, P2, P3, P4) = (p1, p2, p3, p4) ∼ Multinomial(n, (p1, p2, p3, p4)),(6.107)

and(P1, P2, P3) ∼ Dirichlet(α1, α2, α3, α4). (6.108)

Note that P4 = 1− P1 − P2 − P3. (a) Find the conditional distribution of

(P1, P2, P3) | (Z1, Z2, Z3, Z4) = (z1, z2, z3, z4). (6.109)

(b) Data from the General Social Survey of 1991 included a comparison of men’s andwomen’s belief in the afterlife. See Agresti (2013). Assume the data is multinomialwith four categories, arranged as follows:

Belief in afterlifeGender Yes No or UndecidedFemales Z1 Z2Males Z3 Z4

(6.110)

6.8. Exercises 99

The odds that a female believes in the afterlife is then P1/P2, and the odds that a malebelieves in the afterlife is P3/P4. The ratio of these odds is called the odds ratio:

Odds Ratio =P1P4P2P3

. (6.111)

Then using Exercise 6.8.10, it can be shown that

[Odds Ratio | (Z1, Z2, Z3, Z4) = (z1, z2, z3, z4)] =D X1X4

X2X3, (6.112)

where the Xi’s are independent, Xi ∼ Gamma(βi, 1). Give the βi’s (which shouldbe functions of the αi’s and zi’s). (c) Show that gender and belief in afterlife areindependent if and only if Odds Ratio = 1. (d) The actual data from the survey are

Belief in afterlifeGender Yes No or UndecidedFemales 435 147Males 375 134

(6.113)

Take the prior with all αi = 1/2. Find the posterior expected value and posteriorstandard deviation of the odds ratio. What do you conclude about the differencebetween men and women here?

Exercise 6.8.12. Suppose X1 and X2 are independent, X1 ∼ Poisson(λ1) and X2 ∼Poisson(λ2). Let T = X1 + X2, which is Poisson(λ1 +λ2). (a) Find the joint space andjoint pmf of (X1, T). (b) Find the conditional space and conditional pmf of X1 | T = t.(c) Letting p = λ1/(λ1 + λ2), write the conditional pmf from part (b) in terms of p(and t). What is the conditional distribution?

Exercise 6.8.13. Now suppose X1, X2, X3 are iid Poisson(λ), and T = X1 + X2 + X3.What is the distribution of (X1, X2, X3) | T = t? What are the conditional mean vectorand conditional covariance matrix of (X1, X2, X3) | T = t?

Exercise 6.8.14. Suppose Z1, Z2, Z3, Z4 are iid Bernoulli(p) variables, and let Y betheir sum. (a) Find the conditional space and pmf of (Z1, Z2, Z3, Z4) given Y = 2. (b)Find E[Z1 |Y = 2] and Var[Z1 |Y = 2]. (c) Find Cov[Z1, Z2 |Y = 2]. (d) Let Z be thesample mean of the Zi’s. Find E[Z |Y = 2] and Var[Z |Y = 2].

Exercise 6.8.15. This problem is based on the fruit fly example, Section 6.4.4. Sup-pose (Y1, Y2) |X = x are (conditionally) independent Bernoulli(x/2)’s, and X ∼Binomial(2, θ). (a) Show that the marginal distribution of (Y1, Y2) is given by

fY1,Y2 (0, 0) =12(1− θ)(2− θ);

fY1,Y2 (0, 1) = fY1,Y2 (1, 0) =12

θ(1− θ);

fY1,Y2 (1, 1) =12

θ(1 + θ). (6.114)

(b) Find Cov[Y1, Y2], their marginal covariance. Are Y1 and Y2 marginally indepen-dent? (c) The offsprings’ genotypes are observed, but not the fathers’. But given


the offspring, one can guess what the father is by obtaining the conditional distri-bution of X given (Y1, Y2). In the following, take θ = 0.4. (i) If the offspring has(y1, y2) = (0, 0), what would be your best guess for the father’s x? (Best in the senseof having the highest conditional probability.) Are you sure of your guess? (ii) If theoffspring has (y1, y2) = (0, 1), what would be your best guess for the father’s x? Areyou sure of your guess? (iii) If the offspring has (y1, y2) = (1, 1), what would be yourbest guess for the father’s x? Are you sure of your guess?

Exercise 6.8.16. Continue the fruit flies example in Exercise 6.8.15. Let W = Y1 + Y2,so that W |X = x ∼ Binomial(2, x/2) and X ∼ Binomial(2, θ). (a) Find E[W |X = x]and Var[W |X = x]. (b) Now find E[W], E[X2], and Var[W]. (c) Which is larger, themarginal variance of W or the marginal variance of X, or are they the same? (d) Notethat the Dobzhansky estimator θ in (6.65) is W/2 based on W1, . . . , Wn iid versions ofW. What is Var[θ] as a function of θ?

Exercise 6.8.17. (Student’s t).

Definition 6.5. Let Z and U be independent, Z ∼ N(0, 1) and U ∼ Gamma(ν/2, 1/2)(= χ2

ν if ν is an integer), and T = Z/√

U/ν. Then T is called Student’s t on ν degrees offreedom, written T ∼ tν.

(a) Show that E[T] = 0 (if ν > 1) and Var[T] = ν/(ν− 2) (if ν > 2). [Hint: Note thatsince Z and U are independent, E[T] = E[Z]E[

√ν/U] if both expectations are finite.

Show that E[1/√

U] is finite if ν > 1. Similarly for E[T2], where you can use the factfrom Exercise 6.8.7 that E[1/U] = 1/(ν− 2) for ν > 2.] (b) The joint distribution of(T, U) can be represented by

T |U = u ∼ N(0, ν/u) and U ∼ Gamma(ν/2, 1/2). (6.115)

Write down the joint pdf. (c) Integrate out the u from the joint pdf to show that themarginal pdf of T is

fν(t) =Γ((ν + 1)/2)Γ(ν/2)

√νπ

1(1 + t2/ν)(ν+1)/2

. (6.116)

For what values of ν is this density valid?

Exercise 6.8.18. Suppose that X ∼ N(µ, 1), and Y = X2, so that Y ∼ χ21(µ

2), anoncentral χ2 on one degree of freedom. (The pdf of Y was derived in Exercise1.7.14.) Show that the moment generating function of Y is

MY(y) =1√

1− 2teµ2t/(1−2t). (6.117)

[Hint: Find the mgf of Y by finding E[etX2] directly. Write out the integral with the pdf

of X, then complete the square in the exponent, where you will see another normalpdf with a different variance.]

Exercise 6.8.19. Suppose that conditionally,

W |K = k ∼ χ2ν+2k, (6.118)

6.8. Exercises 101

and marginally, K ∼ Poisson(λ). (a) Find the unconditional mean and variance ofW. (You can take the means and variances of the χ2 and Poisson from Tables 1.1and 1.2.) (b) What is the conditional mgf of W given K = k? (You don’t need toderive it, just write it down based on what you know about the mgf of the χ2

m =Gamma(m/2, 1/2).) For what values of t is it finite? (c) Show that the unconditionalmgf of W is

MW(t) =1

(1− 2t)ν/2 e2λt/(1−2t). (6.119)

(d) Suppose Y ∼ χ21(µ

2) as in Exercise 6.8.18. The mgf for Y is of the form in (6.119).What are the corresponding ν and λ? What are the mean and variance of the Y?[Hint: Use part (a).]

Exercise 6.8.20. In this problem you should use your intuition, rather than try to formallyconstruct conditional densities. Suppose U1, U2, U3 are iid Uniform(0, 1), and let U(3) bethe maximum of the three. Consider the conditional distribution of U1 |U(3) = .9. (a)Is the conditional distribution continuous? (b) Is the conditional distribution discrete?(c) What is the conditional space? (d) What is P[U1 = .9 |U(3) = .9]? (e) What isE[U1 |U(3) = .9]?

Exercise 6.8.21. Imagine a box containing three cards. One is red on both sides, oneis green on both sides, and one is red on one side and green on the other. You closeyour eyes and randomly pick a card and lay it on the table. You open your eyes andnotice that the side facing up is red. What is the chance the side facing down is red,too? [Hint: Let Y1 be the color of the side facing up, and Y2 the color of the side facingdown. Then find the joint pmf of (Y1, Y2), and from that P[Y2 = red |Y1 = red].]

Chapter 7

The Multivariate Normal Distribution

7.1 Definition

Almost all data are multivariate, that is, entail more than one variable. There are twogeneral-purpose multivariate models: the multivariate normal for continuous data,and the multinomial for categorical data. There are many specialized multivariatedistributions, but these two are the only ones that are used in all areas of statistics.We have seen the bivariate normal in Section 5.4.1, and the multinomial is introducedin Section 2.5.3.

The multivariate normal is by far the most commonly used distribution for con-tinuous multivariate data. Which is not to say that all data are distributed normally,nor that all techniques assume such. Rather, one usually either assumes normality,or makes few assumptions at all and relies on asymptotic results. Some of the niceproperties of the multivariate normal:

• It is completely determined by the means, variances, and covariances.

• Elements are independent if and only if they are uncorrelated.

• Marginals of multivariate normals are multivariate normal.

• An affine transformation of a multivariate normal is multivariate normal.

• Conditionals of a multivariate normal are multivariate normal.

• The sample mean is independent of the sample covariance matrix in an iidnormal sample.

The multivariate normal arises from iid standard normals, that is, iid N(0, 1)’s.Suppose Z = (Z1, . . . , ZM) is an 1 × M vector of iid N(0, 1)’s. Because they areindependent, all the covariances are zero, so that

E[Z] = 0M and Cov[Z] = IM. (7.1)

A general multivariate normal distribution can have any (legitimate) mean and co-variance, achieved through the use of affine transformations. Here is the definition.

103

104 Chapter 7. The Multivariate Normal Distribution

Definition 7.1. Let the 1× p vector Y be defined by

Y = µ + ZB′, (7.2)

where B is p×M, µ is 1× p, and Z is a 1×M vector of iid standard normals. Then Y ismultivariate normal with mean µ and covariance matrix Σ ≡ BB′, written

Y ∼ Np(µ, Σ). (7.3)

We often drop the subscript “p” from the notation. The definition is for a row vec-tor Y. A multivariate normal column vector is defined the same way, then transposed.That is, if Z is a K × 1 vector of iid N(0, 1)’s, then Y = µ + BZ is N(µ, Σ) as well,where now µ is p× 1. In this book, I’ll generally use a row vector if the elements aremeasurements of different variables on the same observation, and a column vectorif they are measurements of the same variable on different observations, but there isno strict demarcation. The thinking is that a typical n× p data matrix has n observa-tions, represented by the rows, and p variables, represented by the columns. In fact, amultivariate normal matrix is simply a long multivariate normal vector chopped upand arranged into a matrix.

From (7.1) and (7.2), the usual affine transformation results from Section 2.3 showthat E[Y] = µ and Cov[Y] = BB′ = Σ. The definition goes further, implying thatthe distribution depends on B only through the BB′. For example, consider the twomatrices

B1 =

(1 1 00 1 2

)and B2 =

3√5

1√5

0√

5

, (7.4)

so that

B1B′1 = B2B′2 =

(2 11 5

)≡ Σ. (7.5)

Thus the definition says that both

Y1 = (Z1 + Z2, Z2 + 2Z3) and Y2 =

(3√5

Z1 +1√5

Z2,√

5Z2

)(7.6)

are N(02, Σ). They clearly have the same mean and covariance matrix, but it is notobvious they have the exact same distribution, especially as they depend on differentnumbers of Zi’s. To see the distributions are the same, we have to look at the mgf.We have already found the mgf for the bivariate (p = 2) normal in (5.50), and theproof here is the same. The answer is

MY(t) = eµt′+ 12 tΣt′ , t ∈ Rp. (7.7)

Thus the distribution of Y does depend on just µ and Σ, so that because Y1 and Y2have the same BiB′i (and the same mean), they have the same distribution.

Can µ and Σ be anything, or are there restrictions? Any µ is possible since thereare no restrictions on it in the definition. The covariance matrix Σ can be BB′ for anyp×M matrix B. Note that M is arbitrary, too. Clearly, Σ must be symmetric, but wealready knew that. It must also be nonnegative definite, which we define now.

7.1. Definition 105

Definition 7.2. A symmetric p× p matrix Ω is nonnegative definite if

cΩc′ ≥ 0 for all 1× p vectors c. (7.8)

The Ω is positive definite if

cΩc′ > 0 for all 1× p vectors c 6= 0. (7.9)

Note that cBB′c′ = ‖cB‖2 ≥ 0, which means that Σ must be nonnegative definite.But from (2.54),

cΣc′ = Cov(Yc′) = Var(Yc′) ≥ 0, (7.10)

because all variances are nonnegative. That is, any covariance matrix has to be non-negative definite, not just multivariate normal ones.

So we know that Σ must be symmetric and nonnegative definite. Are there anyother restrictions, or for any symmetric nonnegative definite matrix is there a corre-sponding B? Yes. In fact, there are potentially many such square roots B. See Exercise7.8.5. A nice one is the symmetric square root which will be seen in (7.15).

We conclude that the multivariate normal distribution is defined for any (µ, Σ),where µ ∈ Rp and Σ is symmetric and positive definite, i.e., it can be any validcovariance matrix.

7.1.1 Spectral decomposition

To derive a symmetric square root, as well as perform other useful tasks, we need thefollowing decomposition. Recall from (5.51) that a p× p matrix Γ is orthogonal if

Γ′Γ = ΓΓ′ = Ip. (7.11)

Theorem 7.3. Spectral decomposition for symmetric matrices. If Ω is a symmetricp × p matrix, then there exists a p × p orthogonal matrix Γ and a unique p × p diagonalmatrix Λ with diagonals λ1 ≥ λ2 ≥ · · · ≥ λp such that

Ω = ΓΛΓ′. (7.12)

Exercise 7.8.1 shows that the columns of Γ are eigenvectors of Ω, with correspond-ing eigenvalues λi’s. Here are some more handy facts about symmetric Ω and itsspectral decomposition.

• Ω is positive definite if and only if all λi’s are positive, and nonnegative definiteif and only if all λi’s are nonnegative. (Exercises 7.8.2 and 7.8.3.)

• The trace and determinant are, respectively,

trace(Ω) = ∑ λi and |Ω| = ∏ λi. (7.13)

The trace of a square matrix is the sum of its diagonals. (Exercise 7.8.4.)

• Ω is invertible if and only if its eigenvalues are nonzero, in which case itsinverse is

Ω−1 = ΓΛ−1Γ′. (7.14)

Thus the inverse has the same eigenvectors, and eigenvalues 1/λi. (Exercise7.8.5.)


• If Ω is nonnegative definite, then with Λ1/2 being the diagonal matrix withdiagonal elements

√λi,

Ω1/2 = ΓΛ1/2Γ′ (7.15)

is a symmetric square root of Ω, that is, it is symmetric and Ω1/2Ω1/2 = Ω.(Follows from Exercise 7.8.5.)

The last item was used in the previous section to guarantee that any covariancematrix has a square root.

7.2 Some properties of the multivariate normal

We will prove the next few properties using the representation (7.2). They can alsobe easily shown using the mgf.

7.2.1 Affine transformations

Affine transformations of multivariate normals are also multivariate normal, becauseany affine transformation of a multivariate normal vector is an affine transformationof an affine transformation of a standard normal vector, and an affine transformationof an affine transformation is also an affine transformation. That is, suppose Y ∼Np(µ, Σ), and W = c + YD′ for q× p matrix D and q× 1 vector c. Then we know thatfor some B with BB′ = Σ, Y = µ + ZB′, where Z is a vector of iid standard normals.Hence

W = c + YD′ = c + (µ + ZB′)D′ = c + µD′ + Z(DB)′. (7.16)

Then by Definition 7.1,

W ∼ N(c + µD′, DBB′D′) = N(c + µD′, DΣD′). (7.17)

Of course, the mean and covariance result we already knew.As a simple but important special case, suppose X1, . . . , Xn are iid N(µ, σ2), so

that X = (X1, . . . , Xn)′ ∼ N(µ1n, In), where 1n is the n × 1 vector of all 1’s. ThenX = cX where c = (1/n)1′n. Thus

X ∼ N(

1n

1′nµ1n,1n

1′nσ2In1n

1n

)= N

(µ,

1n

σ2)

, (7.18)

since 1′n1n = n. This result checks with what we found in (4.45) using mgfs.

7.2.2 Marginals

Because marginals are special cases of affine transformations, marginals of multivari-ate normals are also multivariate normal. One needs just to pick off the appropriatemeans and covariances. So If Y = (Y1, . . . , Y5) is N5(µ, Σ), and W = (Y2, Y5), then

W ∼ N2

((µ2, µ5),

(σ22 σ25σ52 σ55

)). (7.19)

Here, σij is the ijth element of Σ, so that σii = σ2i . See Exercise 7.8.6.

7.2. Some properties of the multivariate normal 107

7.2.3 Independence

In Section 3.3, we showed that independence of two random variables means thattheir covariance is 0, but that a covariance of 0 does not imply independence. But,with multivariate normals, it does: if (X, Y) is bivariate normal, and Cov(X, Y) = 0,then X and Y are independent. The next theorem proves a generalization of this inde-pendence to sets of variables. We will use (2.110), where for vectors X = (X1, . . . , Xp)and Y = (Y1, . . . , Yq),

Cov[X, Y] =

Cov[X1, Y1] Cov[X1, Y2] · · · Cov[X1, Yq]Cov[X2, Y1] Cov[X2, Y2] · · · Cov[X2, Yq]

......

. . ....

Cov[Xp, Y1] Cov[Xp, Y2] · · · Cov[Xp, Yq]

, (7.20)

the matrix of all possible covariances of one element from X and one from Y.

Theorem 7.4. SupposeW = (X, Y) (7.21)

is multivariate normal, where X is 1× p and Y is 1× q. If Cov[X, Y] = 0 (i.e., Cov[Xi, Yj] =0 for all i, j), then X and Y are independent.

Proof. For simplicitly, we will assume the mean of W is 0. Because covariances be-tween the Xi’s and Yj’s are zero,

Cov(W) =

(Cov(X) 0

0 Cov(Y)

). (7.22)

(The 0’s denote matrices of the appropriate size with all elements zero.) Both of thoseindividual covariance matrices have square roots, hence there are matrices B, p× p,and C, q× q, such that

Cov[X] = BB′ and Cov[Y] = CC′. (7.23)

Thus

Cov[W] = AA′ where A =

(B 00 C

). (7.24)

Then by definition, we know that we can represent the distribution of W with Z beinga 1× (p + q) vector of iid standard normals,

(X, Y) = W = ZA′ = (Z1, Z2)

(B 00 C

)′= (Z1B′, Z2C′), (7.25)

where Z1 and Z2 are 1× p and 1× q, respectively. But that means that

X = Z1B′ and Y = Z2C′. (7.26)

Because Z1 is independent of Z2, we have that X is independent of Y.

Note that this result can be extended to partitions of W into more than two groups.That is, if W = (X(1), . . . , X(K)), where each X(k) is a vector, then

W multivariate normal and Cov[X(k), X(l)] = 0 for all k 6= l

=⇒ X(1), . . . , X(K) are multually independent. (7.27)

Especially, if the covariance matrix of a multivariate normal vector is diagonal, thenall of the elements are mutually independent.


7.3 PDF

The multivariate normal has a pdf only if the covariance is invertible (i.e., positivedefinite). In that case, its pdf is easy to find using the same procedure used to findthe pdf of the bivariate normal in Section 5.4.1. Suppose Y ∼ Np(µ, Σ) where Σ is ap× p positive definite matrix. Let B be the symmetric square root of Σ as in (7.15),which is also positive definite. (Why?) Then, if Y is a row vector, Y = µ + ZB′, whereZ ∼ N(0p, Ip). Follow the steps as in (5.42) to (5.46). The only real difference is thatbecause we have p Zi’s, the power of the 2π is p/2 instead of 1. Thus

fY(y) =1

(2π)p/21√|Σ|

e−12 (y−µ)Σ−1

(y−µ)′ . (7.28)

If Y is a column vector, then the (y− µ) and (y− µ)′ are switched.

7.4 Sample mean and variance

Often one desires a confidence interval for the population mean. Specifically, supposeX1, . . . , Xn are iid N(µ, σ2). By (7.18), X ∼ N(µ, σ2/n), so that

Z =X− µ

σ/√

n∼ N(0, 1). (7.29)

This Z is called a pivotal quantity, meaning it has a known distribution even thoughits definition includes unknown parameters. Then

P[−1.96 <

X− µ

σ/√

n< 1.96

]= 0.95, (7.30)

or, untangling the equations to get µ in the middle,

P[

X− 1.96σ√n< µ < X + 1.96

σ√n

]= 0.95. (7.31)

Thus, (X− 1.96

σ√n

, X + 1.96σ√n

)is a 95% confidence interval for µ, (7.32)

at least if σ is known.But what if σ is not known? Then you estimate it, which will change the distribu-

tion, that is,X− µ

σ/√

n∼ ?? (7.33)

The sample variance for a sample x1, . . . , xn is

s2 =∑(xi − x)2

n, or is it s2

∗ =∑(xi − x)2

n− 1? (7.34)

Rather than worry about that question now, we will find the joint distribution of themean with the numerator:

(X, U), where U = ∑(Xi − X)2. (7.35)

7.4. Sample mean and variance 109

To start, note that the deviations xi − x are linear functions of the xi’s, as of courseis x, and we know how to deal with linear combinations of normals. That is, lettingX = (X1, . . . , Xn)′ be the column vector of observations, because the elements are iid,

X ∼ N(µ1n, σ2In), (7.36)

where 1n is the n× 1 vector of all 1’s. Then as in Exercise 2.7.5 and equation (7.18),the mean can be written X = (1/n)1′nX, and the deviations

X1 − XX2 − X

...Xn − X

= X− 1nX = InX− 1n

1n1′nX = HnX, (7.37)

where

Hn = In −1n

1n1′n (7.38)

is the n× n centering matrix. It is called the centering matrix because for any n× 1vector a, Hna subtracts the mean of the elements from each element, centering thevalues at 0. Note that if all the elements are the same, centering will set everythingto 0, i.e.,

Hn1n = 0n. (7.39)

Also, if the mean of the elements already is 0, centering does nothing, which inparticular means that Hn(Hna) = Hna, or

HnHn = Hn. (7.40)

Such a matrix is called idempotent. It is not difficult to verify (7.40) directly usingthe definition (7.38) and multiplying things out. In fact, In and (1/n)1n1′n are alsoidempotent.

Back to the task. To analyze the mean and deviations together, we stack them:X

X1 − X...

Xn − X

=

1n 1′n

Hn

X. (7.41)

Equation (7.41) gives explicitly that the vector containing the mean and deviationsis a linear transformation of a multivariate normal, hence the vector is multivariatenormal. The mean and covariance are

E

XX1 − X

...Xn − X

=

1n 1′n

Hn

µ1n =

1n 1′n1n

Hn1n

µ =

(µ0n

)(7.42)


and

Cov

XX1 − X

...Xn − X

=

1n 1′n

Hn

σ2In

1n 1′n

Hn

′

= σ2

1n2 1′n1n

1n 1′nHn

1n Hn1n HnHn

= σ2

1n 0′n

0n Hn

. (7.43)

Look at the 0n’s: The covariances between X and the deviations HnX are zero, hencewith the multivariate normality means they are independent. Further, we can readoff the distributions:

X ∼ N(

µ,1n

σ2)

and HnX ∼ N(0n, σ2Hn). (7.44)

The first we already knew. But because X and HnX are independent, and U =‖HnX‖2 is a function of just HnX,

X and U = ‖HnX‖2 = ∑(Xi − X)2 are independent. (7.45)

The next section goes through development of the χ2 distribution, which eventually(Lemma 7.6) shows that U/σ2 is χ2

n−1.

7.5 Chi-square distribution

In Exercises 1.7.13, 1.7.14, and 2.7.11, we defined the central and noncentral chi-square distributions on one degree of freedom. Here we look at the more generalchi-squares.

Definition 7.5. Suppose the ν× 1 vector Z ∼ N(0, Iν). Then

W = Z21 + · · ·+ Z2

ν = Z′Z (7.46)

has the central chi-square distribution on ν degrees of freedom, written

W ∼ χ2ν. (7.47)

Often one drops the “central” when referring to this distribution, unless trying todistinguish it from the noncentral chi-square coming in Section 7.5.3. (It is also oftencalled “chi-squared.”)

The expected value and variance of a central chi-square are easy to find since weknow that for Z ∼ N(0, 1), E[Z2] = Var[Z] = 1, and E[Z4] = 3 since the kurtosis κ4is 0. Thus Var[Z2] = E[Z4]− E[Z2]2 = 2. For W ∼ χ2

ν, by (7.46),

E[W] = νE[Z2] = ν and Var[W] = νVar[Z2] = 2ν. (7.48)

7.5. Chi-square distribution 111

Also, if W1, . . . , Wk are independent, with Wk ∼ χ2νk

, then

W1 + · · ·+ WK ∼ χ2ν1+···+νK

, (7.49)

because each Wk is a sum of νk independent standard normal squares, so the sum ofthe Wk’s is the sum of all ν1 + · · ·+ νK independent standard normal squares.

For the pdf, recall that in Exercise 1.7.13, we showed that a χ21 is a Gamma(1/2,

1/2). Thus a χ2ν is a sum of ν independent such gammas. These gammas all have the

same rate λ = 1/2, hence we just add up the α’s, which are all 1/2, hence

χ2ν = Gamma

(ν

2,

12

), (7.50)

as can be ascertained from Table 1.1 on page 7. This representation is another way toverify (7.48) and (7.49).

Now suppose Y ∼ N(µ, Σ), where Y is p× 1. We can do a multivariate standard-ization analogous to the univariate one in (7.29) if Σ is invertible:

Z = Σ−12 (Y− µ). (7.51)

Here, we will take Σ−1/2 to be inverse of the symmetric square root of Σ as in (7.15),though any square root will do. Since Z is a linear transformation of X, it is multi-variate normal. It is easy to see that E[Z] = 0. For the covariance:

Cov[Z] = Σ−12 Cov[Y]Σ−

12 = Σ−

12 ΣΣ−

12 = Ip. (7.52)

ThenZ′Z = (Y− µ)′Σ−

12 Σ−

12 (Y− µ) = (Y− µ)′Σ−1(Y− µ) ∼ χ2

p (7.53)

by (7.46) and (7.47). Random variables of the form (y − a)′C(y − a) are calledquadratic forms. We can use this random variable as a pivotal quantity, so that ifΣ is known, then

µ | (y− µ)′Σ−1(y− µ) ≤ χ2p,α (7.54)

is a 100× (1− α)% confidence region for µ, where χ2p,α is the (1− α)th quantile of the

χ2q. This region is an ellipsoid.

7.5.1 Noninvertible covariance matrix

We would like to apply this result to the HnX vector in (7.44), but we cannot use (7.54)directly because the covariance matrix of HnX, σ2Hn, is not invertible. In general, ifΣ is not invertible, then instead of its regular inverse, we use the Moore-Penroseinverse, which is a pseudoinverse, meaning it is not a real inverse but in some situ-ations acts like one. To define it for nonnegative definite symmetric matrices, firstlet Σ = ΓΛΓ′ be the spectral decomposition (7.12) of Σ. If Σ is not invertible, thensome of the diagonals (eigenvalues) of Λ will be zero. Suppose there are ν positiveeigenvalues. Since the λi’s are in order from largest to smallest, we have that

λ1 ≥ λ2 ≥ · · · ≥ λν > 0 = λν+1 = · · · = λp. (7.55)

The Moore-Penrose inverse uses a formula similar to that in (7.14), but in the innermatrix, takes reciprocal of just the positive λi’s. That is, let Λ1 be the ν× ν diagonal


matrix with diagonals λ1, . . . , λν. Then the Moore-Penrose inverse of Σ and its squareroot are defined to be

Σ+ = Γ

(Λ−1

1 00 0

)Γ′ and (Σ+)

12 = Γ

(Λ− 1

21 00 0

)Γ′, (7.56)

respectively. See Exercise 12.7.10 for general matrices.Now if Y ∼ N(µ, Σ), we let

Z+ = (Σ+)12 (Y− µ). (7.57)

Again, E[Z+] = 0, but the covariance is

Cov[Z+] = (Σ+)12 Σ(Σ+)

12

= Γ

(Λ− 1

21 00 0

)Γ′Γ

(Λ1 00 0

)Γ′Γ

(Λ− 1

21 00 0

)Γ′

= Γ

(Λ− 1

21 00 0

)(Λ1 00 0

)(Λ− 1

21 00 0

)Γ′

= Γ

(Iν 00 0

)Γ′. (7.58)

To get rid of the final two orthogonal matrices, we set

Z = Γ′Z+ ∼ N(

0,(

Iν 00 0

)). (7.59)

Note that the elements of Z are independent, the first ν of them are N(0, 1), and thelast p− ν are N(0, 0), which means they are identically 0. Hence we have

(Z+)′Z+ = Z′Z = Z21 + . . . + Z2

ν ∼ χ2ν. (7.60)

To summarize,

(Y− µ)′Σ+(Y− µ) ∼ χ2ν, ν = #Positive eigenvalues of Σ. (7.61)

Note that this formula is still valid if Σ is invertible, because then Σ+ = Σ−1.

7.5.2 Idempotent covariance matrix

Recall that a matrix H is idempotent if HH = H. It turns out that the Moore-Penroseinverse of a symmetric idempotent matrix H is H itself. To see this fact, let H = ΓΛΓ′

be the spectral decomposition, and write

HH = H =⇒ ΓΛΓ′ΓΛΓ′ = ΓΛΓ′

=⇒ ΓΛ2Γ′ = ΓΛΓ′

=⇒ Λ2 = Λ. (7.62)

7.5. Chi-square distribution 113

Since Λ is diagonal, we must have that λ2i = λi for each i. But then each λi must be 0

or 1. Letting ν be the number that are 1, since the eigenvalues have the positive onesfirst,

H = Γ

(Iν 00 0

)Γ′, (7.63)

hence (7.56) with Λ1 = Iν shows that the Moore-Penrose inverse of H is H. Also, by(7.13), the trace of a matrix is the sum of its eigenvalues, hence in this case

trace(H) = ν. (7.64)

Thus (7.61) shows that

Y ∼ N(µ, H) =⇒ (Y− µ)′(Y− µ) ∼ χ2trace(H), (7.65)

where we use the fact that (Y− µ) and H(Y− µ) have the same distribution.Finally turn to (7.44), where we started with X ∼ N(µ1n, σ2In), and derived in

(7.44) that HnX ∼ N(0, σ2Hn). Then by (7.65) with Y = (1/σ2)HnX, we have

1σ

HnX ∼ N(0, Hn) =⇒ 1σ2 X′HnX ∼ χ2

trace(Hn). (7.66)

Exercise 7.8.11 shows that

X′HnX =n

∑i=1

(Xi − X)2 and trace(Hn) = n− 1. (7.67)

Together with (7.44) and (7.45), we have the following.

Lemma 7.6. If X1, . . . , Xn are iid N(µ, σ2), then X and ∑(Xi − X)2 are independent, with

X ∼ N(

µ,1n

σ2)

and ∑(Xi − X)2 ∼ σ2χ2n−1. (7.68)

(U ∼ σ2χ2ν means that U/σ2 ∼ χ2

ν.)

Since E[χ2ν] = ν, E[∑(Xi − X)2] = (n − 1)σ2. Thus of the two sample variance

formulas in (7.34), only the second is unbiased, meaning it has expected value σ2:

E[S2∗] = E

[∑(Xi − X)2

n− 1

]=

(n− 1)σ2

n− 1= σ2. (7.69)

(Which doesn’t mean it is better than S2.)

7.5.3 Noncentral chi-square distribution

Definition 7.5 of the central chi-square assumed that the mean of the normal vectorwas zero. The noncentral chi-square allows arbitrary means:

Definition 7.7. Suppose Z ∼ N(γ, Iν). Then

W = Z′Z = ‖Z‖2 (7.70)

has the noncentral chi-squared distribution on ν degrees of freedom with noncentralityparameter ∆ = ‖γ‖2, written

W ∼ χ2ν(∆). (7.71)


Note that the central χ2 is the noncentral chi-square with ∆ = 0. See Exercise7.8.20 for the mgf, and Exercise 7.8.22 for the pdf, of the noncentral chi-square.

This definition implies that the distribution depends on the parameter γ throughjust the noncentrality parameter. That is, if Z and Z∗ are multivariate normal with thesame covariance matrix Iν but different means γ and γ∗, repectively, that ‖Z‖2 and‖Z∗‖2 have the same distribution as long as ‖γ‖2 = ‖γ∗‖2. Is that claim plausible?The key is that if Γ is an orthogonal matrix, then ‖ΓZ‖2 = ‖Z‖2. Thus Z and ΓZwould lead to the same chi-squared. Take Z ∼ N(γ, Iν), and let Γ be the orthogonalmatrix such that

Γγ =

‖γ‖

0...0

=

√

∆0...0

. (7.72)

Any orthogonal matrix whose first row is γ/‖γ‖ will work. Then

‖Z‖2 =D ‖ΓZ‖2, (7.73)

and the latter clearly depends on just ∆, which shows that the definition is fine. (Ofcourse, inspecting the mgf or pdf will also prove the claim.)

Analogous to (7.49), it can be shown that if W1 and W2 are independent, then

W1 ∼ χ2ν1(∆1) and W2 ∼ χ2

ν2(∆2) =⇒ W1 + W2 ∼ χν1+ν2 (∆1 + ∆2). (7.74)

For the mean and variance, we start with Zi ∼ N(γi, 1), so that Z2i ∼ χ2

1(γ2i ).

Exercise 6.8.19 finds the mean and variance of such:

E[χ21(γ

2i )] = E[Z2

i ] = 1 + γ2i and Var[χ2

1(γ2i )] = Var[Z2

i ] = 2 + 4γ2i . (7.75)

Thus for Z ∼ N(γ, Iν), W ∼ χ2ν(‖γ‖2), hence

E[W] = E[‖Z‖2] =ν

∑i=1

E[Z2i ] = ν +

ν

∑i=1

γ2i = ν + ‖γ‖2, and

Var[W] = Var[‖Z‖2] =ν

∑i=1

Var[Z2i ] = 2ν + 4

ν

∑i=1

γ2i = 2ν + 4‖γ‖2. (7.76)

7.6 Student’s t distribution

Here we answer the question of how to find a confidence interval for µ as in (7.32)when σ is unknown. The pivotal quantity we use is

T =X− µ

S∗/√

n, S2∗ =

∑(Xi − X)2

n− 1. (7.77)

Exercise 6.8.17 introduced the Student’s t, finding its mean, variance, and pdf. Forour purposes here, if Z ∼ N(0, 1) and U ∼ χ2

ν, where Z and U are independent, then

T =Z√U/ν

∼ tν, (7.78)

7.7. Linear models and the conditional distribution 115

Student’s t on ν degrees of freedom.From Lemma 7.6, we have the condition for (7.78) satisfied for ν = n− 1 by setting

Z =X− µ

σ/√

nand U =

∑(Xi − X)2

σ2 . (7.79)

Then

T =(X− µ)/(σ/

√n)√

∑(Xi−X)2/σ2

n−1

=X− µ

S∗/√

n∼ tn−1. (7.80)

A 95% confidence interval for µ is

X± tn−1,0.025s∗√

n, (7.81)

where tν,α/2 is the cutoff point that satisfies

P[−tν,α/2 < tν < tν,α/2] = 1− α. (7.82)

7.7 Linear models and the conditional distribution

Expanding on the simple linear model in (6.33), we consider the conditional model

Y |X = x ∼ N(α + xβ, Σe) and X ∼ N(µX , ΣXX), (7.83)

where Y is 1× q, X is 1× p, and β is a p× q matrix. As in Section 6.7.1,

E = Y− α− Xβ and X (7.84)

are independent and multivariate normal, and their joint distribution is also multi-variate normal:

(X, E) ∼ N((µX , 0q),

(ΣXX 0

0 Σe

)). (7.85)

(Here, 0q is a row vector.)To find the joint distribution of X and Y, we note that (X, Y) is a affine transfor-

mation of (X, E), hence is multivariate normal. Specifically,

(X, Y) = (0p, α) + (X, E)(

Ip β0 Iq

)∼ N

((µX , µY),

(ΣXX ΣXYΣYX ΣYY

)), (7.86)

where

(µX , µY) = (0p, α) + (µX , 0q)

(Ip β0 Iq

)= (µX , α + µX β) (7.87)

and (ΣXX ΣXYΣYX ΣYY

)=

(Ip 0β′ Iq

)(ΣXX 0

0 Σe

)(Ip β0 Iq

)=

(ΣXX ΣXX β

β′ΣXX Σe + β′ΣXX β

). (7.88)


We invert the above process to find the conditional distribution of Y given X fromthe joint. First, we solve for α, β, and Σe in terms of the µ’s and Σ’s in (7.88):

β = Σ−1XXΣXY , α = µY − µX β, and Σe = ΣYY − ΣYXΣ−1

XXΣXY . (7.89)

Lemma 7.8. Suppose

(X, Y) ∼ N((µX , µY),

(ΣXX ΣXYΣYX ΣYY

)), (7.90)

where ΣXX is invertible. Then

Y |X = x ∼ N(α + xβ, Σe), (7.91)

where α, β and Σe are given in (7.89).

The lemma deals with row vectors X and Y. For the record, here is the result forcolumn vectors:(

XY

)∼ N

((µXµY

),(

ΣXX ΣXYΣYX ΣYY

))=⇒ Y |X = x ∼ N(α + βx, Σe), (7.92)

where Σe is as in (7.89), but now

β = ΣYXΣ−1XX and α = µY − βµX . (7.93)

Chapter 12 goes more deeply into linear regression.

7.8 Exercises

Exercises 7.8.1 to 7.8.5 are based on the p × p symmetric matrix Ω with spectraldecomposition ΓΛΓ′ as in Theorem 7.3 on page 105.

Exercise 7.8.1. Let γ1, . . . ,γp be the columns of Γ. The p× 1 vector v is a eigenvectorof Ω with corresponding eigenvalue a if Ωv = av. Show that for each i, γi is aneigenvector of Ω. What is the eigenvalue corresponding to γi? [Hint: Show that Γ′γihas one element equal to one, and the rest zero.]

Exercise 7.8.2. (a) For 1× p vector c, show that cΩc′ = ∑pi=1 b2

i λi for some vectorb = (b1, . . . , bp). [Hint: Let b = cΓ.] (b) Suppose λi > 0 for all i. Argue that Ω ispositive definite. (c) Suppose λi ≤ 0 for some i. Find a c 6= 0 such that cΩc′ ≤ 0.[Hint: You can use one of the columns of Ω, transposed.] (d) Do parts (b) and (c)show that Ω is positive definite if and only if all λi > 0?

Exercise 7.8.3. (a) Suppose λi ≥ 0 for all i. Argue that Ω is nonnegative definite. (b)Suppose λi < 0 for some i. Find a c such that cΩc′ < 0. (c) Do parts (a) and (b) showthat Ω is nonnegative definite if and only if all λi ≥ 0?

Exercise 7.8.4. (a) Show that |Ω| = ∏pi=1 λi. [Use can use that fact that |AB| = |A||B|

for square matrices. Also, recall from (5.55) that the determinant of an orthogonalmatrix is ±1.] (b) Show that trace(Ω) = ∑

pi=1 λi. [Hint: Use the fact that trace(AB′) =

trace(B′A), if the matrices are the same dimensions.]

7.8. Exercises 117

Exercise 7.8.5. (a) Show that if all the λi’s are nonzero, then Ω−1 = ΓΛ−1Γ′. (b)Suppose Ω is nonnegative definite, and Ψ is any p× p orthogonal matrix. Show thatB = ΓΛ1/2Ψ′ is a square root of Ω, that is, Ω = BB′.

Exercise 7.8.6. Suppose Y is 1× 5, Y ∼ N(µ, Σ). Find the matrix B so that YB′ =(Y2, Y5) ≡ W. Show that the distribution of W is given in (7.19).

Exercise 7.8.7. Let

Σ =

(5 22 4

). (7.94)

(a) Find the upper triangular matrix A,

A =

(a b0 c

), (7.95)

such that Σ = AA′, where both a and c are positive. (b) Find A−1 (which is alsoupper triangular). (c) Now suppose X = (X1, X2)

′ (a column vector) is multivariatenormal with mean (0, 0)′ and covariance matrix Σ from above. Let Y = A−1X. Whatis Cov[Y]? (d) Are Y1 and Y2 independent?


X =

X1X2X3

∼ N3

µµµ

, σ2

1 ρ ρρ 1 ρρ ρ 1

. (7.96)

(So all the means are equal, all the variances are equal, and all the covariance areequal.) Let Y = AX, where

A =

1 1 11 −1 01 1 −2

. (7.97)

(a) Find E[Y] and Cov[Y]. (b) True or false: (i) Y is multivariate normal; (ii) Y1, Y2 andY3 are identically distributed; (iii) The Yi’s are pairwise independent; (iv) The Yi’s aremutually independent.

Exercise 7.8.9. True or false: (a) If X ∼ N(0, 1) and Y ∼ N(0, 1), and Cov[X, Y] = 0,then X and Y are independent. (b) If Y | X = x ∼ N(0, 4) and X ∼ Uniform(0, 1), thenX and Y are independent. (c) Suppose X ∼ N(0, 1) and Z ∼ N(0, 1) are independent,and Y = Sign(Z)|X| (where Sign(x) is +1 if x > 0, and −1 if x < 0, and Sign(0) = 0).True or false: (i) Y ∼ N(0, 1); (ii) (X, Y) is bivariate normal; (iii) Cov[X, Y] = 0; (iv)(X, Z) is bivariate normal. (d) If (X, Y) is bivariate normal, and Cov[X, Y] = 0.5, thenX and Y are independent. (e) If X ∼ N(0, 1) and Y ∼ N(0, 1), and Cov[X, Y] = 0.5,then (X, Y) is bivariate normal. (f) Suppose (X, Y, Z) is multivariate normal, andCov[X, Y] = Cov[X, Z] = Cov[Y, Z] = 0. True or false: (i) X, Y and Z are pairwiseindependent; (ii) X, Y and Z are mutually independent.

Exercise 7.8.10. Suppose (XY

)∼ N

((00

),(

1 ρρ 1

)), (7.98)


and let

W =

XYX2

Y2

. (7.99)

The goal of the exercise is to find Cov[W]. (a) What are E[XY], E[X2] , and E[Y2]?(b) Both X2 and Y2 are distributed χ2

1. What are Var[X2] and Var[Y2]? (c) What isthe conditional distribution Y |X = x? (d) To find Var[XY], first condition on X = x.Find E[XY |X = x] and Var[XY |X = x]. Then find Var[XY]. [Hint: Use (6.43).] (e)Now Cov[X2, XY] can be written Cov[E[X2|X], E[XY|X]] + E[Cov[X2, XY|X]]. Whatis E[X2|X = x]? Find Cov[X2, XY], which is then same as Cov[Y2, XY]. (f) Finally,for Cov[X2, Y2], first find Cov[X2, Y2 | X = x] and E[Y2 | X = x]. Thus what isCov[X2, Y2]?

Exercise 7.8.11. Suppose X is n× 1 and Hn is the n× n centering matrix, Hn = In −(1/n)1n1′n. (a) Show that X′HnX = ∑n

i=1(Xi − X)2. (b) Show that trace(Hn) = n− 1.

Exercise 7.8.12. Suppose X1, . . . , Xn are iid N(µ, σ2), and Y1, . . . , Ym are iid N(γ, σ2),and the Xi’s and Yi’s are independent. Let U = ∑(Xi − X)2, V = ∑(Yi −Y)2. (a) AreX, Y, U and V mutually independent? (b) Let D = X−Y. What is the distribution ofD? (c) The distribution of U + V is σ2 times what distribution? What are the degreesof freedom? (d) What is an unbiased estimate of σ2? (It should depend on both Uand V.) (e) Let W be that unbiased estimator of σ2. Find the function of D, W, n, m, µ,and γ that is distributed as a Student’s t. (f) Now take n = m = 5. A 95% confidenceinterval for µ− γ is then D ± c× se(µ− γ). What are c and se(µ− γ)?

Exercise 7.8.13. Suppose Y1, . . . , Yn | B = β are independent, where

Yi | B = β ∼ N(βxi, σ2). (7.100)

The xi’s are assumed to be known fixed quantities. Also, marginally, B ∼ N(0, σ20 ).

The σ2 and σ20 are assumed known. (a) The conditional pdf of (Y1, . . . , Yn) can be

writtenfY | B(y1, . . . , yn | B = β) = ae−

12 β2 Ceβ D (7.101)

for some C and D, where the a does not depend on β at all. What are C and D? (Theyshould be functions of the xi’s, yi’s, and σ2.) (b) Similarly, the marginal pdf of B canbe written

fB(β) = a∗e−12 β2 Leβ M, (7.102)

where a∗ does not depend on β. What are L and M? (c) The joint pdf of (Y1, . . . , Yn, B)is thus

f (y1, . . . , yn, β) = aa∗e−12 β2 Reβ S, (7.103)

What are R and S? (d) What is the posterior distribution of B, B | (Y1, . . . , Yn) =(y1, . . . , yn)? (It should be normal.) What are the posterior mean and variance of B?(e) Let β = ∑ xiYi/ ∑ x2

i . Show that

β | B = β ∼ N

(β,

σ2

∑ x2i

). (7.104)

7.8. Exercises 119

(The mean and variance were found in Exercise 2.7.6.) Is the posterior mean of Bequal to β? Is the posterior variance of B equal to the conditional variance of β? (f)Find the limits of the posterior mean and variance of B as the prior variance, σ2

0 , goesto ∞. Is the limit of the posterior mean of B equal to β? Is the limit of the posteriorvariance of B equal to the conditional variance of β?

Exercise 7.8.14. Consider the Bayesian model with

Y |M = µ ∼ N(µ, σ2) and M ∼ N(µ0, σ20 ), (7.105)

where µ0, σ2 > 0, and σ20 > are known. (a) Making the appropriate identifications in

(7.83), show that

(M, Y) ∼ N((µ0, µ0),

(σ2

0 σ20

σ20 σ2 + σ2

0

)). (7.106)

(b) Next, show that

M |Y = y ∼ N

(σ2µ0 + σ2

0 yσ2 + σ2

0,

σ2σ20

σ2 + σ20

). (7.107)

(c) The precision of a random variable is the inverse of the variance. Let ω2 = 1/σ2

and ω20 = 1/σ2

0 be the precisions for the distributions in (7.105). Show that forthe conditional distribution in (7.107), E[M |Y = y] = (ω2

0µ0 + ω2y)/(ω20 + ω2), a

weighted average of the prior mean and observation, weighted by their respectiveprecisions. Also show that the conditional precision of M |Y = y is the sum of thetwo precisions, ω2

0 + ω2. (d) Now suppose Y1, . . . , Yn given M = µ are iid N(µ, σ2),and M is distributed as above. What is the conditional distribution of Y |M = µ?Show that

M |Y = y ∼ N

(σ2µ0 + nσ2

0 yσ2 + nσ2

0,

σ2σ20

σ2 + nσ20

). (7.108)

(e) Find ly and uy such that

P[ly < M < uy |Y = y] = 0.95. (7.109)

That interval is a 95% probability interval for µ. [See (6.72).]

Exercise 7.8.15. Consider a multivariate analog of the posterior mean in Exercise7.8.14. Here,

Y |M = µ ∼ Np(µ, Σ) and M ∼ Np(µ0, Σ0), (7.110)

where Σ, Σ0, and µ0 are known, and the two covariance matrices are invertible. (a)Show that the joint distribution of (Y, M) is multivariate normal. What are the pa-rameters? (They should be multivariate analogs of those in (7.106).) (b) Show that theconditional distribution of M |Y = y is multivariate normal with

E[M |Y = y] = µ0 + (y− µ0)(Σ0 + Σ)−1Σ0 (7.111)

andCov[M |Y = y] = Σ0 − Σ0(Σ0 + Σ)−1Σ0. (7.112)


(c) Let the precision matrices be defined by Ω = Σ−1 and Ω0 = Σ−10 . Show that

Σ0 − Σ0(Σ0 + Σ)−1Σ0 = (Ω0 + Ω)−1. (7.113)

[Hint: Start by setting (Σ0 + Σ)−1 = Ω(Ω0 + Ω)−1Ω0, then try to simplify. At somepoint you may want to note that Ip − Ω(Ω0 + Ω)−1 = Ω0(Ω0 + Ω)−1.] (d) Usesimilar calculations on E[M |Y = y] to finally obtain that

M |Y = y ∼ N(µ∗, (Ω0 + Ω)−1), where µ∗ = (µ0Ω0 + yΩ)(Ω0 + Ω)−1. (7.114)

Note the similarity to the univariate case in (7.108). (e) Show that marginally, Y ∼N(µ0, Σ0 + Σ).

Exercise 7.8.16. The distributional result in Exercise 7.8.15 leads to a simple answerto a particular complete-the-square problem. The joint distribution of (Y, M) in thatexercise can be expressed two ways, depending on which variable is conditionedupon first. That is, using simplified notation,

f (y, µ) = f (y | µ) f (µ) = f (µ | y) f (y). (7.115)

Focussing on just the terms in the exponents in the last two expression in (7.115),show that

(y− µ)Ω(y− µ)′ + (µ− µ0)Ω0(µ− µ0)

= (µ− µ∗)(Ω0 + Ω)(µ− µ∗) + (y− µ0)(Ω−10 + Ω−1)−1(y− µ0)

′ (7.116)

for µ∗ in (7.114). Thus the right-hand-side completes the square in terms of µ.

Exercise 7.8.17. Exercise 6.8.17 introduced Student’s t distribution. This exercisetreats a multivariate version. Suppose

Z ∼ N(0p, Ip) and U ∼ Gamma(ν/2, 1/2), and set T =1

(U/ν)Z. (7.117)

Then T has the standard multivariate Student’s t distribution on ν degrees of freedom,written T ∼ tp,ν. Note that T can be either a row vector or column vector. (a) Showthat the joint distribution of (T, U) can be represented by

T |U = u ∼ N(0, (ν/u)Ip) and U ∼ Gamma(ν/2, 1/2). (7.118)

Write down the joint pdf. (b) Show that E[T] = 0 (if ν > 1) and Cov[T] = (ν/(ν−2))Ip (if ν > 2). Are the elements of T uncorrelated? Are they independent? (c) Showthat the marginal pdf of T is

fν,p(t) =Γ((ν + p)/2)

Γ(ν/2)(√

νπ)p1

(1 + ‖t‖2/ν)(ν+p)/2. (7.119)

Exercise 7.8.18. If U and V are independent, with U ∼ χ2ν and V ∼ χ2

µ, then

W =U/ν

V/µ(7.120)

7.8. Exercises 121

has the Fν,µ distribution. (a) Let X = νW/(µ + νW). Argue that X is Beta(α, β),and give the parameters in terms of µ and ν. (b) From Exercise 5.6.2, we know thatY = X/(1− X) has pdf fY(y) = cyα−1/(1 + y)α+β, where c is the constant for thebeta. Give W as a function of Y. (c) Show that the pdf of W is

h(w | ν, µ) =Γ((ν + µ)/2)Γ(ν/2)Γ(µ/2)

(ν/µ)ν/2 wν/2−1

(1− (ν/µ)w)(ν+µ)/2. (7.121)

[Hint: Use the Jacobian technique to find the pdf of W from that of Y in part (b).] (d)Suppose T ∼ tk. Argue that T2 is F, and give the degrees of freedom for the F. [Whatis the definition of Student’s t?]

Exercise 7.8.19. Suppose X1, . . . , Xn are independent N(µX , σ2X)’s, and Y1, . . . , Ym are

independent N(µY , σ2Y)’s, where the Xi’s are independent of the Yi’s. Also, let

S2X =

∑ni=1(Xi − X)2

n− 1and S2

Y =∑m

i=1(Yi −Y)2

m− 1. (7.122)

(a) For what constant τ, depending on σ2X and σ2

Y , is

F = τS2

XS2

Y(7.123)

distributed as an Fν,µ? Give ν, µ. This F is a pivotal quantity. (b) Find l and u, asfunctions of S2

X , S2Y , such that

P[l <σ2

Xσ2

Y< u] = 95%. (7.124)

Exercise 7.8.20. Suppose Z1, . . . , Zν are independent, with Zi ∼ N(µi, 1), so that

W = Z21 + · · ·+ Z2

ν ∼ χ2ν(∆), ∆ = ‖µ‖2, (7.125)

as in Definition 7.5. From Exercise 6.8.19, we know that the mgf of the Z2i is

(1− 2t)−1/2 eµ2i t/(1−2t). (7.126)

(a) What is the mgf of W? Does it depend on the µi’s through just ∆? (b) Considerthe distribution on (U, X) given by

U |X = k ∼ χ2ν+2k and X ∼ Poisson(λ), (7.127)

so that U is a Poisson(λ) mixture of χ2ν+2k’s. By Exercise 6.8.19, we have that the

marginal mgf of U is

MU(t) = (1− 2t)−ν/2 eλ 2t/(1−2t). (7.128)

By matching the mgf in (7.128) with that of W in part (a), show that W ∼ χ2ν(∆) is a

Poisson(∆/2) mixture of χ2ν+2k’s.


Exercise 7.8.21. This and the next two exercises use generalized hypergeometricfunctions. For nonnegative integers p and q, the function pFq is a function of real (orcomplex) y, with parameters α = (α1, . . . , αp) and β = (β1, . . . , βq), given by

pFq(α ; β ; y) =∞

∑k=0

( p

∏i=1

Γ(αi + k)Γ(αi)

) q

∏j=1

Γ(β j)

Γ(β j + k)

yk

k!

. (7.129)

If either p or q is zero, then the corresponding product of gammas is just 1. Dependingon the values of the y and the parameters, the function may or may not converge. (a)Show that 0F0(− ; − ; y) = ey, where the “−” is a placeholder for the nonexistent αor β. (b) Show that for |y| < 1 and α > 0, 1F0(α ; − ; y) = (1− y)−α. [Hint: Expand(1− z)−α in a Taylor series about z = 0 (so a Maclaurin series), and use the fact thatΓ(a + 1) = aΓ(a) as in Exercise 1.7.10(b).] (c) Show that the mgf of Z ∼ Beta(α, β) isMZ(t) = 1F1(α ; α + β ; y). The 1F1 is called the confluent hypergeometric function.[Hint: In the integral for E[eZt], expand the exponential in its Mclaurin series.]

Exercise 7.8.22. From Exercise 7.8.20, we have that W ∼ χ2ν(∆) has the same distri-

bution as U in (7.127), where λ = ∆/2. (a) Show that the marginal pdf of W is

f (w | ν, ∆) = g(w | ν)e−∆/2∞

∑k=0

Γ(ν/2)Γ(ν/2 + k)

1k!

(∆w4

)k, (7.130)

where g(w | ν) is the pdf of the central χ2ν. (b) Show that the pdf in part (a) can be

writtenf (w | ν, ∆) = g(w | ν)e−∆/2

0F1(− ; ν/2 ; ∆w/4). (7.131)

Exercise 7.8.23. If U and V are independent, with U ∼ χ2ν(∆) and V ∼ χ2

µ, then

Y =U/ν

V/µ∼ Fν,µ(∆), (7.132)

the noncentral F with degrees of freedom (ν, µ) and noncentrality parameter ∆. (Sothat if ∆ = 0, Y is central F as in (7.120).) The goal of this exercise is to derive thepdf of the noncentral F. (a) From Exercise 7.8.20, we know that the distribution of Ucan be represented as in (7.127) with λ = ∆/2. Let Z = U/(U + V). The conditionaldistribution of Z |X = k is then a beta. What are its parameters? (b) Write down themarginal pdf of Z. Show that it can be written as

fZ(z) = c(z ; a, b) e−∆/21F1((ν + µ)/2 ; ν/2 ; w), (7.133)

where c(z ; a, b) is the Beta(a, b) pdf, and 1F1 is defined in (7.129). Give the a, b, andw in terms of ν, µ, ∆, and z. (c) Since Z = νY/(µ + νY), we can find the pdf of Y from(7.133) using the same Jacobian as in Exercise 7.8.18(b). Show that the pdf of Y canbe written as

fY(y) = h(y | ν, µ)e−∆/21F1((ν + µ)/2 ; ν/2 ; w), (7.134)

where h is the pdf of Fν,µ in (7.121) and w is the same as in part (b) but written interms of y.

7.8. Exercises 123

Exercise 7.8.24. The material in Section 7.7 can be used to find a useful matrix iden-tity. Suppose Σ is a (p+ q)× (p+ q) symmetric matrix whose inverse C ≡ Σ−1 exists.Partition these matrices into blocks as in (7.88), so that(

ΣXX ΣXYΣYX ΣYY

)& C =

(CXX CXYCYX CYY

), (7.135)

where ΣXX and CXX are p× p, ΣYY and CYY are q× q, etc. With β = Σ−1XXΣXY , let

A =

(Ip 0−β′ Iq

). (7.136)

(a) Find A−1. [Hint: Just change the sign on the β.] (b) Show that

AΣA′ =(

ΣXX 00 Σe

)where Σe = ΣYY − ΣYXΣ−1

XXΣXY (7.137)

from (7.89). (c) Take inverses on the two sides of the first equation in (7.137) to showthat

C = A′(

Σ−1XX 00 Σ−1

e

)A =

(Σ−1

XX + Σ−1XXΣXYΣ−1

e ΣYXΣ−1XX −Σ−1

XXΣXYΣ−1e

−Σ−1e ΣYXΣ−1

XX Σ−1e

).

(7.138)In particular, CYY = Σ−1

e .

Chapter 8

Asymptotics: Convergence in Probability andDistribution

So far we have been concerned with finding the exact distribution of random variablesand functions of random variables. Especially in estimation or hypothesis testing,functions of data can become quite complicated, so it is necessary to find approx-imations to their distributions. One way to address such difficulties is to look atwhat happens when the sample size is large, or actually, as it approaches infinity. Inmany cases, nice asymptotic results are available, and they give surprisingly goodapproximations even when the sample size is nowhere near infinity.

8.1 Set-up

We assume that we have a sequence of random variables, or random vectors. That is,for each n, we have a random p× 1 vector Wn with spaceWn(⊂ Rp) and probabilitydistribution Pn. There need not be any particular relationship between the Wn’sfor different n’s, but in the most common situation we will deal with, Wn is somefunction of iid X1, . . . , Xn, so as n → ∞, the function is based on more and moreobservations.

The two types of convergence we will consider are convergence in probability toa constant (Section 8.2) and convergence in distribution to a random vector (Section8.4).

8.2 Convergence in probability to a constant

A sequence of constants an approaching the constant c means that as n → ∞, angets arbitrarily close to c; technically, for any ε > 0, eventually |an − c| < ε. Thatdefinition does not immediately transfer to random variables. For example, supposeXn is the mean of n iid N(µ, σ2)’s. We will see that the law of large numbers saysthat as n → ∞, Xn → µ. But that cannot be always true, since no matter how large nis, the space of Xn is R. On the other hand, the probability is high that Xn is close toµ. That is, for any ε > 0,

Pn[|Xn − µ| < ε] = P[|N(0, 1)| <√

nε/σ] = Φ(√

nε/σ)−Φ(−√

nε/σ), (8.1)

125

126 Chapter 8. Asymptotics: Convergence in Probability and Distribution

where Φ is the distribution function of Z ∼ N(0, 1). Now let n → ∞. Because Φ is adistribution function, the first Φ on the right in (8.1) goes to 1, and the second goesto 0, so that

Pn[|Xn − µ| < ε] −→ 1. (8.2)

Thus Xn isn’t for sure close to µ, but is with probability 0.9999999999 (assuming n islarge enough). Now for the definition.

Definition 8.1. The sequence of random variables Wn converges in probability to theconstant c, written

Wn −→P c, (8.3)

if for every ε > 0,Pn[|Wn − c| < ε] −→ 1. (8.4)

If Wn is a sequence of random p× 1 vectors, and c is a p× 1 constant vector, then Wn →P cif for every ε > 0,

Pn[‖Wn − c‖ < ε] −→ 1. (8.5)

It turns out that Wn →P c if and only if each component Wni →P ci, whereWn = (Wn1, . . . , Wnp)′.

As an example, suppose X1, . . . , Xn are iid Beta(2,1), which has space (0,1), pdffX(x) = 2x, and distribution function F(x) = x2 for 0 < x < 1. Denote the minimumof the Xi’s by X(1). You would expect that as the number of observations between0 and 1 increase, the minimum would get pushed down to 0. So the question iswhether X(1) →P 0. To prove it, take any 1 > ε > 0 (Why is it ok for us to ignoreε ≥ 1?), and look at

Pn[|X(1) − 0| < ε] = Pn[X(1) < ε], (8.6)

because X(1) is positive. That final probability is F(1)(ε), where F(1) is the distributionfunction of X(1). Now the minimum is larger than ε if and only if all the observationsare larger than ε, and since the observations are independent, we can write

F(1)(ε) = 1− Pn[X(1) ≥ ε]

= 1− P[X1 ≥ ε]n

= 1− (1− FX(ε))n

= 1− (1− ε2)n −→ 1 as n→ ∞. (8.7)

(Alternatively, we could use the formula for the pdf of order statistics in (5.95).) Thus

minX1, . . . , Xn −→P 0. (8.8)

The examples in (8.1) and (8.7) are unusual in that we can calculate the probabili-ties exactly. It is more common that some inequalities are used, such as Chebyshev’sin the next section.

8.3 Chebyshev’s inequality and the law of large numbers

The most basic result for convergence in probability is the following.

8.3. Chebyshev’s inequality and the law of large numbers 127

Lemma 8.2. Weak law of large numbers (WLLN). If X1, . . . , Xn are iid with (finite) meanµ, then

Xn −→P µ. (8.9)

See Theorem 2.2.9 in Durrett (2010) for a proof. We will prove a slightly weakerversion, where we assume that the variance of the Xi’s is finite as well. First we needan inequality.

Lemma 8.3. Chebyshev’s inequality. For random variable W and ε > 0,

P[|W| ≥ ε] ≤ E[W2]

ε2 . (8.10)

Proof.

E[W2] = E[W2 I[|W| < ε]] + E[W2 I[|W| ≥ ε]]

≥ E[W2 I[|W| ≥ ε]]

≥ ε2 E[I[|W| ≥ ε]]

= ε2 P[|W| ≥ ε]. (8.11)

Then (8.10) follows.

A similar proof can be applied to any nondecreasing function φ(w) : [0, ∞) → Rto show that

P[|W| ≥ ε] ≤ E[φ(|W|)]φ(ε)

. (8.12)

Chebyshev’s inequality uses φ(w) = w2. The general form is called Markov’s in-equality.

Suppose X1, . . . , Xn are iid with mean µ and variance σ2 < ∞. Then using Cheby-shev’s inequality with W = Xn − µ, we have that for any ε > 0,

P[|Xn − µ| ≥ ε] ≤ Var[Xn]

ε2 =σ2

nε2 −→ 0. (8.13)

Thus P[|Xn − µ| ≤ ε]→ 1, and Xn →P µ.The weak law of large numbers can be applied to means of functions of the Xi’s.

For example, if E[X2i ] < ∞, then

1n

n

∑i=1

X2i −→

P E[X2i ] = µ2 + σ2, (8.14)

because the X21 , . . . , X2

n are iid with mean µ2 + σ2.Not only means of functions, but functions of the mean are of interest. For exam-

ple, if the Xi’s are iid Exponential(λ), then the mean is 1/λ, so that

Xn −→P1λ

. (8.15)

But we really want to estimate λ. Does

1Xn−→P λ ? (8.16)


We could find the mean and variance of 1/Xn, but more simply we note that if Xnis close to 1/λ, 1/Xn must be close to λ, because the function 1/w is continuous.Formally, we have the following mapping result.

Lemma 8.4. If Wn →P c, and g(w) is a function continuous at w = c, then

g(Wn) −→P g(c). (8.17)

Proof. By definition of continuity, for every ε > 0, there exists a δ > 0 such that

|w− c| < δ =⇒ |g(w)− g(c)| < ε. (8.18)

Thus the event on the right happens at least as often as that on the left, i.e.,

Pn[|Wn − c| < δ] ≤ Pn[|g(Wn)− g(c)| < ε]. (8.19)

The definition of→P means that Pn[|Wn − c| < δ]→ 1 for any δ > 0, butPn[|g(Wn)− g(c)| < ε] is larger, hence

Pn[|g(Wn)− g(c)| < ε] −→ 1, (8.20)

proving (8.17).

Thus the answer to (8.16) is “Yes.” Such an estimator is said to be consistent. Thislemma also works for vector Wn, that is, if g(w) is continuous at c, then

Wn −→P c =⇒ g(Wn) −→P g(c). (8.21)

For example, suppose X1, . . . , Xn are iid with mean µ and variance σ2 < ∞. Thenby (8.9) and (8.14),

Wn =

(Xn

1n ∑ X2

i

)−→P

(µ

σ2 + µ2

). (8.22)

Letting g(w1, w2) = w2 − w21, we have g(xn, ∑ x2

i /n) = s2n, the sample variance (with

denominator n), henceS2

n −→P (σ2 + µ2)− µ2 = σ2. (8.23)

Also, Sn →P σ.

8.3.1 Regression through the origin

Consider regression through the origin, that is, (X1, Y2), . . . , (Xn, Yn) are iid,

E[Yi |Xi = xi] = βxi, Var[Yi |Xi = xi] = σ2e , E[Xi] = µX , Var[Xi] = σ2

X > 0. (8.24)

We will see later (Exercise 12.7.18) that the least squares estimate of β is

βn =∑n

i=1 xiyi

∑ni=1 x2

i. (8.25)

8.4. Convergence in distribution 129

Is this a consistent estimator? We know by (8.14) that

1n

n

∑i=1

X2i −→

P µ2X + σ2

X . (8.26)

Also, X1Y1, . . . , XnYn are iid, and

E[XiYi |Xi = xi] = xi E[Yi |Xi = xi] = βx2i , (8.27)

henceE[XiYi] = E[βX2

i ] = β(µ2X + σ2

X). (8.28)

Thus the WLLN shows that

1n

n

∑i=1

XiYi −→P β(µ2X + σ2

X). (8.29)

Now consider

Wn =

(1n

n

∑i=1

XiYi,1n

n

∑i=1

X2i

)and c = (β(µ2

X + σ2X), µ2

X + σ2X), (8.30)

so that Wn →P c. The function g(w1, w2) = w1/w2 is continuous at w = c, hence(8.21) shows that

g(Wn) =1n ∑n

i=1 XiYi1n ∑n

i=1 X2i−→P g(c) =

β(µ2X + σ2

X)

µ2X + σ2

X, (8.31)

that is,

βn =∑n

i=1 XiYi

∑ni=1 X2

i−→P β. (8.32)

So, yes, the least squares estimator is consistent.

8.4 Convergence in distribution

Convergence to a constant is helpful, but generally more information is needed, asfor confidence intervals. E.g, if we can say that

θ − θ

SE(θ)≈ N(0, 1), (8.33)

then an approximate 95% confidence interval for θ would be

θ ± 2× SE(θ), (8.34)

where the “2" is approximately 1.96. Thus we need to find the approximate distri-bution of a random variable. In the asymptotic setup, we need the notion of Wnconverging to a random variable. It is formalized by looking at the respective distri-bution function for each possible value, almost.


Definition 8.5. Suppose Wn is a sequence of random variables, and W is a random variable.Let Fn be the distribution function of Wn, and F be the distribution function of W. Then Wnconverges in distribution to W if

Fn(w) −→ F(w) (8.35)

for every w ∈ R at which F is continuous. This convergence is written

Wn −→D W. (8.36)

For example, go back to X1, . . . , Xn iid Beta(2,1), but now let

Wn = n X(1), (8.37)

where again X(1) is the minimum of the n observations. The minimum itself goes to0, but by multiplying by n it may not. The distribution function of Wn is FWn (w) = 0if w ≤ 0 and if w > 0, and using calculations as in (8.7) with ε = w/n,

FWn (w) = P[Wn ≤ w] = P[X(1) ≤ w/n] = 1− (1− (w/n)2)n. (8.38)

Now let n→ ∞. We will use the fact that for a sequence cn,

limn→∞

(1− cn

n

)n= e− limn→∞ cn (8.39)

if the limit exists. Applying this equation to (8.38), we have cn = w2/n, which goesto 0. Thus FWn (w) → 1− 1 = 0. This limit is not distribution function, hence nX(1)does not have a limit in distribution. What happens is that nX(1) is going to ∞. Somultiplying by n is too strong. What about Vn =

√nX(1)? Then we can show that for

v > 0,FVn (v) = 1− (1− v2/n)n −→ 1− e−v2

, (8.40)

since cn = v2 in (8.39). For v ≤ 0, FVn (v) = 0, since X(1) > 0. Thus the limit is 0, too.Hence

FVn (v) −→

1− e−v2if v > 0

0 if v ≤ 0. (8.41)

Is the right-hand side a distribution function of some random variable? Yes, indeed.Thus

√nX(1) does have a limit in distribution, the distribution function being given

in (8.41).For another example, suppose Xn ∼ Binomial(n, λ/n) for some fixed λ > 0. The

distribution function of Xn is

Fn(x) =

0 if x < 0

∑floor(x)i=0 f (i | n, λ/n) if 0 ≤ x ≤ n

1 if n < x, (8.42)

where floor(x) is the largest integer less than or equal to x, and f (i | n, λ/n) is theBinomial(n, λ/n) pmf. Taking the limit as n → ∞ of Fn requires taking the limit ofthe f ’s, so we will do that first. For a positive integer i ≤ n,

f(

i | n,λ

n

)=

n!i!(n− i)!

(λ

n

)i (1− λ

n

)n−i

=λi

i!n(n− 1) · · · (n− i + 1)

ni

(1− λ

n

)n (1− λ

n

)−i. (8.43)

8.4. Convergence in distribution 131

Now i is fixed and n→ ∞. Consider the various factors on the right. The first has non’s. The second has i terms on the top, and ni on the bottom, so can be written

n(n− 1) · · · (n− i + 1)ni =

nn

n− 1n· · · n− i + 1

n

= 1(

1− 1n

)· · ·

(1− i− 1

n

)→ 1. (8.44)

The third term goes to e−λ, and the fourth goes to 1. Thus for any positive integer i,

f(

i | n,λ

n

)−→ e−λ λi

i!, (8.45)

which is the pmf of the Poisson(λ). Going back to the Fn in (8.42), note that no matterhow large x is, as n → ∞, eventually x < n, so that the third line never comes intoplay. Thus

Fn(x) −→

0 if x < 0

∑floor(x)i=0 e−λ λi

i! if 0 ≤ x. (8.46)

But that is the distribution function of the Poisson(λ), i.e.,

Binomial(

n,λ

n

)−→D Poisson(λ). (8.47)

8.4.1 Points of discontinuity of F

In the definition, the convergence (8.35) does not need to hold at w’s for which F is notcontinuous. This relaxation exists because sometimes the limit of the Fn’s will havepoints at which the function is continuous from the left but not the right, whereasF’s need to be continuous from the right. For example, take Wn = Z + 1/n, whereZ ∼ Bernoulli(1/2). It seems reasonable that Wn −→D Bernoulli(1/2). Let Fn be thedistribution function for Wn:

Fn(w) =

0 if w < 1/n1/2 if 1/n ≤ w < 1 + 1/n

1 if 1 + 1/n ≤ w.. (8.48)

Now let n→ ∞, so that

Fn(w) −→

0 if w ≤ 01/2 if 0 < w ≤ 1

1 if 1 < w. (8.49)

That limit is not a distribution function, though it would be if the ≤’s and <’s wereswitched, in which case it would be the F for a Bernoulli(1/2). Luckily, the definitionallows the limit to be wrong at points of discontinuity, which are 0 and 1 in thisexample, so we can say that Wn −→D Bernoulli(1/2).


8.4.2 Converging to a constant random variable

It could be that Wn −→D W, where P[W = c] = 1, that is, W is a constant randomvariable. (Sounds like an oxymoron.) But that looks like Wn is converging to aconstant. It is:

Wn −→D W where P[W = c] = 1 if and only if Wn −→P c. (8.50)

Let Fn(w) be the distribution function of Wn, and F the distribution function of W,so that

F(w) =

0 if w < c1 if c ≤ w . (8.51)

For any ε > 0,

P[|Wn − c| ≤ ε] = P[Wn ≤ c + ε]− P[Wn < c− ε]. (8.52)

Also,P[Wn ≤ c− 3ε/2] ≤ P[Wn < c− ε] ≤ P[Wn ≤ c− ε/2], (8.53)

hence, because Fn(w) = P[Wn ≤ w],

Fn(c + ε)− Fn(c− ε/2) ≤ P[|Wn − c| ≤ ε] ≤ Fn(c + ε)− Fn(c− 3ε/2). (8.54)

We use this equation to show (8.50).

1. First suppose that Wn →D W, so that Fn(w) → F(w) if w 6= c. Then applyingthe convergence to w = c + ε and c− ε/2,

Fn(c + ε)→ F(c + ε) = 1 andFn(c− ε/2)→ F(c− ε/2) = 0. (8.55)

Then (8.54) shows that P[|Wn − c| ≤ ε]→ 1, proving that Wn →P c.

2. Next, suppose Wn −→P c. Then P[|Wn − c| ≤ ε]→ 1, hence from (8.54),

Fn(c + ε)− Fn(c− 3ε/2) −→ 1. (8.56)

The only way that can happen is for

Fn(c + ε)→ 1 and Fn(c− 3ε/2)→ 0. (8.57)

But since that holds for any ε > 0, for any w < c, Fn(w)→ 0, and for any w > c,Fn(w)→ 1. That is, Fn(w)→ F(w) for w 6= c, proving that Wn →D W.

8.5 Moment generating functions

It may be difficult to find distribution functions and their limits. Often momentgenerating functions are easier to work with. Recall that if two random variableshave the same moment generating function that is finite in a neighborhood of 0, thenthey have the same distribution. It is true also of limits, that is, if the mgfs of asequence of random variables converge to a mgf, then the random variable converge.See Section 30 of Billingsley (1995) for a proof.

8.6. Central limit theorem 133

Lemma 8.6. Suppose W1, W2, . . . is a sequence of random variables, where Mn(t) is the mgfof Wn, and suppose W is a random variable with mgf M(t). If for some ε > 0, Mn(t) < ∞and M(t) < ∞ for all n and all |t| < ε,

Wn −→D W if and only if Mn(t) −→ M(t) for all |t| < ε. (8.58)

Looking again at the binomial example in (8.42), if Xn ∼ Binomial(n, λ/n), thenusing (2.83) we see that its mgf is

Mn(t) =((

1− λ

n

)+

λ

net)n

=

(1 +−λ + λet

n

)n

. (8.59)

Letting n→ ∞,

Mn(t) −→ e−λ+λet, (8.60)

which is the mgf of a Poisson(λ), showing again that (8.47) holds.

8.6 Central limit theorem

We know that sample means tend to the population mean, if the latter exists. Butwe can obtain more information with a distributional limit. In the normal case, weknow that the sample mean is normal, and with appropriate normalization, it isstandard normal, i.e., Normal(0,1). A central limit theorem is one that says a properlynormalized sample mean approaches normality even if the original variables are notnormal.

Start with X1, . . . , Xn iid with mean 0 and variance 1, and mgf MX(t), which isfinite for |t| < ε for some ε > 0. The variance of Xn is 1/n, so to normalize it wemultiply by

√n:

Wn =√

n Xn. (8.61)

To find the asymptotic distribution of Wn, we first find its mgf, Mn(t):

Mn(t) = E[etWn ] = E[et√

n Xn ]

= E[e(t/√

n) ∑ Xi ]

= E[e(t/√

n) Xi ]n

= MX(t/√

n)n. (8.62)

Now Mn(t) is finite if MX(t/√

n) is, and MX(t/√

n) < ∞ if |t/√

n| < ε, which iscertainly true if |t| < ε. That is, Mn(t) < ∞ if |t| < ε.

To find the limit of Mn, first take logs:

log(Mn(t)) = n log(MX(t/√

n)) = ncX(t/√

n), (8.63)

where cX(t) = log(MX(t)) is the cumulant generating function for a single Xi. Ex-pand cX in a Taylor series about t = 0:

cX(t) = cX(0) + t c′X(0) +t2

2c′′X(t

∗), t∗ between 0 and t. (8.64)


But cX(0) = 0, and c′X(0) = E[X] = 0, by assumption. Thus substituting t/√

n for tin (8.64) yields

cX(t/√

n) =t2

2nc′′X(t

∗n), t∗n between 0 and t/

√n, (8.65)

hence by (8.63),

log(Mn(t)) =t2

2c′′X(t

∗n). (8.66)

The mgf MX(t) has all its derivatives as long as |t| < ε, which means so does cX . Inparticular, c′′X(t) is continuous at t = 0. As n → ∞, t∗n gets squeezed between 0 andt/√

n, hence t∗n → 0, and

log(Mn(t)) =t2

2c′′X(t

∗n)→

t2

2c′′X(0) =

t2

2Var[Xi] =

t2

2, (8.67)

because we have assumed that Var[Xi] = 1. Finally,

Mn(t) −→ et2/2, (8.68)

which is the mgf of a N(0,1), i.e.,√

n Xn −→D N(0, 1). (8.69)

There are many central limit theorems, depending on various assumptions, butthe most basic is the following.

Theorem 8.7. Central limit theorem. Suppose X1, X2, . . . are iid with mean 0 and variance1. Then (8.69) holds.

What we proved using (8.68) required the mgf be finite in a neighborhood of 0.This theorem does not need mgfs, only that the variance is finite. A slight general-ization of the theorem has X1, X2, . . . iid with mean µ and variance σ2, 0 < σ2 < ∞,and concludes that √

n (Xn − µ) −→D N(0, σ2). (8.70)

E.g., see Theorem 27.1 of Billingsley (1995).

8.6.1 Supersizing

Convergence in distribution immediately translates to multivariate random variables.That is, suppose Wn is a p× 1 random vector with distribution function Fn. Then

Wn −→D W (8.71)

for some p× 1 random vector W with distribution function F if

Fn(w) −→ F(w) (8.72)

for all points w ∈ R at which F(w) is continuous.If Mn(t) is the mgf of Wn, and M(t) is the mgf of W, and these mgfs are all finite

for ‖t‖ < ε for some ε > 0, then

Wn −→D W iff Mn(t) −→ M(t) for all ‖t‖ < ε. (8.73)

8.6. Central limit theorem 135

Now for the central limit theorem. Suppose X1, X2, . . . are iid random vectors withmean µ, finite covariance matrix Σ, and mgf MX(t) < ∞ for ‖t‖ < ε. Set

Wn =√

n(Xn − µ). (8.74)

Let a be any p× 1 vector, and write

a′Wn =√

n(a′Xn − a′µ) =√

n

(1n

n

∑i=1

a′Xi − a′µ

). (8.75)

Thus a′Wn is the normalized sample mean of the a′Xi’s, and the regular central limittheorem (actually, equation 8.70) can be applied, where σ2 = Var[a′Xi] = a′Σa:

a′Wn −→D N(0, a′Σa). (8.76)

But then that means the mgfs converge in (8.76): Letting t = ta,

E[et(a′Wn)] −→ et22 a′Σa, (8.77)

for t‖a‖ < ε. Now switch notation so that a = t and t = 1, and we have that

Mn(t) = E[et′Wn ] −→ e12 t′Σt, (8.78)

which holds for any ‖t‖ < ε. The right hand side is the mgf of a N(0, Σ), so

Wn =√

n(Xn − µ) −→D N(0, Σ). (8.79)

Example. Suppose X1, X2, . . . are iid with mean µ and variance σ2. One might beinterested in the joint distribution of the sample mean and variance, after some nor-malization. When the data are normal, we know the answer exactly, of course, butwhat about otherwise? We won’t answer that question quite yet, but take a step bylooking at the joint distribution of the sample means of the Xi’s and the X2

i ’s. We willassume that Var[X2

i ] < ∞. We start with

Wn =√

n

(1n

n

∑i=1

(XiX2

i

)−(

µµ2 + σ2

)). (8.80)

Then the central limit theorem says that

Wn −→D N(

02, Cov(

XiX2

i

)). (8.81)

Look at that covariance. We know Var[Xi] = σ2. Also,

Var[X2i ] = E[X4

i ]− E[X2i ]

2 = µ′4 − (µ2 + σ2)2, (8.82)

andCov[Xi, X2

i ] = E[X3i ]− E[Xi]E[X2

i ] = µ′3 − µ(µ2 + σ2), (8.83)


where µ′k = E[Xki ], the raw kth moment from (2.55).

It’s not pretty, but the final answer is

√n

(1n

n

∑i=1

(XiX2

i

)−(

µµ2 + σ2

))

−→D N(

02,(

σ2 µ′3 − µ(µ2 + σ2)µ′3 − µ(µ2 + σ2) µ′4 − (µ2 + σ2)2

)). (8.84)

8.7 Exercises

Exercise 8.7.1. Suppose X1, X2, . . . all have E[Xi] = µ and Var[Xi] = σ2 < ∞, andthey are uncorrelated. Show that Xn → µ in probability.

Exercise 8.7.2. Consider the random variable Wn with

P[Wn = 0] = 1− 1n

and P[Wn = an] =1n

(8.85)

for some constants an, n = 1, 2, . . .. (a) For each given sequence an, find the limits asn → ∞, when existing, for Wn, E[Wn], and Var[Wn]. (i) an = 1/n. (ii) an = 1. (iii)an =

√n. (iv) an = n. (v) an = n2. (b) For which sequences an in part (a) can one

use Chebyshev’s inequality to find the limit of Wn in probability? (c) Does Wn →P cimply that E[Wn]→ c?

Exercise 8.7.3. Suppose Xn ∼ Binomial(n, 1/2), and let Wn = Xn/n. (a) What isthe limit of Wn in probability? (b) Suppose n is even. Find P[Wn = 1/2]. Does thisprobability approach 1 as n→ ∞?

Exercise 8.7.4. Let f be a function on (0, 1) with∫ 1

0 f (u)du < ∞. Let U1, U2, . . . be iid

Uniform(0,1)’s, and let Xn = ( f (U1) + · · ·+ f (Un))/n. Show that Xn →∫ 1

0 f (u)duin probability.

Exercise 8.7.5. Suppose X1, . . . , Xn are iid N(µ, 1). Find the exact probability P[|Xn −µ| > ε], and the bound given by Chebyshev’s inequality, for ε = 0.1 for various valuesof n. Is the bound very close to the exact probability?

Exercise 8.7.6. Suppose(X1Y1

), · · · ,

(XnYn

)are iid N

((µXµY

),(

σ2X ρσXσY

ρσXσY σ2Y

)). (8.86)

Assume σ2X > 0 and σ2

Y > 0, and let S2X = ∑n

i=1(Xi − Xn)2/n and S2Y = ∑n

i=1(Yi −Yn)2/n. Find the limits in probability of the following. (Actually, the answers do notdepend on the normality assumption.) (a) ∑n

i=1 XiYi/n. (b) SXY = ∑ni=1(Xi −X)(Yi −

Y)/n. (c) Rn = SXY/(SXSY). (d) What if instead of dividing by n for S2X , S2

Y , andSXY , we divide by n− 1?

8.7. Exercises 137

Exercise 8.7.7. The distribution of a random variable Y is called a mixture of twodistributions if its distribution function is FY(y) = (1 − ε)F1(y) + εF2(y), y ∈ R,where F1 and F2 are distribution functions, and 0 < ε < 1. The idea is that withprobability 1− ε, Y has distribution F1, and with probability ε, it has distribution F2.Now let Yn have the following mixture distribution: N(µ, 1) with probability 1− εn,and N(n, 1) with probability εn, where εn ∈ (0, 1). (a) Write down the distributionfunction of Yn in terms of Φ, the distribution function of a N(0, 1). (b) Let εn → 0as n → ∞. What is the limit of the distribution function in (a)? What does that sayabout the distribution of Y, the limit of the Yn’s? (c) What is E[Y]? Find E[Yn] and itslimit when (i) εn = 1/

√n. (ii) εn = 1/n. (iii) εn = 1/n2. (d) Does Yn →D Y imply

that E[Yn]→ E[Y]?

Exercise 8.7.8. Suppose Xn is Geometric(1/n). What is the limit in distribution ofXn/n? [Hint: First find the distribution function of Yn = Xn/n.]

Exercise 8.7.9. Suppose U1, U2, . . . , Un are iid Uniform(0,1), and U(1) is their mini-mum. Then U(1) has distribution function Fn(u) = 1− (1− u)n, u ∈ (0, 1), as in(8.7). (What is Fn(u) for u ≤ 0 or u ≥ 1?) (a) What is the limit of Fn(u) as n → ∞for u ∈ (0, 1)? (b) What is the limit for u ≤ 0? (c) What is the limit for u ≥ 1? (d)Thus the limit of Fn(u) is the distribution function (at least for u 6= 0) of what randomvariable? Choose among (i) a constant random variable, with value 0; (ii) a constantrandom variable, with value 1; (iii) a Uniform(0,1); (iv) an Exponential(1); (v) none ofthe above.

Exercise 8.7.10. Continue with the setup in Exercise 8.7.9. Let Vn = nU(1), andlet Gn(v) be its distribution function. (a) For v ∈ (0, n), Gn(v) = P[Vn ≤ v] =P[U(1) ≤ c] = Fn(c) for some c. What is c (as a function of v, n)? (b) Find Gn(v).(c) What is the limit of Gn(v) as n → ∞ for v > 0? (d) That limit is the distributionfunction of what distribution? Choose among (i) a constant random variable, withvalue 0; (ii) a constant random variable, with value 1; (iii) a Uniform(0,1); (iv) anExponential(1); (v) none of the above.

Exercise 8.7.11. Continue with the setup in Exercises 8.7.9 and 8.7.10. (a) Find thedistribution function of

√n U(1). What is its limit for y > 0? What is the limit in

distribution of√

n U(1), if it exists? (b) Same question, but for n2U(1).

Exercise 8.7.12. Suppose X1, X2, . . . , Xn are iid Exponential(1). Let Wn = bnX(1),where X(1) is the minimum of the Xi’s and bn > 0. (a) Find the distribution functionF(1) of X(1), and show that FWn (w) = F(1)(w/bn) is the distribution function of Wn.(b) For each of the following sequences, decide whether Wn goes in probability to aconstant, goes in distribution to a non-constant random variable, or does not have alimit in distribution: (i) bn = 1; (ii) bn = log(n); (iii) bn = n; (iv) bn = n2. If Wngoes to a constant, give the constant, and if it goes to a random variable, specify thedistribution function. [Hint: Find the limit of FWn (w) for each fixed w. Note thatsince Wn is always positive, we automatically have that FWn (w) = 0 for w < 0.]

Exercise 8.7.13. Again, X1, X2, . . . , Xn are iid Exponential(1). Let Un = X(n) − an. (a)Find the distribution function of Un, Fn(u). (b) For each of the following sequences,decide whether Un goes in probability to a constant, goes in distribution to a non-constant random variable, or does not have a limit in distribution: (i) an = 1; (ii)


an = log(n); (iii) an = n; (iv) an = n2. If Un goes to a constant, give the constant,and if it goes to a random variable, specify the distribution function. [Hint: Thedistribution function of Un in each case will be of the form (1− cn/n)n, for whichyou can use (8.39). Also, recall the Gumbel distribution, from (5.103). Note that youmust deal with any u unless an = 1, since for any x > 0, eventually x− an < u.]

Exercise 8.7.14. This question considers the asymptotic distribution of the samplemaximum, X(n), based on X1, . . . , Xn iid. Specifically, what do constants an and bn

need to be so that Wn = bnX(n) − an →D W, where W is a non-constant random vari-able? The constants, and W, will depend on the distribution of the Xi’s. In the partsbelow, find the appropriate an, bn, and W for the Xi’s having the given distribution.[Find the distribution function of the Wn, and see what an and bn have to be for thatto go to a non-trivial distribution function.] (a) Exponential(1). (b) Uniform(0,1). (c)Beta(α, 1). (This uses the same an and bn as in part (b).) (d) Gumbel(0), from (5.103).(e) Logistic.

Exercise 8.7.15. Here, X1, . . . , Xn, are iid Laplace(0, 1), so they have mgf M(s) =1/(1− s2). Let Zn =

√n X = ∑ Xi/

√n. (a) What is the mgf M∑ Xi (s) of ∑ Xi? (b)

What is the mgf MZn (t) of Zn? (c) What is the limit of MZn (t) as n → ∞? (d) Whatrandom variable is that the mgf of?

Chapter 9

Asymptotics: Mapping and the ∆-Method

The law of large numbers and central limit theorem are useful on their own, but theycan be combined in order to find convergence for many more interesting situations.In this chapter we look at mapping and the ∆-method. Lemma 8.4 on page 128 andequation (8.21) deal with mapping, where Wn →P c implies that g(Wn)→P g(c) if gis continuous at c. General mapping results allow mixing convergences in probabilityand in distribution.

The ∆-method extends the central limit theorem to normalized functions of thesample mean.

9.1 Mapping

The next lemma lists a number of mapping results, all of which relate to one another.

Lemma 9.1. Mapping.

1. If Wn →D W, and g : R→ R is continuous at all points inW , the space of W, then

g(Wn) −→D g(W). (9.1)

2. Similarly, for multivariate Wn (p× 1) and g (q× 1): If Wn →D W, and g : Rp → Rq

is continuous at all points inW , then

g(Wn) −→D g(W). (9.2)

3. The next results constitute what is usually called Slutsky’s theorem, or sometimesCramér’s theorem. Suppose that

Zn −→P c and Wn −→D W. (9.3)

ThenZn + Wn −→D c + W, ZnWn −→D cW, (9.4)

andif c 6= 0,

Wn

Zn−→D W

c. (9.5)

139

140 Chapter 9. Asymptotics: Mapping and the ∆-Method

4. Generalizing Slutsky, if Zn −→P c and Wn −→D W, and g : R × R → R iscontinuous at c ×W , then

g(Zn, Wn) −→D g(c, W). (9.6)

5. Finally, the multivariate version of #4. If Zn −→P c (p1 × 1) and Wn −→D W(p2 × 1), and g : Rp1 ×Rp2 → Rq is continuous at c ×W , then

g(Zn, Wn) −→D g(c, W). (9.7)

All the other four follow from #2, because convergence in probability is the sameas convergence in distribution to a constant random variable. Also, #2 is just themultivariate version of #1, so basically they are all the same. They appear variousplaces in various forms, though. The idea is that as long as the function is continuous,the limit of the function is the function of the limit. See Theorem 29.2 of Billingsley(1995) for an even more general result.

Together with the law of large numbers and central limit theorem, these mappingresults can prove a huge number of useful approximations. The t statistic is oneexample. Let X1, . . . , Xn be iid with mean µ and variance σ2 ∈ (0, ∞). Student’s tstatistic is defined as

Tn =√

nXn − µ

S∗n, where S2

∗n =∑n

i=1(Xi − Xn)2

n− 1. (9.8)

We know that if the data are normal, this Tn ∼ tn−1 exactly, but if the data are notnormal, who knows what the distribution is. We can find the limit, though, usingSlutsky. Take

Zn = S∗n and Wn =√

n (Xn − µ). (9.9)

ThenZn −→P σ and Wn −→D N(0, σ2) (9.10)

from (8.22) and the central limit theorem, respectively. Because σ2 > 0, the finalcomponent of (9.5) shows that

Tn =Wn

Zn−→D N(0, σ2)

σ= N(0, 1). (9.11)

Thus for large n, Tn is approximately N(0, 1) even if the data are not normal, and

Xn ± 2S∗n√

n(9.12)

is an approximate 95% confidence interval for µ. Notice that this result doesn’t sayanything about small n, especially it doesn’t say that the t is better than the z whenthe data are not normal. Other studies have shown that the t is fairly robust, so it canbe used at least when the data are approximately normal. Actually, heavy tails forthe Xi’s means light tails for the Tn, so the z might be better than t in that case.

9.2. ∆-method 141

9.1.1 Regression through the origin

Recall the example in Section 8.3.1. We know βn is a consistent estimator of β, butwhat about its asymptotic distribution? That is,

√n (βn − β) −→D ?? (9.13)

We need to do some manipulation to get it into a form where we can use the centrallimit theorem, etc. To that end,

√n (βn − β) =

√n

(∑n

i=1 XiYi

∑ni=1 X2

i− β

)

=√

n∑n

i=1 XiYi − β ∑ni=1 X2

i

∑ni=1 X2

i

=√

n∑n

i=1(XiYi − βX2i )

∑ni=1 X2

i

=

√n ∑n

i=1(XiYi − βX2i )/n

∑ni=1 X2

i /n. (9.14)

The numerator in the last expression contains the sample mean of the (XiYi −βX2

i )’s. Conditionally, from (8.24),

E[XiYi− βX2i | Xi = xi] = xiβxi− βx2

i = 0, Var[XiYi− βX2i | Xi = xi] = x2

i σ2e , (9.15)

so that unconditionally,

E[XiYi − βX2i ] = 0, Var[XiYi − βX2

i ] = E[X2i σ2] + Var[0] = σ2

e (σ2X + µ2

X). (9.16)

Thus the central limit theorem shows that

√n

n

∑i=1

(XiYi − βX2i )/n −→D N(0, σ2

e (σ2X + µ2

X)). (9.17)

We already know from (8.29) that ∑ X2i /n→P σ2

X + µ2X , hence by Slutsky (9.5),

√n (βn − β) −→D

N(0, σ2e (σ

2X + µ2

X))

σ2X + µ2

X= N

(0,

σ2e

σ2X + µ2

X

). (9.18)

9.2 ∆-method

The central limit theorem deals with sample means, but often we are interested insome function of the mean, such as the g(Xn) = 1/Xn in (8.16). One way to linearizea function is to use a one-step Taylor series. Thus if Xn is close to its mean µ, theng(Xn) ≈ g(µ) + (Xn − µ)g′(µ), and the central limit theorem can be applied to theright-hand side. This method is called the ∆-method, which we formally define nextin more generality (i.e., we do not need to base it on the sample mean).


Lemma 9.2. ∆-method. Suppose√

n (Yn − µ) −→D W, (9.19)

and the function g : R→ R has a continuous derivative at µ. Then√

n (g(Yn)− g(µ)) −→D g′(µ) W. (9.20)

Proof. Taylor series yields

g(Xn) = g(µ) + (Yn − µ)g′(µ∗n), µ∗n is between Yn and µ, (9.21)

hence √n (g(Yn)− g(µ)) =

√n (Yn − µ) g′(µ∗n). (9.22)

We wish to show that g′(µ∗n)→P g′(µ), but first need to show that Yn →P µ. Now

Yn − µ = [√

n (Yn − µ)]× 1√n

−→D W × 0 (because1√n−→P 0)

= 0. (9.23)

That is, Yn − µ →P 0, hence Yn →P µ. Because µ∗n is trapped between Yn and µ,µ∗n →P µ, which by continuity of g′ means that

g′(µ∗n) −→P g′(µ). (9.24)

Applying Slutsky (9.5) to (9.22), by (9.19) and (9.24),√

n (Yn − µ) g′(µ∗n) −→D g′(µ) W, (9.25)

which via (9.22) proves (9.20).

Usually, the limiting W is normal, so that under the conditions on g, we have that√

n (Yn − µ) −→D N(0, σ2) =⇒√

n (g(Yn)− g(µ)) −→D N(0, g′(µ)2σ2). (9.26)

9.2.1 Median

Here we apply Lemma 9.2 to the sample median. We have X1, . . . , Xn iid with contin-uous distribution function F and pdf f . Let η be the median, so that F(η) = 1/2, andassume that the pdf f is positive and continuous at η. For simplicity we take n oddand set kn = (n + 1)/2, so that X(kn), the kth

n order statistic, is the median. Exercise9.5.4 shows that for U1, . . . , Un iid Uniform(0,1),

√n (U(kn) −

12 ) −→ N(0, 1

4 ). (9.27)

Thus for a function g with continuous derivative at 1/2, the ∆-method (9.26) showsthat √

n (g(U(kn))− g( 12 )) −→ N(0, 1

4 g′( 12 )

2). (9.28)

9.3. Variance stabilizing transformations 143

Let g(u) = F−1(u). Then

g(U(kn)) = F−1(U(kn)) =D X(kn) and g( 1

2 ) = F−1( 12 ) = η. (9.29)

We also use the fact that

g′(u) =1

F′(F−1(u))=⇒ g′( 1

2 ) =1

f (η). (9.30)

Thus making the substitutions in (9.28), we obtain

√n (X(kn) − η) −→ N

(0,

14 f (η)2

). (9.31)

If n is even and one takes the average of the two middle values as the median, thenone can show that the asymptotic results are the same.

Recall location-scale families of distributions in Section 4.2.4. Consider just lo-cation families, so that for given pdf f (x), the family of pdfs in the model consistof the fµ(x) = f (x − µ) for µ ∈ R. We restrict to f ’s that are symmetric about 0,hence the median of fµ(x) is µ, as is the mean if the mean exists. In these cases,both the sample mean and sample median are reasonable estimates of µ. Which isbetter? The exact answer may be difficult (though if n = 1 they are both the same),but we can use asymptotics to approximately compare the variances when n is large.That is, we know

√n(Xn − µ) is asymptotically N(0, σ2) if the σ2 = Var[Xi] < ∞,

and (9.31) provides that the asymptotic distribution of√

n(Mediann −µ) is N(0, τ2),where τ2 = 1/(4 f (0)2) since fµ(µ) = f (0). If σ2 > τ2, then asymptotically themedian is better, and vice versa. The ratio σ2/τ2 is called the asymptotic relativeefficiency of the median to the mean. Table (9.32) gives these values for variouschoices of f .

Base distribution σ2 τ2 σ2/τ2

Normal(0, 1) 1 π/2 2/π ≈ 0.6366Cauchy ∞ π2/4 ∞Laplace 2 1 2Uniform(−1, 1) 1/3 1 1/3Logistic π2/3 4 π2/12 ≈ 0.8225

(9.32)

The mean is better for the normal, uniform, and logistic, but the median is betterfor the Laplace and, especially, for the Cauchy. Generally, the thinner the tails of thedistribution, the relatively better the mean performs.

9.3 Variance stabilizing transformations

Often, the variance of an estimator depends on the value of the parameter beingestimated. For example, if Xn ∼ Binomial(n, p), then with pn = Xn/n,

Var[ pn] =p(1− p)

n. (9.33)

In regression situations, one usually desires the dependent Yi’s to have the samevariance for each i, but if these Yi’s are binomial, or Poisson, the variance will not be


constant. Also, confidence intervals are easier if the standard error does not need tobe estimated.

By taking an appropriate function of the estimator, we may be able to achieveapproximately constant variance. Such a function is called a variance stabilizing trans-formation. Formally, if θn is an estimator of θ, then we wish to find a g such that

√n (g(θn)− g(θ)) −→D N(0, 1). (9.34)

The “1” for the variance is arbitrary. The important thing is that it does not dependon θ.

In the binomial example, the variance stabilizing g would satisfy√

n (g( pn)− g(p)) −→D N(0, 1). (9.35)

We know that √n ( pn − p) −→D N(0, p(1− p)), (9.36)

and by the ∆-method,√

n (g( pn)− g(p)) −→D N(0, g′(p)2 p(1− p)). (9.37)

What should g be so that that variance is 1? We need to solve

g′(p) =1√

p(1− p), (9.38)

so thatg(p) =

∫ p

0

1√y(1− y)

dy. (9.39)

First, let u =√

y, so that y = u2 and dy = 2udu, and

g(p) =∫ √p

0

1

u√

1− u22udu = 2

∫ √p

0

1√1− u2

du. (9.40)

The integral is arcsin(u), which means the variance stabilizing transformation is

g(p) = 2 arcsin(√

p). (9.41)

Note that adding a constant to g won’t change the derivative. The approximationsuggested by (9.35) is then

2 arcsin(√

pn

)≈ N

(2 arcsin(

√p),

1n

). (9.42)

An approximate 95% confidence interval for 2 arcsin(√

p) is

2 arcsin(√

pn

)± 2√

n. (9.43)

That interval can be inverted to obtain the interval for p, that is, apply g−1(u) =sin(u/2)2 to both ends:

p ∈(

sin(

arcsin(√

pn

)− 1√

n

)2, sin

(arcsin

(√pn

)+

1√n

)2)

. (9.44)

9.4. Multivariate ∆-method 145

Brown, Cai, and DasGupta (2001) show that this interval, but with pn replaced by(x + 3/8)/(n + 3/4) as proposed by Anscombe (1948), is quite a bit better than theusual approximate interval,

pn ± 2

√pn(1− pn)

n. (9.45)

9.4 Multivariate ∆-method

It might be that we have a function of several random variables to deal with, or moregenerally we have several functions of several variables. For example, we might beinterested in the mean and variance simultaneously. So we start with a sequence ofp× 1 random vectors, whose asymptotic distribution is multivariate normal:

√n (Yn − µ) −→D N(0n, Σ), (9.46)

and a function g : Rp → Rq. We cannot just take the derivative of g, since there arepq of them. What we need is the entire matrix of derivatives, just as for finding theJacobian. Letting

g(y) =

g1(y)g2(y)

...gq(y)

and y =

y1y2...

yp

, (9.47)

define the q× p matrix

D(y) =

∂

∂w1g1(y) ∂

∂w2g1(y) · · · ∂

∂wpg1(y)

∂∂w1

g2(y) ∂∂w2

g2(y) · · · ∂∂wp

g2(y)...

.... . .

...∂

∂w1gq(y) ∂

∂w2gq(y) · · · ∂

∂wpgq(y)

. (9.48)

Lemma 9.3. Multivariate ∆-method. Suppose (9.46) holds, and D(y) in (9.48) is contin-uous at y = µ. Then

√n (g(Yn)− g(µ)) −→D N(0n, D(µ)ΣD(µ)′). (9.49)

The Σ is p× p and D is q× p, so that the covariance in (9.49) is q× q, as it shouldbe. Some examples follow.

9.4.1 Mean, variance, and coefficient of variation

Go back to the example that ended with (8.84): X1, . . . , Xn are iid with mean µ,variance σ2, E[X3

i ] = µ′3 and E[X4i ] = µ′4. Ultimately, we wish to find the asymptotic

distribution of Xn and S2n, so we start with that of (∑ Xi, ∑ X2

i )′:

Yn =1n

n

∑i=1

(XiX2

i

), µ =

(µ

µ2 + σ2

), (9.50)


and

Σ =

(σ2 µ′3 − µ(µ2 + σ2)

µ′3 − µ(µ2 + σ2) µ′4 − (µ2 + σ2)2

). (9.51)

Since S2n = ∑ X2

i /n− Xn2,(

XnS2

n

)= g(Xn, ∑ X2

i /n) =(

g1(Xn, ∑ X2i /n)

g2(Xn, ∑ X2i /n)

), (9.52)

whereg1(y1, y2) = y1 and g2(y1, y2) = y2 − y2

1. (9.53)

Then

D(y) =

∂y1∂y1

∂y1∂y2

∂(y2−y21)

∂y1

∂(y2−y21)

∂y2

=

(1 0−2y1 1

). (9.54)

Also,

g(µ) =(

µσ2 + µ2 − µ2

)=

(µσ2

), D(µ) =

(1 0−2µ 1

), (9.55)

and

D(µ)ΣD(µ)′ =

(1 0−2µ 1

)(σ2 µ′3 − µ(µ2 + σ2)

µ′3 − µ(µ2 + σ2) µ′4 − (µ2 + σ2)2

)(1 −2µ0 1

)=

(σ2 µ′3 − µ3 − 3µσ2

µ′3 − µ3 − 3µσ2 µ′4 − σ4 − 4µµ′3 + 3µ4 + 6µ2σ2

).

Yikes! Notice in particular that the sample mean and variance are not asymptoticallyindependent necessarily.

Before we go on, let’s assume the data are normal. In that case,

µ′3 = µ3 + 3µσ2 and µ′4 = 3σ4 + 6σ2µ2 + µ4, (9.56)

(left to the reader), and, magically,

D(µ)ΣD(µ)′ =

(σ2 00 2σ4

), (9.57)

hence√

n((

XnS2

n

)−(

µσ2

))−→D N

(02,(

σ2 00 2σ4

)). (9.58)

Actually, that is not surprising, since we know the variance of Xn is σ2/n, and that ofS2

n is the variance of a χ2n−1 times (n− 1)σ2/n, which is 2(n− 1)3σ4/n2. Multiplying

those by n and letting n → ∞ yields the diagonals σ2 and 2σ4. Also the mean andvariance are independent, so their covariance is 0.

From these, we can find the coefficient of variance, or the noise-to-signal ratio,

cv =σ

µand sample version cv =

Sn

Xn. (9.59)

9.4. Multivariate ∆-method 147

Then cv = h(Xn, S2n) where h(w1, w2) =

√w2/w1. The derivatives here are

Dh(w1, w2) =

(−√

w2w2

1,

12√

w2 w1

)=⇒ Dh(µ, σ2) =

(− σ

µ2 ,1

2σµ

). (9.60)

Then, assuming that µ 6= 0,

DhΣD′h =

(− σ

µ2 ,1

2σµ

)(σ2 00 2σ4

)( − σµ2

12σµ

)

=σ4

µ4 +12

σ2

µ2 . (9.61)

The asymptotic distribution is then

√n (cv− cv) −→D N

(0,

σ4

µ4 +12

σ2

µ2

)= N(0, cv2(cv2 + 1

2 )). (9.62)

For example, data on n = 102 female students’ heights had a mean of 65.56 and astandard deviation of 2.75, so the cv = 2.75/65.56 = 0.0419. We can find a confidenceinterval by estimating the variance in (9.62) in the obvious way:(

cv± 2|cv|

√0.5 + cv2√

n

)= (0.0419± 2× 0.0029) = (0.0361, 0.0477). (9.63)

For the men, the mean is 71.25 and the sd is 2.94, so their cv is 0.0413. That’s prac-tically the same as for the women. The men’s standard error of cv is 0.0037 (then = 64), so a confidence interval for the difference between the women and men is

(0.0419− 0.0413± 2√

0.00292 + 0.00372) = (0.0006± 0.0094). (9.64)

Clearly 0 is in that interval, so there does not appear to be any difference between thecoefficients of variation.

9.4.2 Correlation coefficient

Consider the bivariate normal sample,(X1Y1

), . . . ,

(XnYn

)are iid ∼ N

(02,(

1 ρρ 1

)). (9.65)

The sample correlation coefficient in this case (we don’t need to subtract the means)is

Rn =∑n

i=1 XiYi√∑n

i=1 X2i ∑n

i=1 Y2i

. (9.66)

From Exercise 8.7.6(c), we know that Rn →P ρ. What about the asymptotic distribu-tion? Notice that Rn can be written as a function of three sample means,

Rn = g

(1n

n

∑i=1

XiYi,1n

n

∑i=1

X2i ,

1n

n

∑i=1

Y2i

)where g(w1, w2, w3) =

w1√w1w2

. (9.67)


First, apply the central limit theorem to the three means:

√n

1n ∑n

i=1 XiYi1n ∑n

i=1 X2i

1n ∑n

i=1 Y2i

− µ

−→D N(03, Σ). (9.68)

Now E[XiYi] = ρ and E[X2i ] = E[Y2

i ] = 1, hence

µ =

ρ11

. (9.69)

The covariance is a little more involved. Exercise 7.8.10 shows that

Σ = Cov

XiYiX2

iY2

i

=

1 + ρ2 2ρ 2ρ2ρ 2 2ρ2

2ρ 2ρ2 2

. (9.70)

Exercise 9.5.10 applies the ∆-method to (9.68) to obtain√

n (Rn − ρ) −→D N(0, (1− ρ2)2). (9.71)

9.4.3 Affine transformations

If A is a q× p matrix, then it is easy to get the asymptotic distribution of AXn + b:√

n (Xn − µ) −→D N(0n, Σ) =⇒√

n (AXn + b− (Aµ + b)) −→D N(0n, AΣA′),(9.72)

because for the function g(w) = Aw, D(w) = A.

9.5 Exercises

Exercise 9.5.1. Suppose X1, X2, . . . are iid with E[Xi] = 0 and Var[Xi] = σ2 < ∞.Show that

∑ni=1 Xi√

∑ni=1 X2

i

−→D N(0, 1). (9.73)


), · · · ,

(XnYn

)are iid N

((00

),(

σ2X σXY

σXY σ2Y

)). (9.74)

Then Yi |Xi = xi ∼ N(α + βxi, σ2e ). (What is α?) As in Section 8.3.1 let

βn =∑n

i=1 XiYi

∑ni=1 X2

i, (9.75)

and set Wn =√

n(βn − β). (a) Give the Vi for which

Wn =√

n

(∑n

i=1 Vi

∑ni=1 X2

i

), (9.76)

9.5. Exercises 149

where Vi depends on Xi, Yi, and β. (b) Find E[Vi |Xi = xi] and Var[Vi |Xi = xi]. (c)Find E[Vi] and σ2

V = Var[Vi]. (d) Are the Vi’s iid? (e) For what a does

∑ni=1 Vina −→D N(0, σ2

V)? (9.77)

(f) Find the b and c > 0 for which

∑ni=1 X2

inb −→P c. (9.78)

(g) Using parts (e) and (f), Wn −→D N(0, σ2B) for what σ2

B? What theorem is neededto prove the result?

Exercise 9.5.3. Here, Yk ∼ Beta(k, k). We are interested in the limit of√

k(Yk− 1/2) ask→ ∞. Start by representing Yk with gammas, or more particularly, Exponential(1)’s.So let X1, . . . , X2k be iid Exponential(1). Then

Yk =X1 + · · ·+ Xk

X1 + · · ·+ Xk + Xk+1 + · · ·+ X2k. (9.79)

(a) Is this representation correct? (b) Now write

Yk − 12 = c

U1 + · · ·+ UkV1 + · · ·+ Vk

= cUk

Vk, (9.80)

where Ui = Xi − Xk+i and Vi = Xi + Xk+i. What is c? (c) Find E[Ui], Var[Ui], andE[Vi]. (d) So

√k (Yk − 1

2 ) = c

√k Uk

Vk. (9.81)

What is the asymptotic distribution of√

k Uk as k → ∞? What theorem do you use?(e) What is the limit of Vk in probablity as k → ∞? What theorem do you use? (f)Finally,

√k (Yk − 1/2) −→D N(0, v). What is v? (It is a number.) What theorem do

you use?

Exercise 9.5.4. Suppose U1, . . . , Un are iid Uniform(0,1), and n is odd. Then thesample median is U(kn) for kn = (n + 1)/2, hence U(kn) ∼ Beta(kn, kn). Letting k = knand Yk = U(kn) in Exercise 9.5.3, we have that√

kn(U(kn) −12 ) −→ N(0, v). (9.82)

(a) What is the limit of n/kn? (b) Show that√

n(U(kn) −12 ) −→ N(0, 1

4 ). (9.83)

What theorem did you use?

Exercise 9.5.5. Suppose T ∼ tν, Student’s t. Exercise 6.8.17 shows that E[T] = 0if ν > 1 and Var[T] = ν/(ν − 2) if ν > 2, as well as gives the pdf. This exerciseis based on X1, X2, . . . , Xn iid with pdf fν(x − µ), where fν is the tν pdf. (a) Findthe asymptotic efficiency of the median relative to the mean for ν = 1, 2, . . . , 7. [Youmight want to use the function dt in R to find the density of the tν.] (b) When ν issmall, which is better, the mean or the median? (c) When ν is large, which is better,the mean or the median? (d) For which ν is the asymptotic relative efficiency closestto 1?


Exercise 9.5.6. Suppose U1, . . . , Un are iid from the location family with fα being theBeta(α, α) pdf, so that the pdf of each Ui is fα(u − η) for η ∈ R. (a) What mustbe added to the sample mean or sample median in order to obtain an unbiasedestimator of η? (b) Find the asymptotic variances of the two estimators in part (a) forα = 1/10, 1/2, 1, 2, 10. Also, find the asymptotic relative efficiency of the median tothe mean for each α. What do you see?

Exercise 9.5.7. Suppose X1, . . . , Xn are iid Poisson(λ). (a) What is the asymptoticdistribution of

√n(Xn − λ) as n → ∞? (b) Consider the function g, such that√

n(g(Xn)− g(λ))→D N(0, 1). What is its derivative, g′(w)? (c) What is g(w)?

Exercise 9.5.8. (a) Suppose X1, . . . , Xn are iid Exponential(θ). Since E[Xi] = 1/θ, wemight consider using 1/Xn an estimator of θ. (a) What is the asymptotic distributionof√

n(1/Xn − θ)? (b) Find g(w) so that√

n(g(1/Xn)− g(θ))→D N(0, 1).

Exercise 9.5.9. Suppose X1, . . . , Xn are iid Gamma(p, 1) for p > 0. (a) Find the asymp-totic distribution of

√n(Xn − p). (b) Find a statistic Bn (depending on the data alone)

such that√

n(Xn − p)/Bn →D N(0, 1). (c) Based on the asymptotic distribution inpart (b), find an approximate 95% confidence interval for p. (d) Next, find a functiong such that

√n(g(Xn)− g(p)) →D N(0, 1). (e) Based on the asymptotic distribution

in part (d), find an approximate 95% confidence interval for p. (f) Let xn = 100 withn = 25. Compute the confidence intervals using parts (c) and (e). Are they reasonablysimilar?


), · · · ,

(XnYn

)are iid N

((00

),(

1 ρρ 1

)). (9.84)

Also, let

Wi =

XiYiX2

iY2

i

. (9.85)

Equations (9.68) to (9.70) exhibit the asymptotic distribution of

√n

Wn −

ρ11

(9.86)

as n→ ∞, where Wn is the sample mean of the Wi’s. Take g(w1, w2, w3) = w1/√

w2w3as in (9.67) so that

Rn = g(Wn) =∑ XiYi√

∑ X2i

√∑ Y2

i

. (9.87)

(a) Find the vector of derivatives of g, Dg, evaluated at w = (ρ, 1, 1)′. (b) Use the∆-method to show that

√n(Rn − ρ) −→ N(0, (1− ρ2)2). (9.88)

as n→ ∞.

9.5. Exercises 151

Exercise 9.5.11. Suppose Rn is the sample correlation coefficient from an iid sample ofbivariate normals as in Exercise 9.5.10. Consider the function h, such that

√n(h(Rn)−

h(ρ)) −→ N(0, 1). (a) Find h′(w). (b) Show that

h(w) =12

log(

1 + w1− w

). (9.89)

[Hint: You might want to use partial fractions, that is, write h′(w) as A/(1− w) +B/(1 + w). What are A and B?] The statistic h(Rn) is called Fisher’s z. (c) In asample of n = 712 students, the sample correlation coefficient between x = shoe sizeand y = log(# of pairs of shoes owned) is r = −0.500. Find an approximate 95%confidence interval for h(ρ) (using "±2"). (Assuming these data are a simple randomsample from a large normal population.) (d) What is the corresponding approximate95% confidence interval for ρ? Is 0 in the interval? What do you conclude? (e) Forjust men, the sample size is 227 and the correlation between x and y is 0.0238. Forjust women, the sample size is 485 and the correlation between x and y is −0.0669.Find the confidence intervals for the population correlations for the men and womenusing the method in part (d). Is 0 in either or both of those intervals? What do youconclude?

Exercise 9.5.12. Suppose Xn ∼ Multinomial(n, p) where p = (p1, p2, p3, p4)′ (so K =

4). Then E[Xn] = np and, from Exercise 2.7.7,

Cov[Xn] = nΣ where Σ =

p1 0 0 00 p2 0 00 0 p3 00 0 0 p4

− pp′. (9.90)

(a) Suppose Z1, . . . , Zn are iid Multinomial (1, p). Show that Xn = Z1 + · · ·+ Zn isMultinomial(n, p). [Hint: Use mgfs from Section 2.5.3.] (b) Argue that by part (a),the central limit theorem can be applied to show that

√n(pn − p) −→D N(0, Σ). (9.91)

(c) Arrange the pi’s in a 2× 2 table:

p1 p2p3 p4

(9.92)

In Exercise 6.8.11 we saw the odds ratio. Here we look at the log odds ratio, givenby log((p1/p2)/(p3/p4)). The higher it is, the more positively associated being isrow 1 is with being in column 1. For w = (w1, . . . , w4)

′ with each wi ∈ (0, 1), letg(w) = log(w1w4/(w2w3)), so that g(p) is the log odds ratio for the 2 × 2 table.Show that in the independence case as in Exercise 6.8.11(c), g(p) = 0. (d) FindDg(w), the vector of derivatives of g, and and show that

√n (g(pn)− g(p)) −→D N(0, σ2

g), (9.93)

where σ2g = 1/p1 + 1/p2 + 1/p3 + 1/p4.

Exercise 9.5.13. In a statistics class, people were classified on how well they did onthe combined homework, labs and inclass assignments (hi or lo), and how well they


did on the exams (hi or lo). Thus each person was classified into one of four groups.Letting p be the population probabilities, arrange the vector in a two-by-two table asabove. The table of observed counts is

Exams→ Lo HiHomework ↓

Lo 36 18Hi 18 35

(9.94)

(a) Find the observed log odds ratio for these data and its estimated standard error.Find an approximate 95% confidence interval for g(p). (b) What is the correspondingconfidence interval for the odds ratio? What do you conclude?

The next two tables split the data by gender:

Women MenExams→ Lo Hi

Homework ↓Lo 26 6Hi 10 28

Exams→ Lo HiHomework ↓

Lo 10 12Hi 8 7

(9.95)

Assume the men and women are independent, each with their own multinomialdistribution. (c) Find the difference between the women’s and men’s log odds ratios,and the standard error of that difference. What do you conclude about the differencebetween the women and men? (d) Looking at the women’s and men’s odds ratiosseparately, what do you conclude?

Part II

Statistical Inference

153

Chapter 10

Statistical Models and Inference

Most, although not all, of the material so far has been straight probability calculations,that is, we are given a probability distribution, and try to figure out the implications(what X is likely to be, marginals, conditionals, moments, what happens asymp-totically, etc.). Statistics generally concerns itself with the reverse problem, that is,observing the data X = x, and then having to guess aspects of the probability distri-bution that generated x. This “guessing” goes under the general rubric of inference.Four major aspects of inference are

• Estimation: What is the best guess of a particular parameter (vector), or func-tion of the parameter? The estimate may be a point estimate, or a point estimateand measure of accuracy, or an interval or region, e.g., “The mean is in the in-terval (10.44,19.77).”

• Hypothesis testing: The question is whether a specific hypothesis, the nullhypothesis, about the distribution is true, so that the inference is basically either“yes” or “no”, along with an idea of how reliable the conclusion is.

• Prediction: One is interested in predicting a new observation, possibly depend-ing on a covariate. For example, the data may consist of a number of (Xi, Yi)pairs, and a new observation comes along, where we know the x but not they, and wish to guess that y. We may be predicting a numerical variable, e.g.,return on an investment, or a categorical variable, e.g., the species of a plant.

• Model selection: There may be several models under consideration, e.g., inmultiple regression each subset of potential regressors defines a model. Thegoal would be to choose the best one, or a set of good ones.

The boundaries between these notions are not firm. One can consider predictionto be estimation, or model selection as an extension of hypothesis testing with morethan two hypotheses. Whatever the goal, the first task is to specify the statisticalmodel.

10.1 Statistical models

A probability model consists of a random X in space X and a probability distributionP. A statistical model also has X and space X , but an entire family P of probability

155

156 Chapter 10. Statistical Models and Inference

distributions on X . By family we mean a set of distributions; the only restrictionbeing that they are all distributions for the same X . Such families can be quitegeneral, e.g.,

X = Rn,P = P |X1, . . . , Xn are iid with finite mean and variance. (10.1)

This family includes all kinds of distributions (iid normal, gamma, beta, binomial),but not ones with the Xi’s correlated, or distributed Cauchy (which has no meanor variance). Another possibility is the family with the Xi’s iid with a continuousdistribution.

Often, the families are parametrized by a finite-dimensional parameter θ, i.e.,

P = Pθ | θ ∈ T , where T ⊂ RK . (10.2)

The T is called the parameter space. We are quite familiar with parameters, butfor statistical models we must be careful to specify the parameter space as well. Forexample, suppose X and Y are independent, X ∼ N(µX , σ2

X) and Y ∼ N(µY , σ2Y).

Then the following parameter spaces lead to distinctly different models:

T1 = (µX , σ2X , µY , σ2

Y) ∈ R× (0, ∞)×R× (0, ∞);T2 = (µX , σ2

X , µY , σ2Y) | µX ∈ R, µY ∈ R, σ2

X = σ2Y ∈ (0, ∞);

T3 = (µX , σ2X , µY , σ2

Y) | µX ∈ R, µY ∈ R, σ2X = σ2

Y = 1;T4 = (µX , σ2

X , µY , σ2Y) | µX ∈ R, µY ∈ R, µX > µY , σ2

X = σ2Y ∈ (0, ∞). (10.3)

The first model places no restrictions on the parameters, other than the variancesare positive. The second one demands the two variances be equal. The third sets thevariances to 1, which is equivalent to saying that the variances are known to be 1.The last one equates the variances, as well as specifying that the mean of X is largerthan that of Y.

A Bayesian model includes a (prior) distribution on P , which in the case of aparametrized model means a distribution on T . In fact, the model could include afamily of prior distributions, although we will not deal with that case explicitly.

Before we introduce inference, we take a brief look at how probability is inter-preted.

10.2 Interpreting probability

In Section 1.2, we defined probability distributions mathematically, starting withsome axioms. Everything else flowed from those axioms. But as in all mathematicalobjects, they do not in themselves have physical reality. In order to make practicaluse of the results, we must somehow connect the mathematical objects to the physicalworld. That is, how is one to interpret P[A]? In games of chance, people generallyfeel confident that they know what “the chance of heads” or “the chance of a fullhouse” mean. But other probabilities may be less obvious, e.g., “the chance that itrains next Tuesday” or “the chance ____ and ____ get married” (fill in the blanks withany two people). Two popular interpretations are frequency and subjective. Bothhave many versions, and there are also many other interpretations, but much of thismaterial is beyond the scope of the author. Here are sketches of the two.

10.2. Interpreting probability 157

Frequency. An experiment is presumed to be repeatable, so that one could conceiv-ably repeat the experiment under the exact same conditions over and over again (i.e.,infinitely often). Then the probability of a particular event A, P[A], is the long-runproportion of times it occurs, as the experiment is repeated forever. That is, it is thelong-run frequency A occurs. This interpretation implies that probability is objectivein the sense that it is inherent in the experiment, not a product of one’s beliefs. Thisinterpretation works well for games of chance. One can imagine rolling a die or spin-ning a roulette wheel an “infinite” number of times. Population sampling also fits inwell, as one could imagine repeatedly taking a random sample of 100 subjects froma given population. The frequentist interpretation can not be applied to situationsthat are not in principle repeatable, such as whether two people will get married,or whether a particular candidate will win an election. One would have to imagineredoing the world over and over Groundhog Day-like.

Subjective. The subjective approach allows each person to have a different proba-bility, so that for a given person, P[A] is that person’s opinion of the probability ofA. The only assumption is that each person’s probabilities cohere, that is, satisfy theprobability axioms. Subjective probability can be applied to any situation. For a re-peatable experiment, people’s subjective probabilities would tend to agree, whereasin other cases, such as the probability a certain team will win a particular game, theirprobabilities could differ widely.

Some subjectivists make the assumption that any given person’s subjective prob-abilities can be elicited using a betting paradigm. For example, suppose the event inquestion is “Pat and Leslie will get married,” the choices being “Yes” and “No,” andwe wish to elicit your probability of the event “Yes.” We give you $10, and ask youfor a number w, which will be used in two possible bets:

Bet 1→Win $w if “Yes”, Lose $10 if “No”;Bet 2→ Lose $10 if “Yes”, Win $(100/w) if “No”. (10.4)

Some dastardly being will decide which of the bets you will take, so the w should bean amount for which you are willing to take either of those two bets. For example, ifyou choose $w = $5, then you are willing to accept a bet that pays only $5 if they doget married, and loses $10 if they don’t; and you are willing to take a bet that wins$20 if they do not get married, and loses $10 if they do. These numbers suggest youexpect they will get married. Suppose p is your subjective probability of “Yes.” Thenyour willingness to take Bet 1 means you expect to not lose money:

Bet 1→ E[Winnings] = p($w)− (1− p)($10) ≥ 0. (10.5)

Same with Bet 2:

Bet 2→ E[Winnings] = −p($10) + (1− p)$(100/w) ≥ 0. (10.6)

A little algebra translates those two inequaties into

p ≥ 1010 + w

and p ≤ 100/w10 + 100/w

=10

10 + w, (10.7)

which of course means that

p =10

10 + w. (10.8)


With $w = $5, your p = 2/3 that they will get married.The betting approach is then an alternative to the frequency approach. Whether it

is practical to elicit an entire probability distribution (i.e., P[A] for all A ⊂ X ), andwhether the result will satisfy the axioms, is questionable, but the main point is thatthere is in principle a grounding to a subjective probability.

10.3 Approaches to inference

Paralleling the interpretations of probability are the two main approaches to statisticalinference: frequentist and Bayesian. Both aim to make inferences about θ based onobserving the data X = x, but take different tacks.

Frequentist. The frequentist approach assumes that the parameter θ is fixed butunknown (that is, we know only that θ ∈ T ). An inference is an action, which is afunction

δ : X −→ A, (10.9)

for some action space A. The action space depends on the type of inference de-sired. For example, if one wishes to estimate θ, then δ(x) would be the estimate, andA = T . Or δ may be a vector containing the estimate as well as an estimate of itsvariance, or it may be a two-dimensional vector representing a confidence interval, asin (7.32). In hypothesis testing, we often take A = 0, 1, where 0 means accept thenull hypothesis, and 1 means reject it. The properties of a procedure δ, which woulddescribe how good it is, are based on the behavior if the experiment were repeatedover and over, with θ fixed. Thus an estimator δ of θ is unbiased if

Eθ [δ(X)] = θ for all θ ∈ T . (10.10)

Or a confidence interval procedure δ(x) = (l(x), u(x)) has 95% coverage if

Pθ [l(X) < θ < u(X)] ≥ 0.95 for all θ ∈ T . (10.11)

Understand that the 95% does not refer to your particular interval, but rather to theinfinite number of intervals that you imagine arising from repeating the experimentover and over. That is, without a prior, for fixed x,

P[l(x) < θ < u(x)] 6= 0.95, (10.12)

because there is nothing random in the probability statement. The actual probablyis either 0 or 1, depending on whether θ is indeed between l(x) and u(x), but wetypically do not know which value it is.

Bayesian. The frequentist approach does not tell you what to think of θ. It just pro-duces a number or numbers, then reassures you by telling you what would happenif you repeated the experiment an infinite number of times. The Bayesian approach,by contrast, tells you what to think. More precisely, given your prior distributionon T , which may be your subjective distribution, the Bayes approach tells you howto update your opinion upon observing X = x. The update is of course the pos-terior, which we know how to find using Bayes theorem (Theorem 6.3 on page 94).The posterior fΘ|X(θ | x) is the inference, or at least all inferences are derived fromit. For example, an estimate could be the posterior mean, median, or mode. A 95%probability interval is any interval (lx, ux) such that

P[lx < Θ < ux |X = x] = 0.95. (10.13)

10.4. Exercises 159

A hypothesis test would calculate

P[Null hypothesis is true |X = x], (10.14)

or, if an accept/reject decision is desired, reject the null hypothesis if the posteriorprobability of the null is less than some cutoff, say 0.50 or 0.01.

A drawback to the frequentist approach is that we cannot say what we wish tosay, such as the probability a null hypothesis is true, or the probability µ is betweentwo numbers. Bayesians can make such statements, but as the cost of having to comeup with a (usually subjective) prior. The subjectivity means that different people cancome to different conclusions from the same data. (Imagine a tobacco company and aconsumer advocate analyzing the same smoking data.) Fortunately, there are more orless well-accepted “objective” priors, and especially when the data is strong, differentreasonable priors will lead to practically the same posteriors. From an implementa-tion point of view, sometimes frequentist procedures are computationally easier, andsometimes Bayesian procedures are. It may not be philosophically pleasing, but it isnot a bad idea to take an opportunistic view and use whichever approach best movesyour understanding along.

There are other approaches to inference, such as the likelihood approach, thestructural approach, the fiducial approach, and the fuzzy approach. These are allinteresting and valuable, but seem a bit iffy to me.

The rest of the course goes more deeply into inference.

10.4 Exercises

Exercise 10.4.1. Suppose X1, . . . , Xn are independent N(µ, σ20 ), where µ ∈ R, n = 25,

and σ20 = 9. Also, suppose that U ∼ Uniform(0, 1), and U is independent of the Xi’s.

The µ does not have a prior in this problem. Consider the following two confidenceinterval procedures for µ:

Procedure 1: CI1(x, u) =(

x− 1.96σ0√

n, x + 1.96

σ0√n

);

Procedure 2: CI2(x, u) =

R if u ≤ .95∅ if u > .95 . (10.15)

The ∅ is the empty set. (a) Find P[µ ∈ CI1(X, U)]. (b) Find P[µ ∈ CI2(X, U)]. (c)Suppose x = 70 and u = 0.5. Using Procedure 1, is (68.824, 71.176) a 95% confidenceinterval for µ? (d) Given the data in part (c), using Procedure 2, is (−∞, ∞) a 95%confidence interval for µ? (e) Given the data in part (c), using Procedure 1, doesP[68.824 < µ < 71.176] = 0.95? (f) Given the data in part (c), using Procedure 2,does P[µ ∈ CI2(x, u)] = 0.95? If not, what is the probability? (g) Suppose x = 70and u = 0.978. Using Procedure 2, does P[µ ∈ CI2(x, u)] = 0.95? If not, what is theprobability?

Exercise 10.4.2. Continue with the situation in Exercise 10.4.1, but now suppose thereis a prior on µ, so that

X |M = µ ∼ N(µ, (0.6)2),

M ∼ N(66, 102), (10.16)


and U is independent of (X, M). (a) Find the posterior distribution M |X = 70, U =0.978. Does it depend on U? (b) Find P[M ∈ CI1(X, U) |X = 70, U = 0.978]. (c) FindP[M ∈ CI2(X, U) |X = 70, U = 0.978]. (d) Which 95% confidence interval procedureseems to give closest to 95% confidence that M is in the interval?

Exercise 10.4.3. Imagine you have to weigh a substance whose true weight is µ, inmilligrams. There are two scales, an old mechanical one and a new electronic one.Both are unbiased, but the mechanical one has a measurement error of 3 milligrams,while the electronic one has a measurement error of only 1 milligram. Letting Y bethe measurement, and S be the scale, we have that

Y | S = mech ∼ N(µ, 32),Y | S = elec ∼ N(µ, 1). (10.17)

There is a fifty-fifty chance you get to use the good scale, so marginally, P[S =mech] = P[S = elec] = 1/2. Consider the confidence interval procedure, CI(y) =(y− 3, y + 3). (a) Find P[µ ∈ CI(Y) | S = mech]. (b) Find P[µ ∈ CI(Y) | S = elec]. (c)Find the unconditional probability, P[µ ∈ CI(Y)]. (d) The interval (y− 3, y + 3) is aQ% confidence interval for µ. What is Q? (e) Suppose the data are (y, s) = (14, mech).Using the CI above, what is the Q% confidence interval for µ? (f) Suppose the dataare (y, s) = (14, elec). Using the CI above, what is the Q% confidence interval forµ? (g) What is the difference between the two intervals from (e) and (f)? (h) Are youequally confident in them?

Exercise 10.4.4. Continue with the situation in Exercise 10.4.3, but now suppose thereis a prior on µ, so that

Y |M = µ, S = mech ∼ N(µ, 32),Y |M = µ, S = elec ∼ N(µ, 1),

M ∼ N(16, 152), (10.18)

and P[S = mech] = P[S = elec] = 1/2, where M and S are independent. (a) Findthe posterior M |Y = 14, S = mech. (b) Find the posterior M |Y = 14, S = elec.(c) Find P[Y − 3 < M < Y + 3 |Y = 14, S = mech]. (d) Find P[Y − 3 < M <Y + 3 |Y = 14, S = elec]. (e) Is Q% (Q is from Exercise 10.4.3 (d)) a good measure ofconfidence for the interval (y− 3, y + 3) for the data (y, s) = (14, mech)? For the data(y, s) = (14, elec)?

Chapter 11

Estimation

11.1 Definition of estimator

We assume a model with parameter space T , and suppose we wish to estimate somefunction g of θ,

g : T −→ Rq. (11.1)

This function could be θ itself, or just part of θ. For example, if X1, . . . , Xn are iidN(µ, σ2), where θ = (µ, σ2) ∈ R× (0, ∞), some possible one-dimensional g’s are

g(µ, σ2) = µ;

g(µ, σ2) = σ;

g(µ, σ2) = σ/µ = coefficient of variation;

g(µ, σ2) = P[Xi ≤ 10] = Φ((10− µ)/σ), (11.2)

where Φ is the distribution function for N(0, 1).Formally, an estimator is a function δ(x),

δ : X −→ A, (11.3)

where A is some space, presumably the space of g(θ), but not always. The estimatorcan be any function of x, but cannot depend on an unknown parameter. Thus withg(µ, σ2) = σ/µ in the above example,

δ(x1, . . . , xn) =sx

, [s2 = ∑(xi − x)2/n] is an estimator,

δ(x1, . . . , xn) =σ

xis not an estimator. (11.4)

We often use the “hat” notation, so that if δ is an estimator of g(θ), we would write

δ(x) = g(θ). (11.5)

Any function can be an estimator, but that does not mean it will be a particularlygood estimator. There are basically two questions we must address: How does one

161

162 Chapter 11. Estimation

find reasonable estimators? How do we decide which estimators are good? Thischapter looks at plug-in methods and Bayesian estimation. Chapter 12 considers leastsquares and similar procedures as applied to linear regression. Chapter 13 presentsmaximum likelihood estimation, a widely applicable approach. Later chapters (19and 20) deal with optimality of estimators.

11.2 Bias, standard errors, and confidence intervals

Making inferences about a parameter typically involves more than just a point esti-mate. One would also like to know how accurate the estimate is likely to be, or havea reasonable range of values. One basic measure is bias, which is how far off theestimator δ of g(θ) is on average:

Biasθ [δ] = Eθ [δ(X)]− g(θ). (11.6)

For example, if X1, . . . , Xn are iid with variance σ2 < ∞, then S2 = ∑(Xi −X)2/n has

E[S2] =n− 1

nσ2 =⇒ Biasσ2 [S2] = − 1

nσ2. (11.7)

Thus δ(X) = S2 is a biased estimator of σ2. Instead, if we divide by n− 1, we havethe unbiased estimator S2

∗. (We saw this result for the normal in (7.69).)A little bias is not a big deal, but one would not like a huge amount of bias.

Another basic measure of accuracy is the standard error of an estimator:

seθ [δ] =√

Varθ [δ] or√Varθ [δ], (11.8)

that is, it is the theoretical standard deviation of the estimator, or an estimator thereof.In Exercise 11.7.1 we will see that the mean square error, Eθ [(δ(X)− g(θ))2], combinesthe bias and standard error, being the bias squared plus the variance. Section 19.3delves more formally into optimality considerations.

Confidence intervals (as in (10.11)) or probability intervals (as in (10.13)) will of-ten be more informative than simple point estimates, even with the standard errors.A common approach to deriving confidence intervals uses what are called pivotalquantities, as introduced in (7.29). A pivotal quantity is a function of the data and theparameter, whose distribution does not depend on the parameter. That is, supposeT(X ; θ) has a distribution that is known or can be approximated. Then for given α,we can find constants A and B, free of θ, such that

P[A < T(X ; θ) < B] = (or ≈) 1− α. (11.9)

If the pivotal quantity works, then we can invert the event so that our estimand g(θ)is in the middle, and statistics (free of θ) define the interval:

A < T(x ; θ) < B ⇔ l(x) < g(θ) < u(x). (11.10)

Then (l(x), u(x)) is a (maybe approximate) 100(1− α)% confidence interval for g(θ).The quintessential pivotal quantities are the z statistic in (7.29) and t statistic in (7.77).

In many situations, the exact distribution of an estimator is difficult to find, so thatasymptotic considerations become useful. For a sequence of estimators δ1, δ2, . . ., ananalog of unbiasedness is consistency, where δn is a consistent estimator of g(θ) if

δn −→P g(θ) for all θ ∈ T . (11.11)

11.3. Plug-in methods: Parametric 163

Note that consistency and unbiasedness are distinct notions. Consistent estimatorsdo not have to be unbiased: S2

n is a consistent estimator of σ2 in the iid case, but isnot unbiased. Also, an unbiased estimator need not be consistent (can you think ofan example?), though an estimator that is unbiased and has variance going to zero isconsistent, by Chebyshev’s inequality (Lemma 8.3).

Likewise, whereas the exact standard error may not be available, we may have aproxy, sen(δn), for which

δn − g(θ)sen(δn)

−→D N(0, 1). (11.12)

Then an approximate confidence interval for g(θ) is

δn ± 2 sen(δn). (11.13)

Here the ∆-method (Section 9.2) often comes in useful.

11.3 Plug-in methods: Parametric

Often the parameter of interest has an obvious sample analog, or is a function of someparameters that have obvious analogs. For example, if X1, . . . , Xn are iid, then it maybe reasonable to estimate µ = E[Xi] by X, σ2 = Var[Xi] by S2, and the coefficient ofvariation by S/X (see (11.2) and (11.4)).

An obvious estimator of P[Xi ≤ 10] is

δ(x) =#xi ≤ 10

n. (11.14)

A parametric model may suggest other options. For example, if the data are iidN(µ, σ2), then P[Xi ≤ 10] = Φ((10− µ)/σ), so that we can plug in the mean andstandard deviation estimates to obtain the alternative estimator

δ∗(x) = Φ(

10− xs

). (11.15)

Or suppose the Xi’s are iid Beta(α, β), with (α, β) ∈ (0, ∞) × (0, ∞). Then fromTable 1.1 on page 7, the population mean and variance are

µ =α

α + βand σ2 =

αβ

(α + β)2(α + β + 1). (11.16)

The sample quantities x and s2 are estimates of those functions of α and β, hence theestimates α and β of α and β would be the solutions to

x =α

α + βand s2 =

αβ

(α + β)2(α + β + 1), (11.17)

or after some algebra,

α = x(

x(1− x)s2 − 1

)and β = (1− x)

(x(1− x)

s2 − 1)

. (11.18)


The estimators in (11.18) are special plug-in estimators, called method of mo-ments estimators, because the estimates of the parameters are chosen to match thepopulation moments with their sample versions. Method of moment estimators arenot necessarily strictly defined. For example, in the Poisson(λ), both the mean andvariance are λ, so that λ could be x or s2. Also, one has to choose moments thatwork. For example, if the data are iid N(0, σ2), and we wish to estimate σ, the meanis useless because one cannot do anything to match 0 = x.

Finding standard errors for plug-in estimators often involves another plug-in. Forexample, we know that when estimating the mean with iid observations, Var[X] =σ2/n if Var[Xi] = σ2 < ∞, hence we can use s/

√n as the standard error. Or in the

Poisson case, sen(X) =√

X/n, since the mean and variance are the same.

11.3.1 Coefficient of variation

Suppose we have a sample of iid N(µ, σ2)’s, and we wish to find a confidence intervalfor the coefficient of variation in (11.2), cv = g(µ, σ2) = σ/µ, assuming µ 6= 0. Letδn = δ(x1, . . . , xn) = s/x (or we could use s∗/x). In (9.62) we used the ∆-method toshow that √

n(δn − cv) −→D N(0, cv2(cv2 + 12 )). (11.19)

Thus we can estimate the standard error of δn by plugging δn into the standarddeviation:

sen(δn) =1√n|δn|

√δ2

n + 12 . (11.20)

We can then use Lemma 9.1 on page 139 to show that

√n

δn − cv

|δn|√

δ2n + 1

2

−→D N(0, 1), (11.21)

hence δn ± 2 sen(δn) is an approximate 95% confidence interval for cv, as in (11.13).

11.4 Plug-in methods: Nonparametric

A nonparametric model is one that cannot be defined using a finite number of pa-rameters. Here we will focus on the model where X1, . . . , Xn are iid, each with spaceX ⊂ Rp and distribution function F, where we assume that F ∈ F , for F beinga large class of distribution functions. For example, it could contain all continuousdistribution functions, or all with finite mean, or all with unimodal pdfs.

This F is the ultimate parameter, and we are interested in estimating functionals onF ,

θ : F −→ Rq. (11.22)

That is, the argument for the function θ is F. The mean, median, variance, P[Xi ≤10] = F(10), etc., are all such functionals for appropriate F ’s. Notice that the α andβ of a beta distribution are not such functionals, because they are not defined fornon-beta distributions. The θ(F) is then the population parameter θ.

The plug-in estimate is obtained by plugging in an estimate of the parameter F.Most generally for the iid case, this estimate Fn is the empirical distribution function,

11.5. Plug-in methods: Bootstrap 165

defined by

Fn(x) =#xi | xi ≤ x

n(11.23)

in the univariate case. If the data are p-variate, with xi = (xi1, . . . , xip), then

Fn(x) =#xi | xij ≤ xj for all j = 1, . . . , p

n, (11.24)

where x = (x1, . . . , xn). This Fn is the distribution function for the random vector X∗,which has space

X ∗ = the distinct values among x1, . . . , xn (11.25)

and probabilities

P∗[X∗ = x∗] =#xi | xi = x∗

n, x∗ ∈ X ∗. (11.26)

Thus X∗ is generated by randomly choosing one of the observations xi. For example,if the sample is 3, 5, 2, 6, 2, 2, then X ∗ = 2, 3, 5, 6 and

P[X∗ = 2] =12

, P[X∗ = 3] = P[X∗ = 5] = P[X∗ = 6] =16

. (11.27)

Back to estimating the θ(F) in (11.22), the plug-in estimator is then

θ = θ(X1, . . . , Xn) = θ(Fn). (11.28)

If θ is the mean, θ(Fn) = x, and if θ is the variance, θ(Fn) = s2 = ∑(xi − x)2/n. Asin (11.2), the coefficient of variation can be estimated by the sample version, s/x. Theparameter P[Xi ≤ 10] = F(10) is estimated by Fn(10), which is the same as (11.14).

In the next section we look at estimating standard errors and confidence intervalsof such estimators.

11.5 Plug-in methods: Bootstrap

Continue with the setup in the previous section. We will introduce the bootstrapprocedure to estimate the bias and standard error of θ(Fn), and confidence intervalsfor θ(F). In general, the bootstrap can be used to estimate the distribution of afunction T of the data and the distribution function:

T(X1, . . . , Xn ; F), where X1, . . . , Xn are iid ∼ F. (11.29)

We will focus on taking

T(X1, . . . , Xn ; F) = θ(Fn)− θ(F) = θ − θ(F). (11.30)

Note that the bias of our estimator is the expected value of T, and the standard erroris the standard deviation of T. In addition, this statistic can be used like a pivotalquantity as in (11.10), so that if we can find A and B such that

P[A < T(X1, . . . , Xn ; F) < B] = P[A < θ − θ(F) < B] = 1− α, (11.31)


then a 100(1− α)% confidence interval for θ(F) is

(θ − B, θ − A). (11.32)

The bootstrap estimate of the distribution of T is also a plug-in estimator, wherewe plug the empirical distribution function in for the F, and the data in T are replacedby iid observations drawn from Fn. That is, the bootstrap estimate of the distributionof T in (11.30) is the distribution of

T(X∗1 , . . . , X∗n ; Fn), where X∗1 , . . . , X∗n are iid ∼ Fn. (11.33)

Note that the X∗i ’s distribution is equivalent to drawing n observations with replace-ment from the data x1, . . . , xn.

In principle it is easy to find the bootstrap distribution since there is only a finitenumber of possible such draws. In practice, though, the number of possibilities iscombinatorially large (see Exercise 11.7.9), so we usually estimate the estimate of thedistribution by taking a number, say K, of random samples of the X∗i ’s. That is, thekth bootstrap sample x∗k1, . . . , x∗kn is obtained by randomly drawing n observationswith replacement from the original data x1, . . . , xn. The kth realization of the T in(11.33) is then

t∗k = T(x∗k1, . . . , x∗kn ; Fn) = θ(x∗k1, . . . , x∗kn)− θ(x1, . . . , xn). (11.34)

We then estimate A and B in (11.31) and (11.32) with a and b, respectively, where a isthe 0.025th quantile, and b is the 0.975th quantile, of the bootstrapped T’s, t∗1 , . . . , t∗K.

11.5.1 Sample mean and median

Take the univariate data X1, . . . , Xn to be iid F, and θ(F) = E[Xi], so that θ(Fn) = Xn,the usual sample mean. Then

T(X1, . . . , Xn ; F) = Xn − θ(F). (11.35)

The bootstrap estimate of the distribution of T is the distribution of

T∗ = T(X∗1 , . . . , X∗n ; Fn) = X∗n − xn, (11.36)

where X∗n is the sample mean of the bootstrapped sample.It is easy to find the mean and variance of T∗. Since X∗1 , . . . , X∗n are iid observations

from a distribution with mean xn and variance s2n = ∑(xi − xn)2/n, we have

E[T∗] = 0 and Var[T∗] =s2

nn

. (11.37)

Thus the bootstrap estimate of the bias is 0 (which we know is the actual bias), andthe bootstrap estimate of the standard error of the sample mean is sn/

√n, which is

also reasonable. For a 95% confidence interval, we first need to find a and b so that

P[a < X∗n − xn < b] ≈ 0.95, (11.38)

11.5. Plug-in methods: Bootstrap 167

Shoe numbers

0 20 40 60 80 100

Bootstrapped means

−2 −1 0 1 2

Bootstrapped medians

−2 −1 0 1 2

Figure 11.1: Bootstrapping the shoes.

the approximation being needed because the distribution is discrete. To find the exacta and b, we would need to find the sample means for all possible bootstrap samples,then line them up in order to find the 2.5 and 97.5 percentiles. Unless n is very small,that approach is not feasible, so we resample.

To illustrate, each of a sample of 712 students were asked, “How many pairs ofshoes do you own?” The first histogram in Figure 11.1 shows the original data,where “shoe number” means number of pairs owned. We look at both the mean andmedian, hoping to estimate a confidence interval for the population quantities. Thesample mean is xn = 15.2079 and the Medianx1, . . . , xn = 12. The two bootstrapquantities are

T∗Mean = X∗n − xn and T∗Median = MedianX∗1 , . . . , X∗n −Medianx1, . . . , xn.(11.39)

The second and third histograms in Figure 11.1 are the histograms of K = 5000bootstrapped values of the T∗Mean’s and T∗Median’s, respectively. That is, for the mean,we took 5000 bootstrap samples of 712 observations without replacement from theoriginal data. For the kth bootstrap sample, we found the mean, then subtracted theoriginal sample mean from the bootstrap mean to obtain t∗k . Then t∗1 , . . . , t∗K are usedto estimate the mean, standard deviation, and quantiles a and b of T∗Mean. A similarprocedure was used for the median. The following table contains the results:

Mean Standard deviation (a, b)T∗Mean 0.0013 0.4710 (−0.8919, 0.9425)

T∗Median −0.0769 0.4511 (−1.25, 1.25)(11.40)

As before, a and b are the 0.025th and 0.975th sample quantiles of the bootstrappedT∗’s, respectively. The intervals (a, b) contain about 95% of the bootstrapped samples(exactly 95% for the mean, but 98.1% for the median).

The means of the bootstrapped quantities estimate the biases of the estimators. Inthis case, we see both biases are quite small, as expected for the mean. The standarddeviations are our estimates of the standard errors of the estimators. Both the sam-ple mean and sample median have standard errors a little below 0.5 pairs of shoes(which is one shoe). For the sample mean, we could also use the usual standarderror estimate of sn/

√n, which in this case is 0.4692, about the same as the bootstrap


estimate. For the median, we do not really have a ready alternative for estimating thestandard error. In Section 9.2.1 we used the ∆-method to find an estimate, but thatprocedure required knowing the pdf. Here, we do not even have a pdf, since the dataare discrete.

Turn to confidence intervals. The histogram of the bootstrapped means looksvery close to a normal curve, suggesting that xn ± 2 se(xn) would be a reasonableconfidence interval. The bootstrapped estimate of the confidence interval using thequantiles a and b would be (xn − b, xn − a) as in (11.32). The two intervals are verysimilar:

xn ± 2 se(xn) = (14.27, 16.15) and (xn − b, xn − a) = (14.27, 16.10). (11.41)

For the median, the histogram does not look particularly normal, though it doesappear reasonably symmetric. In fact, it is a discrete distribution, with values only atintegers and half-integers (since the data is all integers). Thus I took the a and b atquarter-integers, i.e., ±1.25. The bootstrap estimate of the confidence interval for themedian is then

(Medianx1, . . . , xn − b, Medianx1, . . . , xn − a) = (10.75, 13.25). (11.42)

This analysis suggests that the population mean number of pairs is between about14 and 16, while the median is between about 11 to 13. Not surprisingly, given thepositive skewness of the data, the mean is larger than the median.

11.5.2 Using R

Bootstrapping directly is fairly easy in R, though there are also packages with en-hanced capabilities. Here, we will illustrate bootstrapping the median. The observa-tions are in the vector x. To find 5000 bootstrapped values of the median, we use

medstar <− NULL # Vector to collect the bootstrapped mediansfor(k in 1:5000)

xstar <− sample(x,replace=T) # Obtains one bootstrap samplemedstar <− c(medstar,median(xstar))

The t∗k ’s are found by subtracting the sample median from the bootstrapped medians,from which we can estimate the bias, standard error, and confidence interval:

tstar <− medstar − median(x)mean(tstar) # Estimates the biassd(tstar) # Estimates the standard errorab <− quantile(tstar,c(0.025,0.975)) # Obtains a and bmedian(x) − ab[2:1] # Finds the confidence interval.

Then if x contains the sample of 712 shoe numbers, the output of the above codeis -0.0769 for the bias, 0.4511 for the standard error, and (11,13) for the confidenceinterval, since a = −1 and b = 1. These numbers will be slightly different each timewe run the code. Note that in (11.40), I made the continuity correction expanding the(a, b) interval by 0.25 on each side because of the discreteness of the distribution ofthe t∗k ’s.

11.6. Posterior distribution 169

11.6 Posterior distribution

In the Bayesian framework, the inference is the posterior, because the posterior hasall the information about the parameter given the data. Estimation of θ then becomesa matter of summarizing the distribution. In Section 6.6.2, we have X |Θ = θ ∼Binomial(n, θ) with prior Θ ∼ Beta(α, β), so that

Θ |X = x ∼ Beta(α + x, β + n− x). (11.43)

One estimate of the parameter is the posterior mean,

E[Θ |X = x] =α + x

α + β + n. (11.44)

Another is the posterior mode, the value with the highest density, which turns out tobe

Mode[Θ |X = x] =α + x− 1

α + β + n− 2. (11.45)

One can think of α as the prior number of successes and α + β as the prior samplesize, the larger α + β, the more weight the prior has in the posterior.

A 95% probability interval for the parameter given the data is (lx, ux), where theendpoints are chosen to satisfy

P[lx < Θ < ux |X = x] = 0.95. (11.46)

There are many choices. the simplest may be to take P[Θ < lx |X = x] = P[Θ >ux |X = x] = 0.025. Commonly they are chosen so that the interval contains thepoints with the highest posterior density, but still with 95% probability.

11.6.1 Normal mean

Exercise 7.8.14 treats the normal mean case when the variance is known, and the prioron the mean is normal. That is, if

Y1, . . . , Yn |M = µ are iid N(µ, σ2), M ∼ N(µ0, σ20 ), (11.47)

where µ0, σ20 , and σ2 are known, then

M |Y = y ∼ N(

ω0µ0 + nωyω0 + nω

,1

ω0 + nω

), (11.48)

where ω0 = 1/σ20 and ω = 1/σ2 are the precisions. Note that the posterior mean

is a weighted average of the prior mean and the sample mean, with weights pro-portional to their precisions. In particular, if we take a very flat prior, meaning takeσ2

0 to be very large or ω0 to be near 0, then the posterior distribution for M is veryclose to N(y, σ2/n). Thus the usual frequentist confidence interval can be a goodapproximation of a probability interval. That is, consider the probability

P[

y− 1.96σ√n< M < y + 1.96

σ√n

∣∣∣∣ Y = y]

(11.49)


0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

ci

z = 1z = 2z = 3z = 4z = 5P

roba

bilit

y

τ

Figure 11.2: The posterior probability that the mean is in the 95% confidence intervalas a function of τ for values of z = 1, 2, 3, 4, 5. See (11.50).

as a function of the prior parameters µ0 and σ20 . Exercise 11.7.16 shows that this

probability can be written

Φ

z1+τ + 1.96√

τ1+τ

−Φ

z1+τ − 1.96√

τ1+τ

, where z =√

ny− µ0

σand τ =

nσ20

σ2 . (11.50)

The z is a measure of how far the sample mean is from the prior mean, and τ is theratio of the prior variance to the conditional variance of Y given M = µ. Figure 11.2graphs some of these probabilities. What we see is that even if the prior guess ofthe mean is off by five standard errors, the confidence interval settles to a probabilityclose to 95% by the time the prior variance is 15 times that of the sample variance. Inany case, as σ2

0 → ∞, the probability goes to 95%. Thus we are justified in taking asflat a prior as we wish.

In each of the binomial example in (11.43) and the normal one in (11.48), the priorand posterior distributions are of the same form (beta in the former case, normal inthe latter), but the posterior has updated parameters depending on the observed data.Such priors are called conjugate priors. Exercise 11.7.14 discovers the conjugate priorfor the exponential. Fink (1997) is a compendium of conjugate priors, and Diaconisand Ylvisaker (1979) take a more theoretical look at them.

Consider the normal case when σ2 is not known. We will use the precisions in thenotation, so that

Y1, . . . , Yn |M = µ, Ω = ω are iid N(µ, 1/ω). (11.51)

We need to place a joint prior on (µ, ω). Start by inspecting the conditional pdf:

f (y | µ, ω) = c ωn/2e−12 ω(n(y−µ)2+∑(yi−y)2), (11.52)

11.6. Posterior distribution 171

where we use∑(yi − µ)2 = n(y− µ)2 + ∑(yi − y)2. (11.53)

A conjugate prior would have the same form as the likelihood in terms of the param-eters. If you look at the µ, you see what looks like a normal-type pdf. If you ignorethe n(y− µ)2 term, the likelihood looks like a gamma pdf in ω. The problem is thatthe ω also appears in the µ term. It turns out that the form can be replicated using atwo-stage prior:

M |Ω = ω ∼ N(µ0, 1/(k0ω)) and Ω ∼ Gamma(ν0/2, λ0/2), (11.54)

where µ0, k0, ν0, and λ0 are known prior parameters, the last three being positive.(Note that if ν0 is an integer, Ω ∼ (1/λ0)χ

2ν0

.) The joint prior pdf can be written

f (µ, ω) = c ω(ν0+1)/2−1e−12 ω(k0(µ−µ0)2+λ0). (11.55)

The form of the posterior is obtained by multiplying the likelihood and prior. It iseasy to see what happens to the power of ω at the front. In the exponent, we againhave −ω/2 times a quadratic in µ, but it takes little work to complete the square. Theanswer (Exercise 11.7.20) is

f (µ, ω | y) = c∗ω(ν0+n+1)/2−1e−12 ω((n+k0)(µ−µ∗)2+λ∗), (11.56)

where

µ∗ =k0µ0 + ny

k0 + nand λ∗ = λ0 + ∑(yi − y)2 +

nk0n + k0

(y− µ0)2. (11.57)

Note that µ∗ is the analog of the posterior mean in (11.48). Matching the prior pa-rameters in (11.55) to their analogs in (11.56), we have

ν0 → ν0 + n, k0 → k0 + n, µ0 → µ∗, and λ0 → λ∗. (11.58)

Formally we have the posterior distribution of (M, Ω). The following lemma helpsto understand this prior. See Exercise 11.7.19 for the proof.

Lemma 11.1. Suppose (M, Ω) has distribution as given in (11.54). Then

E[M] = µ0, Var[M] =λ0

k0(ν0 − 2), and E

[1Ω

]=

λ0ν0 − 2

, (11.59)

the last two if ν0 > 2. Also, marginally

√ν0

M− µ0√λ0/k0

∼ tν0 , (11.60)

Student’s t on ν0 degrees of freedom. See (6.115). (Note: The density is well-defined even fornon-integer values of ν0 > 0.)

Plugging in the posterior parameters, we can see that E[M |Y = y] = µ∗ of (11.57),which is close to y if k0 is near zero. For the variance,

E[

1Ω

∣∣∣∣ Y = y]=

λ0 + ∑(yi − y)2 + nk0n+k0

(y− µ0)2

ν0 + n− 2≈ s2. (11.61)


If λ0, k0, and ν0 are near zero, then the posterior mean of σ2 is very close to thesample variance (dividing by n− 2). Likewise, the posterior variance of the mean is

Var[M |Y = y] =λ0 + ∑(yi − y)2 + nk0

n+k0(y− µ0)

2

(k0 + n)(ν0 + n− 2)≈ s2

n. (11.62)

The marginal posterior distribution of M is given by

√ν0 + n

M− µ∗√(λ0 + ∑(yi − y)2 + nk0

n+k0(y− µ0)2)/(n + k0)

∼ tν0+n, (11.63)

where that variable is close to the usual t statistic√

n(M− y)/s.

11.6.2 Improper priors

As always with Bayesian inference, the choice of a prior is important. If there aresubstantive reasons to choose a particular prior or a range of priors, then there isjustification for using those. Often, researchers wish to have priors that are chosenautomatically. Such priors should have minimal influence on the inference (i.e., theposteriors), and be generally accepted by other researchers.

In the normal and binomial situations in the previous parts of this section, wesaw that by taking prior parameter values close to the boundary of their spaces, theposterior quantities became close to the sample quantities we would use in frequentistevaluations. This happens because the posteriors are becoming less informative. E.g.,if the prior on the mean is N(µ0, σ2

0 ) and σ20 is very large, then the prior is expressing

very little knowledge of the value of the parameter. It could be anything! If theprior parameter actually reaches the limit, e.g., taking σ2

0 = ∞, then the prior is nolonger a probability distribution. In this case, it would be uniform on the real line,which has total mass +∞. Such priors are called improper. Jeffreys (1961) madean early effort to specify reasonable automatic priors, often called noninformativeor reference priors, in a number of specific situations. If the parameter space isR, he suggests using the uniform prior, and if the parameter σ has space(0, ∞), hesuggests the density dσ/σ, which is also improper. For the binomial p, Jeffreys likesBeta(1/2,1/2).

Interestingly, improper priors often lead to proper posteriors, where the posteriorpdf is defined formally as if the prior were proper. For example, if we use the uniformprior on µ ∈ R, the posterior density is

fM |Y(µ | y) =f (y | µ)∫ ∞

−∞ f (y | µ∗)dµ∗. (11.64)

If this denominator does not converge, then this procedure is not useful. Even if itdoes, there may in general be problems with improper priors. See Kass and Wasser-man (1996) for an overview of improper priors, including examples where they leadto difficulties. It is safest to use proper priors, but using improper priors in situationswhere they have been shown to work well is fine.

11.7. Exercises 173

11.7 Exercises

Exercise 11.7.1. For any estimator, let MSEθ [δ] = Eθ [(δ(X)− g(θ))2], the mean squareerror. Show that MSEθ [δ] = Biasθ [δ]

2 + Varθ [δ] if the variance is finite. [Hint: Writeδ(X)− g(θ) = δ(X)− Eθ [δ] + Eθ [δ]− g(θ) and expand the square.]

Exercise 11.7.2. True or false. (The convergences are all as n → ∞.) (a) If anestimator is consistent, then it is unbiased. (b) If an estimator is unbiased, then itis consistent. (c) If an estimator is consistent, then its bias goes to zero. (d) If anestimator’s bias goes to zero, then it is consistent. (e) If an estimator’s variance goesto zero, then it is consistent. (f) If an estimator is unbiased and its variance goes tozero, then it is consistent. (g) If an estimator’s variance and bias both go to zero,then it is consistent. (h) If an estimator is consistent, then its variance goes to zero.(i) If Wn is asymptotically N(0, σ2), then E[Wn] goes to 0. (j) If Wn is asymptoticallyN(0, σ2), then Var[Wn] goes to σ2.

Exercise 11.7.3. Let X1, . . . , Xn be iid N(µ, 1). We are interested in estimating µ2.Consider the two estimators

δ(1)n = x2

n −1n

and δ(2)n =

1n

n

∑i=1

x2i − 1. (11.65)

(a) Show that the two estimators are unbiased. (b) Show that the two estimators areconsistent. (c) Find the τ2

1 for which√

n(δ(1)n − µ2) →D N(0, τ21 ). (d) Find the τ2

2 for

which√

n(δ(2)n − µ2) →D N(0, τ22 ). (e) The asymptotic relative efficiency of δ

(1)n to

δ(2)n is τ2

2 /τ21 . Find this ratio as a function of µ. What is it at µ = 0? (f) Is one of the

estimators always better than the other?

Exercise 11.7.4. Suppose X1, . . . , Xn are iid Gamma(α, λ), where (α, λ) ∈ (0, ∞) ×(0, ∞). Find method-of-moment estimators for α and λ.

Exercise 11.7.5. Let X1, . . . , Xn be iid Beta(α, 1), α > 0. (a) Consider the method ofmoments estimator, αn, of α based on Xn. What is it? Is it consistent? (b) What isVar[Xi]? (c) Let

Wn =√

n (αn − α) −→D N(0, σ2(α)). (11.66)

What is σ2(α)? (d) Does σ2(αn)→P σ2(α)? (e) Does

√n

αn − α

σ(αn)−→D N(0, 1)? (11.67)

(f) Using part (e), find an approximate 95% confidence interval for α when n = 100and xn = 3/4. (g) Imagine instead the Xi’s are Beta(α, α). What is the method ofmoments estimator of α based on Xn for this model?

Exercise 11.7.6. Suppose that the number of telephone calls coming in to a call centerfollows a Poisson process with a rate of λ calls per minute, where λ ∈ (0, ∞). Thatis, if X is the number of calls coming in over the course of t minutes, then X ∼Poisson(tλ). (a) Assuming t is known, what is a reasonable estimate of λ based onX? (b) Assuming t is known, what is a reasonable estimate of the parameter θ =Pλ[no calls in the next two minutes] based on X? (First find θ in terms of λ.)


Exercise 11.7.7. Suppose (X1, M1), . . . , (Xn, Mn) are independent pairs, where foreach i, Xi |Mi = µi ∼ N(µi, 1), and Mi ∼ N(0, σ2). The µi’s are unobserved andto be estimated. The parameter σ2 ∈ (0, ∞) is unknown. (a) What is the marginaldistribution of X = (X1, . . . , Xn)? [The Xi’s are marginally iid.] (b) Find an estimateof σ2 based on X1, . . . , Xn. (c) Find E[Mi |Xi = xi] as xi times a function of σ2. (d)Now use the estimate of σ2 from part (b) to find an estimate of E[Mi |Xi = xi] as inpart (c). The result is called an empirical Bayes estimator of µi. It is also a shrinkageestimator of µi, since it takes xi and shrinks it a bit (or a lot, depending on ∑ x2

i ).

Exercise 11.7.8. Suppose X1, . . . , Xn are iid with distribution function FX , Y1, . . . , Ymare iid with distribution function FY , and the Xi’s are independent of the Yi ’s. Con-sider estimating the parameter θ = P[X > Y], where X and Y are independent,X ∼ FX and Y ∼ FY . The data has n = 4 and m = 5, where the Xi’s are the fastestspeeds (in MPH) a sample of men have ever driven, and the Yi’s are the fastest speedsa sample of women have ever driven. Here are the values: Men: 135, 110, 110, 80;Women: 75, 100, 95, 90, 90. Give an estimate of θ. [Hint: The parameter is the proba-bility that a randomly chosen man has driven faster than a randomly chosen women.The estimate is the analog for the given samples.]

Exercise 11.7.9. Consider the sample x1, . . . , xn, where all the xi’s are distinct. Howmany different sets of bootstrap samples, x∗1 , . . . , x∗n are there? (If n = 3, then thesets x1, x3, x1 and x3, x1, x1 are the same.) Feller (1968) introduced the “starsand bars” technique for certain combinatorial calculations. Here, we can representa bootstrap sample by specifying how many times each xi is in the sample. Thus ifthere are ki of the xi’s, then k1 + · · ·+ kn = n. Now write out a sequence of stars andbars as follows: Write down k1 stars, then one bar, then k2 stars, then one bar, . . .,then kn−1 stars, then one bar, and finally kn stars. You should end up with n stars andn− 1 bars. For example, if n = 7 and the bootstrap sample is x1, x1, x3, x5, x5, x5, x6,the picture will be

∗ ∗ | | ∗ | | ∗ ∗ ∗ | ∗ | (11.68)

Note that the possible arrangements of n stars and n− 1 bars are in one-to-one cor-respondence with possible bootstrap samples. (a) Argue that there are (2n−1

n ) sucharrangements. (b) How many bootstrap samples are there for n = 10, n = 100, andn = 1000? [You may wish to use the log(Γ(x)) function in R, lgamma.] (c) For whichn is the number of bootstrap samples approximately one googol (10100)?

Note: The data used in Exercises 11.7.10 through 11.7.12 can be found as R matricesin the file http://istics.net/r/data_chapter_11.txt.

Exercise 11.7.10. Henson, Rogers, and Reynolds (1996) performed an experimentto see if caffeine has a negative effect on short-term visual memory. High schoolstudents were randomly chosen: 9 from eighth grade, 10 from tenth grade, and 9from twelfth grade. Each person was tested once after having caffeinated Coke, andonce after having decaffeinated Coke. After each drink, the person was given tenseconds to try to memorize twenty small, common objects, then allowed a minuteto write down as many as could be remembered. The main question of interest is

11.7. Exercises 175

whether people remembered more objects after the Coke without caffeine than afterthe Coke with caffeine. The data (in the R matrix caffeine) are

Grade 8 Grade 10 Grade 12Without With Without With Without With

5 6 6 3 7 79 8 9 11 8 66 5 4 4 9 68 9 7 6 11 77 6 6 8 5 56 6 7 6 9 48 6 6 8 9 76 8 9 8 11 86 7 10 7 10 9

10 6

(11.69)

For each grade, consider the differences between the number of objects rememberedwithout caffeine and the number remembered with caffeine, Xi = Without−With.The sample means of these three differences are 0, 0.70, and 2.22 for grade 8, 10, and12, respectively. For each grade, find an approximate 95% confidence interval forthe population mean using 5000 bootstrap samples. For which grades, if any, doescaffeine seem to effect the number of objects remembered?

Exercise 11.7.11. Consider the data on the number of pairs of shoes people ownedin in Section 11.5.1, but now separate men and women. See the R matrix shoes. Themeans and medians are given in the table:

n Mean MedianWomen 485 19.408 15

Men 227 6.233 5(11.70)

(a) Consider the parameter θ being the ratio of the population mean for the womento that for the men. Find the plug-in estimate of θ based on the data in (11.70). Forthe two-sample situation here, a bootstrap sample consists of one bootstrap samplefrom the women, and one from the men, then the bootstrap quantity of interest is theratio of sample means of the bootstrap samples. In the following, use 5000 bootstrapsamples. (b) Find the bootstrap estimate of the bias of the estimate in part (a). Isthere much bias? (c) Find the bootstrap estimate of the 95% confidence interval for θ.(d) Repeat the process for the ratio of medians.

Exercise 11.7.12. Continue with the data from Exercises 11.7.11 and 9.5.11 (c). Asin the latter, let x = shoe size and y = log(# of pairs of shoes owned), so that theirsample correlation is r = −0.500. The approximate 95% confidence interval for ρbased on the ∆-method was found to be (−.5540,−.4417). Find the approximate 95%confidence interval using the bootstrap. Does it differ much from what we foundearlier? (Here, we want to bootstrap (xi, yi) pairs. In R, we can bootstrap the indiceswith i<−sample(712,replace=T), which yields a vector of 712 indices between 1 and712. Then the bootstrapped correlation is cor(x[i],y[i]).)

Exercise 11.7.13. Consider the bootstrap in the binomial case. We observe X ∼Binomial(n, p), thinking of it as the sum of n iid Bernoulli(p)’s, Z1, . . . , Zn. Thena bootstrap sample is Z∗1 , . . . , Z∗n. Let X∗ = Z∗1 + · · ·+ Z∗n. (a) Given the data X = x,


what is the exact distribution of X∗? (b) What is the bootstrap estimate of the bias ofthe estimate p = X/n of p? (c) What is the bootstrap estimate of the standard errorof p?

Exercise 11.7.14. Suppose X1, . . . , Xn are iid Exponential(λ) for λ ∈ (0, ∞). (a) Showthat the pdf can be written as f (x1, . . . , xn | λ) = λne−λT for some T, a function of theXi’s. What is T? (b) Consider the pdf for λ ∈ (0, ∞) with parameters ν and τ suchthat for some c,

ρ(λ | ν, τ) = c(ν, τ)λνe−λτ (11.71)

is a legitimate density for ν > −1 and τ > 0. What is the distribution? (c) Nowconsider the Bayesian model where

X1, . . . , Xn |Λ = λ are iid Exponential(λ), and Λ ∼ ρ(λ | ν0, τ0) (11.72)

for some ν0 > −1 and τ0 > 0. Show the posterior density of Λ given the data isρ(λ | τ∗, ν∗) for τ∗ a function of τ0 and T, and and ν∗ a function of ν0 and n. What isthe posterior distribution? Thus the conjugate prior for the exponential has densityρ. (d) What is the posterior mean of Λ given the data? What does this posterior meanapproach as ν0 → −1 and τ0 → 0? Is the ρ(−1, 0) density a proper one?

Exercise 11.7.15. Here the data are X1, . . . , Xn iid Poisson(λ). (a)√

n(Xn − λ) →DN(0, v) for what v? (b) Find the approximate 95% confidence intervals for λ whenX = 1 and n = 10, 100, and 1000 based on the result in part (a). (c) Using the relativelynoninformative prior λ ∼ Gamma(1/2, 1/2), find the 95% probability intervals forλ when X = 1 and n = 10, 100, and 1000. How large does n have to be for thefrequency-based interval from part (b) to approximate the Bayesian interval well?(d) Using the relatively strong prior λ ∼ Gamma(50, 50), find the 95% probabilityintervals for λ when X = 1 and n = 10, 100, and 1000. How large does n have to befor the frequency-based interval from part (b) to approximate this Bayesian intervalwell?

Exercise 11.7.16. Here we assume that Y1, . . . , Yn |M = µ are iid N(µ, σ2) and M ∼N(µ0, σ2

0 ), where σ2 is known. Then from (11.48), we have M |Y = y ∼ N(µ∗, σ∗2),where with ω0 = 1/σ2

0 and ω = 1/σ2,

µ∗ =σ2µ0 + nσ2

0 yσ2 + nσ2

0and σ∗2 =

σ2σ20

σ2 + nσ20

. (11.73)

(See also (7.108).) (a) Show that

P[

y− 1.96σ√n< M < y + 1.96

σ√n

∣∣∣∣ Y = y]

= Φ

z1+τ + 1.96√

τ1+τ

−Φ

z1+τ − 1.96√

τ1+τ

, (11.74)

where z =√

n(y− µ0)/σ, τ = nσ20 /σ2, and Φ is the distribution function of a N(0, 1).

(b) Show that for any fixed z, the probability in (11.74) goes to 95% as τ → ∞. (c)Show that if we use the improper prior that is uniform on R (as in (11.64)), theposterior distribution µ |Y = y is exactly N(y, σ2/n), hence the confidence intervalhas a posterior probability of exactly 95%.

11.7. Exercises 177

Exercise 11.7.17. Suppose X |Θ = θ ∼ Binomial(n, θ). Consider the prior with den-sity 1/(θ(1− θ)) for θ ∈ (0, 1). (It looks like a Beta(0, 0), if there were such a thing.)(a) Show that this prior is improper. (b) Find the posterior distribution as in (11.64),Θ |X = x, for this prior. For some values of x the posterior is valid. For others, it isnot, since the denominator is infinite. For which values is the posterior valid? Whatis the posterior in these cases? (c) If the posterior is valid, what is the posterior meanof Θ? How does it compare to the usual estimator of θ?

Exercise 11.7.18. Consider the normal situation with known mean but unknown vari-ance or precision. Suppose Y1, . . . , Yn |Ω = ω are iid N(µ, 1/ω) with µ known. Takethe prior on Ω to be Gamma(ν0/2, λ0/2) as in (11.54). (a) Show that

Ω |Y = y ∼ Gamma((ν0 + n)/2, (λ0 + ∑(yi − µ)2)/2). (11.75)

(b) Find E[1/Ω |Y = y]. What is the value as λ0 and ν0 approach zero? (c) Ignoringthe constant in the density, what is the density of the gamma with both parametersequalling zero? Is this an improper prior? Is the posterior using this prior valid? Ifso, what is it?

Exercise 11.7.19. This exercise is to prove Lemma 11.1. So let

M |Ω = ω ∼ N(µ0, 1/(k0ω)) and Ω ∼ Gamma(ν0/2, λ0/2), (11.76)

as in (11.54). (a) Let Z =√

k0Ω(M − µ0) and U = λ0Ω. Show that Z ∼ N(0, 1),U ∼ Gamma(ν0/2, 1/2), and Z and U are independent. [What is the conditionaldistribution Z |U = u? Also, refer to Exercise 5.6.1.] (b) Argue that T = Z/

√U/ν0 ∼

tν0 , which verifies (11.60). [See Definition 6.5 in Exercise 6.8.17.] (c) Derive the meanand variance of M based on the known mean and variance of Student’s t. [SeeExercise 6.8.17(a).] (d) Show that E[1/Ω] = λ0/(ν0 − 2) if ν0 > 2.

Exercise 11.7.20. Show that

n(y− µ)2 + k0(µ− µ0)2 = (n + k0)(µ− µ∗)2 +

nk0n + k0

(y− µ0)2, (11.77)

which is necessary to show that (11.56) holds, where µ∗ = (k0µ0 + ny)/(k0 + n) as in(11.57).

Exercise 11.7.21. Take the Bayesian setup with a one-dimensional parameter, so thatwe are given the conditional distribution X |Θ = θ and the (proper) prior distributionof Θ with space T ⊂ R. Let δ(x) = E[Θ |X = x] be the Bayes estimate of θ. Supposethat δ is an unbiased estimator of θ, so that E[δ(X) |Θ = θ] = θ. Assume that themarginal and conditional variances of δ(X) and Θ are finite. (a) Using the formulafor covariance based on conditioning on X (as in (6.50)), show that the unconditionalcovariance Cov[Θ, δ(X)] equals the unconditional Var[δ(X)]. (b) Using the same for-mula, but conditioning on Θ, show that Cov[Θ, δ(X)] = Var[Θ]. (c) Show that (a)and (b) imply that the correlation between δ(X) and Θ is 1. Use the result in (2.32)and (2.33) to help show that in fact Θ and δ(X) are the same (i.e., P[Θ = δ(X)] = 1).(d) The conclusion in (c) means that the only time the Bayes estimator is unbiased iswhen it is exactly equal to the parameter. Can you think of any situations where thisphenomenon would occur?

Chapter 12

Linear Regression

12.1 Regression

How is height related to weight? How are sex and age related to heart disease? Whatfactors influence crime rate? Questions such as these have one dependent variableof interest, and one or more explanatory or predictor variables. The goal is to assessthe relationship of the explanatory variables to the dependent variable. Examples:

Dependent Variable (Y) Explanatory Variables (X’s)Weight Height, gender

Cholesterol level Fat intake, obesity, exerciseHeart function Age, sex

Crime rate Density, income, educationBacterial count Drug

We will generically denote an observation (Y, X), where the dependent variable isY, and the vector of p explanatory variables is X. The overall goal is to find a functiong(X) that is a good predictor of Y. The (mean) regression function uses the average Yfor a particular vector of values of X = x as the predictor. Median regression modelsthe median Y for given X = x. That is,

g(x) =

E[Y |X = x] in mean regressionMedian[Y |X = x] in median regression . (12.1)

The median is less sensitive to large values, so may be a more robust measure. Moregenerally, quantile regression (Koenker and Bassett, 1978) seeks to determine a par-ticular quantile of Y given X = x. For example, Y may be a measure of water depthin a river, and one wishes to know the 90th percentile level given X = x to help warnof flooding. (Typically, “regression” refers to mean regression, so that median orquantile regression needs the adjective.)

The function g may or may not be a simple function of the x’s, and in fact we mightnot even know the exact form. Linear regression tries to approximate the conditionalexpected value by a linear function of x:

g(x) = β0 + β1x1 + · · ·+ βKxK ≈ E[Y |X = x]. (12.2)

179

180 Chapter 12. Linear Regression

As we saw in Lemma 7.8 on page 116, if (Y, X) is (jointly) multivariate normal, thenE[Y |X = x] is itself linear, in which case there is no need for an approximation in(12.2).

The rest of this chapter deals with estimation in linear regression. Rather thantrying to model Y and X jointly, everything will be performed conditioning on X =x, so we won’t even mention the distribution of X. The next section develops thematrix notation needed in order to formally present the model. Section 12.3 discussesleast squares estimation, which is associated with mean regression. In Section 12.5,we present some regularization, which modifies the objective functions to reign inthe sizes of the estimated coefficients, possibly improving prediction. Section 12.6looks more carefully at median regression, which uses least absolute deviations as itsobjective function.

12.2 Matrix notation

Here we write the linear model in a universal matrix notation. Simple linear re-gression has one explanatory x variable, such as trying to predict cholesterol level(Y) from fat intake (x). If there are n observations, then the linear model would bewritten

Yi = β0 + β1xi + Ei, i = 1, . . . , n. (12.3)

Imagine stacking these equations on top of each other. That is, we construct vectors

Y =

Y1Y2...

Yn

, E =

E1E2...

En

, and β =

(β0β1

). (12.4)

For X, we need a vector for the xi’s, but also a vector of 1’s, which are surreptitiouslymultiplying the β0:

x =

1 x11 x2...

...1 xn

. (12.5)

Then the model in (12.3) can be written compactly as

Y = xβ + E. (12.6)

When there is more than one explanatory variable, we need an extra subscript forx, so that xi1 is the value for fat intake and xi2 is the exercise level, say, for person i:

Yi = β0 + β1xi1 + β2xi2 + Ei, i = 1, . . . , n. (12.7)

With K variables, the model would be

Yi = β0 + β1xi1 + · · ·+ βKxiK + Ei, i = 1, . . . , n. (12.8)

12.3. Least squares 181

The general model (12.8) has the form (12.6) with a longer β and wider x:

Y = xβ + E =

1 x11 x12 · · · x1K1 x21 x22 · · · x2K...

...... · · ·

...1 xn1 xn2 · · · xnK

β0β1β2...

βK

+ E. (12.9)

We will generally assume that the xij’s are fixed constants, hence the Ei’s and Yi’sare the random quantities. It may be that the x-values are fixed by the experimenter(e.g., denoting dosages or treatment groups assigned to subjects), or (Y, X) has ajoint distribution, but the analysis proceeds conditional on Y given X = x, and (12.9)describes this conditional distribution. Assumptions on the Ei’s in mean regression,moving from least to most specific, include the following:

1. E[Ei] = 0, so that E[E] = 0 and E[Y] = xβ.

2. The Ei’s are uncorrelated, hence the Yi’s are uncorrelated.

3. The Ei’s are homoscedastic, i.e., they all have the same variance, hence so dothe Yi’s.

4. The Ei’s are iid.

5. The Ei’s are multivariate normal. If the previous assumptions also hold, wehave

E ∼ N(0, σ2In) which implies that Y ∼ N(xβ, σ2In). (12.10)

Median regression replaces #1 with Median(E1) = 0, and generally dispenses with#2 and #3 (since moments are unnecessary). General quantile regression would setthe desired quantile of Ei to 0.

12.3 Least squares

In regression, or any prediction situation, one approach to estimation chooses theestimate of the parameters so that the predictions are close to the Yi’s. Least squaresis a venerable and popular criterion. In the regression case, the least squares estimatesof the βi’s are the values bi that minimize the objective function

obj(b ; y) =n

∑i=1

(yi − (b0 + b1xi1 + · · ·+ bKxiK))2 = ‖y− xb‖2. (12.11)

We will take x to be n × p, where if as in (12.11) there are K predictors plus theintercept, p = K + 1. In general, x need not have an intercept term (i.e., column of1’s). Least squares is tailored to mean regression because for a sample z1, . . . , zn, thesample mean is the value of m that minimize ∑(zi −m)2 over m. (See Exercise 12.7.1.Also, Exercise 2.7.25 has the result for random variables.)

Ideally, we’d solve y = xb for b, so that if x is square and invertible, then β = x−1y.It is more likely that x′x is invertible, at least when p < n, in which case we multiplyboth sides by x′:

x′y = x′xβ =⇒ β = (x′x)−1x′y if x′x is invertible. (12.12)


If p > n, i.e., there are more parameters to estimate than observations, x′x will not beinvertible. Noninvertibility will occur for p ≤ n when there are linear redundanciesin the variables. For example, predictors of a student’s score on the final exam mayinclude scores on each of three midterms, plus the average of the three midterms. Orredundancy may be random, such as when there are several categorical predictors,and by chance all the people in the sample that are from Asia are female. Such redun-dancies can be dealt with by eliminating one or more of the variables. Alternatively,we can use the Moore-Penrose inverse from (7.56), though if x′x is not invertible,the least squares estimate is not unique. See also Exercise 12.7.11, which uses theMoore-Penrose inverse of x itself.

Assume that x′x is invertible. We show that β as in (12.12) does minimize the leastsquares criterion. Write

‖y− xb‖2 = ‖(y− xβ) + (xβ− xb)‖2

= ‖y− xβ‖2 + ‖xβ− xb‖2 + 2(y− xβ)′(xβ− xb). (12.13)

The estimated fit is y = xβ, and the estimated error or residual vector is e = y− y =

y− xβ. By definition of β, we have that

y = Pxy, Px = x(x′x)−1x′, and e = Qxy, Qx = In − Px. (12.14)

(For those who know about projections: This Px is the projection matrix onto thespace spanned by the column of x, and Qx is the projection matrix on the orthogonalcomplement to the space spanned by the columns of x.)


Px and Qx are symmetric and idempotent, Pxx = x, and Qxx = 0. (12.15)

(Recall from (7.40) that idempotent means PxPx = Px.) Thus the cross-product termin (12.13) can be eliminated:

(y− xβ)′(xβ− xb) = (Qxy)′x(β− b) = y′(Qxx)(β− b) = 0. (12.16)

Henceobj(b ; y) = ‖y− xb‖2 = y′Qxy + (β− b)′x′x(β− b). (12.17)

Since x′x is nonnegative definite and invertible, it must be positive definite. Thus thesecond summand on the right-hand side of (12.17) must be positive unless b = β,proving that the least squares estimate of β is indeed β in (12.12). The minimum ofthe least squares objective function is the sum of squared errors:

SSe ≡ ‖e‖2 = ‖y− xβ‖2 = y′Qxy. (12.18)

Is β a good estimator? It depends partially on which of the assumptions in (12.10)hold. If E[E] = 0, then β is unbiased. If Σ is the covariance matrix of E, then

Cov[β] = (x′x)−1x′Σx(x′x)−1. (12.19)

If the Ei’s are uncorrelated and homoscedastic, with common Var[Ei] = σ2, thenΣ = σ2In, so that Cov[β] = σ2(x′x)−1. In this case, the least squares estimator is the

12.3. Least squares 183

best linear unbiased estimator (BLUE) in that it has the lowest variance among thelinear unbiased estimators, where a linear estimator is one of the form LY for constantmatrix L. See Exercise 12.7.8. The strictest assumption, which adds multivariatenormality, bestows the estimator with multivariate normality:

E ∼ N(0, σ2In) =⇒ β ∼ N(β, σ2(x′x)−1). (12.20)

If this normality assumption does hold, then the estimator is the best unbiased esti-mator of β, linear or not. See Exercise 19.8.16.

On the other hand, the least squares estimator can be notoriously non-robust. Justone or a few wild values among the yi’s can ruin the estimate. See Figure 12.2.

12.3.1 Standard errors and confidence intervals

For this section we will make the normal assumption that E ∼ N(0, σ2In), thoughmuch of what we say works without normality. From (12.20), we see that

Var[βi] = σ2[(x′x)−1]ii, (12.21)

where the last term is the ith diagonal of (x′x)−1. To estimate σ2, note that

QxY ∼ N(Qxxβ, σ2QxQ′x) = N(0, σ2Qx) (12.22)

by (12.14). Exercise 12.7.4 shows that trace(Qx) = n− p, hence we can use (7.65) toshow that

SSe = Y′QxY ∼ σ2χ2n−p, which implies that σ2 =

SSe

n− p(12.23)

is an unbiased estimate of σ2, leading to

se(βi) = σ√[(x′x)−1]ii . (12.24)

We have the ingredients to use Student’s t for confidence intervals, but first weneed the independence of β and σ2. Exercise 12.7.6 uses calculations similar to thosein (7.43) to show that β and QxY are in fact independent. To summarize,

Theorem 12.1. If Y = xβ + E, E ∼ N(0, σ2In), and x′x is invertible, then β and SSe areindependent, with

β ∼ N(β, σ2(x′x)−1) and SSe ∼ σ2χ2n−p. (12.25)

From this theorem we can derive (Exercise 12.7.7) that

βi − βi

se(βi)∼ tn−p =⇒ βi ± tn−p,α/2 se(βi) (12.26)

is a 100(1− α)% confidence interval for βi.


12.4 Bayesian estimation

We start by assuming that σ2 is known in the normal model (12.10). The conjugateprior for β is also normal:

Y | β = b ∼ N(xb, σ2In) and β ∼ N(β0, Σ0), (12.27)

where β0 and Σ0 are known. We use Bayes theorem to find the posterior distributionof β |Y = y. We have that

β | β = b ∼ N(b, σ2(x′x)−1) (12.28)

for β in (12.12) and (12.25). Note that this setup is the same as that for the multivariatenormal mean vector in Exercise 7.8.15, where β is the M and β is the Y. The onlydifference is that here we are using column vectors, but the basic results remain thesame. In this case, the prior precision is Ω0 ≡ Σ−1

0 , and the conditional precision isx′x/σ2. Thus we immediately have

β | β = b ∼ N(β∗, (Ω0 + x′x/σ2)−1), (12.29)

whereβ∗= (Ω0 + x′x/σ2)−1(Ω0β0 + (x′x/σ2)b). (12.30)

If the prior variance is very large, so that the precision Ω0 ≈ 0, the posterior meanand covariance are approximately the least squares estimate and its covariance:

β | β = b ≈ N(b, σ2(x′x)−1). (12.31)

For less vague priors, one may specialize to β0 = 0, with the precision proportionalto In. For convenience take Ω0 = (κ/σ2)In for some κ, so that κ indicates the relativeprecision of the prior to that of one observation (Ei). The posterior then resolves to

β | β = b ∼ N((κIp + x′x)−1x′xb, σ2(κIp + x′x)−1

). (12.32)

This posterior mean is the ridge regression estimator of β,

βκ = (x′x + κIp)−1x′y, (12.33)

which we will see in the next section.If σ2 is not known, then we can use the prior used for the normal mean in Section

11.6.1. Using the precision ω = 1/σ2,

β | β = b, Ω = ω ∼ N(b, (1/ω)(x′x)−1), (12.34)

where the prior is given by

β |Ω = ω ∼ N(β0, (1/ω)K−10 ) and Ω ∼ Γ(ν0/2, λ0/2). (12.35)

Here, K0 is an invertible symmetric p× p matrix, and ν0 and λ0 are positive. It is nottoo hard to see that

E[β] = β0, Cov[β] =λ0

ν0 − 2K−1

0 , and E[

1Ω

]=

λ0ν0 − 2

, (12.36)

12.5. Regularization 185

similar to (11.59). The last two equations need ν0 > 2.Analogous to (11.57) and (11.58), the posterior has the same form, but updating

the parameters as

β0 → β∗ ≡ (K0 + x′x)−1(K0β0 + (x′x)β), K0 → K0 + x′x, ν0 → ν0 + n, (12.37)

and λ0 → λ∗ ≡ λ0 + SSe + (β− β0)′(K−1

0 + (x′x)−1)−1(β− β0). (12.38)

See Exercise 12.7.12. If the prior parameters β0, λ0, ν0 and K0 are all close to zero,then the posterior mean and covariance matrix of β have approximations

E[β |Y = y] = β∗ ≈ β (12.39)

and

Cov[β |Y = y] =λ∗

ν0 + n− 2(K−1

0 + (x′x)−1) ≈ SSe

n− 2(x′x)−1, (12.40)

close to the frequentist estimates.The marginal distribution of β under the prior or posterior is a multivariate

Student’s t, which was introduced in Exercise 7.8.17. If Z ∼ Np(0, Ip) and U ∼Gamma(ν/2, 1/2), then

T ≡ 1√U/ν

Z ∼ tp,ν, (12.41)

is a standard p-variate Student’s t on ν degrees of freedom. With the parameters in(12.37) and (12.38), it can be shown that a posteriori

T =1√

λ∗/(ν0 + n)(K0 + x′x)1/2(β− β

∗) ∼ tp,ν0+n. (12.42)

12.5 Regularization

Often regression estimates are used for prediction. Instead of being primarily inter-ested in estimating the values of β, one is interested in how the estimates can beused to predict new yi’s from new x-vectors. For example, we may have data on theprogress of diabetes in a number of patients, along with a variety of their health anddemographic variables (age, sex, BMI, etc.). Based on these observations, we wouldthen like to predict the progress of diabetes for a number of new patients for whomwe know the predictors.

Suppose β is an estimator based on observing Y = xβ + E, and a new set ofobservations are contemplated that follow the same model, i.e., YNew = zβ + ENew,where z contains the predictor variables for the new observations. We know the z.We do not observe the YNew, but would like to estimate it. The natural estimatorwould then be YNew = zβ.

12.5.1 Ridge regression

When there are many possible predictors, it may be that leaving some of them outof the equation can improve the prediction, since the variance in their estimationoverwhelms whatever predictive power they have. Or it may be that the prediction


can be improved by shrinking the estimates somewhat. A systematic approach tosuch shrinking is to add a regularization term to the objective function. The ridgeregression term is a penalty based on the squared length of the parameter vector:

objκ(b ; y) = ‖y− xb‖2 + κ‖b‖2. (12.43)

The κ ≥ 0 is a tuning parameter, indicating how much weight to give to the penalty.As long as κ > 0, the minimizing b in (12.43) would be tend to be closer to zero thanthe least squares estimate. The larger κ, the more the estimate would be shrunk.

There are two questions: How to find the optimal b given the κ, and how to choosethe κ. For given κ, we can use a trick to find the estimator. For b being p× 1, write

objκ(b ; y) = ‖y− xb‖2 + ‖0p − (√

κIp)b‖2

=

∥∥∥∥( y0p

)−(

x√κIp

)b∥∥∥∥2

. (12.44)

This objective function looks like the least squares criterion, where we have addedp observations, all with y-value of zero, and the ith one has x-vector with all zerosexcept

√κ for the ith predictor. Thus the minimizer is the ridge estimator, which can

be shown (Exercise 12.7.13) to be

βκ = (x′x + κIp)−1x′Y. (12.45)

Notice that this estimator appeared as a posterior mean in (12.33).This estimator was originally proposed by Hoerl and Kennard (1970) as a method

to ameliorate the effects of multicolinearity in the x’s. Recall the covariance matrixof the least squares estimator is σ2(x′x)−1. If the x’s are highly correlated amongthemselves, then some of the diagonals of (x′x)−1 are likely to be very large, henceadding a small positive number to the diagonals of x′x can drastically reduce thevariances, without increasing bias too much. See Exercise 12.7.17.

One way to choose the κ is to try to estimate the effectiveness of the predictor forvarious values of κ. Imagine n new observations that have the same predictor valuesas the data, but whose YNew is unobserved and independent of the data Y. That is,we assume

Y = xβ + E and YNew = xβ + ENew, (12.46)

where E and ENew are independent,

E[E] = E[ENew] = 0p, and Cov[E] = Cov[ENew] = σ2Ip. (12.47)

It is perfectly reasonable to take the predictors of the new variables to be differentthan for the observed data, but the formulas are a bit simpler with the same x.

Our goal is to estimate how well the prediction based on Y predicts the YNew. Wewould like to look at the prediction error, but since we do not observe the new data,we will assess the expected value of the sum of squares of prediction errors:

ESSpred,κ = E[‖YNew − xβκ‖2]. (12.48)

We do observe the data, so a first guess at estimating the prediction error is theobserved error,

SSe,κ = ‖Y− xβκ‖2. (12.49)


The observed error should be an underestimate of the prediction error, since we chosethe estimate of β specifically to fit the observed Y. How much of an underestimate?The following lemma helps to find ESSpred,κ and ESSe,κ = E[SSe,κ ]. Its proof is inExercise 12.7.14.

Lemma 12.2. If W has finite mean and variance,

E[‖W‖2] = ‖E[W]‖2 + trace(Cov[W]). (12.50)

We apply the lemma with W = YNew − xβκ , and with W = Y − xβκ . SinceE[Y] = E[YNew], the expected value parts of ESSpred,κ and ESSerr,κ are equal, so wedo not have to do anything further on them. The covariances are different. Write

xβκ = PκY, where Pκ = x(x′x + κIp)−1x′. (12.51)

Then for the prediction error, since YNew and Y are independent,

Cov[YNew − xβκ ] = Cov[YNew − PκY] = Cov[YNew] + PκCov[Y]Pκ

= σ2(Ip + P2κ). (12.52)

For the observed error,

Cov[Y− PκY] = Cov[(Ip − Pκ)Y] = (Ip − Pκ)Cov[Y](Ip − Pκ)

= σ2(Ip − Pκ)2

= σ2(Ip + P2κ − 2Pκ). (12.53)

Thus for the covariance parts, the observed error has that extra −2Pκ term, so that

ESSpred,κ − ESSe,κ = 2σ2 trace(Pκ). (12.54)

For given κ, trace(Pκ) can be calculated. We can use the usual unbiased estimatorfor σ2 in (12.23) to obtain an unbiased estimator of the prediction error:

ESSpred,κ = SSe,κ + 2σ2 trace(Pκ). (12.55)

Exercise 12.7.16 presents an efficient formula for this estimate. It is then reasonableto use the estimates based on the κ that minimizes the estimated prediction error in(12.55).

12.5.2 Hurricanes

Jung, Shavitt, Viswanathan, and Hilbe (2014) collected data on the most dangeroushurricanes in the US since 1950. The data here are primarily taken from that article,but the maximum wind speed was added, and the cost of damage was updatedto 2014 equivalencies (in millions of dollars). Also, we added two outliers, Katrinaand Audrey, which had been left out. We are interested in predicting the numberof deaths caused by the hurricane based on five variables: minimum air pressure,category, damage, wind speed, and gender of the hurricane’s name. We took logs ofthe dependent variable (actually, log(deaths+1)) and the damage variable.


0 5 10 15 20 25

261

262

263

264

bb$p

red

Est

imat

ed p

redi

ctio

n er

ror

κ

Figure 12.1: Estimated prediction error as a function of κ for ridge regression in thehurricane data.

In ridge regression, the κ is added to the diagonals of the x′x matrix, which meansthat the effect of the ridge is stronger on predictors that have smaller sums of squares.In particular, the units in which the variables are measured has an effect on theresults. To deal with this issue, we normalize all the predictors so that they havemean zero and variance 1. We also subtract the mean from the Y variable, so that wedo not have to worry about an intercept.

Figure 12.1 graphs the estimated prediction error versus κ. Such graphs typicallyhave a fairly sharp negative slope for small values of κ, then level off and begin toincrease in κ. We searched over κ’s at intervals of 0.1. The best we found was κ = 5.9.The estimated prediction error for that κ is 112.19. The least squares estimate (κ = 0)has an estimate of 113.83, so the best ridge estimate is a bit better than the leastsquares.

Slope SE tLS Ridge LS Ridge LS Ridge

Pressure −0.662 −0.523 0.255 0.176 −2.597 −2.970Category −0.498 −0.199 0.466 0.174 −1.070 −1.143Damage 0.868 0.806 0.166 0.140 5.214 5.771Wind 0.084 −0.046 0.420 0.172 0.199 −0.268Gender 0.035 0.034 0.113 0.106 0.313 0.324

(12.56)

Table (12.56) contains the estimated β using least squares and ridge with the opti-mal κ. The first four predictors are related to the severity of the storm, so are highlyintercorrelated. Gender is basically orthogonal to the others. Ridge regression tendsto affect intercorrelated variable most, which we see. The category and wind esti-mates are cut in half. Pressure and damage are reduced, but not as much. Gender is


hardly shrunk at all. The standard errors tend to be similarly reduced, leading to tstatistics that have increased a bit. (See Exercise 12.7.15 for the standard errors.)

12.5.3 Subset selection: Mallows’ Cp

In ridge regression, it is certainly possible to use different κ’s for different variables,so that the regularization term in the objective function in (12.43) would be ∑ κiβ

2i .

An even more drastic proposal would be to have all such κi either 0 or ∞, that is,each parameter would either be left alone, or shrunk to 0. Which is a convoluted wayto say we wish to use a subset of the predictors in the model. The main challenge isthat if there are p predictors, then there are 2p possible subsets. Fortunately, there areefficient algorithms to search through subsets, such as the leaps algorithm in R. SeeLumley (2009).

Denote the matrix of a given subset of p∗ of the predictors by x∗, so that the modelfor this subset is

Y = x∗β∗ + E, E[E] = 0, Cov[E] = σ2In (12.57)

where β∗ is then p∗ × 1. We can find the usual least squares estimate of β∗ as in(12.12), but with x∗ in place of x. To decide which subset to choose, or at leastwhich are reasonable subsets to consider, we can again estimate the prediction sumof squares as in (12.55) for ridge regression. Calculations similar to those in (12.52) to(12.55) show that

ESS∗pred = SS∗e + 2σ2 trace(Px∗ ), (12.58)

whereSS∗e = ‖Y− x∗ β

∗‖2 and Px∗ = x∗(x∗′x∗)−1x∗′. (12.59)

The σ2 is the estimate in (12.23) based on all the predictors. Exercise 12.7.4 shows thattrace(Px∗ ) = p∗, the number of predictors. The resulting estimate of the predictionerror is

ESS∗pred = SS∗e + 2p∗σ2, (12.60)

which is equivalent to Mallows’ Cp (Mallows, 1973), given by

Cp(x∗) =ESS

∗pred

σ2 − n =SS∗eσ2 − n + 2p∗. (12.61)

Back to the hurricane example, (12.62) has the estimated prediction errors for theten best subsets. Each row denotes a subset, where the column under the variable’sname indicates whether the variable is in that subset, 1=yes, 0=no.

Pressure Category Damage Wind Gender SS∗e p∗ ESS∗pred

1 1 1 0 0 102.37 3 109.341 0 1 1 0 103.63 3 110.591 0 1 0 0 106.24 2 110.881 1 1 0 1 102.26 4 111.551 1 1 1 0 102.33 4 111.620 0 1 0 0 110.17 1 112.491 0 1 1 1 103.54 4 112.831 0 1 0 1 106.18 3 113.151 1 1 1 1 102.21 5 113.830 0 1 1 0 110.10 2 114.74

(12.62)


We can see that the damage variable is in all the top 10 models, and pressure isin most of them. The other variables are each in 4 or 5 of them. The best modelhas pressure, category, and damage. The estimated prediction error for that modelis 109.34, which is somewhat better than the best for ridge regression, 112.19. (Itmay not be a totally fair comparison, since the best ridge regression is found by aone-dimensional search over κ, while the subset regression is a discrete search.) See(12.69) for the estimated slopes in this model.

12.5.4 Lasso

Lasso is a technique similar to ridge regression, but features both shrinkage andsubset selection all at once. It uses the sum of absolute values of the slopes as theregularization term. The objective function is

objλ(b ; y) = ‖y− xb‖2 + λ ∑ |bi| (12.63)

for some λ ≥ 0. There is no closed-form solution to the minimizer of that objec-tive function (unless λ = 0), but convex programming techniques can be used. Efron,Hastie, Johnstone, and Tibshirani (2004) presents an efficient method to find the mini-mizers for all values of λ, implemented in the R package lars (Hastie and Efron, 2013).Hastie, Tibshirani, and Friedman (2009) contains an excellent treatment of lasso andother regularization procedures.

The solution in the simple p = 1 case gives some insight into what lasso is doing.See Exercise 12.7.18. The model is Yi = βxi + Ei. As in (12.17), but with just onepredictor, we can write the objective function as

objλ(b ; y) = SSe + (b− β)2 ∑ x2i + λ|b|, (12.64)

where β is the least squares estimate. Thus the minimizing b also minimizes

h(b) = (b− β)2 + λ∗|b|, where λ∗ = λ/ ∑ x2i . (12.65)

The function h(b) is strictly convex, and goes to infinity if |b| does, hence there is aunique minimum. If there is a b for which h′(b) = 0, then by convexity that b mustbe the minimizer. On the other hand, if there is no solution to h′(b) = 0, then sincethe minimizer cannot be at a point with a nonzero derivative, it must be where thederivative doesn’t exist, which is at b = 0.

Now h′(b) = 0 implies that

b = β− λ∗

2Sign(b). (12.66)

(Sign(b) = −1, 0, or 1 as b < 0, b = 0, or b > 0.) Exercise 12.7.18 shows that thereis such a solution if and only if |β| ≥ λ∗/2, in which case the solution has the samesign as β. Hence

b =

0 if |β| < λ∗/2

β− λ∗

2 Sign(β) if |β| ≥ λ∗/2. (12.67)

Thus the lasso estimator starts with β, then shrinks it towards 0 by the amount λ∗/2,stopping at 0 if necessary. For p > 1, lasso generally shrinks all the least squaresslopes, some of them (possibly) all the way to 0, but not in an obvious way.

12.6. Least absolute deviations 191

As for ridge, we would like to use the best λ. Although there is not a simpleanalytic form for estimating the prediction error, Efron et al. (2004) suggests that theestimate (12.60) used in subset regression is a reasonable approximation:

ESSλ,pred = SSe,λ + 2pλσ2, (12.68)

where pλ is the number of non-zero slopes in the solution. For the hurricane data,the best pλ = 3, with a SSeλ = 103.88, which leads to ESSλ,pred = 110.85. (Thecorresponding λ = 0.3105.)

Table (12.69) exhibits the estimated coefficients. Notice that lasso leaves out thetwo variables that the best subset regression does, and shrinks the remaining three.The damage coefficient is not shrunk very much, the category coefficient is cut by2/3, similar to ridge, and pressure is shrunk by 1/3, versus about 1/5 for ridge. Soindeed, lasso here combines ridge and subset regression. If asked, I’d pick either thelasso or subset regression as the best of these.

Least squares Ridge Subset LassoPressure −0.6619 −0.5229 −0.6575 −0.4269Category −0.4983 −0.1989 −0.4143 −0.1651Damage 0.8680 0.8060 0.8731 0.8481Wind 0.0838 −0.0460 0 0Gender 0.0353 0.0342 0 0SSe 102.21 103.25 102.37 103.88ESSpred 113.83 112.19 109.34 110.85

(12.69)

We note that there does not appear to be a standard approach to finding standarderrors in lasso regression. Bootstrapping is a possibility, and Kyung, Gill, Ghosh, andCasella (2010) has a solution for Bayesian lasso, which closely approximates frequen-tist lasso.

12.6 Least absolute deviations

The least squares objective function minimizes the sum of squares of the residuals.As mentioned before, it is sensitive to values far from the center. M-estimators (Hu-ber and Ronchetti (2011)) were developed as more robust alternatives, but ones thatstill provide reasonably efficient estimators. An M-estimator chooses m to minimize∑ ρ(xi, m) for some function ρ measuring the distance between xi’s and m. Specialcases of M-estimators include those using ρ(xi, m) = log( f (xi −m)) for some pdf f ,which leads to the maximum likelihood estimates (see Section 13.6) for the locationfamily with density f , and those based on Lq objective functions. The latter choose bto minimize ∑ |yi − xib|q, where xi is the ith row of x. The least squares criterion isL2, and the least absolute deviations criterion is L1:

obj1(b ; y) = ∑ |yi − xib|. (12.70)

We saw above that for a sample y1, . . . , yn, the sample mean is the m that min-imizes the sum of squares ∑(yi − m)2. Similarly, the sample median is the m thatminimizes ∑ |yi − m|. (See Exercise 2.7.25 for the population version of this result.)Thus minimizing (12.70) is called median regression. There is no closed-form so-lution to finding the optimal b, but standard linear programming algorithms work


0 20 40 60 80

050

010

0015

00

y

Dea

ths

Damage

LSLS w/o outlierLADLAD w/o outlier

Audrey

Katrina

Figure 12.2: Estimated regression lines for deaths versus damage in the hurricanedata. The lines were calculated using least squares and least absolute deviations,with and without the outlier, Katrina.

efficiently. We will use the R package quantreg by Koenker, Portnoy, Ng, Zeileis,Grosjean, and Ripley (2015).

To illustrate, we turn to the hurricane data, but take Y to be the number of deaths(not the log thereof), and the single x to be the damage in billions of dollars. Figure12.2 plots the data. Notice there is one large outlier in the upper right, Katrina. Theregular least squares line is the steepest, which is affected by the outlier. Redoing theleast squares fit without the outlier changes the slope substantially, going from about7.7 to 1.6. The least absolute deviations fit with all the data is very close to the leastsquares fit without the outlier. Removing the outlier changes this slope as well, butnot by as much, 2.1 to 0.8. Thus least absolute deviations is much less sensitive tooutliers than least squares.

The standard errors of the estimators of β are not obvious. Bassett and Koenker(1978) finds the asymptotic distribution under reasonable conditions. We won’t prove,or use, the result, but present it because it has an interesting connection to the asymp-totic distribution of the sample median found in (9.31). We assume we have a se-quence of independent vectors (Yi, xi), where the xi’s are fixed, that follows the model(12.8), Yi = xiβ + Ei. The Ei are assumed to be iid with continuous distribution F thathas median 0 and a continuous and positive pdf f (y) at y = 0. From (9.31), we havethat

√n Median(E1, . . . , En) −→D N

(0,

14 f (0)2

). (12.71)

We also need the sequence of xi’s to behave. Specifically, let x(n) be the matrix withrows x1, . . . , xn, and assume that

1n

x(n)′x(n) −→ D, (12.72)

12.7. Exercises 193

where D is an invertible p × p matrix. Then if βn is the unique minimizer of theobjective function in (12.70) based on the first n observations,

√n(βn − β) −→D N

(0,

14 f (0)2 D−1

). (12.73)

Thus we can estimate the standard errors of the βi’s as we did for least squares in(12.24), but using 1/(2 f (0)) in place of σ. But if we are not willing to assume we knowthe density f , then estimating this value can be difficult. There are other approaches,some conveniently available in the quantreg package. We’ll use the bootstrap.

There are two popular methods for bootstrapping in regression. One considersthe data to be iid (p + 1)-vectors (Yi, Xi), i = 1, . . . , n, which implies that the Yi’s andXi’s have a joint distribution. A bootstrap sample involves choosing n of the vectors(yi, xi) with replacement, and finding the estimated coefficients for the bootstrapsample. This process is repeated a number of times, and the standard deviationsof the resulting sets of coefficients become the bootstrap estimates of their standarderrors. In this case, the estimated standard errors are estimating the unconditionalstandard errors, rather than the standard errors conditioning on X = x.

The other method is to fit the model to the data, then break each observation intothe fit (yi = xi β) and residual (ei = yi − yi). A bootstrap sample starts by first choos-ing n from the estimated residuals without replacement. Call these values e∗1 , . . . , e∗n.Then the bootstrapped values of the dependent variable are y∗i = yi + e∗i , i = 1, . . . , n.That is, each bootstrapped observation has its own fit, but adds a randomly chosenresidual. Then many such bootstrap sample are taken, and the standard errors are es-timated as before. This process more closely mimics the conditional model we startedwith, but the estimated residuals that are bootstrapped are not quite iid as usuallyassumed in bootstrapping.

The table (12.74) uses the first bootstrap option to estimate the standard errors inthe least absolute deviations regressions. Notice that as for the coefficients’ estimates,the standard errors and t-statistics are much more affected by the outlier in leastsquares than in least absolute deviations.

Estimate Std. Error t valueLeast squares 7.743 1.054 7.347Least squares w/o outlier 1.592 0.438 3.637Least absolute deviations 2.093 1.185 1.767Least absolute deviations w/o outlier 0.803 0.852 0.943

(12.74)

The other outlier, Audrey, does not have much of an effect on the estimates. Also, abetter analysis uses logs of the variables, as above in (12.69). In that case, the outliersdo not show up. Finally, we note that, not surprisingly, regularization is useful inleast absolute deviation regression, though the theory is not as well developed as forleast squares. Lasso is an option in quantreg.

There are many other methods for robustly estimating regression coefficients. Ven-ables and Ripley (2002), pages 156–163, gives a practical introduction to some of them.

12.7 Exercises

Exercise 12.7.1. Show that for a sample z1, . . . , zn, the quantity ∑ni=1(zi −m)2 is min-

imized over m by m = z.


Exercise 12.7.2. Consider the simple linear regression model as in (12.3) through(12.5), where Y = xβ + E and E ∼ N(0, σ2In) as in (12.10). Assume n ≥ 3. (a) Showthat

x′x =

(n ∑ xi

∑ xi ∑ x2i

). (12.75)

(b) Show that |x′x| = n ∑(xi − x)2. (c) Show that x′x is invertible if and only if thexi’s are not all equal, and if it is invertible, that

(x′x)−1 =

1n + x2

∑(xi−x)2 − x∑(xi−x)2

− x∑(xi−x)2

1∑(xi−x)2

. (12.76)

(d) Consider the mean of Y for given fixed value x0 of the dependent variable. Thatis, let θ = β0 + β1x0. The estimate is θ = β0 + β1x0, where β0 and β1 are the leastsquares estimates. Find the 2× 1 vector c such that

θ = c′(

β0β1

). (12.77)

(e) Show that

Var[θ] = σ2(

1n+

(x0 − x)2

∑(xi − x)2

). (12.78)

(f) A 95% confidence interval is θ± t se(θ), where the standard error uses the unbiasedestimate of σ2. What is the constant t?

Exercise 12.7.3. Suppose (X1, Y1), . . . , (Xn, Yn) are iid pairs, with(XiYi

)∼ N

((µXµY

),(

σ2X σXY

σXY σ2Y

)), (12.79)

where σ2X > 0 and σ2

Y > 0. Then the Yi’s conditional on the Xi’s have a simple linearregression model:

Y |X = x ∼ N(β01n + β1x, σ2In), (12.80)

where X = (X1, . . . , Xn)′ and Y = (Y1, . . . , Yn)′. Let ρ = σXY/(σXσY) be the pop-ulation correlation coefficient. The Pearson sample correlation coefficient is definedby

r =∑n

i=1(xi − x)(yi − y)√∑n

i=1(xi − x)2 ∑ni=1(yi − y)2

=sXYsXsY

, (12.81)

where sXY is the sample covariance, ∑(xi − x)(yi − y)/n, and s2X and s2

Y are thesample variances of the xi’s and yi’s, respectively. (a) Show that

β1 = ρσYσX

, β1 = rsYsX

, and SSe = ns2Y(1− r2), (12.82)

for SSe as in (12.23). [Hint: For the SSe result, show that SSe = ∑((yi − y)− β1(xi −x))2, then expand and simplify.] (b) Consider the Student’s t-statistic for β1 in (12.26).Show that when ρ = 0, conditional on X = x, we have

T =β1 − β1

se(βi)=√

n− 2r√

1− r2∼ tn−2. (12.83)

12.7. Exercises 195

(c) Argue that the distribution of T in part (b) is unconditionally tn−2 when ρ = 0, sothat we can easily perform a test that the correlation is 0 based directly on r.

Exercise 12.7.4. Assume that x′x is invertible, where x is n× p. Take Px = x(x′x)−1x′,and Qx = In − Px as in (12.14). (a) Show that Px is symmetric and idempotent. (b)Show that Qx is also symmetric and idempotent. (c) Show that Pxx = x and Qxx = 0.(d) Show that trace(Px) = p and trace(Qx) = n− p.

Exercise 12.7.5. Verify that (12.13) through (12.16) lead to (12.17).

Exercise 12.7.6. Suppose x′x is invertible and E ∼ N(0, σ2In). (a) Show that the fitY = PxY ∼ N(xβ, σ2Px). (See (12.14).) (b) Show that Y and the residuals E = QxY areindependent. [Hint: What is QxPx?] (c) Show that β and E are independent. [Hint:Show that β is a function of just Y.] (d) We are assuming the Ei’s are independent.Are the Ei’s independent?

Exercise 12.7.7. Assume that E ∼ N(0, σ2In) and that x′x is invertible. Show that(12.23) through (12.25) imply that (βi − βi)/se(βi) is distributed tn−p. [Hint: See(7.78) through (7.80).]

Exercise 12.7.8. A linear estimator of β is one of the form β∗= LY, where L is a

p× n known matrix. Assume that x′x is invertible. Then the least squares estimatorβ is linear, with L0 = (x′x)−1x′. (a) Show that β

∗is unbiased if and only if Lx = Ip.

(Does L0x = Ip?)Next we wish to prove the Gauss-Markov theorem, which states that if β

∗= LY

is unbiased, then

Cov[β∗]− Cov[β] ≡ M is nonnegative definite. (12.84)

For the rest of this exercise, assume that β∗

is unbiased. (b) Write L = L0 + (L− L0),and show that

Cov[β∗] = Cov[(L0 + (L− L0))Y]

= Cov[L0Y] + Cov[(L− L0)Y] + σ2L0(L− L0)′ + σ2(L− L0)L′0. (12.85)

(c) Use part (a) to show that L0(L− L0)′ = (L− L0)L′0 = 0. (d) Conclude that (12.84)

holds with M = Cov[(L− L0)Y]. Why is M nonnegative definite? (e) The importanceof this conclusion is that the least squares estimator is BLUE: Best linear unbiasedestimator. Show that (12.84) implies that for any p× 1 vector c, Var[c′ β

∗] ≥ Var[c′ β],

and in particular, Var[β∗i ] ≥ Var[βi] for any i.

Exercise 12.7.9. (This exercise is used in subsequent ones.) Given the n× p matrixx with p ≤ n, let the spectral decomposition of x′x be ΨΛΨ′, so that Ψ is a p × porthogonal matrix, and Λ is diagonal with diagonal elements λ1 ≥ λ2 ≥ · · · ≥ λp ≥0. (See Theorem 7.3 on page 105.) Let r be the number of positive λi’s, so that λi > 0if i ≤ r and λi = 0 if i > r, and let ∆ be the r × r diagonal matrix with diagonalelements

√λi for i = 1, . . . , r. (a) Set z = xΨ, and partition z = (z1, z2), where z1 is

n× r and z2 is n× (p− r). Show that

z′z =

z′1z1 z′1z2

z′2z1 z′2z2

=

(∆2 00 0

), (12.86)


hence z2 = 0. (b) Let Γ1 = z1∆−1. Show that Γ′1Γ1 = Ir, hence the columns of Γ1 areorthogonal. (c) Now with z = Γ1(∆, 0), show that

x = Γ1(

∆ 0)

Ψ′. (12.87)

(d) Since the columns of the n× r matrix Γ1 are orthogonal, we can find an n× (n− r)matrix Γ2 such that Γ = (Γ1, Γ2) is an n× n orthogonal matrix. (You don’t have toprove that, but you are welcome to.) Show that

x = Γ

(∆ 00 0

)Ψ′, (12.88)

where the middle matrix is n× p. (This formula is the singular value decompositionof x. It says that for any n× p matrix x, we can write (12.88), where Γ (n× n) and Ψ(p× p) are orthogonal, and ∆ (r × r) is diagonal with diagonal elements δ1 ≥ δ2 ≥· · · ≥ δr > 0. This exercise assumed n ≥ p, but the n < p case follows by transposingthe formula and switching Γ and Ψ.)

Exercise 12.7.10. Here we assume the matrix x has singular value decomposition(12.88). (a) Suppose x is n× n and invertible, so that the ∆ is n× n. Show that

x−1 = Ψ∆−1Γ′. (12.89)

(b) Now let x be n × p. When x is not invertible, we can use the Moore-Penroseinverse, which we saw in (7.56) for symmetric matrices. Here, it is defined to be thep× n matrix

x+ = Ψ

(∆−1 0

0 0

)Γ′. (12.90)

Show that xx+x = x. (c) If x′x is invertible, then ∆ is p× p. Show that in this case,x+ = (x′x)−1x′. (d) Let Px = xx+ and Qx = In − Px. Show that as in (12.15), Pxx = xand Qxx = 0.

Exercise 12.7.11. Consider the model Y = xβ + E where x′x may not be invertible.

Let β+= x+y, where x+ is given in Exercise 12.7.10. Follow steps similar to (12.13)

through (12.17) to show that β+

is a least squares estimate of β.

Exercise 12.7.12. Let Y | β = b, Ω = ω ∼ N(xb, (1/ω)In), and consider the priorβ |Ω = ω ∼ N(β0, K−1

0 /ω) and Ω ∼ Gamma(ν0/2, λ0/2). (a) Show that the condi-tional pdf can be written

f (y | b, ω) = c ωn/2e−(ω/2)((β−b)′x′x(β−b)+SSe). (12.91)

[See (12.13) through (12.18).] (b) Show that the prior density is

π(b, ω) = d ω(ν0+p)/2−1e−(ω/2)((b−β0)′K0(b−β0)+λ0). (12.92)

(c) Multiply the conditional and prior densities. Show that the power of ω is nowν0 + n + p. (d) Show that

(β− b)′x′x(β− b) + (b− β0)′K0(b− β0)

= (b− β∗)′(K0 + x′x)(b− β

∗) + (β− β0)

′(K−10 + (x′x)−1)−1(β− β0), (12.93)

12.7. Exercises 197

for β∗

in (12.37). [Use (7.116).] (e) Show that the posterior density then has the sameform as the prior density with respect to the b and ω, where the prior parameters areupdated as in (12.37) and (12.38).

Exercise 12.7.13. Apply the formula for the least square estimate to the task of findingthe b to minimize ∥∥∥∥( y

0p

)−(

x√κIp

)b∥∥∥∥2

. (12.94)

Show that the minimizer is indeed the ridge estimator in (12.45), (x′x + κIp)−1x′y.

Exercise 12.7.14. Suppose W is p× 1, and each Wi has finite mean and variance. (a)Show that E[W2

i ] = E[Wi]2 + Var[Wi]. (b) Show that summing the individual terms

yields E[‖W‖2] = ‖E[W]‖2 + trace(Cov[W]), which is Lemma 12.2.

Exercise 12.7.15. For βκ = (x′x + κIp)−1x′Y as in (12.45), show that

Cov[βκ ] = σ2(x′x + κIp)−1x′x(x′x + κIp)

−1. (12.95)

(We can estimate σ2 using the σ2 as in (12.23) that we obtained from the least squaresestimate.)

Exercise 12.7.16. This exercise provides a simple formula for the prediction error esti-mate in ridge regression for various values of κ. We will assume that x′x is invertible.(a) Show that the invertibility implies that the singular value decomposition (12.88)can be written

x = Γ

(∆0

)Ψ′, (12.96)

i.e., the column of 0’s in the middle matrix is gone. (b) Let eκ be the estimated errorsfor ridge regression with given κ. Show that we can write

eκ = (In − x(x′x + κIp)−1x′)y

= Γ

(In −

(∆0

)(∆2 + κIp)

−1 ( ∆ 0))

Γ′y. (12.97)

(c) Let g = Γ′y. Show that

SSe,κ = ‖eκ‖2 =p

∑i=1

g2i

(κ

δ2i + κ

)2

+n

∑i=p+1

g2i . (12.98)

Also, note that since the least squares estimate takes κ = 0, its sum of squared errorsis SSe,0 = ∑n

i=p+1 g2i . (d) Take Pκ = x(x′x + κIp)−1x′ as in (12.51). Show that

trace(Pκ) = ∑δ2

iδ2

i + κ. (12.99)

(e) Put parts (c) and (d) together to show that in (12.55), we have that

ESSpred,κ = SSe,κ + 2σ2 trace(Pκ) = SSe,0 +p

∑i=1

g2i

(κ

δ2i + κ

)2

+ 2σ2 ∑δ2

iδ2

i + κ.

(12.100)Hence once we find the gi’s and δi’s, it is easy to calculate (12.100) for many valuesof κ.


Exercise 12.7.17. This exercise looks at the bias and variance of the ridge regressionestimator. The ridge estimator for tuning parameter κ is βκ = (x′x + κIp)−1x′Y asin (12.45). Assume that x′x is invertible, E[E] = 0, and Cov[E] = σ2In. The singularvalue decomposition of x in this case is given in (12.96). (a) Show that

E[βκ ] = (x′x + κIp)−1x′xβ = Ψ(∆2 + κIp)

−1∆2Ψ′β. (12.101)

(b) Show that the bias of the ridge estimator can be written

Biasκ = −Ψ κ(∆2 + κIp)−1γ, where γ = Ψ′β. (12.102)

Also, show that the squared norm of the bias is

‖Biasκ ‖2 = ∑ γ2i

κ2

(δ2i + κ)2

. (12.103)

(c) Use (12.95) to show that the covariance matrix of βκ can be written

Cov[βκ ] = σ2Ψ∆2(∆2 + κIp)−2Ψ′. (12.104)

Also, show that

trace(Cov[βκ ]) = σ2 ∑δ2

i(δ2

i + κ)2. (12.105)

(d) The total expected mean square error of the estimator βκ is defined to be MSEκ =

E[‖βκ − β‖2]. Use Lemma 12.2 with W = βκ − β to show that

MSEκ = ‖Biasκ ‖2 + trace(Cov[βκ ]). (12.106)

(e) Show that when κ = 0, we have the least squares estimator, and that MSE0 =

σ2 ∑ δ−2i . Thus if any of the δi’s are near 0, the MSE can be very large. (f) Show that

the ‖Biasκ ‖2 is increasing in κ, and the trace(Cov[βκ ]) is decreasing in κ, for κ ≥ 0.Also, show that

∂

∂κMSEκ |κ=0 = −2σ2 ∑ δ−2

i . (12.107)

Argue that for small enough κ, the ridge estimator has a lower MSE than least squares.Note that the smaller the δi’s, the more advantage the ridge estimator has. (This resultis due to Hoerl and Kennard (1970).)

Exercise 12.7.18. Consider the regression model with just one x and no intercept, sothat yi = βxi + ei, i = 1, . . . , n. (a) Show that ‖y− xb‖2 = SSe + (b− β)2 ∑ x2

i , whereβ is the least squares estimator ∑ xiyi/ ∑ x2

i . (b) Show that the b that minimizes thelasso objective function in (12.64) also minimizes h(b) = (b− β)2 + λ∗|b| as in (12.65),where λ∗ ≥ 0. (c) Show that for b 6= 0 the derivative of h exists and

h′(b) = 2(b− β) + λ∗ Sign(b). (12.108)

(d) Show that if h′(b) = 0, then b = β − (λ∗/2) Sign(b). Also, show that such a bexists if and only if |β| ≥ λ∗/2, in which case b and β have the same sign. [Hint:Look at the signs of the two sides of the equation depending on whether β is biggeror smaller than λ∗/2.]

Chapter 13

Likelihood, Sufficiency, and MLEs

13.1 Likelihood function

If we know θ, then the density tells us what X is likely to be. In statistics, we donot know θ, but we do observe X = x, and wish to know what values θ is likely tobe. The analog to the density for the statistical problem is the likelihood function,which is the same as the density, but considered as a function of θ for fixed x. Thefunction is not itself a density (usually), because there is no particular reason tobelieve that the integral over θ for fixed x is 1. Rather, the likelihood function gives therelative likelihood of various values of θ. It is fundamental to Bayesian inference, andextremely useful in frequentist inference. It encapsulates the relationship between θand the data.

Definition 13.1. Suppose X has density f (x | θ) for θ ∈ T . Then a likelihood function forobservation X = x is

L(θ ; x) = cx f (x | θ), θ ∈ T , (13.1)

where cx is any positive constant.

Likelihoods are to be interpreted in only relative fashion, that is, to say the likeli-hood of a particular θ1 is L(θ1; y) does not mean anything by itself. Rather, meaningis attributed to saying that the relative likelihood of θ1 to θ2 (in light of the data y) isL(θ1; y)/L(θ2; y). There is a great deal of controversy over what exactly the relativelikelihood means. We do not have to worry about that particularly, since we are justusing likelihood as a means to an end. The general idea, though, is that the datasupports θ’s with relatively high likelihood.

For example, suppose X ∼ Binomial(n, θ), θ ∈ (0, 1), n = 5. The pmf is

f (x | θ) =(

nx

)θx(1− θ)n−x, x = 0, . . . , n. (13.2)

The likelihood isL(θ ; x) = cx θx(1− θ)n−x, 0 < θ < 1. (13.3)

See Figure 13.1 for graphs of the two functions.

199

200 Chapter 13. Likelihood, Sufficiency, and MLEs

0 1 2 3 4 5

0.00

0.10

0.20

0.30

x

f(x

| θ),

θ =

0.3

Binomial pmf

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.10

0.20

0.30

θL(

θ ; x

), x

= 2

Binomial likelihood

Figure 13.1: Binomial pmf and likelihood, where X ∼ Binomial(5, θ).

Here, cx is a constant that may depend on x but not on θ. Likelihood is notprobability, in particular because θ is not necessarily random. Even if θ is random,the likelihood is not the pdf, but rather the part of the pdf that depends on x. Thatis, suppose π is the prior pdf of θ. Then the posterior pdf can be written

f (θ | x) = f (x | θ)π(θ)∫T f (x | θ∗)π(θ∗)dθ∗

= L(θ ; x)π(θ). (13.4)

(Note: These densities should have subscripts, e.g., f (x | θ) should be fX|θ(x | θ), but Ihope leaving them off is not too confusing.) Here, the cx is the inverse of the integralin the denominator of (13.4) (the marginal density, which does not depend on θ as itis integrated away). Thus though the likelihood is not a density, it does tell us howto update the prior to obtain the posterior.

13.2 Likelihood principle

As we will see in this and later chapters, likelihood functions are very useful ininference, whether from a Bayesian or frequentist point of view. Going beyond theutility of likelihood, the likelihood principle is a fairly strong rule that purports tojudge inferences. It basically says that if two experiments yield the same likelihood,then they should yield the same inference. We first define what it means for twooutcomes to have the same likelihood, then briefly illustrate the principle. Berger andWolpert (1988) goes into much more depth.

Definition 13.2. Suppose we have two models, X with density f (x | θ) and Y with densityg(y | θ), that depend on the same parameter θ with space T . Then x and y have the samelikelihood if for some positive constants cx and cy in (13.1),

L(θ ; x) = L∗(θ ; y) for all θ ∈ T , (13.5)

where L and L∗ are their respective likelihoods.

13.2. Likelihood principle 201

The two models in Definition 13.2 could very well be the same, in which casex and y are two possible elements of X . As an example, suppose the model hasX = (X1, X2), where the elements are iid N(µ, 1), and µ ∈ T = R. Then the pdf is

f (x | µ) = 12π

e−12 ∑(xi−µ)2

=1

2πe−

12 (x2

1+x22)+(x1+x2)µ−µ2

. (13.6)

Consider two possible observed vectors, x = (1, 2) and y = (−3, 6). These observa-tions have quite different pdf values. Their ratio is

f (1, 2 | µ)f (−3, 6 | µ) = 485165195. (13.7)

This ratio is interesting in two ways: It shows the probability of being near x is almosthalf a billion times larger than the probability of being near y, and it shows the ratiodoes not depend on µ. That is, their likelihoods are the same:

L(µ; x) = c(1,3) e3µ−µ2and L(µ; y) = c(−3,6) e3µ−µ2

. (13.8)

It does not matter what the constants are. We could just take c(1,3) = c(−3,6) = 1, butthe important aspect is that the sums of the two observations are both 3, and the sumis the only part of the data that hits µ.

13.2.1 Binomial and negative binomial

The models can be different, but they do need to share the same parameter. Forexample, suppose we have a coin with probability of heads being θ ∈ (0, 1), and weintend to flip it a number of times independently. Here are two possible experiments:

• Binomial. Flip the coin n = 10 times, and count X, the number of heads, sothat X ∼ Binomial(10, θ).

• Negative binomial. Flip the coin until there are 4 heads, and count Y, thenumber of tails obtained. This Y is Negative Binomial(4, θ).

The space of the Negative Binomial(r, θ) is Y = 0, 1, 2, . . ., and the pmf is givenin Table 1.2 on page 9:

g(y | θ) =(

K− 1 + yK− 1

)θK(1− θ)y. (13.9)

Next, suppose we perform the binomial experiment and obtain X = 4 heads outof 10 flips. The likelihood is the usual binomial one:

L(θ ; 4) =(

104

)θ4(1− θ)6. (13.10)

Also, suppose we perform the negative binomial experiment and happen to see Y = 6tails before the K = 4th head. The likelihood here is

L∗(θ ; 6) =(

93

)θ4(1− θ)6. (13.11)


The likelihoods are the same. I left the constants there to illustrate that the pmfsare definitely different, but erasing the constants leaves the same θ4(1− θ)6. Thesetwo likelihoods are based on different random variables. The binomial has a fixednumber of flips but could have any number of heads (between 0 and 10), while thenegative binomial has a fixed number of heads but could have any number of flips(over 4). In particular, either experiment with the given outcome would yield thesame posterior for θ.

The likelihood principle says that if two outcomes (whether they are from the sameexperiment or not) have the same likelihood, then any inference made about θ basedon the outcomes must be the same. Any inference that is not the same under the twoscenarios are said to “violate the likelihood principle.” Bayesian inference does notviolate the likelihood principle, nor does maximum likelihood estimation, as long asyour inference is just “Here is the estimate ...”

Unbiased estimation does violate the likelihood principle. Keeping with the aboveexample, we know that X/n is an unbiased estimator of θ for the binomial. ForY ∼ Negative Binomial(K, θ), the unbiased estimator is found by ignoring the lastflip, because that we know is always heads, so would bias the estimate if used. Thatis,

E[θ∗U ] = θ, θ∗U =K− 1

Y + K− 1. (13.12)

See Exercise 4.4.15.Now we test out the inference: “The unbiased estimate of θ is ...”:

• Binomial(10, θ), with outcome x = 4. “The unbiased estimate of θ is θU =4/10 = 2/5.”

• Negative Binomial(4, θ), with outcome y = 6. “The unbiased estimate of θ isθ∗U = 3/9 = 1/3.”

Those two situations have the same likelihood, but different estimates, thus violatingthe likelihood principle!

The problem with unbiasedness is that it depends on the entire density, i.e., onoutcomes not observed, so different densities would give different expected values.For that reason, any inference that involves the operating characteristics of the proce-dure violates the likelihood principle.

Whether one decides to fully accept the likelihood principle or not, it provides animportant guide for any kind of inference, as we shall see.

13.3 Sufficiency

Consider again (13.6), or more generally, X1, . . . , Xn iid N(µ, 1). The likelihood is

L(µ ; x1, . . . , xn) = eµ ∑ xi− n2 µ2

, (13.13)

where the exp(−∑ x2i /2) part can be dropped as it does not depend on µ. Note that

this function depends on the xi’s only through their sum, that is, as in (13.8), if x andx∗ have the same sum, they have the same likelihood. Thus the likelihood principlesays that all we need to know is the sum of the xi’s to make an inference about µ.This sum is a sufficient statistic.

13.3. Sufficiency 203

Definition 13.3. Consider the model with space X and parameter space T . A function

s : X −→ S (13.14)

is a sufficient statistic if for some function b,

b : S × T −→ [0, ∞), (13.15)

the constant in the likelihood can be chosen so that

L(θ ; x) = b(s(x), θ). (13.16)

Thus S = s(X) is a sufficient statistic (it may be a vector) if by knowing S, youknow the likelihood, i.e., it is sufficient for performing any inference. That is handy,because you can reduce your data set, which may be large, to possibly just a fewstatistics without losing any information. More importantly, it turns out that the bestinferences depend on just the sufficient statistics.

We next look at some examples. First, we note that for any model, the data x isitself sufficient, because the likelihood depends on x through x.

13.3.1 IID

If X1, . . . , Xn are iid with density f (xi | θ), then no matter what the model, the orderstatistics (see Section 5.5) are sufficient. To see this fact, write

L(θ ; x) = f (x1 | θ) · · · f (xn | θ) = f (x(1) | θ) · · · f (x(n) | θ) = b((x(1), . . . , x(n)), θ),(13.17)

because the order statistics are just the xi’s in a particular order.

13.3.2 Normal distribution

If X1, . . . , Xn are iid N(µ, 1), µ ∈ T = R, then we have several candidates for sufficientstatistic:

The data itself : s1(x) = x;The order statistics : s2(x) = (x(1), . . . , x(n));

The sum : s3(x) = ∑ xi;

The mean : s4(x) = x;Partial sums : s5(x) = (x1 + x2, x3 + x4 + x5, x6) (if n = 6). (13.18)

An important fact is that any one-to-one function of a sufficient statistic is alsosufficient, because knowing one means you know the other, hence you know the like-lihood. For example, the mean and sum are one-to-one. Also, note that the dimensionof the sufficient statistics in (13.18) are different (n, n, 1, 1, and 3, respectively). Gener-ally, one prefers the most compact one, in this case either the mean or sum. Each ofthose are functions of the others. In fact, they are minimal sufficient.

Definition 13.4. A statistic s(x) is minimal sufficient if it is sufficient, and given anyother sufficient statistic t(x), there is a function h such that s(x) = h(t(x)).


This concept is important, but we will later focus on the more restrictive notion ofcompleteness.

Which statistics are sufficient depend crucially on the parameter space. In theabove, we assumed the variance known. But suppose X1, . . . , XN are iid N(µ, σ2),with (µ, σ2) ∈ T ≡ R× (0, ∞). Then the likelihood is

L(µ, σ2 ; x) =1

σn e−1

2σ2 ∑(xi−µ)2=

1σn e−

12σ2 ∑ x2

i +1

σ2 µ ∑ xi− n2σ2 µ2

. (13.19)

Now we cannot eliminate the ∑ x2i part, because it involves σ2. Here the sufficient

statistic is two-dimensional:

s(x) = (s1(x), s2(x)) = (∑ xi, ∑ x2i ), (13.20)

and the b function is

b(s1, s2) =1

σn e−1

2σ2 s2+1

σ2 µs1− n2σ2 µ2

. (13.21)

13.3.3 Uniform distribution

Suppose X1, . . . , Xn are iid Uniform(0, θ), θ ∈ (0, ∞). The likelihood is

L(θ ; x) = ∏1θ

I[0 < xi < θ] =

1θn if 0 < xi < θ for all xi0 if not

. (13.22)

All the xi’s are less than θ if and only if the largest one is, so that we can write

L(θ ; x) = 1

θn if 0 < x(n) < θ

0 if not. (13.23)

Thus the likelihood depends on x only through the maximum, hence x(n) is sufficient.

13.3.4 Laplace distribution

Now suppose X1, . . . , Xn are iid Laplace(θ), θ ∈ R, which has pdf exp(−|xi − θ|)/2,so that the likelihood is

L(θ ; x) = e−∑ |xi−θ|. (13.24)

Because the data are iid, the order statistics are sufficient, but unfortunately, thereis not another sufficient statistic with smaller dimension. The absolute value barscannot be removed. Similar models based on the Cauchy, logistic, and others havethe same problem.

13.3.5 Exponential families

In some cases, the sufficient statistic does reduce the dimensionality of the data signif-icantly, such as in the iid normal case, where no matter how large n is, the sufficientstatistic is two-dimensional. In other cases, such as the Laplace above, there is no di-mensionality reduction, so one must still carry around n values. Exponential familiesare special families in which there is substantial reduction for iid variables or vec-tors. Because the likelihood of an iid sample is the product of individual likelihoods,statistics “add up” when they are in an exponent.

13.4. Conditioning on a sufficient statistic 205

The vector X has an exponential family distribution if its density (pdf or pmf)depends on a p× 1 vector θ and can be written

f (x | θ) = a(x)eθ1t1(x)+···+θptp(x)−ψ(θ) (13.25)

for some functions t1(x), . . . , tp(x), a(x), and ψ(θ). The θ is called the natural pa-rameter and the t(x) = (t1(x), . . . , tp(x))′ is the vector of natural sufficient statistics.Now if X1, . . . , Xn are iid vectors with density (13.25), then the joint density is

f (x1, . . . , xn | θ) = ∏ f (xi | θ) = [∏ a(xi)]eθ1Σit1(xi)+···+θpΣitp(xi)−nψ(θ), (13.26)

hence has sufficient statistic

s(x1, . . . , xn) =

∑i t1(xi)...

∑i tp(xi)

= ∑i

t(xi), (13.27)

which has dimension p no matter how large n.The natural parameters and sufficient statistics are not necessarily the most “nat-

ural” to us. For example, in the normal case of (13.19), the natural parameters can betaken to be

θ1 =µ

σ2 and θ2 = − 12σ2 . (13.28)

The corresponding statistics are

t(xi) =

(t1(xi)t2(xi)

)=

(xix2

i

). (13.29)

There are other choices, e.g., we could switch the negative sign from θ2 to t2.Other exponential families include the Poisson, binomial, gamma, beta, multivari-

ate normal, and multinomial.

13.4 Conditioning on a sufficient statistic

The intuitive meaning of a sufficient statistic is that once you know the statistic,nothing else about the data helps in inference about the parameter. For example, inthe iid case, once you know the values of the xi’s, it does not matter in what orderthey are listed. This notion can be formalized by finding the conditional distributionof the data given the sufficient statistic, and showing that it does not depend on theparameter.

First, a lemma that makes it easy to find the conditional density, if there is one.

Lemma 13.5. Suppose X has space X and density fX, and S = s(X) is a function of X withspace S and density fS. Then the conditional density of X given S, if it exists, is

fX|S(x | s) =fX(x)fS(s)

for x ∈ Xs ≡ x ∈ X | s(x) = s. (13.30)

The caveat “if it exists” in the lemma is unnecessary in the discrete case, becausethe conditional pmf always will exist. But in continuous or mixed cases, the resultingconditional distribution may not have a density with respect to Lebesgue measure. Itwill have a density with respect to some measure, though, which one would see in ameasure theory course.


Proof. (Discrete case) Suppose X is discrete. Then by Bayes theorem,

fX|S(x | s) =fS|X(s | x) fX(x)

fS(s), x ∈ Xs. (13.31)

Because S is a function of X,

fS|X(s | x) = P[s(X) = s |X = x] =

1 if s(x) = s0 if s(x) 6= s

=

1 if x ∈ Xs0 if x 6∈ Xs

.

(13.32)Thus fS|X(s | x) = 1 in (13.31), because x ∈ Xs, so we can erase it, yielding (13.30).

Here is the main result.

Lemma 13.6. Suppose s(x) is a sufficient statistic for a model with data X and parameterspace T . Then the conditional distribution X | s(X) = s does not depend on θ.

Before giving a proof, consider the example with X1, . . . , Xn iid Poisson(θ), θ ∈(0, ∞). The likelihood is

L(θ ; x) = ∏ e−θθxi = e−nθθΣxi . (13.33)

We see that s(x) = ∑ xi is sufficient. We know that S = ∑ Xi is Poisson(nθ), hence

fS(s) = e−nθ (nθ)s

s!. (13.34)

Then by Lemma 13.5, the conditional pmf of X given S is

fX|S(x | s) =fX(x)fS(s)

=e−nθθΣxi / ∏ xi!

e−nθ(nθ)s/s!

=s!

∏ xi!1ns , x ∈ Xs = x | ∑ xi = s. (13.35)

That is a multinomial distribution:

X | ∑ Xi = s ∼ Multinomialn(s, ( 1n , . . . , 1

n )). (13.36)

But the main point of Lemma 13.6 is that this distribution is independent of θ. Thusknowing the sum means the exact values of the xi’s do not reveal anything extraabout θ.

The key to the result lies in (13.33) and (13.34), where we see that the likelihoodsof X and s(X) are the same, if ∑ xi = s. This fact is a general one.

Lemma 13.7. Suppose s(X) is a sufficient statistic for the model with data X, and considerthe model for S where S =D s(X). Then the likelihoods for X = x and S = s are the same ifs = s(x).

13.4. Conditioning on a sufficient statistic 207

This result should make sense, because it is saying that the sufficient statistic con-tains the same information about θ that the full data do. One consequence is that in-stead of having to work with the full model, one can work with the sufficient statistic’smodel. For example, instead of working with X1, . . . , Xn iid N(µ, σ), you can workwith just the two independent random variables X ∼ N(µ, σ2/n) and S2 ∼ σ2χ2

n−1/nwithout losing any information.

We prove the lemma in the discrete case.

Proof. Let fX(x | θ) be the pmf of X, and fS(s | θ) be that of S. Because s(X) is suffi-cient, the likelihood for X can be written

L(θ ; x) = cx fX(x | θ) = b(s(x), θ) (13.37)

by Definition 13.3, for some cx and b(s, θ). The pmf of S is

fS(s | θ) = Pθ[s(X) = s] = ∑x∈Xs

fX(x | θ) (where Xs = x ∈ X | s(x) = s)

= ∑x∈Xs

b(s(x), θ)/cx

= b(s, θ) ∑x∈Xs

1/cx (because in the summation, s(x) = s)

= b(s, θ)ds, (13.38)

where ds is that sum of 1/cx’s, which does not depend on θ. Thus the likelihood ofS can be written

L∗(θ ; s) = b(s, θ), (13.39)the same as L in (13.37).

A formal proof for the continuous case proceeds by introducing appropriate extravariables Y so that X and (S, Y) are one-to-one, then using Jacobians, then integratingout the Y. We will not do that, but special cases can be done easily, e.g., if X1, . . . , Xnare iid N(µ, 1), one can directly show that X and s(X) = ∑ Xi ∼ N(nµ, n) have thesame likelihood.

Now for the proof of Lemma 13.6, again in the discrete case. Basically, we justrepeat the calculations for the Poisson.

Proof. As in the proof of Lemma 13.7, the pmfs of X and S can be written, respectively,

fX(x | θ) = L(θ ; x)/cx and fS(s | θ) = L∗(θ ; s)ds, (13.40)

where the likelihoods are equal in the sense that

L(θ ; x) = L∗(θ ; s) if x ∈ Xs. (13.41)

Then by Lemma 13.5,

fX|S(x | s, θ) =fX(x | θ)fS(s | θ)

for x ∈ Xs

=L(θ ; x)/cx

L∗(θ ; s)dsfor x ∈ Xs

=1

cx dsfor x ∈ Xs (13.42)

by (13.41), which does not depend on θ.


Many statisticians would switch Definition 13.3 and Lemma 13.6 for sufficiency.That is, a statistic s(X) is defined to be sufficient if the conditional distributionX | s(X) = s does not depend on the parameter. Then one can show that the likelihooddepends on x only through s(x), that result being Fisher’s factorization theorem.

We end this section with a couple of additional examples that derive the condi-tional distribution of X given s(X).

13.4.1 IID

Suppose X1, . . . , Xn are iid continuous random variables with pdf f (xi). No matterwhat the model, we know that the order statistics are sufficient, so we will suppressthe θ. We can proceed as in the proof of Lemma 13.6. Letting s(x) = (x(1), . . . , x(n)),we know that the joint pdfs of X and S ≡ s(X) are, respectively,

fX(x) = ∏ f (xi) and fS(s) = n! ∏ f (x(i)). (13.43)

The products of the pdfs are the same, just written in different orders. Thus

P[X = x | s(X) = s] =1n!

for x ∈ Xs. (13.44)

The s is a particular set of ordered values, and Xs is the set of all x that have the samevalues as s, but in any order. To illustrate, suppose that n = 3 and s(X) = s = (1, 2, 7).Then X has a conditional chance of 1/6 of being any x with order statistics (1, 2, 7):

X(1,2,7) = (1, 2, 7), (1, 7, 2), (2, 1, 7), (2, 7, 1), (7, 1, 2), (7, 2, 1). (13.45)

The discrete case works as well, although the counting is a little more complicatedwhen there are ties. For example, if s(x) = (1, 3, 3, 4), there are 4!/2! = 12 differentorderings.

13.4.2 Normal mean

Suppose X1, . . . , Xn are iid N(µ, 1), so that X is sufficient (as is ∑ Xi). We wish tofind the conditional distribution of X given X = x. Because we have normality andlinear functions, everything is straightforward. First we need the joint distribution ofW = (X, X′)′, which is a linear transformation of X ∼ N(µ1n, In), hence multivariatenormal. We could figure out the matrix for the transformation, but all we really needare the mean and covariance matrix of W. The mean and covariance of the X part weknow, and the mean and variance of X are µ and 1/n, respectively. All that is left isthe covariance of the Xi’s with X, which are all the same, and can be found directly:

Cov[Xi, X] =1n

n

∑j=1

Cov[Xi, Xj] =1n

, (13.46)

because Xi is independent of all the Xj’s except for Xi itself, with which its covarianceis 1. Thus

W =

(XX

)∼ N

(µ1n+1,

( 1n

1n 1′n

1n 1n In

)). (13.47)

13.5. Rao-Blackwell: Improving an estimator 209

For the conditional distribution, we use Lemma 7.8. The X there is X here, andthe Y there is the X here. Thus

ΣXX =1n

, ΣYX =1n

1n, ΣYY = In, µX = µ, and µY = µ1n, (13.48)

hence

β =1n

1n1

n−1 = 1n and α = µ1n − βµ = µ1n − 1nµ = 0n. (13.49)

Thus the conditional mean is

E[X |X = x] = α + β x = x 1n. (13.50)

That result should not be surprising. It says that if you know the sample mean is x,you expect on average the observations to be x. The conditional covariance is

ΣYY − ΣYXΣ−1XXΣXY = In −

1n

1n1

n−11n

1′n = In −1n

1n1′n = Hn, (13.51)

the centering matrix from (7.38). Putting it all together:

X |X = x ∼ N(x 1n, Hn). (13.52)

We can pause and reflect that this distribution is free of µ, as it had better be. Notealso that if we subtract the conditional mean, we have

X− X 1n |X = x ∼ N(0n, Hn). (13.53)

That conditional distribution is free of x, meaning the vector is independent of X.But this is the vector of deviations, and we already knew it is independent of X from(7.43).

13.4.3 Sufficiency in Bayesian analysis

Since the Bayesian posterior distribution depends on the data only through the like-lihood function for the data, and the likelihood for the sufficient statistic is the sameas that for the data (Lemma 13.7), one need deal with just the sufficient statistic whenfinding the posterior. See Exercise 13.8.7.

13.5 Rao-Blackwell: Improving an estimator

We see that sufficient statistics are nice in that we do not lose anything by restricting tothem. In fact, they are more than just convenient — if you base an estimate on moreof the data than the sufficient statistic, then you do lose something. For example,suppose X1, . . . , Xn are iid N(µ, 1), µ ∈ R, and we wish to estimate

g(µ) = Pµ[Xi ≤ 10] = Φ(10− µ), (13.54)

Φ being the distribution function of the standard normal. An unbiased estimator is

δ(x) =#xi | xi ≤ 10

n=

1n ∑ I[xi ≤ 10]. (13.55)


Note that this estimator is not a function of just x, the sufficient statistic. The claim isthat there is another unbiased estimator that is a function of just x and is better, i.e.,has lower variance.

Fortunately, there is a way to find such an estimator besides guessing. We useconditioning as in the previous section, and the results on conditional means andvariances in Section 6.4.3. Start with an estimator δ(x) of g(θ), and a sufficient statisticS = s(X). Then consider the conditional expected value of δ:

δ∗(s) = E[δ(X) | s(X) = s]. (13.56)

First, we need to make sure δ∗ is an estimator, which means that it does not dependon the unknown θ. But by Lemma 13.6, we know that the conditional distributionof X given S does not depend on θ, and δ does not depend on θ because it is anestimator, hence δ∗ is an estimator. If we condition on something not sufficient, thenwe may not end up with an estimator.

Is δ∗ a good estimator? From (6.38), we know it has the same expected value as δ:

Eθ[δ∗(S)] = Eθ[δ(X)], (13.57)

so thatBiasθ[δ

∗] = Biasθ[δ]. (13.58)

Thus we haven’t done worse in terms of bias, and in particular if δ is unbiased, so isδ∗.

Turning to variance, we have the “variance-conditional variance” equation (6.43),which translated to our situation here is

Varθ[δ(X)] = Eθ[v(S)] + Varθ[δ∗(S)], (13.59)

wherev(s) = Varθ[δ(X) | s(X) = s]. (13.60)

Whatever v is, it is not negative, hence

Varθ[δ∗(S)] ≤ Varθ[δ(X)]. (13.61)

Thus variance-wise, the δ∗ is no worse than δ. In fact, δ∗ is strictly better unless v(S)is zero with probability one. But in that case, δ and δ∗ are the same, so that δ itself isa function of just S already.

Finally, if the bias is the same and the variance of δ∗ is lower, then the meansquared error of δ∗ is better. To summarize:

Theorem 13.8. Rao-Blackwell. If δ is an estimator of g(θ), and s(X) is sufficient, thenδ∗(s) given in (13.56) has the same bias as δ, and smaller variance and MSE, unless δ is afunction of just s(X), in which case δ and δ∗ are the same.

13.5.1 Normal probability

Consider the estimator δ(x) in (13.55) for g(µ) = Φ(10− µ) in (13.54) in the normalcase. With X being the sufficient statistic, we can find the conditional expected valueof δ. We start by finding the conditional expected value of just one of the I[xi ≤10]’s. It turns out that the conditional expectation is the same for each i, hence the

13.5. Rao-Blackwell: Improving an estimator 211

conditional expected value of δ is the same as the condition expected value of justone. So we are interested in finding

δ∗(x) = E[I[Xi ≤ 10] |X = x] = P[Xi ≤ 10 |X = x]. (13.62)

From (13.52) we have that

Xi |X = x ∼ N(

x, 1− 1n

), (13.63)

hence

δ∗(x) = P[Xi ≤ 10 |X = x]= P[N(x, 1− 1/n) ≤ 10]

= Φ(

10− x√1− 1/n

). (13.64)

This estimator is then guaranteed to be unbiased, and have a lower variance that δ.It would have been difficult to come up with this estimator directly, or even showthat it is unbiased, but the original δ is quite straightforward, as is the conditionalcalculation.

13.5.2 IID

In Section 13.4.1, we saw that when the observations are iid, the conditional dis-tribution of X given the order statistics is uniform over all the permutations of theobservations. One consequence is that any estimator must be invariant under permu-tations, or else it can be improved. For a simple example, consider estimating µ, themean, with the simple estimator δ(X) = X1. Then with s(x) being the order statistics,

δ∗(s) = E[X1 | s(X) = s] =1n ∑ si = s (13.65)

because X1 is conditionally equally likely to be any of the order statistics. Of course,the mean of the order statistics is the same as the x, so we have that X is a betterestimate than X1, which we knew. The procedure applied to any weighted averagewill also end up with the mean, e.g.,

E[

12

X1 +13

X2 +16

X4

∣∣∣∣ s(X) = s]=

12

E[X1 | s(X) = s] +13

E[X2 | s(X) = s]

+16

E[X4 | s(X) = s]

=12

s +13

s +16

s

= s. (13.66)

Turning to the variance σ2, because X1 − X2 has mean 0 and variance 2σ2, δ(x) =(x1 − x2)

2/2 is an unbiased estimator of σ2. Conditioning on the order statistics, weobtain the mean of all the (xi − xj)

2/2’s with i 6= j:

δ∗(s) = E[(X1 − X2)

2

2

∣∣∣∣ s(X) = s]=

1n(n− 1) ∑ ∑i 6=j

(xi − xj)2

2. (13.67)


After some algebra, we can write

δ∗(s(x)) = ∑(xi − x)2

n− 1, (13.68)

the usual unbiased estimate of σ2.The above estimators are special cases of U-statistics, for which there are many

nice asymptotic results. A U-statistic is based on a kernel h(x1, . . . , xd), a functionof a subset of the observations. The corresponding U-statistic is the symmetrizedversion of the kernel, i.e., the conditional expected value,

u(x) = E[h(X1, . . . , Xd) | s(X) = s(x)]

=1

n(n− 1) · · · (n− d + 1) ∑ · · ·∑i1,...,id , distincth(xi1 , . . . , xid). (13.69)

See Serfling (1980) for more on these statistics.

13.6 Maximum likelihood estimates

If L(θ ; x) reveals how likely θ is in light of the data x, it seems reasonable that themost likely θ would be a decent estimate of θ. In fact, it is reasonable, and theresulting estimator is quite popular.

Definition 13.9. Given the model with likelihood L(θ ; x) for θ ∈ T , the maximum likeli-hood estimate (MLE) at observation x is the unique value of θ that maximizes L(θ ; x) overθ ∈ T , if such unique value exists. Otherwise, the MLE does not exist at x.

By convention, the MLE of any function of the parameter is the function of theMLE:

g(θ) = g(θ), (13.70)

a plug-in estimator. See Exercises 13.8.10 through 13.8.12 for some justification of theconvention.

There are times when the likelihood does not technically have a maximum, butthere is an obvious limit of the θ’s that approach the supremum. For example, sup-pose X ∼ Uniform(0, θ). Then the likelihood at x > 0 is

L(θ ; x) =1θ

Ix<θ(θ) =

1θ if θ > x

0 if θ ≤ x. (13.71)

Figure 13.2 exhibits the likelihood function when x = 4. The highest point occursat θ = 4, almost. To be precise, there is no maximum, because the graph is notcontinuous at θ = 4. But we still take the MLE to be θ = 4 in this case, because wecould switch the filled-in dot from (4, 0) to (4, 1/4) by taking X ∼ Uniform(0, θ], sothat the pdf at x = θ is 1/θ, not 0.

Often, the MLE is found by differentiating the likelihood, or the log of the like-lihood (called the loglikelihood). Because the log function is strictly increasing, thesame value of θ maximizes the likelihood and the loglikelihood. For example, in thebinomial example, from (13.3) the loglikelihood is (dropping the cx)

l(θ ; x) = log(L(θ ; x)) = x log θ + (n− x) log(1− θ). (13.72)

13.6. Maximum likelihood estimates 213

0 2 4 6 8 10

0.00

0.10

0.20

0.30

likel

ihoo

d

θ

Figure 13.2: Uniform(0, θ) likelihood when x = 4.

Thenl′(θ ; x) =

xθ− n− x

1− θ. (13.73)

Set that expression to 0 and solve for θ to obtain

θ =xn

. (13.74)

From Figure 13.1, one can see that the maximum is at x/n = 0.4.If θ is p× 1, then one must do a p-dimensional maximization. For example, sup-

pose X1, . . . , Xn are iid N(µ, σ2), (µ, σ2) ∈ R × (0, ∞). Then the loglikelihood is(dropping the

√2π’s),

log(L(µ, σ2 ; x1, . . . , xn)) = −1

2σ2 ∑(xi − µ)2 − n2

log(σ2). (13.75)

We could go ahead and differentiate with respect to µ and σ2, obtaining two equa-tions. But notice that µ appears only in the sum of squares part, and we have already,in Exercise 12.7.1, found that µ = x. (It is not hard to prove by differentiation.) Thenfor the variance, we need to maximize

log(L(x, σ2 ; x1, . . . , xn)) = −1

2σ2 ∑(xi − x)2 − n2

log(σ2). (13.76)

Differentiating with respect to σ2,

∂

∂(σ2)log(L(x, σ2 ; x1, . . . , xn)) =

12σ4 ∑(xi − x)2 − n

21

σ2 , (13.77)

and setting to 0 leads to

σ2 =∑(xi − x)2

n= s2. (13.78)

So the MLE of (µ, σ2) is (x, s2). It is then easy to find the MLEs of functions of (µ, σ2),e.g., the MLE of the coefficient of variation σ/µ is s/x as in (13.70).


Chapter 14 delves more deeply into likelihood estimation. In particular, underconditions, we can obtain an automatic estimate of the standard error based on theFisher information.

13.7 Functions of estimators

If θ is an estimator of θ, then g(θ) is an estimator of g(θ). Do the properties of θ

transfer to g(θ)? Maybe, maybe not. Although there are exceptions (such as when gis linear for the first two statements), generally

If θU is an unbiasedestimator of θ

then g(θU) is not an unbiasedestimator of g(θ);

If θB is the Bayes pos-terior mean of θ

then g(θB) is not the Bayes posteriormean of g(θ);

If θmle is the MLE of θ then g(θmle) is the MLE of g(θ).

The basic reason for the first two statements is the following:

Theorem 13.10. If Y is a random variable with finite mean, and g is a function of y, then

E[g(Y)] 6= g(E[Y]) (13.79)

unless

• g is linear: g(y) = a + b y for constants a and b;

• Y is essentially constant: P[Y = µ] = 1 for some µ;

• You are lucky.

A simple example is g(x) = x2. If E[X2] = E[X]2, then X has variance 0. As longas X is not a constant, E[X2] > E[X]2.

13.7.1 Poisson distribution

Suppose X ∼ Poisson(θ), θ ∈ (0, ∞). Then for estimating θ, θU = X is unbi-ased, and happens to be the MLE as well. For a Bayes estimate, take the priorΘ ∼ Exponential(1). The likelihood is L(θ ; x) = exp(−θ)θx, hence the posterioris

π(θ | x) = cL(θ ; x)π(θ) = ce−θθxe−θ = cθxe−2θ , (13.80)

which is Gamma(x + 1, 2). Thus from Table 1.1, the posterior mean with respect tothis prior is

θB = E[θ |X = x] =x + 1

2. (13.81)

Now consider estimating g(θ) = exp(−θ), which is the Pθ [X = 0]. (So if θ is theaverage number of telephone calls coming in an hour, exp(−θ) is the chance there

13.8. Exercises 215

are 0 calls in the next hour.) Is g(θU) unbiased? No:

E[g(θU)] = E[e−X ] = e−θ∞

∑x=0

e−x θx

x!

= e−θ∞

∑x=0

(e−1θ)x

x!

= e−θee−1θ

= eθ(e−1−1) 6= e−θ . (13.82)

There is an unbiased estimator for g, namely I[X = 0].Turn to the posterior mean of g(θ), which is

g(θ)B = E[g(θ) |X = x] =∫ ∞

0g(θ)π(θ | x)dθ

=2x+1

Γ(x + 1)

∫ ∞

0e−θθx e−2θdθ

=2x+1

Γ(x + 1)

∫ ∞

0θx e−3θdθ

=2x+1

Γ(x + 1)Γ(x + 1)

3x+1

=

(23

)x+16= e−θB = e−

x+12 . (13.83)

See Exercise 13.8.10 for the MLE.

13.8 Exercises

Exercise 13.8.1. Imagine a particular phone in a call center, and consider the timebetween calls. Let X1 be the waiting time in minutes for the first call, X2 the wait-ing time for the second call after the first, etc. Assume that X1, X2, . . . , Xn are iidExponential(θ). There are two devices that may be used to record the waiting timesbetween the calls. The old mechanical one can measure each waiting time up to onlyan hour, while the new electronic device can measure the waiting time with no limit.Thus there are two possible experiments:

Old: Using the old device, one observes Y1, . . . , Yn, where Yi = minXi,60, that is, if the true waiting time is over 60 minutes, the device records60.

New: Using the new device, one observes the true X1, . . . , Xn.

(a) Using the old device, what is Pθ [Yi = 60]? (b) Using the old device, find Eθ [Yi].Is Yn an unbiased estimate of 1/θ? (c) Using the new device, find Eθ [Xi]. Is Xn anunbiased estimate of 1/θ?

In what follows, suppose the actual waiting times are 10, 12, 25, 35, 38 (so n = 5).(d) What is the likelihood for these data when using the old device? (e) What is thelikelihood for these data using the new device? Is it the same as for the old device?


(f) Let µ = 1/θ. What is the MLE of µ using the old device for these data? The newdevice? Are they the same? (g) Let the prior on θ be Exponential(25). What is theposterior mean of µ for these data using the old device? The new device? Are theythe same? (h) Look at the answers to parts (b), (c), and (f). What is odd about thesituation?

Exercise 13.8.2. In this question X and Y are independent, with

X ∼ Poisson(2θ), Y ∼ Poisson(2(1− θ)), θ ∈ (0, 1). (13.84)

(a) Which of the following are sufficient statistics for this model? (i) (X, Y); (ii) X +Y;(iii) X − Y; (iv) X/(X + Y); (v) (X + Y, X − Y). (b) If X = Y = 0, which of thefollowing are versions of the likelihood? (i) exp(−2) exp(4θ); (ii) exp(−2); (iii) 1; (iv)exp(−2θ). (c) What value(s) of θ maximize the likelihood when X = Y = 0? (d) Whatis the MLE for θ when X + Y > 0? (e) Suppose θ has the Beta(a, b) prior. What isthe posterior mean of θ? (f) Which of the following estimators of θ are unbiased? (i)δ1(x, y) = x/2; (ii) δ2(x, y) = 1− y/2; (iii) δ3(x, y) = (x − y)/4 + 1/2. (g) Find theMSE for each of the three estimators in part (f). Also, find the maximum MSE foreach. Which has the lowest maximum? Which is best for θ near 0? Which is best forθ near 1?

Exercise 13.8.3. Suppose X and Y are independent, with X ∼ Binomial(n, θ) and Y ∼Binomial(m, θ), θ ∈ (0, 1), and let T = X + Y. (a) Does the conditional distributionof X | T = t depend on θ? (b) Find the conditional distribution from part (a) forn = 6, m = 3, t = 4. (c) What is E[X | T = t] for n = 6, m = 3, t = 4? (d) Now supposeX and Y are independent, but with X ∼ Binomial(n, θ1) and Y ∼ Binomial(m, θ2),θ = (θ1, θ2) ∈ (0, 1)× (0, 1). Does the conditional distribution X | T = t depend on θ?

Exercise 13.8.4. Suppose X1 and X2 are independent N(0, σ2)’s, σ2 ∈ (0, ∞), and let

R =√

X21 + X2

2 . (a) Find the conditional space of X given R = r, Xr. (b) Find the pdfof R. (c) Find the “density" of X | R = r. (It is not the density with respect to Lebesguemeasure on R2, but still is a density.) Does it depend on σ2? Does it depend on r?Does it depend on (x1, x2) other than through r? How does it relate to the conditionalspace? (d) What do you think the conditional distribution of X | R = r is?

Exercise 13.8.5. For each model, indicate which statistics are sufficient. (They do notneed to be minimal sufficient. There may be several correct answers for each model.)The models are each based on X1, . . . , Xn iid with some distribution. (Assume thatn > 3.) Here are the distributions and parameter spaces: (a) N(µ, 1), µ ∈ R. (b)N(0, σ2), σ2 > 0. (c) N(µ, σ2), (µ, σ2) ∈ R× (0, ∞). (d) Uniform(θ, 1 + θ), θ ∈ R.(e) Uniform(0, θ), θ > 0. (f) Cauchy(θ), θ ∈ R (so the pdf is 1/(π(1− (x − θ)2)).(g) Gamma(α, λ), (α, λ) ∈ (0, ∞) × (0, ∞). (h) Beta(α, β), (α, β) ∈ (0, ∞) × (0, ∞).(i) Logistic(θ), θ ∈ R (so the pdf is exp(x − θ)/(1 + exp(x − θ))2). (j) The “shiftedexponential (α, λ)”, which has pdf given in (13.87), where (α, λ) ∈ R× (0, ∞). (k) Themodel has the single distribution Uniform(0, 1).

The choices of sufficient statistics are below. For each model, decide which of the fol-lowing are sufficient for that model. (1) (X1, . . . , Xn), (2) (X(1), . . . , X(n)), (3) ∑n

i=1 Xi,(4) ∑n

i=1 X2i , (5) (∑n

i=1 Xi, ∑ni=1 X2

i ), (6) X(1), (7) X(n), (8) (X(1), X(n)) (9) (X(1), X), (10)(X, S2), (11) ∏n

i=1 Xi, (12) (∏ni=1 Xi, ∑n

i=1 Xi), (13) (∏ni=1 Xi, ∏n

i=1(1− Xi)), (14) 0.

13.8. Exercises 217

Exercise 13.8.6. Suppose X1, X2, X3, . . . are iid Laplace(µ, σ), so that the pdf of xi isexp(−|xi − µ|/σ)/(2σ). (a) Find a minimal sufficient statistic if −∞ < µ < ∞ andσ > 0. (b) Find a minimal sufficient statistic if µ = 0 (known) and σ > 0.

Exercise 13.8.7. Suppose that X |Θ = θ has density f (x | θ), and S = s(x) is a suf-ficient statistic. Show that for any prior density π on Θ, the posterior density forΘ |X = x is the same as that for Θ | S = s if s = s(x). [Hint: Use Lemma 13.7 and(13.16) to show that both posterior densities equal

b(s, θ)π(θ)∫b(s, θ∗)π(θ∗)dθ∗

. ] (13.85)

Exercise 13.8.8. Show that each of the following models is an exponential fam-ily model. Give the natural parameters and statistics. In each case, the data areX1, . . . , Xn with the given distribution. (a) Poisson(λ) where λ > 0. (b) Exponential(λ)where λ > 0. (c) Gamma(α, λ), where α > 0 and λ > 0. (d) Beta(α, β), where α > 0and β > 0.

Exercise 13.8.9. Show that each of the following models is an exponential familymodel. Give the natural parameters and statistics. (a) X ∼ Binomial(n, p), wherep ∈ (0, 1). (b) X ∼ Multinomial(n, p), where p = (p1, . . . , pk), pk > 0 for all k,and ∑ pk = 1. Take the natural sufficient statistic to be X. (c) Take X as in part(b), but take the natural sufficient statistic to be (X1, . . . , XK−1). [Hint: Set XK =n− X1 − · · · − XK−1.]

Exercise 13.8.10. Suppose X ∼ Poisson(θ), θ ∈ (0, ∞), so that the MLE of θ is θ = X.Reparameterize to τ = g(θ) = exp(−θ), so that the parameter space of τ is (0,1). (a)Find the pmf for X in terms of τ, f ∗(x | τ). (b) Find the MLE of τ for the pmf in part(a). Does it equal exp(−θ)?

Exercise 13.8.11. Consider the statistical model with densities f (x | θ) for θ ∈ T .Suppose the function g : T → O is one-to-one and onto, so that a reparameterizationof the model has densities f ∗(x |ω) for ω ∈ Ω, where f ∗(x |ω) = f (x | g−1(ω)). (a)Show that θ uniquely maximizes f (x | θ) over θ if and only if ω ≡ g(θ) uniquelymaximizes f ∗(x |ω) over ω. [Hint: Show that f (x | θ) > f (x | θ) for all θ 6= θ impliesf ∗(x | ω) > f ∗(x |ω) for all ω 6= ω, and vice versa.] (b) Argue that if θ is the MLE ofθ, then g(θ) is the MLE of ω.

Exercise 13.8.12. Again consider the model in Exercise 13.8.11, but now supposeg : T → O is just onto, not one-to-one. Let g∗ be any function of θ such that thejoint function h(θ) = (g(θ), g∗(θ)), h : T → L, is one-to-one and onto, and set thereparameterized density as f ∗(x | λ) = f (x | h−1(λ)), λ ∈ L. Exercise 13.8.11 showsthat if θ uniquely maximizes f (x | θ) over T , then λ = h(θ) uniquely maximizesf ∗(x | λ) over L. Argue that if θ is the MLE of θ, that it is legitimate to define g(θ) tobe the MLE of ω = g(θ).

Exercise 13.8.13. Recall the fruit fly example in Section 6.4.4. Equation (6.114) hasthe pmf for one observation (Y1, Y2). The data consist of n iid such observations,(Yi1, Yi2), i = 1, . . . , n. Let nij be the number of pairs (Yi1, Yi2) that equal (i, j) for


i = 0, 1 and j = 0, 1. (a) Show that the loglikelihood can be written

ln(θ) = (n00 + n01 + n10) log(1− θ) + n00 log(2− θ)

+ (n01 + n10 + n11) log(θ) + n11 log(1 + θ). (13.86)

(b) The data can be found in (6.54). Each Yij is the number of CUs in its genotype, sothat (TL, TL)⇒ 0 and (TL, CU)⇒ 1. Find n00, n01 + n10, and n11, and fill the valuesinto the loglikelihood. (c) Sketch the likelihood. Does there appear to be a uniquemaximum? If so, what is it approximately?

Exercise 13.8.14. Suppose X1, . . . , Xn are iid with shifted exponential pdf,

f (xi | α, λ) = λe−λ(xi−α) I[xi ≥ α], (13.87)

where (α, λ) ∈ R × (0, ∞). Find the MLE of (α, λ) when n = 4 and the data are10,7,12,15. [Hint: First find the MLE of α for fixed λ, and note that it does not dependλ.]

Exercise 13.8.15. Let X1, . . . , Xn be iid N(µ, σ2), −∞ < µ < ∞ and σ2 > 0. Considerestimates of σ2 of the form σ2

c = c ∑ni=1(Xi−X)2 for some c > 0. (a) Find the expected

value, bias, and variance of σ2c . (b) For which value of c is the estimator unbiased?

For which value is it the MLE? (c) Find the expected mean square error (MSE) of theestimator. For which value of c is the MSE mimimized? (d) Is the MLE unbiased?Does the MLE minimize the MSE? Does the unbiased estimator minimize the MSE?

Exercise 13.8.16. Suppose U1, . . . , Un are iid Uniform(µ − 1, µ + 1). The likelihooddoes not have a unique maximum. Let u(1) be the minimum and u(n) be the max-imum of the data. (a) The likelihood is maximized for any µ in what interval? (b)Recall the midrange umr = (u(1) + u(n))/2 from Exercise 5.6.15. Is umr one of themaxima of the likelihood?

Exercise 13.8.17. Let X1, . . . , Xn be a sample from the Cauchy(θ) distribution (whichhas pdf 1/(π(1 + (x − θ)2))), where θ ∈ R. (a) If n = 1, show that the MLE of θ isX1. (b) Suppose n = 7 and the observations are 10, 2, 4, 2, 5, 7, 1. Plot the loglikelihoodand likelihood equations. Is the MLE of θ unique? Does the likelihood equation havea unique root?

Exercise 13.8.18. Consider the simple linear regression model, where Y1, . . . , Yn areindependent, and Yi ∼ N(α + βxi, σ2) for i = 1, . . . , n. The xi’s are fixed, and assumethey are not all equal. (a) Find the likelihood of Y = (Y1, . . . , Yn)′. Write downthe log of the likelihood, l(α, β, σ2 ; y). (b) Fix σ2. Why is the MLE of (α, β) thesame as the least squares estimate of (α, β)? (c) Let α and β be the MLEs, and letSSe = ∑(yi − α − βxi)

2 be the residual sum of squares. Write l(α, β, σ2 ; y) as afunction of SSe and σ2 (and n). (d) Find the MLE of σ2. Is it unbiased?

Exercise 13.8.19. Now look at the multiple linear model, where Y ∼ N(xβ, σ2In) asin (12.10), and assume that x′x is invertible. (a) Show that the pdf of Y is

f (y | β, σ2) =1

(√

2πσ)ne−

12σ2 ‖y−xβ‖2

. (13.88)

13.8. Exercises 219

[Hint: Use (7.28).] (b) Use (12.17) and (12.18) to write ‖y − xβ‖2 = SSe + (βLS −β)′x′x(βLS − β), where βLS is the least squares estimate of β, and SSe is the sum ofsquared errors. (c) Use parts (a) and (b) to write the likelihood as

L(β, σ2 ; y) =1

σ2 e−1

2σ2 (SSe+(βLS−β)′x′x(βLS−β)). (13.89)

(d) From (13.89), argue that the sufficient statistic is (βLS, SSe).

Exercise 13.8.20. Continue with the multiple regression model in Exercise 13.8.19.(a) Find β, the MLE of β. (You can keep σ2 fixed here.) Is it the same as the leastsquares estimate? (b) Find σ2, the MLE of σ2. Is it unbiased? (c) Find the value of theloglikelihood at the MLE, l(β, σ2 ; y).

Exercise 13.8.21. Suppose that for some power λ,

Yλ − 1λ

∼ N(µ, σ2). (13.90)

Generally, λ will be in the range from -2 to 2. We are assuming that the parametersare such that the chance of Y being non-positive is essentially 0, so that λ can be afraction (i.e., tranforms like

√Y are real). (a) What is the limit as λ→ 0 of (yλ − 1)/λ

for y > 0? (b) Find the pdf of Y. (Don’t forget the Jacobian. Note that if W ∼ N(µ, σ2),then y = g(w) = (λw + 1)1/λ. So g−1(y) is (yλ − 1)/λ.)

Now suppose Y1, . . . , Yn are independent, and x1, . . . , xn are fixed, with

Yλi − 1

λ∼ N(α + βxi, σ2). (13.91)

In regression, one often takes transformations of the variables to get a better fittingmodel. Taking logs or square roots of the yi’s (and maybe the xi’s) can often beeffective. The goal here is to find the best transformation by finding the MLE of λ,as well as of α and β and σ2. This λ represents a power transformation of the Yi’s,called the Box-Cox transformation.

The loglikelihood of the parameters based on the yi’s depends on α, β, σ2 and λ.For fixed λ, it can be maximized using the usual least-squares theory. So let

RSSλ = ∑(

yλi − 1

λ− αλ − βλxi

)2

(13.92)

be the residual sum of squares, considering λ fixed and the (yλi − 1)/λ’s as the de-

pendent variable’s observations. (c) Show that the loglikelihood can be written as thefunction of λ,

h(λ) ≡ ln(αλ, βλ, σ2λ, λ ; y′is) = −

n2

log(RSSλ) + λ ∑ log(yi). (13.93)

Then to maximize this over λ, one tries a number of values for λ, each time per-forming a new regression on the (yλ

i − 1)/λ’s, and takes the λ that maximizes theloglikelihood. (d) As discussed in Section 12.5.2, Jung et al. (2014) collected data onthe n = 94 most dangerous hurricanes in the US since 1950. Let Yi be the estimate


of damage by hurricane i in millions of 2014 dollars (plus 1, to avoid taking log of0), and xi be the minimum atmospheric pressure in the storm. Lower pressure leadsto more severe storms. These two variables can be downloaded directly into R usingthe command

source("http://istics.net/r/hurricanes.R")

Apply the results on the Box-Cox transformation to these data. Find the h(λ) for agrid of values of λ from -2 to 2. What value of λ maximizes the loglikelihood, andwhat is that maximum value? (e) Usually one takes a more understandable powernear the MLE. Which of the following transformations has loglikelihood closest tothe MLE’s: 1/y2, 1/y, 1/

√y, log(y),

√y, y, or y2? (f) Plot x versus (yλ − 1)/λ, where

λ is the MLE. Does it look like the usual assumptions for linear regression in (12.10)are reasonable?

Exercise 13.8.22. Suppose X ∼ Multinomial(n, p). Let the parameter be (p1, . . .,pK−1), so that pK = 1− p1−· · ·− pK−1. The parameter space is then (p1, . . . , pK−1) |0 < pi for each i, and p1 + · · ·+ pK−1 < 1. Show that the MLE of p is X/n.

Exercise 13.8.23. Take (X1, X2, X3, X4) ∼ Multinomial(n, (p1, p2, p3, p4)). Put the pi’sin a table:

p1 p2 αp3 p4 1− αβ 1− β 1

(13.94)

Here, α = p1 + p2 and β = p1 + p3. Assume the model that the rows and columnsare independent, meaning

p1 = αβ, p2 = α(1− β), p3 = (1− α)β, p4 = (1− α)(1− β). (13.95)

(a) Write the loglikelihood as a function of α and β (not the pi’s). (b) Find the MLEsof α and β. (c) What are the MLEs of the pi’s?

Exercise 13.8.24. Suppose X1, . . . , Xn are iid N(µ, 1), µ ∈ R, so that with X =(X1, . . . , Xn)′, we have from (13.53) the conditional distribution

X |X = x ∼ N(x 1n, Hn), (13.96)

where Hn = In − (1/n) 1n1′n is the centering matrix. Assume n ≥ 2. (a) FindE[X1 |X = x]. (b) Find E[X2

1 |X = x]. (c) Find E[X1X2 |X = x]. (d) Considerestimating g(µ) = 0. (i) Is δ(X) = X1 − X2 an unbiased estimator of 0? (ii) What isthe variance of δ? (iii) Find δ∗(x) ≡ E[δ(X) |X = x]. Is it unbiased? (iv) What is thevariance of δ∗? (v) Which estimator has a lower variance? (e) Consider estimatingg(µ) = µ2. (i) The estimator δ(X) = X2

1 − a is unbiased for what a? (ii) What is thevariance of δ? (iii) Find δ∗(x) ≡ E[δ(X) |X = x]. Is it unbiased? (iv) What is thevariance of δ∗? (v) Which estimator has a lower variance? (f) Continue estimatingg(µ) = µ2. (i) The estimator δ(X) = X1X2 − a is unbiased for what a? (ii) What isthe variance of δ? (iii) Find δ∗(x) ≡ E[δ(X) |X = x]. Is it unbiased? (iv) What is thevariance of δ∗? (v) Which estimator has a lower variance? (g) Compare the estimatorsδ∗ in (e)(iii) and (f)(iii).

Chapter 14

More on Maximum Likelihood Estimation

In the previous chapter we have seen some situations where maximum likelihoodhas yielded fairly reasonable estimators. In fact, under certain conditions, MLEs areconsistent, asymptotically normal as n → ∞, and have optimal asymptotic standarderrors. We will focus on the iid case, but the results have much wider applicability.

The first three sections of this chapter show how to use maximum likelihood tofind estimates and their asymptotic standard errors and confidence intervals. Sec-tions 14.4 on present the technical conditions and proofs for the results. Most of thepresentation presumes a one-dimensional parameter. Section 14.8 extends the resultsto multidimensional parameters.

14.1 Score function

Suppose X1, . . . , Xn are iid, each with space X and density f (x | θ), where θ ∈ T ⊂ R,so that θ is one-dimensional. There are a number of technical conditions that need tobe satisfied for what follows, which will be presented in Section 14.4. Here, we notethat we do need to have continuous first, second, and third derivatives with respectto the parameters, and

f (x | θ) > 0 for all x ∈ X , θ ∈ T . (14.1)

In particular, (14.1) rules out the Uniform(0, θ), since the sample space would dependon the parameters. Which is not to say that the MLE is bad in this case, but that theasymptotic normality, etc., does not hold.

By independence, the overall likelihood is ∏ f (xi | θ), hence the loglikelihood is

ln(θ ; x1, . . . , xn) =n

∑i=1

log( f (xi | θ)) =n

∑i=1

l1(θ ; xi), (14.2)

where l1(θ ; x) = log( f (x | θ)) is the loglikelihood for one observation. The MLEis found by differentiating the loglikelihood, which is the sum of the derivativesof the individual loglikelihoods, or score functions. For one observation, the scoreis l′1(θ ; xi). The score for the entire set of data is l′n(θ; x1, . . . , xn), the sum of the

221

222 Chapter 14. More on Maximum Likelihood Estimation

individual scores. The MLE θn then satisfies

l′n(θn ; x1, . . . , xn) =n

∑i=1

l′1(θn ; xi) = 0. (14.3)

We are assuming that there is a unique solution, and that it does indeed maximizethe loglikelihood.

If one is lucky, there is a closed-form solution. Mostly there will not be, so thatsome iterative procedure will be necessary. The Newton-Raphson method is a pop-ular approach, which can be quite quick if it works. The idea is to expand l′n in aone-step Taylor series around an initial guess of the solution, θ(0), then solve for θ

to obtain the next guess θ(1). Given the jth guess θ(j), we have (dropping the xi’s forsimplicity)

l′n(θ) ≈ l′n(θ(j)) + (θ − θ(j))l′′n (θ

(j)). (14.4)

We know that l′n(θ) = 0, so we can solve approximately for θ:

θ ≈ θ(j) − l′n(θ(j))

l′′n (θ(j))≡ θ(j+1). (14.5)

This θ(j+1) is our next guess for the MLE. We iterate until the process converges, if itdoes, in which case we have our θ.

14.1.1 Fruit flies

Go back to the fruit fly example in Section 6.4.4. Equation (6.114) has the pmf of eachobservation, and (6.54) contains the data. Exercise 13.8.13 shows that the loglikeli-hood can be written

ln(θ) = 7 log(1− θ) + 5 log(2− θ) + 5 log(θ) + 3 log(1 + θ). (14.6)

Starting with the guess θ(0) = 0.5, the iterations for Newton-Raphson proceed asfollows:

j θ(j) l′n(θ(j)) l′′n (θ(j))0 0.500000 −5.333333 −51.555561 0.396552 0.038564 −54.501612 0.397259 0.000024 −54.433773 0.397260 0 −54.43372

(14.7)

The process has converged sufficiently, so we have θMLE = 0.3973. Note this estimateis very close to the Dobzhansky estimate of 0.4 found in (6.65) and therebelow.

14.2 Fisher information

The likelihood, or loglikelihood, is supposed to reflect the relative support for variousvalues of the parameter given by the data. The MLE is the value with the mostsupport, but we would also like to know which other values have almost as muchsupport. One way to assess the range of highly-supported values is to look at thelikelihood near the MLE. If it falls off quickly as we move from the MLE, then we

14.2. Fisher information 223

4.0 4.5 5.0 5.5 6.0

−0.

10−

0.08

−0.

06−

0.04

−0.

020.

00

logl

ikel

ihoo

d

θ

n=1n=5n=100

Figure 14.1: Loglikelihood for the Poisson with x = 5 and n = 1, 5, and 100.

have more confidence that the true parameter is near the MLE than if the likelihoodfalls way slowly. For example, consider Figure 14.1. The data are iid Poisson(θ)’swith sample mean being 5, and the three curves are the loglikelihoods for n = 1, 5,and 100. In each case the maximum is at θ = 5. The flattest loglikelihood is that forn = 1, and the one with the narrowest curve is that for n = 100. Note that for n = 1,there are many values of θ that have about the same likelihood as the MLE. Thusthey are almost as likely. By contrast, for n = 100, there is a distinct drop off from themaximum as one moves away from the MLE. Thus there is more information aboutthe parameter. The n = 5 case is in between the other two. Of course, we expectmore information with larger n. One way to quantify the information is to look atthe second derivative of the loglikelihood at the MLE: The more negative, the moreinformative.

In general, the negative second derivative of the loglikelihood is called the ob-served Fisher information in the data. It can be written

In(θ ; x1, . . . , xn) = −l′′n (θ ; x1, . . . , xn) =n

∑i=1I1(θ ; xi), (14.8)

where I1(θ ; xi) = −l′′1 (θ ; xi) is the observed Fisher information in the single obser-vation xi. The idea is that the larger the information, the more we know about θ. Inthe Poisson example above, the observed information is ∑ xi/θ2, hence at the MLEθ = xn it is n/xn. Thus the information is directly proportional to n (for fixed xn).

The (expected) Fisher information in one observation, I1(θ), is the expected valueof the observed Fisher information:

I1(θ) = E[I1(θ ; Xi)]. (14.9)


The Fisher information in the entire iid sample is

In(θ) = E[In(θ ; X1, . . . , Xn)] = E

[n

∑i=1I1(θ ; Xi)

]= nI1(θ). (14.10)

In the Poisson case, I1(θ) = E[Xi/θ2] = 1/θ, hence In(θ) = n/θ.For multidimensional parameters, the Fisher information is a matrix. See (14.80).

14.3 Asymptotic normality

One of the more amazing properties of the MLE is that, under very general condi-tions, it is asymptotically normal, with variance the inverse of the Fisher information.In the one-parameter iid case,

√n(θn − θ) −→D N

(0,

1I1(θ)

). (14.11)

To eliminate the dependence on θ in the normal, we can use Slutsky to obtain√nI1(θn) (θn − θ) −→D N(0, 1). (14.12)

It turns out that we can also use the observed Fisher information in place of the nI1:√In(θn) (θn − θ) −→D N(0, 1). (14.13)

Consider the fruit fly example in Section 14.1.1. Since the score function is thefirst derivative of the loglikelihood, minus the first derivative of the score function isthe observed Fisher information. Thus the Newton-Raphson process in (14.7) auto-matically presents us with In(θn) = −l′′n (θn) = 54.4338, hence an approximate 95%confidence interval for θ is(

0.3973± 2√54.4338

)= (0.3973± 2× 0.1355) = (0.1263, 0.6683). (14.14)

A rather wide interval.

14.3.1 Sketch of the proof

The regularity conditions and statement and proof of the main asymptotic resultsrequire a substantial amount of careful analysis, which we present in Sections 14.4 to14.6. Here we give the basic idea behind the asymptotic normality in (14.11). Startingwith the Taylor series as in the Newton-Raphson algorithm, write

l′n(θn) ≈ l′n(θ) + (θn − θ)l′′n (θ), (14.15)

where θ is the true value of the parameter, and the dependence on the xi’s is sup-pressed. Rearranging and inserting n’s in the appropriate places, we obtain

√n(θn − θ) ≈

√n

l′n(θ)l′′n (θ)

=

√n 1

n ∑ l′1(θ ; Xi)1n ∑ l′′1 (θ ; Xi)

=

√n 1

n ∑ l′1(θ ; Xi)

− 1n ∑ I1(θ ; Xi)

, (14.16)

14.4. Cramér’s conditions 225

since I1(θ ; Xi) = −l′′1 (θ ; Xi).We will see in Lemma 14.1 that

Eθ [l′1(θ ; Xi)] = 0 and Varθ [l′1(θ ; Xi)] = I1(θ). (14.17)

Thus the central limit theorem shows that√

n1n ∑ l′1(θ ; Xi) −→D N(0, I1(θ)). (14.18)

Since E[I1(θ ; Xi)] = I1(θ) by definition, the law of large numbers shows that

1n ∑ I1(θ ; Xi) −→P I1(θ). (14.19)

Finally, Slutsky shows that√

n 1n ∑ l′1(θ ; Xi)

− 1n ∑ I1(θ ; Xi)

−→D N(0, I1(θ))

−I1(θ)= N

(0,

1I1(θ)

), (14.20)

as desired. Theorem 14.6 below deals more carefully with the approximation in(14.16) to justify (14.11).

If the justification in this section is satisfactory, you may want to skip to Section14.7 on asymptotic efficiency, or Section 14.8 on the multiparameter case.

14.4 Cramér’s conditions

Cramér (1999) was instrumental in applying rigorous mathematics to the study ofstatistics. In particular, he provided technical conditions under which the likelihoodresults are valid. The conditions easily hold in exponential families, but for otherdensities they may or may not be easy to verify. We start with X1, . . . , Xn iid, eachhaving space X and pdf f (x | θ), where θ ∈ T = (a, b) for fixed −∞ ≤ a < b ≤ ∞.First, we need that the space of Xi is the same for each θ, which is satisfied if

f (x | θ) > 0 for all x ∈ X , θ ∈ T . (14.21)

We also need that

∂ f (x | θ)∂θ

,∂2 f (x | θ)

∂θ2 ,∂3 f (x | θ)

∂θ3 exist for all x ∈ X , θ ∈ T . (14.22)

In order for the score and information functions to exist and behave correctly, assumethat for any θ ∈ T , ∫

X

∂ f (x | θ)∂θ

dx =∂

∂θ

∫X

f (x | θ)dx (= 0)

and∫X

∂2 f (x | θ)∂θ2 dx =

∂2

∂θ2

∫X

f (x | θ)dx (= 0). (14.23)

(Replace the integrals with sums for the discrete case.)Recall the Fisher information in one observation from (14.8) and (14.9) is given by

I1(θ) = −Eθ [l′′1 (θ ; x)]. (14.24)

Assume that0 < I1(θ) < ∞ for all θ ∈ T . (14.25)

We have the following.


Lemma 14.1. If (14.21), (14.22), and (14.23) hold, then

Eθ [l′1(θ ; X)] = 0 and Varθ [l′1(θ ; X)] = I1(θ). (14.26)

Proof. First, since l1(θ ; x) = log( f (x | θ)),

Eθ [l′1(θ ; X)] = Eθ

[∂

∂θlog( f (x | θ))

]=∫X

∂ f (x | θ)/∂θ

f (x | θ) f (x | θ)dx

=∫X

∂ f (x | θ)∂θ

dx

= 0 (14.27)

by (14.23). Next, write

I1(θ) = −Eθ [l′′1 (θ ; X)] = −Eθ

[∂2

∂θ2 log( f (X|θ))]

= −Eθ

[∂2 f (X|θ)/∂θ2

f (X|θ) −(

∂ f (X|θ)/∂θ

f (X|θ)

)2]

= −∫X

∂2 f (x|θ)/∂θ2

f (x|θ) f (x|θ)dx + Eθ [l′1(θ ; X)2]

= − ∂2

∂θ2

∫X

f (x|θ)dx + Eθ [l′1(θ ; X)2]

= Eθ [l′1(θ ; X)2] (14.28)

again by (14.23), which with Eθ [l′1(θ ; X)] = 0 proves (14.26).

One more technical assumption we need is that for each θ ∈ T (which will takethe role of the “true” value of the parameter), there exists an ε > 0 and a functionM(x) such that

|l′′′1 (t ; x)| ≤ M(x) for θ − ε < t < θ + ε, and Eθ [M(X)] < ∞. (14.29)

14.5 Consistency

First we address the question of whether the MLE is a consistent estimator of θ. Theshort answer is “Yes,” although things can get sticky if there are multiple maxima.But before we get to the results, there are some mathematical prerequisites to dealwith.

14.5.1 Convexity and Jensen’s inequality

Definition 14.2. Convexity. A function g : X → R, X ⊂ R, is convex if for each x0 ∈ X ,there exist α0 and β0 such that

g(x0) = α0 + β0x0, (14.30)

14.5. Consistency 227

andg(x) ≥ α0 + β0x for all x ∈ X . (14.31)

The function is strictly convex if (14.30) holds, and

g(x) > α0 + β0x for all x ∈ X , x 6= x0. (14.32)

The definition basically means that the tangent to g at any point lies below thefunction. If g′′(x) exists for all x, then g is convex if and only if g′′(x) ≥ 0 for all x,and it is strictly convex if and only if g′′(x) > 0 for all x. Notice that the line definedby a0 and b0 need not be unique. For example, g(x) = |x| is convex, but when x0 = 0,any line through (0, 0) with slope between ±1 will lie below g.

By the same token, any line segment connecting two points on the curve lies abovethe curve, as in the next lemma.

Lemma 14.3. If g is convex, x, y ∈ X , and 0 < ε < 1, then

εg(x) + (1− ε)g(y) ≥ g(εx + (1− ε)y). (14.33)

If g is strictly convex, then

εg(x) + (1− ε)g(y) > g(εx + (1− ε)y) for x 6= y. (14.34)

Rather than prove this lemma, we will prove the more general result for randomvariables.

Lemma 14.4. Jensen’s inequality. Suppose that X is a random variable with space X , andthat E[X] exists. If the function g is convex, then

E[g(X)] ≥ g(E[X]), (14.35)

where E[g(X)] may be +∞. Furthermore, if g is strictly convex and X is not constant,

E[g(X)] > g(E[X]). (14.36)

Proof. We’ll prove it just in the strictly convex case, when X is not constant. The othercase is easier. Apply Definition 14.2 with x0 = E[X], so that

g(E[X]) = α0 + β0E[X], and g(x) > α0 + β0x for all x 6= E[X]. (14.37)

But thenE[g(X)] > E[α0 + β0X] = α0 + β0E[X] = g(E[X]), (14.38)

because there is a positive probability X 6= E[X].

Why does Lemma 14.4 imply Lemma 14.3? [Take X to be the random variablewith P[X = x] = ε and P[X = y] = 1− ε.]

A mnemonic device for which way the inequality goes is to think of the convexfunction g(x) = x2. Jensen implies that

E[X2] ≥ E[X]2, (14.39)

but that is the same as saying that Var[X] ≥ 0. Also, Var[X] > 0 unless X is aconstant.

Convexity is also defined for x being a p × 1 vector, so that X ⊂ Rp, in whichcase the line α0 + β0x in Definition 14.2 becomes a hyperplane α0 + β′0x. Jensen’sinequality follows as well, where we just turn X into a vector.


14.5.2 A consistent sequence of roots

For now, we assume that the space does not depend on θ (14.21), and the first deriva-tive of the loglikelihood in (14.22) is continuous. We need identifiability, whichmeans that if θ1 6= θ2, then the distributions of Xi under θ1 and θ2 are different. Also,for each n and x1, . . . , xn, there exists a unique solution to l′n(θ ; x1, . . . , xn) = 0:

l′n(θn ; x1, . . . , xn) = 0, θn ∈ T . (14.40)

Note that this θn is a function of x1, . . . , xn. It is also generally the maximum like-lihood estimate, although it is possible it is a local minimum or an inflection pointrather than the maximum.

Now suppose θ is the true parameter, and take ε > 0. Look at the difference,divided by n, of the likelihoods at θ and θ + ε:

1n(ln(θ ; x1, . . . , xn)− ln(θ + ε ; x1, . . . , xn)) =

1n

n

∑i=1

log(

f (xi | θ)f (xi | θ + ε)

)=

1n

n

∑i=1− log

(f (xi | θ + ε)

f (xi | θ)

). (14.41)

The final expression is the mean of iid random variables, hence by the WLLN itconverges in probability to the expected value of the summand (and dropping thexi’s in the notation for convenience):

1n(ln(θ)− ln(θ + ε)) −→P Eθ

[− log

(f (X | θ + ε)

f (X | θ)

)]. (14.42)

Now apply Jensen’s inequality, Lemma 14.4, to that expected value, with g(x) =− log(x), and the random variable being f (X | θ + ε)/ f (X | θ). This g is strictly con-vex, and the random variable is not constant because the parameters are different(identifiability), hence

Eθ

[− log

(f (X | θ + ε)

f (X | θ)

)]> − log

(Eθ

[f (X | θ + ε)

f (X | θ)

])= − log

(∫X

f (x | θ + ε)

f (x | θ) f (x | θ)dx)

= − log(∫X

f (x | θ + ε)dx)

= − log(1)= 0. (14.43)

The same result holds for θ − ε, hence

1n(ln(θ)− ln(θ + ε)) −→P c > 0 and

1n(ln(θ)− ln(θ − ε)) −→P d > 0. (14.44)

These equations mean that eventually, the likelihood at θ is higher than that at θ ± ε.Precisely,

Pθ [ln(θ) > ln(θ + ε) and ln(θ) > ln(θ − ε)] −→ 1. (14.45)

14.6. Proof of asymptotic normality 229

Note that if ln(θ) > ln(θ + ε) and ln(θ) > ln(θ − ε), then between θ − ε and θ + ε,the likelihood goes up then comes down again. Because the derivative is continuous,somewhere between θ ± ε the derivative must be 0. By assumption, that point is theunique root θn. It is also the maximum. Which means that

ln(θ) > ln(θ + ε) and ln(θ) > ln(θ − ε) =⇒ θ − ε < θn < θ + ε. (14.46)

By (14.45), the left hand side of (14.46) has probability going to 1, hence

P[|θn − θ| < ε]→ 1 =⇒ θn −→P θ, (14.47)

and the MLE is consistent.The requirement that there is a unique root (14.40) for all n and set of xi’s is too

strong. The main problem is that sometimes the maximum of the likelihood does notexist over T = (a, b), but at a or b. For example, in the binomial case, if the numberof successes is 0, then the MLE of p would be 0, which is not in (0,1). Thus in the nexttheorem, we need only that probably there is a unique root.

Theorem 14.5. Suppose that

Pθ [l′n(t ; X1, . . . , Xn) has a unique root θn ∈ T ] −→ 1. (14.48)

Thenθn −→P θ. (14.49)

Technically, if there is not a unique root, you can choose θn to be whatever youwant, but typically it would be either one of a number of roots, or one of the limitingvalues a and b. Equation (14.48) does not always hold. For example, in the Cauchylocation-family case, the number of roots goes in distribution to 1 + Poisson(1/π)(Reeds, 1985), so there is always a good chance of two or more roots. But it willbe true that if you pick the right root, e.g., the one closest to the median, it will beconsistent.

14.6 Proof of asymptotic normality

To find the asymptotic distribution of the MLE, we first expand the derivative of thelikelihood around θ = θn:

l′n(θn) = l′n(θ) + (θn − θ) l′′n (θ) +12 (θn − θ)2 l′′′n (θ∗n), θ∗n between θ and θn. (14.50)

(Recall that these functions depend on the xi’s.) If θn is a root of l′n as in (14.40), then

0 = l′n(θ) + (θn − θ) l′′n (θ) +12 (θn − θ)2 l′′′n (θ∗n)

=⇒ (θn − θ)(l′′n (θ) +12 (θn − θ) l′′′n (θ∗n)) = −l′n(θ)

=⇒√

n (θn − θ) = −√

n 1n l′n(θ)

1n l′′n (θ) + (θn − θ) 1

2n l′′′n (θ∗n). (14.51)

The task is then to find the limits of the three terms on the right: the numerator andthe two summands in the denominator.


Theorem 14.6. Cramér. Suppose that the assumptions in Section 14.4 hold, i.e., (14.21),(14.22), (14.23), (14.25), and (14.29). Also, suppose that θn is a consistent sequence of rootsof (14.40), that is, l′n(θn) = 0 and θn →P θ, where θ is the true parameter. Then

√n (θn − θ) −→D N

(0,

1I1(θ)

). (14.52)

Proof. From the sketch of the proof in Section 14.3.1, (14.18) gives us

√n

1n

l′n(θ) −→D N(0, I1(θ)), (14.53)

and (14.19) gives us1n

l′′n (θ) −→P −I1(θ). (14.54)

Consider the M(xi) from assumption (14.29). By the WLLN,

1n

n

∑i=1

M(Xi) −→P Eθ [M(X)] < ∞, (14.55)

and we have assumed that θn →P θ, hence

(θn − θ)1n

n

∑i=1

M(Xi) −→P 0. (14.56)

Thus for any δ > 0,

P[|θn − θ| < δ and |(θn − θ)1n

n

∑i=1

M(Xi)| < δ] −→ 1. (14.57)

Now take the δ < ε, where ε is from the assumption (14.29). Then

|θn − θ| < δ =⇒ |θ∗n − θ| < δ

=⇒ |l′′′1 (θ∗n ; xi)| ≤ M(xi) by (14.29)

=⇒ | 1n

l′′′n (θ∗n)| ≤1n

n

∑i=1

M(xi). (14.58)

Thus

|θn − θ| < δ and |(θn − θ)1n

n

∑i=1

M(Xi)| < δ =⇒ |(θn − θ)1n

l′′′n (θ∗n)| < δ, (14.59)

and (14.57) shows that

P[|(θn − θ)1n

l′′′n (θ∗n)| < δ] −→ 1. (14.60)

That is,

(θn − θ)1n

l′′′n (θ∗n) −→P 0. (14.61)

14.7. Asymptotic efficiency 231

Putting together (14.53), (14.54), and (14.61),

√n (θn − θ) = −

√n 1

n l′n(θ)1n l′′n (θ) + (θn − θ) 1

2n l′′′n (θ∗n)

−→D −N(0, I1(θ))

−I1(θ) + 0

= N(

0,1I1(θ)

), (14.62)

which proves the theorem (14.52).

Note. The assumption that we have a consistent sequence of roots can be relaxed tothe condition (14.48), that is, θn has to be a root of l′n only with high probability:

P[l′n(θn) = 0 and θn ∈ T ] −→ 1. (14.63)

If I1(θ) is continuous, In(θn)/n →P I1(θ), so that (14.12) holds here, too. Itmay be that I1(θ) is annoying to calculate. One can instead use the observed FisherInformation as in (14.13), In(θn). The advantage is that the second derivative itself isused, and the expected value of it does not need to be calculated. Using θn yields aconsistent estimate of I1(θ):

1nIn(θn) = −

1n

l′′n (θ)− (θn − θ)1n

l′′′n (θ∗n)

−→P I1(θ) + 0, (14.64)

by (14.54) and (14.61). It is thus legitimate to use, for large n, either of the followingas approximate 95% confidence intervals:

θn ± 21√

n I1(θn)(14.65)

orθn ± 2

1√In(θn)

or, equivalently, θn ± 21√

− l′′n (θn). (14.66)

14.7 Asymptotic efficiency

We do not expected the MLE to be unbiased. In fact, it may be that the mean orvariance of the MLE does not exist. For example, the MLE for 1/λ in the Poisson caseis 1/Xn, which does not have a finite mean because there is a positive probabilitythat Xn = 0. But under the given conditions, if n is large, the MLE is close indistribution to a random variable that is unbiased and has optimal (in a sense givenbelow) asymptotic variance.

A sequence δn is a consistent and asymptotically normal sequence of estimators ofg(θ) if

δn −→P g(θ) and√

n (δn − g(θ)) −→D N(0, σ2g(θ)) (14.67)


for some σ2g(θ). That is, it is consistent and asymptotically normal. The asymptotic

normality implies the consistency, because g(θ) is subtracted from the estimator inthe second convergence.

Theorem 14.7. Suppose the conditions in Section 14.4 hold. If the sequence δn is a consistentand asymptotically normal estimator of g(θ), and g′ is continuous, then

σ2g(θ) ≥

g′(θ)2

I1(θ)(14.68)

for all θ ∈ T except perhaps for a set of Lebesgue measure 0.

See Bahadur (1964) for a proof. That coda about “Lebesgue measure 0” is therebecause it is possible to trick up the estimator so that it is “superefficient” at a fewpoints. If σ2

g(θ) is continuous in θ, then you can ignore that bit. Also, the conditionsneed not be quite as strict as in Section 14.4 in that the part about the third deriva-tive in (14.22) can be dropped, and (14.29) can be changed to be about the secondderivative.

Definition 14.8. If the conditions above hold, then the asymptotic efficiency of the sequenceδn is

AEθ(δn) =g′(θ)2

I1(θ)σ2g(θ)

. (14.69)

If the asymptotic efficiency is 1, then the sequence is said to be asymptotically efficient.

A couple of immediate consequences follow, presuming the conditions hold.

1. The maximum likelihood estimator of θ is asymptotically efficient, becauseσ2(θ) = 1/I1(θ) and g′(θ) = 1.

2. If θn is an asymptotically efficient estimator of θ, then g(θn) is an asymptoticallyefficient estimator of g(θ) by the ∆-method.

Recall that in Section 9.2.1 we introduced the asymptotic relative efficiency oftwo estimators. Here, we see that the asymptotic efficiency of an estimator is itsasymptotic relative efficiency to the MLE.

14.7.1 Mean and median

Recall Section 9.2.1, where we compared the median and mean as estimators of thesample median θ in some location families. Here we look at the asymptotic efficien-cies. For given base pdf f , the densities we consider are f (x− µ) for µ ∈ R. In orderto satisfy condition (14.21), we need that f (x) > 0 for all x ∈ R, which rules out theuniform.

We first need to find the Fisher information. Since l1(µ ; xi) = log( f (xi − µ)),l′1(µ ; xi) = − f ′(xi − µ)/ f (xi − µ). Using (14.28), we have that

I1(µ) = E[l′1(µ ; Xi)2] =

∫ ∞

−∞

(f ′(x− µ)

f (x− µ)

)2f (x− µ)dx

=∫ ∞

−∞

f ′(x)2

f (x)dx. (14.70)

14.7. Asymptotic efficiency 233

Note that the information does not depend on µ.For example, consider the logistic, which has

f (x) =ex

(1 + ex)2 . (14.71)

Then

f ′(x)f (x)

=∂

∂xlog( f (x)) =

∂

∂x(x− 2 log(1 + ex)) = 1− 2

ex

1 + ex , (14.72)

and

I1(0) =∫ ∞

−∞

(1− 2

ex

1 + ex

)2 ex

(1 + ex)2 dx

=∫ 1

0(1− 2(1− u))2u(1− u)

duu(1− u)

= 4∫ 1

0(u− 1

2 )2du =

13

, (14.73)

where we make the change of variables u = 1/(1 + ex).Exercise 14.9.6 finds the Fisher information for the normal, Cauchy, and Laplace.

The Laplace does not satisfy the conditions, since its pdf f (x) is not differentiableat x = 0, but the results still hold as long as we take Var[l′1(θ ; Xi)] as I1(θ). Thenext table exhibits the Fisher information, and asymptotic efficiencies of the meanand median, for these distributions. The σ2 is the variance of Xi, and the τ2 is thevariance in the asymptotic distribution of

√n(Mediann −µ), found earlier in (9.32).

Base distribution σ2 τ2 I1(µ) AE(Mean) AE(Median)Normal(0, 1) 1 π/2 1 1 2/π ≈ 0.6366Cauchy ∞ π2/4 1/2 0 8/π2 ≈ 0.8106Laplace 2 1 1 1/2 1Logistic π2/3 4 1/3 9/π2 ≈ 0.9119 3/4

(14.74)For these cases, the MLE is asymptotically efficient; in the normal case the MLE

is the mean, and in the Laplace case the MLE is the median. If you had to choosebetween the mean and the median, but weren’t sure which of the distributions is ineffect, the median would be the safer choice. Its efficiency ranges from about 64% to100%, while the mean’s efficiency can be 50% or even 0.

Lehmann (1991) (an earlier edition of Lehmann and Casella (2003)) in Table 4.4has more calculations for the asymptotic efficiencies of some trimmed means. The αtrimmed mean for a sample of n observations is the mean of the remaining observa-tions after removing the smallest and largest floor(nα) observations, where floor(x)is the largest integer less than or equal to x. The regular mean has α = 0, and themedian has α = 1/2 (or slightly lower than 1/2 if n is even). Here are some of the


asymptotic efficiencies:

f ↓ ; α → 0 1/8 1/4 3/8 1/2Normal 1.00 0.94 0.84 0.74 0.64Cauchy 0.00 0.50 0.79 0.88 0.81

t3 0.50 0.96 0.98 0.92 0.81t5 0.80 0.99 0.96 0.88 0.77

Laplace 0.50 0.70 0.82 0.91 1.00Logistic 0.91 0.99 0.95 0.86 0.75

(14.75)

This table can help you choose what trimming amount you would want to use, de-pending on what you think your f might be. You can see that between the mean anda small amount of trimming (1/8), the efficiencies of most distributions go up sub-stantially, while the normal’s goes down only a small amount. With 25% trimming,all have at least a 79% efficiency.

14.8 Multivariate parameters

The work so far assumed that θ was one-dimensional (although the data couldbe multidimensional). Everything follows for multidimensional parameters θ, withsome extended definitions. Now assume that T ⊂ RK , and that T is open. The scorefunction, the derivative of the loglikelihood, is now K-dimensional:

∇ln(θ) = ∇ln(θ ; x1, . . . , xn) =n

∑i=1∇l1(θ ; xi), (14.76)

where

∇l1(θ ; xi) =

∂l1(θ ; xi)

∂θ1...

∂l1(θ ; xi)∂θK

. (14.77)

The MLE then satisfies the equations

∇ln(θn) = 0. (14.78)

As in Lemma 14.1,E[∇l1(θ ; Xi)] = 0. (14.79)

Also, the Fisher information in one observation is a K× K matrix,

I1(θ) = Covθ[∇l1(θ ; Xi)] = Eθ[I1(θ ; Xi)], (14.80)

where I1 is the observed Fisher information matrix in one observation defined by

I1(θ ; xi) =

∂2 l1(θ ; xi)∂θ2

1

∂2 l1(θ ; xi)∂θ1∂θ2

· · · ∂2 l1(θ ; xi)∂θ1∂θK

∂2 l1(θ ; xi)∂θ2∂θ1

∂2 l1(θ ; xi)∂θ2

2· · · ∂2 l1(θ ; xi)

∂θ2∂θK

......

. . ....

∂2 l1(θ ; xi)∂θK∂θ1

∂2 l1(θ ; xi)∂θK∂θ2

· · · ∂2 l1(θ ; xi)∂θ2

K

. (14.81)

14.8. Multivariate parameters 235

I won’t detail all the assumptions, but they are basically the same as before, exceptthat they apply to all the partial and mixed partial derivatives. The equation (14.25),

0 < I1(θ) < ∞ for all θ ∈ T , (14.82)

means that I1(θ) is positive definite, and all its elements are finite. The two mainresults are next.

1. If θn is a consistent sequence of roots of the derivative of the loglikelihood, then

√n (θn − θ) −→D N(0,I−1

1 (θ)). (14.83)

2. If δn is a consistent and asymptotically normal sequence of estimators of g(θ),where the partial derivatives of g are continuous, and

√n (δn − g(θ)) −→D N(0, σ2

g(θ)), (14.84)

thenσ2

g(θ) ≥ Dg(θ)I−11 (θ)Dg(θ)

′ (14.85)

for all θ ∈ T (except possibly for a few), where Dg is the 1× K vector of partialderivatives ∂g(θ)/∂θi as in the multivariate ∆-method in (9.48).

If θn is the MLE of θ, then the lower bound in (14.85) is the variance in the asymp-totic distribution of

√n(g(θn)− g(θ)). Which is to say that the MLE of g(θ) is asymp-

totically efficient.

14.8.1 Non-IID models

Often the observations under consideration are not iid, as in the regression model(12.3) where the Yi’s are independent but have different means depending on theirxi’s. Under suitable conditions, the asymptotic results will still hold for the MLE. Insuch case, the asymptotic distributions would use the Fisher’s information (observedor expected) on the left-hand side:

I1/2n (θn)(θn − θ) −→D N(0, IK) or

I1/2n (θn)(θn − θ) −→D N(0, IK). (14.86)

Of course, these two convergences hold in the iid case as well.

14.8.2 Common mean

Suppose X1, . . . , Xn and Y1, . . . , Yn are all independent, and

X′i s are N(µ, θX), Y′i s are N(µ, θY). (14.87)

That is, the Xi’s and Yi’s have the same means, but possibly different variances.Such data may arise when two unbiased measuring devices with possibly differentprecisions are used. We can use the likelihood results we have seen so far by pairingup the Xi’s with the Yi’s, so that we have (X1, Y1), . . . , (Xn, Yn) as iid vectors. In fact,


similar results will still hold if n 6= m, as long as the ratio n/m has a limit strictlybetween 0 and 1.

Exercise 14.9.3 shows that the score function in one observation is

∇l1(µ, θX , θY ; xi, yi) =

xi−µθX

+yi−µ

θY

− 12

1θX

+ 12

(xi−µ)2

θ2X

− 12

1θY

+ 12

(yi−µ)2

θ2Y

, (14.88)

and the Fisher information in one observation is

I1(µ, θX , θY) =

1

θX+ 1

θY0 0

0 12θ2

X0

0 0 12θ2

Y

. (14.89)

A multivariate version of the Newton-Raphson algorithm in (14.5) replaces theobserved Fisher information with its expectation. Specifically, letting θ = (µ, θX , θY)

′

be the parameter vector, we obtain the jth guess from the (j− 1)st one via

θ(j) = θ(j−1) + I−1n (θ(j−1))∇ln(θ(j−1)). (14.90)

We could use the observed Fisher information, but it is not diagonal, so the expectedFisher information is easier to invert. A bit of algebra shows that the updating reducesto the following:

µ(j) =xnθ

(j−1)Y + ynθ

(j−1)X

θ(j−1)X + θ

(j−1)Y

,

θ(j)X = s2

X + (xn − µ(j−1))2, and

θ(j)Y = s2

Y + (yn − µ(j−1))2. (14.91)

Here, s2X = ∑(xi − xn)2/n, and similarly for s2

Y .Denoting the MLE of µ by µn, we have that

√n(µn − µ) −→D N

(0,

θXθYθX + θY

). (14.92)

14.8.3 Logistic regression

In Chapter 12, we looked at linear regression, where the mean of the Yi’s is assumedto be a linear function of some xi’s. If the Yi’s are Bernoulli(pi), so take on onlythe values 0 and 1, then a linear model on E[Yi] = pi may not be appropriate as theα+ βxi could easily fall outside of the [0,1] range. A common solution is to model thelogit of the pi’s, which we saw way back in Exercise 1.7.15. The logit of a probabilityp is the log odds of the probability:

logit(p) = log(

p1− p

). (14.93)


This transformation has range R. A simple logistic regression model is based on(x1, Y1), . . . , (xn, Yn) independent observations, where for each i, xi is fixed, and Yi ∼Bernoulli(pi) with

logit(pi) = β0 + β1xi, (14.94)

β0 and β1 being the parameters. Multiple logistic regression has several x-variables,so that the model is

logit(pi) = β0 + β1xi1 + · · ·+ βKxiK . (14.95)

Analogous to the notation in (12.9) for linear regression, we write

logit(p) = xβ =

1 x11 x12 · · · x1K1 x21 x22 · · · x2K...

...... · · ·

...1 xn1 xn2 · · · xnK

β0β1β2...

βK

, (14.96)

where p = (p1, . . . , pn)′ and by logit(p) we mean (logit(p1), . . . , logit(pn))′.To use maximum likelihood, we first have to find the likelihood as a function of β.

The inverse function of z = logit(p) is p = ez/(1 + ez), so that since the likelihood ofY ∼ Bernoulli(p) is py(1− p)1−y = (p/(1− p))y(1− p), we have that the likelihoodfor the data Y = (Y1, . . . , Yn)′ is

Ln(β ; y) =n

∏i=1

(pi

1− pi

)yi

(1− pi)

=n

∏i=1

(exi β)yi (1 + exi β)−1

= ey′xβn

∏i=1

(1 + exi β)−1, (14.97)

where xi is the ith row of x. Note that we have an exponential family.Since the observations do not have the same distribution (the distribution of Yi

depends on xi), we deal with the score and Fisher information of the entire sample.The score function can be written as

∇l(β ; y) = x′(y− p), (14.98)

keeping in mind that the pi’s are functions of xiβ. The Fisher information is the sameas the observed Fisher information, which can be written as

In(β) = Cov[∇l(β ; y)] = x′diag(p1(1− p1), . . . , pn(1− pn))x, (14.99)

where diag(a1, . . . , an) is the diagonal matrix with the ai’s along the diagonal. TheMLE then can be found much as in (14.90), though using software such as R is easier.

Fahrmeir and Kaufmann (1985) show that the asymptotic normality is valid here,even though we do not have iid observations, under some conditions: The minimumeigenvalue of In(β) goes to ∞, and x′nI−1

n (β)xn → 0, as n → ∞. The latter followsfrom the former if the xi’s are bounded.


0.0

0.2

0.4

0.6

0.8

1.0

1 2 3 4

0 10 20 30 40 50

0 2 4 6 8 10

GPA

Party

Politics

Est

imat

ed p

roba

bilit

y

GPAPartyPolitics

Figure 14.2: The estimated probabilities of being Greek as a function of the variablesGPA, party, and politics, each with the other two being held at their average value.

When there are multiple observations with the same xi value, the model can beequivalently but more compactly represented as binomials. That is, the data areY1, . . . , Yq, independent, where Yi ∼ Binomial(ni, pi), and logit(pi) = xiβ as in (14.95).Now n = ∑

qi=1 ni, p is q× 1, and x is q× (K + 1) in (14.96). The likelihood can be

written

Ln(β ; y) = ey′xβq

∏i=1

(1 + exi β)−ni , (14.100)

so the score is∇l(β ; y) = x′(y− (n1 p1, . . . , nq pq)

′) (14.101)

and the Fisher information is

In(β) = Cov[∇l(β ; y)] = x′diag(n1 p1(1− p1), . . . , nq pq(1− pq))x. (14.102)

Being Greek

Here we will use a survey of n = 788 students to look at some factors that are relatedto people being Greek in the sense of being a member of a fraternity or sorority. TheYi’s are then 1 if that person is Greek, and 0 if not. The x-variables we will considerare gender (0 = male, 1 = female), GPA (grade point average), political views (from 0(most liberal) to 10 (most conservative), and average number of hours per week spent


partying. Next are snippets of the 788× 1 vector y and the 788× 4 matrix x:

y =

0001...0

and x =

1 0 2.8 16 31 1 3.9 0 81 0 3.1 4 31 0 3.7 10 6...

......

......

1 1 3.8 4 5

. (14.103)

To use R for finding the MLE, let y be the y vector, and x be the x matrix withoutthe first column of 1’s. The following will calculate and display the results:

lreg <− glm(y∼x,family="binomial")summary(lreg)

The estimate of β is

β =

−2.2133

0.5023−0.0384

0.11660.0673

. (14.104)

The standard errors can be found by first estimating the logit(pi)’s, turning theminto pi’s, inserting these into the formula (14.99) for Fisher’s information matrix, theninverting the matrix, to obtain an estimate of the asymptotic covariance:

Cov(β) = I−1n (β) =

1100

38.1772 −1.8705 −9.5683 −0.1392 −0.7459−1.8705 3.5100 −0.3490 0.0336 0.0505−9.5683 −0.3490 2.9939 −0.0125 0.0032−0.1392 0.0336 −0.0125 0.0196 0.0022−0.7459 0.0505 0.0032 0.0022 0.1466

.

(14.105)The estimated standard errors of the coefficients are the square roots of the diagonals.The next table has the estimates and standard errors (ignoring the intercept), plus theapproximate 95% confidence intervals.

βi se(βi) Confidence intervalGender 0.5023 0.1874 (0.1276, 0.8770)GPA −0.0384 0.1730 (−0.3845, 0.3076)Party 0.1166 0.0140 (0.0886, 0.1446)Politics 0.0673 0.0383 (−0.0092, 0.1439)

(14.106)

Thus we see that gender and hours partying have strong positive association withbeing Greek, GPA does not have much association at all, and politics is maybe mildlypositive (the more conservative, the more likely Greek).

Interpreting the coefficients is a little tricky. The ith coefficient estimates the av-erage additional log odds of being Greek associated with a one unit increase in theith variable, holding the other variables constant. For example, one more hour par-tying is associated with 0.1166 increase in log odds. Translating to probabilities isnon-linear, so depends on the original level. Here we will set all variables except forthe ith at their average, and vary the ith, to see the effect on the estimate probabilities.


For gender, we set that variable at 0 and 1, yielding 22.27% and 32.13%, respectively.Thus women are about 10% more likely to be Greek, at the average of the other vari-ables. Figure 14.2 plots the probabilities versus the other three variables. It is clearthat partying has the most drastic relationship with being Greek. GPA and politicsare both fairly flat.

14.9 Exercises

Exercise 14.9.1. Find the loglikelihood, score function, and Fisher’s information inone observation for the following distributions: (a) Bernoulli(p), p ∈ (0, 1). (b)Exponential(λ), λ > 0. (c) N(0, λ), λ > 0. (d) N(µ, µ2), µ > 0 (so that the coeffi-cient of variation is always 1).

Exercise 14.9.2. Continue with Exercise 13.8.13 on the fruit fly data. From (6.114) wehave that the data (N00, N01, N10, N11) is Multinomial(n, p(θ)), where Nab = #(Yi1,Yi2) = (a, b), and

p(θ) = ( 12 (1− θ)(2− θ), 1

2 θ(1− θ), 12 θ(1− θ), 1

2 θ(1 + θ)). (14.107)

Thus as in (13.86), the loglikelihood is

ln(θ) = (n00 + n01 + n10) log(1− θ) + n00 log(2− θ)

+ (n01 + n10 + n11) log(θ) + n11 log(1 + θ). (14.108)

(a) The observed Fisher information for the n observations is then

In(θ ; n00, n01, n10, n11) =1

(1− θ)2 A +1

(2− θ)2 B +1θ2 C +

1(1 + θ)2 D. (14.109)

Find A, B, C, D as functions of the nij’s. (b) Show that the expected Fisher informationis

In(θ) =n2

(2 + θ

1− θ+

1− θ

2− θ+

3− θ

θ+

θ

1 + θ

)=

3n(1− θ)(2− θ)θ(1 + θ)

. (14.110)

(c) The Dobzhansky estimator of θ is given in (6.65) to be θD = ∑ ∑ yij/(2n). Exercise6.8.16 shows that its variance is 3θ(1− θ)/(4n). Find the asymptotic efficiency of θD.Graph the efficiency it as a function of θ. What is its minimum? Maximum? For whatvalues of θ, if any, is the Dobzhansky estimator fully efficient (AE=1)?

Exercise 14.9.3. Consider the common mean problem from Section 14.8.2, so thatX1, . . . , Xn are iid N(µ, θX), Y1, . . . , Yn are iid N(µ, θY), and the Xi’s are independentof the Yi’s. (a) Show that the score function in one observation is

∇l1(µ, θX , θY ; xi, yi) =

xi−µθX

+yi−µ

θY

− 12

1θX

+ 12

(xi−µ)2

θ2X

− 12

1θY

+ 12

(yi−µ)2

θ2Y

. (14.111)

14.9. Exercises 241

What is the expected value of the score? (b) The observed Fisher information matrixin one observation is

I1(µ, θX , θY ; xi, yi) =

1

θX+ 1

θY

xi−µ

θ2X

∗xi−µ

θ2X

− 12θ2

X+

(xi−µ)2

θ3X

∗∗ ∗ ∗

. (14.112)

Find the missing elements. (c) Show that the Fisher information in one observation,I1(µ, θX , θY), is as in (14.89). (d) Verify the asymptotic variance in (14.92).

Exercise 14.9.4. Continue with the data in Exercise 14.9.3. Instead of the MLE,consider estimators of µ of the form µa = aXn + (1 − a)Yn for some a ∈ [0, 1].(a) Find E[µa] and Var[µa]. Is µa unbiased? (b) Show that the variance is mini-mized for a equalling α = θY/(θX + θY). Is µα an unbiased estimator of µ? (c) Letαn = S2

Y/(S2X + S2

Y). Does αn →P α? (d) Consider the estimator µαn . Show that

√n(µαn − µ) −→D N

(0,

θXθYθX + θY

). (14.113)

[Hint: First show that√

n(µα − µ) ∼ N(0, θXθY/(θX + θY)). Then show that√

n(µαn − µ)−√

n(µα − µ) =√

n(Xn −Yn)(αn − α) −→P 0.] (14.114)

(e) What is the asymptotic efficiency of µαn ?

Exercise 14.9.5. Suppose X is from an exponential family model with pdf f (x | θ) =a(x) exp(θx− ψ(θ)) and parameter space θ ∈ (b, d). (It could be that b = −∞ and/ord = ∞.) (a) Show that the cumulant generating function is c(t) = ψ(t + θ)− ψ(θ).For which values of t is it finite? Is it finite for t in a neighborhood of 0? (b) Letµ(θ) = Eθ [X] and σ2(θ) = Varθ [X]. Show that µ(θ) = ψ′(θ) and σ2(θ) = ψ′′(θ). (c)Show that the score function for one observation is l′1(θ ; x) = x− µ(θ). (d) Show thatthe observed Fisher information and expected Fisher information in one observationare both I1(θ) = σ2(θ).

Now suppose X1, . . . , Xn are iid from f (x | θ). (e) Show that the MLE based on the nobservations is θn = µ−1(xn). (f) Show that dµ−1(w)/dw = 1/σ2(µ−1(w)). (g) Usethe ∆-method to show that

√n(θn − θ) −→D N

(0,

1σ2(θ)

), (14.115)

which proves (14.11) directly for one-dimensional exponential families.

Exercise 14.9.6. This exercise is based on the location family model with pdfs f (x−µ)for µ ∈ R. For each part, verify the Fisher information in one observation for fbeing the pdf of the given distribution. (a) Normal(0,1), I1(0) = 1. (b) Laplace. Inthis case, the first derivative of log( f (x)) is not differentiable at x = 0, but becausethe distribution is continuous, you can ignore that point when calculating I1(µ) =Varµ[l′1(µ ; X)] = 1. It will not work to use the second derivative to find the Fisherinformation. (c) Cauchy, I1(0) = 1/2. [Hint: Start by showing that

I1(0) =4π

∫ ∞

−∞

x2

(1 + x2)3 dx =8π

∫ ∞

0

x2

(1 + x2)3 dx =4π

∫ 1

0

√u√

1− udu, (14.116)


using the change of variables u = 1/(1 + x2). Then note that the integral looks likepart of a beta density.]

Exercise 14.9.7. Agresti (2013), Table 4.2, summarizes data on the relationship be-tween snoring and heart disease for n = 2484 adults. Observation (Yi, xi) indicateswhether person i had heart disease (Yi = 1) or did not have heart disease (Yi = 0),and the amount the person snored, in four categories. The table summarizes the data:

Heart disease?→ Yes NoFrequency of snoring ↓ xiNever −3 24 1355Occasionally −1 35 603Nearly every night 1 21 192Every night 3 30 224

(14.117)

(So that there are 24 people who never snore and have Yi = 1, and 224 people whosnore every night and have Yi = 0.) The model is the linear logistic one, withlogit(pi) = α + βxi, i = 1, . . . , n. The xi’s are categorical, but in order of snoringfrequency, so we will code them xi = −3,−1, 1, 3, as in the table. The MLEs areα = −2.79558, β = 0.32726. (a) Find the Fisher information in the entire sampleevaluated at the MLE, In(α, β). (b) Find I−1

n (α, β). (c) Find the standard errors of α

and β. Does the slope appear significantly different than 0? (d) For an individual iwith xi = 3, find the MLE of logit(pi) and its standard error. (e) For the person inpart (d), find the MLE of pi and its standard error. [Hint: Use the ∆-method.]

Exercise 14.9.8. Consider a set of teams. The chance team A beats team B in a singlegame is pAB. If these two teams do not play often, or at all, one cannot get a very goodestimate of pAB by just looking at those games. But we often do have information ofhow they did against other teams, good and bad, which should help in estimatingpAB.

Suppose pA is the chance team A beats the “typical” team, and pB the chancethat team B beats the typical team. Then even if A and B have never played eachother, one can use the following idea to come up with a pAB: Both teams flip a coinindependently, where the chance of heads is pA for team A’s coin and pB for teamB’s. Then if both are heads or both are tails, they flip again. Otherwise, whoever gotthe heads wins. They keep flipping until someone wins. (a) What is the probabilityteam A beats team B, pAB, in this scenario? (As a function of pA, pB.) If pA = pB,what is pAB? If pB = 0.5 (so that team B is typical), what is pAB? If pA = 0.6 andpB = 0.4, what is pAB? If both are very good: pA = 0.9999 and pB = 0.999, what ispAB? (b) Now let oA, oB be their odds of beating a typical team, (odds = p/(1− p)).Find oAB, the odds of team A beating team B, as a function of the individual odds(so the answer is in terms of oA, oB). (c) Let γA and γB be the corresponding logits(log odds), so that γi = logit(pi). Find γAB = log(oAB) as a function of the γ’s. (d)Now suppose there are 4 teams, i = 1, 2, 3, 4, and γi is the logit for team i beating thetypical team. Then the logits for team i beating team j, the γij’s, can be written as a

14.9. Exercises 243

linear transformation of the γi’s:γ12γ13γ14γ23γ24γ34

= x

γ1γ2γ3γ4

. (14.118)

What is the matrix x? This model for the logits is called the Bradley-Terry model(Bradley and Terry, 1952).

Exercise 14.9.9. Continue Exercise 14.9.8 on the Bradley-Terry model. We look at thenumbers of times each team in the National League (baseball) beat the other teams in2015. The original data can be found at http://espn.go.com/mlb/standings/grid/_/year/2015. There are 15 teams. Each row is a paired comparison of a pairof teams. The first two columns of row ij contain the yij’s and (nij − yij)’s, whereyij is the number of times team i beat team j, and nij is the number of games theyplayed. The rest of the matrix is the x matrix for the logistic regression. The modelis that Yij ∼ Binomial(nij, pij), where logit(pij) = γi − γj. So we have a logisticregression model, where the x matrix is that in the previous problem, expanded to15 teams. Because the sum of each row is 0, the matrix is not full rank, so we dropthe last column, which is equivalent to setting γ15 = 0, which means that team #15,the Nationals, are the “typical” team. That is ok, since the logits depend only onthe differences of the γi’s. The file http://istics.net/r/nl2015.R contains thedata in an R matrix.

Here are the data for just the three pairings among the Cubs, Cardinals, and Brew-ers:

W LCubs vs. Brewers 14 5Cubs vs. Cardinals 8 11Brewers vs. Cardinals 6 13

(14.119)

Thus the Cubs and Brewers played 19 times, the Cubs winning 14 and the Brewerswinning 5. We found the MLEs and Fisher information for this model using all theteams. The estimated coefficients for these three teams are

γCubs = 0.4525, γBrewers = −0.2477, γCardinals = 0.5052. (14.120)

The part of the inverse of the Fisher information at the MLE pertaining to these threeteams is

Cubs Brewers CardinalsCubs 0.05892 0.03171 0.03256Brewers 0.03171 0.05794 0.03190Cardinals 0.03256 0.03190 0.05971

. (14.121)

(a) For each of the three pairings of the three above teams, find the estimate oflogit(pij) and its standard error. For which pair, if any, does the logit appear signifi-cantly different from 0? (I.e., 0 is not in the approximate 95% confidence interval.) (b)For each pair, find the estimated pij and its standard error. (c) Find the estimated ex-pected number of wins and losses for each matchup. That is, find (nij pij, nij(1− pij))for each pair. Compare these to the actual results in (14.119). Are the estimates closeto the actual wins and losses?



), · · · ,

(XnYn

)are iid N

((00

),(

1 ρρ 1

))(14.122)

for ρ ∈ (−1, 1). This question will consider the following estimators of ρ:

R1n =1n

n

∑i=1

XiYi;

R2n =∑n

i=1 XiYi√∑n

i=1 X2i ∑n

i=1 Y2i

;

R3n = the MLE. (14.123)

(a) Find I1(ρ), Fisher’s information in one observation. (b) What is the asymptoticvariance for R1n, that is, what is the σ2

1 (ρ) in

√n (R1n − ρ) −→D N(0, σ2

1 (ρ))? (14.124)

(c) Data on n = 107 students, where the Xi’s are the scores on the midterms, and Yiare the scores on the final, has

∑ xiyi = 73.31, ∑ x2i = 108.34, ∑ y2

i = 142.80. (14.125)

(Imagine the scores are normalized so that the population means are 0 and popula-tion variances are 1.) Find the values of R1n, R2n, and R3n for these data. Calculatingthe MLE requires a numerical method like Newton-Raphson. Are these estimatesroughly the same? (d) Find the asymptotic efficiency of the three estimators. Whichone (among these three) has the best asymptotic efficiency? (See (9.71) for the asymp-totic variance of R2n.) Is there an obvious worst one?

Chapter 15

Hypothesis Testing

Estimation addresses the question, “What is θ?” Hypothesis testing addresses ques-tions like, “Is θ = 0?” Confidence intervals do both. It will give a range of plausiblevalues, and if you wonder whether θ = 0 is plausible, you just check whether 0 isin the interval. But hypothesis testing also addresses broader questions in whichconfidence intervals may be clumsy. Some types of questions for hypothesis testing:

• Is a particular drug more effective than a placebo?

• Are cancer and smoking related?

• Is the relationship between amount of fertilizer and yield linear?

• Is the distribution of income the same among men and women?

• In a regression setting, are the errors independent? Normal? Homoscedastic?

The main feature of hypothesis testing problems is that there are two competingmodels under consideration, the null hypothesis model and the alternative hypothe-sis model. The random variable (vector) X and space X are the same in both models,but the sets of distributions are different, being denoted P0 and PA for the null andalternative, respectively, where P0 ∩ PA = ∅. If P is the probability distribution forX, then the hypotheses are written

H0 : P ∈ P0 versus HA : P ∈ PA. (15.1)

Often both models will be parametric:

P0 = Pθ | θ ∈ T0 and PA = Pθ | θ ∈ TA, with T0, TA ⊂ T , T0 ∩TA = ∅, (15.2)

for some overall parameter space T . It is not unusual, but also not required, thatTA = T − T0. In a parametric setting, the hypotheses are written

H0 : θ ∈ T0 versus HA : θ ∈ TA. (15.3)

Mathematically, there is no particular reason to designate one of the hypothesesnull and the other alternative. In practice, the null hypothesis tends to be the one thatrepresents the status quo, or that nothing unusual is happening, or that everything is

245

246 Chapter 15. Hypothesis Testing

ok, or that the new isn’t any better than the old, or that the defendant is innocent. Forexample, in the Salk polio vaccine study (Exercise 6.8.9), the null hypothesis would bethat the vaccine has no effect. In simple linear regression, the typical null hypothesiswould be that the slope is 0, i.e., the distributions of the Yi’s do not depend on thexi’s. One may also wish to test the assumptions in regression: The null hypothesiswould be that the residuals are iid Normal(0,σ2

e )’s.Section 16.4 considers model selection, in which there are a number of models,

and we wish to choose the best in some sense. Hypothesis testing could be thoughtof as a special case of model selection, where there are just two models, but it ismore useful to keep the notions separate. In model selection, the models have thesame status, while in hypothesis testing the null hypothesis is special in representinga status quo. (Though hybrid model selection/hypothesis testing situations could beimagined.)

We will look at two primary approaches to hypothesis testing. The accept/rejector fixed α or Neyman-Pearson approach is frequentist and action-oriented: Basedon the data x, you either accept or reject the null hypothesis. The evaluation ofany procedure is based on the chance of making the wrong decision. The Bayesianapproach starts with a prior distribution (on the parameters, as well as on the truth ofthe two hypotheses), and produces the posterior probability that the null hypothesisis true. In the latter case, you can decide to accept or reject the null based on acutoff for its probability. We will also discuss p-values, which arise in the frequentistparadigm, and are often misinterpreted as the posterior probabilities of the null.

15.1 Accept/Reject

There is a great deal of terminology associated with the accept/reject paradigm, butthe basics are fairly simple. Start with a test statistic T(x), which is a function T :X → R that measures in some sense the difference between the data x and the nullhypothesis. To illustrate, let X1, . . . , Xn be iid N(µ, σ2

0 ), where σ20 is known, and test

the hypotheses

H0 : µ = µ0 versus HA : µ 6= µ0. (15.4)

The usual test statistic is based on the z statistic:

T(x1, . . . , xn) = |z|, where z =x− µ0

σ0/√

n. (15.5)

The larger T, the more one would doubt the null hypothesis. Next, choose a cutoffpoint c that represents how large the test statistic can be before rejecting the nullhypothesis. That is,

The test

Rejects the null if T(x) > cAccepts the null if T(x) ≤ c . (15.6)

Or it may reject when T(x) ≥ c and accept when T(x) < c.In choosing c, there are two types of error to balance called, rather colorlessly,

15.1. Accept/Reject 247

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

µ

Pow

er

Two−sidedOne−sided

Figure 15.1: The power function for the z test when α = 0.10, µ0 = 0, n = 25, andσ2

0 = 1.

Type I and Type II errors:

Truth ↓ ActionAccept H0 Reject H0

H0 OK Type I error(false positive)

HA Type II error OK(false negative)

(15.7)

It would have been better if the terminology had been in line with medical and otherusage, where a false positive is rejecting the null when it is true, e.g., saying you havecancer when you don’t, and a false negative is accepting the null when it is false, e.g.,saying everything is ok when it is not. In any case, the larger c, the smaller chance ofa false positive, but the greater chance of a false negative.

Common practice is to set a fairly low limit (such as 5% or 1%), the level, on thechance of a Type I error:

Definition 15.1. A hypothesis test (15.7) has level α if

Pθ[T(X) > c] ≤ α for all θ ∈ T0. (15.8)

Note that a test with level 0.05 also has level 0.10. A related concept is the size ofa test, which is the smallest α for which it is level α:

Size = supθ∈T0

Pθ[T(X) > c]. (15.9)

Usually the size and level are the same, or close, and rarely is the distinction betweenthe two made.


Traditionally, more emphasis is on the power than the probability of Type II error,where Powerθ = 1− Pθ[Type II Error] when θ ∈ TA. Power is good. Designing agood study involves making sure that one has a large enough sample size that thepower is high enough.

Under the null hypothesis, the z statistic in (15.5) is distributed N(0,1). Thus thesize as a function of c is 2(1−Φ(c)), where Φ is the N(0,1) distribution function. Toobtain a size of α we would take c = zα/2, the (1− α/2)nd quantile of the normal.The power function is also straightforward to calculate:

Powerµ = 1−Φ(

c−√

nµ

σ0

)+ Φ

(−c−

√nµ

σ0

). (15.10)

Figure 15.1 plots the power function for α = 0.10 (so c = z0.10 = 1.645), µ0 = 0,n = 25, and σ2

0 = 1, denoting it “two-sided” since the alternative contains both sidesof µ0. Note that the power function is continuous and crosses the null hypothesisat the level 0.10. Thus we cannot decrease the size without decreasing the power, orincrease the power without increasing the size.

A one-sided version of the testing problem in (15.4) would have the alternativebeing just one side of the null, for example,

H0 : µ ≤ µ0 versus HA : µ > µ0. (15.11)

The test would reject when z > c′, where now c′ is the (1− α)th quantile of a standardnormal. With the same level α = 0.10, the c′ here is 1.282. The power of this test isthe “one-sided” curve in Figure 15.1. We can see that though it has the same size asthe two-sided test, its power is better for the µ’s in its alternative. Since the two-sidedtest has to guard against both sides, its power is somewhat lower.

There are a number of approaches to finding reasonable test statistics. Section15.2 leverages results we have for estimation. Section 15.3 and Chapter 16 developtests based on the likelihood. Section 15.4 presents Bayes tests. Chapter 17 looks atrandomization tests, and Chapter 18 applies randomization to nonparametric tests,many of which are based on ranks. Chapters 21 and 22 compare tests decision-theoretically.

15.1.1 Interpretation

In practice, one usually doesn’t want to reject the null unless there is substantialevidence against it. The situation is similar to the courts in a criminal trial. Thedefendant is “presumed innocent until proven guilty.” That is, the jury imagines thedefendant is innocent, then considers the evidence, and only if the evidence piles upso that the jury believes the defendant is “guilty beyond reasonable doubt” does itconvict. The accept/reject approach to hypothesis testing parallels the courts withthe following connections:

Courts TestingDefendant innocent Null hypothesis true

Evidence DataDefendant declared guilty Reject the null

Defendant declared not guilty Accept the null

(15.12)

Notice that the jury does not say that the defendant is innocent, but either guilty ornot guilty. “Not guilty” is a way to say that there is not enough evidence to convict,

15.2. Tests based on estimators 249

not to say the jury is confident the defendant is innocent. Similarly, in hypothesistesting accepting the hypothesis does not mean one is confident it is true. Rather, itmay be true, or just that there is not enough evidence to reject it. In fact, one mayprefer to replace the choice “accept the null” with “fail to reject the null.” “Reasonabledoubt” is quantified by level.

15.2 Tests based on estimators

If the null hypothesis sets a parameter or set of parameters equal to a fixed constant,that is, H0 : θ = θ0 for known θ0, then the methods developed for estimating θ canbe applied to testing. In the one-parameter hypothesis, we could find a 100(1− α)%confidence interval for θ, then reject the null hypothesis if the θ0 is not in the interval.Such a test has level α. Bootstrapped confidence intervals can be used for approximatelevel α tests. Randomization tests (Chapter 17) provide another resampling-basedapproach.

In normal-based models, z tests and t tests are often available. In (15.4) through(15.6) we saw the z test for testing µ = µ0 when σ2 is known. If we have the same iidN(µ, σ2) situation, but do not know σ2, then the (two-sided) hypotheses become

H0 : µ = µ0, σ2 > 0 versus HA : µ 6= µ0, σ2 > 0. (15.13)

Here we use the t statistic:

Reject H0 when |T(x1, . . . , xn)| > tn−1,α/2, where T(x1, . . . , xn) =x− µ0

s∗/√

n, (15.14)

s2∗ = ∑(xi − x)2/(n− 1), and tn−1,α/2 is the (1− α/2)nd quantile of a Student’s tn−1.

In Exercise 7.8.12, we looked at a confidence interval for the difference in means ina two-sample model, where X1, . . . , Xn are iid N(µ, σ2), Y1, . . . , Ym are iid N(γ, σ2),and the Xi’s and Yi’s are independent. Here we test

H0 : µ = γ, σ2 > 0 versus HA : µ 6= γ, σ2 > 0. (15.15)

We can again use a t test, where we reject the null when |T| > tn+m−2,α/2, where

T =x− y

spooled

√1n + 1

m

, s2pooled =

∑(xi − x)2 + ∑(yi − y)2

n + m− 2. (15.16)

Or for normal linear regression, testing βi = 0 uses T = βi/se(βi) and rejects when|T| > tn−p,α/2.

More generally, we often have the asymptotic normal result that if θ = θ0,

Z =θ − θ0

se(θ)−→D N(0, 1), (15.17)

so that an approximate z test rejects the null when |Z| > zα/2. If θ is K × 1, and forsome C we have

C−1/2(θ− θ0) −→D N(0, IK) (15.18)as in (14.86) for MLEs, then an approximate χ2 test would reject H0 : θ = θ0 when

(θ− θ0)′C−1(θ− θ0) > χ2

K,α, (15.19)

χ2K,α being the (1− α)th quantile of a χ2

K .


15.2.1 Linear regression

Let Y ∼ N(xβ, σ2In), where β is p × 1, and x′x is invertible. We saw above thatwe can use a t test to test a single βi = 0. We can also test whether a set of βi’s iszero, which often arises in analysis of variance models, and any time one has a setof related x-variables. Partition the β and its least squares estimator into the first p1and last p2 components, p = p1 + p2:

β =

(β1β2

)and β =

(β1β2

). (15.20)

Then β1 and β1 are p1 × 1, and β2 and β2 are p2 × 1. We want to test

H0 : β2 = 0 versus HA : β2 6= 0. (15.21)

Theorem 12.1 shows that β ∼ N(β, σ2C) where C = (x′x)−1. If we partition C inaccordance with β, i.e.,

C =

(C11 C12C21 C22

), C11 is p1 × p1 and C22 is p2 × p2, (15.22)

then we have β2 ∼ N(β, σ2C22). Similar to (15.19), if β2 = 0,

U ≡ 1σ2 β

′2C−1

22 β2 ∼ χ2p2

. (15.23)

We cannot use U directly, since σ2 is unknown, but we can estimate it with σ2 =

SSe/(n− p) from (12.25), where SSe = ‖y− xβ‖2, the residual sum of squares fromthe original model. Theorem 12.1 also shows that SSe is independent of β, hence ofU, and V ≡ SSe/σ2 ∼ χ2

n−p. We thus have the ingredients for an F random variable,defined in Exercise 7.8.18. That is, under the null,

F ≡ U/p2V/(n− p)

=β′2C−1

22 β2p2σ2 ∼ Fp2,n−p. (15.24)

The F test rejects the null when F > Fp2,n−p,α, where Fp2,n−p,α is the (1− α)th quantileof an Fp2,n−p.

15.3 Likelihood ratio test

We will start simple, where each hypothesis has exactly one distribution. Let f be thedensity of the data X, and consider the hypotheses

H0 : f = f0 versus HA : f = fA, (15.25)

where f0 and fA are the null and alternative densities, respectively, under consider-ation. The densities could be from the same family with different parameter values,or densities from distinct families, e.g., f0 = N(0, 1) and fA = Cauchy.

15.4. Bayesian testing 251

Recalling the meaning of likelihood, it would make sense to reject f0 in favor offA if fA is sufficiently more likely than f0. Consider basing the test on the likelihoodratio,

LR(x) =fA(x)f0(x)

. (15.26)

We will see in Section 21.3 that the Neyman-Pearson lemma guarantees that such atest is best in the sense that it has the highest power among tests with its size.

For example, suppose X ∼ Binomial(n, p), and the hypotheses are

H0 : p =12

versus HA : p =34

. (15.27)

Then

LR(x) =(3/4)x(1/4)n−x

(1/2)n =3x

2n . (15.28)

We then reject the null hypothesis if LR(x) > c, where we choose c to give us thedesired level. But since LR(x) is strictly increasing in x, there exists a c′ such thatLR(x) > c if and only if x > c′. For example, if n = 10, taking c′ = 7.5 yields a levelα = 0.05468(= P[X ∈ 8, 9, 10 | p = 1

2 ]). We could go back and figure out what c is,but there is no need. What we really want is the test, which is to reject H0 : p = 1/2when x > 7.5. Its power is 0.5256, which is the best you can do with the given size.

When one or both of the hypotheses are not simple (composite is the terminologyfor not simple), it is not so obvious how to proceed, because the likelihood ratiof (x | θA)/ f (x | θ0) will depend on which θ0 ∈ T0 and/or θA ∈ TA. Two possiblesolutions are to average or to maximize over the parameter spaces. That is, possibletest statistics are∫

TAf (x | θA)ρA(θA)dθA∫T0

f (x | θ0)ρ0(θ0)dθ0and

supθA∈TAf (x | θA)

supθ0∈T0f (x | θ0)

=f (x | θA)

f (x | θ0), (15.29)

where ρ0 and ρA are prior probability densities over T0 and TA, respectively, and θ0and θA are the respective MLEs for θ over the two parameter spaces. The latter ratiois the (maximum) likelihood ratio statistic, which is discussed in Section 16.1. Scoretests, which are often simpler than the likelihood ratio tests, are presented in Section16.3. The former ratio in (15.29) is the statistic in what is called a Bayes test, which iskey to the Bayesian testing presented next.

15.4 Bayesian testing

The Neyman-Pearson approach is all about action: Either accept the null or reject it.As in the courts, it does not suggest the degree to which the null is plausible or not.By contrast, the Bayes approach produces the probabilities the null and alternative aretrue, given the data and a prior distribution. Start with the simple versus simple caseas in (15.25), where the null hypothesis is that the density is f0, and the alternativethat the density is fA. The prior π is given by

P[H0](= P[H0 is true]) = π0, P[HA] = πA, (15.30)

where π0 + πA = 1. Where do these probabilities come from? Presumably, from areasoned consideration of all that is known prior to seeing the data. Or, one may try to


be fair and take π0 = πA = 1/2. The densities are then the conditional distributionsof X given the hypotheses are true:

X |H0 ∼ f0 and X |HA ∼ fA. (15.31)

Bayes theorem (Theorem 6.3 on page 94) gives the posterior probabilities:

P[H0 |X = x] =π0 f0(x)

π0 f0(x) + πA fA(x)=

π0π0 + πALR(x)

,

P[HA |X = x] =πA fA(x)

π0 f0(x) + πA fA(x)=

πALR(x)π0 + πALR(x)

, (15.32)

where LR(x) = fA(x)/ f0(x) is the likelihood ratio from (15.26). In dividing numera-tor and denominator by f0(x), we are assuming it is not zero. Thus these posteriorsdepend on the data only through the likelihood ratio. That is, the posterior does notviolate the likelihood principle. Hypothesis tests do violate the likelihood principle:Whether they reject depends on the c, which is calculated from f0, not the likelihood.

Odds are actually more convenient here, where the odds of an event B are

Odds[B] =P[B]

1− P[B], hence P[B] =

Odds[B]1 + Odds[B]

. (15.33)

The prior odds in favor of HA are then πA/π0, and the posterior odds are

Odds[HA |X = x] =P[HA |X = x]P[H0 |X = x]

=πAπ0

LR(x)

= Odds[HA]× LR(x). (15.34)

That is,

Posterior odds = (Prior odds)× (Likelihood ratio), (15.35)

which neatly separates the contribution to the posterior of the prior and the data. Ifa decision is needed, one would choose a cutoff point k, say, and reject the null if theposterior odds exceed k. But this test is the same as the Neyman-Pearson test basedon (15.26) with cutoff point c. The difference is that in the present case, the cutoffwould not be chosen to achieve a certain level, but rather on an assessment of whatprobability of H0 is too low to accept the null.

Take the example in (15.27) with n = 10, so that X ∼ Binomial(10, p), and thehypotheses are

H0 : p = 12 versus HA : p = 3

4 . (15.36)

If the prior odds are even, i.e., π0 = πA = 1/2, so that the prior odds are 1, then the


posterior odds are equal to the likelihood ratio, giving the following:

x Odds[HA |X = x] = LR(x) 100× P[H0 |X = x]0 0.0010 99.901 0.0029 99.712 0.0088 99.133 0.0264 97.434 0.0791 92.675 0.2373 80.826 0.7119 58.417 2.1357 31.898 6.4072 13.509 19.2217 4.95

10 57.6650 1.70

(15.37)

Thus if you see X = 2 heads, the posterior probability that p = 1/2 is about 99%,and if X = 9, it is about 5%. Note that using the accept/reject test as in Section 15.3,X = 8 would lead to rejecting the null with α ≈ 5.5%, whereas here the posteriorprobability of the null is 13.5%.

In the composite situation, it is common to take a stagewise prior. That is, asin (15.29), we have distributions over the two parameter spaces, as well as the priormarginal probabilities of the hypotheses. Thus the prior π is specified by

Θ |H0 ∼ ρ0, Θ |HA ∼ ρA, P[H0] = π0, and P[H1] = πA, (15.38)

where ρ0 is a probability distribution on T0, and ρA is one on TA. Conditioningon one hypothesis, we can find the marginal distribution of the X by integrating itsdensity (in the pdf case) with respect to the conditional density on θ. That is, for thenull,

f (x |H0) =∫T0

f (x, θ |H0)dθ =∫T0

f (x | θ & H0) f (θ |H0)dθ

=∫T0

f (x | θ)ρ0(θ)dθ. (15.39)

The alternative is similar. Now we are in the same situation as (15.31), where the ratiohas the integrated densities, i.e.,

BA0(x) =f (x |HA)

f (x |H0)=

∫TA

f (x | θ)ρA(θ)dθ∫T0

f (x | θ)ρ0(θ)dθ=

EρA [ f (x |Θ)]

Eρ0 [ f (x |Θ)]. (15.40)

(The final ratio is applicable when the ρ0 and ρA do not have pdfs.) This ratio iscalled the Bayes factor for HA versus H0, which is where the “BA0” notation arises.It is often inverted, so that B0A = 1/BA0 is the Bayes factor for H0 versus HA. SeeJeffreys (1961).

For example, consider testing a normal mean is 0. That is, X1, . . . , Xn are iidN(µ, σ2), with σ2 > 0 known, and we test

H0 : µ = 0 versus HA : µ 6= 0. (15.41)

Take the prior probabilities π0 = πA = 1/2. Under the null, there is only µ = 0, soρ0[M = 0] = 1. For the alternative, take a normal centered at 0 as the prior ρA:

M |HA ∼ N(0, σ20 ), (15.42)


where σ20 is known. (Technically, we should remove the value 0 from the distribution,

but it has 0 probability anyway.)

We know the sufficient statistic is Xn, hence as in Section 13.4.3, we can base theanalysis on Xn ∼ N(µ, σ2/n). The denominator in the likelihood ratio in (15.40)is the N(0, σ2/n) density at xn. The numerator is the marginal density under thealternative, which using (7.106) can be shown to be N(0, σ2

0 + σ2/n). Thus the Bayesfactor is

BA0(xn) =φ(xn | 0, σ2/n)

φ(xn | 0, σ20 + σ2/n)

, (15.43)

where φ(z | µ, σ2) is the N(µ, σ2) pdf. Exercise 15.7.5 rewrites the Bayes factor as

BA0(x) =1√

1 + τe

12 z2τ/(1+τ), where z =

√n

xn

σand τ =

nσ20

σ2 . (15.44)

The z is the usual z statistic, and τ is the ratio of the prior variance to Var[Xn |M = µ],similar to the quantities in (11.50). Then with π0 = πA, we have that

P[H0 |Xn = xn] =1

1 + BA0(xn). (15.45)

Even if one does not have a good idea from the context of the data what σ20 should

be, a value (or values) needs to be chosen. Berger and Sellke (1987) consider thisquestion extensively. Here we give some highlights. Figure 15.2 plots the posteriorprobability of the null hypothesis as a function of τ for values of the test statisticz = 1, 2, 3. A small value of τ indicates a tight prior around 0 under the alternative,which does not sufficiently distinguish the alternative from the null, and leads to aposterior probability of the null towards 1/2. At least for the larger values of z, asτ increases, the posterior probability of the null quickly decreases, then bottoms outand slowly increases. In fact, for any z 6= 0, the posterior probability of the nullapproaches 1 as τ → ∞. Contrast this behavior with that for probability intervals(Section 11.6.1), which stabilize fairly quickly as the posterior variance increases toinfinity.

A possibly reasonable choice of τ would be one where the posterior probabilityis fairly stable. For |z| > 1, the minimum is achieved at τ = z2 − 1, which maybe reasonable though most favorable to the alternative. Choosing τ = 1 means theprior variance and variance of the sample mean given µ are equal. Choosing τ = nequates the prior variance and variance of one Xi given µ. This is the choice Bergerand Sellke (1987) use as one close to the Cauchy proposed by Jeffreys (1961), anddeemed reasonable by Kass and Wasserman (1995). It also is approximated by theBayes information criterion (BIC) presented in Section 16.5. Some value within thatrange is defensible. The table below shows the posterior probability of the null asa percentage for various values of z and relationship between the prior and samplevariances.


0 5 10 15 20 25

0.1

0.2

0.3

0.4

0.5

0.6

0.7

pp

z = 1z = 2z = 3

Pro

babi

lity

of th

e nu

ll

τ

Figure 15.2: The posterior probability of the null hypothesis as a function of τ forvalues of z = 1 (upper curve), z = 2 (middle curve), and z = 3 (lower curve). See(15.44).

σ20 = σ2

z σ20 = σ2/n Minimum n=100 n=1000

0 58.58 50.00 90.95 96.940.5 57.05 50.00 89.88 96.541.0 52.41 50.00 85.97 95.051.5 44.62 44.53 76.74 91.142.0 34.22 30.86 58.11 81.102.5 22.87 15.33 31.29 58.243.0 12.97 5.21 10.45 26.093.5 6.20 1.25 2.28 6.514.0 2.52 0.22 0.36 1.064.5 0.89 0.03 0.04 0.135.0 0.27 0 0 0.01

(15.46)

It is hard to make any sweeping recommendations, but one fact may surpriseyou. A z of 2 is usually considered substantial evidence against the null. Lookingat the table, the prior that most favors the alternative still gives the null a posteriorprobability of over 30%, as high as 81% for the Jeffreys-type prior with n = 1000.When z = 3, the lowest posterior probability is 5.21%, but other reasonable priorsgive values twice that. It seems that z must be at least 3.5 or 4 to have real doubtsabout the null, at least according to this analysis.

Kass and Raftery (1995) discuss many aspects of Bayesian testing and computationof Bayes factors, and Lecture 2 of Berger and Bayarri (2012) has a good overview ofcurrent approaches to Bayesian testing.


15.5 P-values

In the accept/reject setup, the outcome of the test is simply an action: Accept the nullor reject it. The size α is the (maximum) chance of rejecting the null given that it istrue. But just knowing the action does not reveal how strongly the null is rejected oraccepted. For example, when testing the null µ = 0 versus µ 6= 0 based on X1, . . . , Xniid N(µ, 1), an α = 0.05 level test rejects the null when Z =

√n|Xn| > 1.96. The

test then would reject the null at the 5% level if z = 2. It would also reject the nullat the 5% level if z = 10. But intuitively, z = 10 provides much stronger evidenceagainst the null than does z = 2. The Bayesian posterior probability of the null is avery comprehensible assessment of evidence for or against the null, but does requirea prior.

A frequentist measure of evidence often used is the p-value, which can be thoughtof as the smallest size such that the test of that size rejects the null. Equivalently, it isthe (maximum) chance under the null of seeing something as or more extreme thanthe observed test statistic. That is, if T(X) is the test statistic,

p-value(x) = supθ∈T0

P[T(X) ≥ T(x) | θ]. (15.47)

In the normal mean case with T(x1, . . . , xn) = |z| where z =√

n xn, the p-value is

p-value(x1, . . . , xn) = P[|√

n Xn| ≥ |z| | µ = 0] = P[|N(0, 1)| ≥ |z|]. (15.48)

If z = 2, the p-value is 4.45%, if z = 3 the p-value is 0.27%, and if z = 10, the p-valueis essentially 0 (≈ 10−21%). Thus the lower the p-value, the more evidence againstthe null.

The next lemma shows that if one reports the p-value instead of the accept/rejectdecision, the reader can decide on the level to use, and easily perform the test.

Lemma 15.2. Consider the hypothesis testing problem based on X, where the null hypothesisis H0 : θ ∈ T0. For test statistic T(x), define the p-value as in (15.47). Then for α ∈ (0, 1),

supθ∈T0

P[p-value(X) ≤ α | θ] ≤ α, (15.49)

hence the test that rejects the null when p-value(x) ≤ α has level α.

Proof. Take a θ0 ∈ T0. If we let F be the distribution function of −T(X) when θ = θ0,then the p-value is F(−t(x)). Equation (4.31) shows that P[F(−T(X)) ≤ α] ≤ α, i.e.,

P[p-value(X) ≤ α | θ0] ≤ α, (15.50)

and (15.49) follows by taking the supremum over θ0 on the left-hand side.

It is not necessary that p-values be defined through a test statistic. We couldalternately define a p-value to be any function p-value(x) that satisfies (15.50) for allα ∈ (0, 1). Thus the p-value is the test statistic, where small values lead to rejection.

A major stumbling block to using the p-value as a measure of evidence is to knowhow to interpret the p-value. A common mistake is to conflate the p-value with theprobability that the null is true given the data:

P[T(X) ≥ T(x) | µ = 0] ¿ ≈ ? P[M = 0 | T(X) = T(x)]. (15.51)

15.6. Confidence intervals from tests 257

There are a couple of obvious problems with removing the question marks in (15.51).First, the conditioning is reversed, i.e., in general P[A | B] 6= P[B | A]. Second, theright-hand side is taking the test statistic exactly equal to the observed one, while theleft-hand side is looking at being greater than or equal to the observed.

More convincing is to calculate the two quantities in (15.51). The right-hand sideis found in (15.46). Even taking the minimum posterior probabilities, we have thefollowing:

z→ 1 1.5 2 2.5 3 3.5 4 4.5 5p-value 31.73 13.36 4.55 1.24 0.27 0.047 0.006 0.0007 0.0001

P[H0|Z = z] 50.00 44.53 30.86 15.33 5.21 1.247 0.221 0.0297 0.0031P[H0|Z = z] 49.75 42.23 27.65 12.90 4.16 0.961 0.166 0.0220 0.0022

(15.52)The p-values far overstate the evidence against the null compared to the posteriorprobabilities. The ratio of p-value to posterior probability is 6.8 for z = 2, 19.3 forz = 3, and 53.6 for z = 5. In fact, as Berger and Sellke (1987) note, the p-valuefor Z = z is similar to the posterior probability for Z = z− 1. Sellke, Bayarri, andBerger (2001) provide a simple formula that adjusts the p-value to approximate areasonable lower bound for the posterior probability of the null. Let p be the p-value,and suppose that p < 1/e. Then for a simple null and a general class of alternatives(those with decreasing failure rate, though we won’t go into that),

P[H0 | T(X) = t(x)] ≥B∗0A

1 + B∗0Awhere B∗0A = −e p log(p). (15.53)

The bound is 1/2 if p > 1/e. See Exercise 15.7.8 for an illustration. The third row in(15.52) contains these values, which are indeed very close to the actual probabilities.

15.6 Confidence intervals from tests

Hypothesis tests can often be “inverted” to find confidence intervals for parametersin certain models. At the beginning of this chapter, we noted that one could test thenull hypothesis that η = η0 by first finding a confidence interval for η, and rejectingthe null hypothesis if η0 is not in that interval. We can reverse this procedure, takingas a confidence interval all η0 that are not rejected by the hypothesis test.

Formally, we have X ∼ Pθ and a parameter of interest, say η = η(θ), with spaceH. (We will work with one-dimensional η, but it could be multidimensional.) Thenfor fixed η0 ∈ H, consider testing

H0 : η = η0 (i.e., θ ∈ Tη0 = θ ∈ T | η(θ) = η0), (15.54)

where we can use whatever alternative fits the situation. For given α, we need afunction T(x ; η0) such that for each η0 ∈ H, T(x ; η0) is a test statistic for the null in(15.54). That is, there is a cutoff point cη0 for each η0 such that

supθ∈Tη0

Pθ[T(X ; η0) > cη0 ] ≤ α. (15.55)

Now we use this T as a pivotal quantity as in (11.9). Consider the set defined foreach x via

C(x) = η0 ∈ H | T(x ; η0) ≤ cη0. (15.56)


This set is a 100(1− α)% confidence region for η, since for each θ,

Pθ[η(θ) ∈ C(X)] = 1− Pθ[T(X ; η(θ)) > cη(θ)] ≥ 1− α (15.57)

by (15.55). We call it a “confidence region” since in general it may not be an interval.For a simple example, suppose X1, . . . , Xn are iid N(µ, σ2), and we wish a con-

fidence interval for µ. To test H0 : µ = µ0 with level α, we can use Student’s t,T(x ; µ0) = |x− µ0|/(s∗/

√n). The cutoff point is tn−1,α/2 for any µ0. Thus

C(x) =

µ0

∣∣∣∣ |x− µ0|s∗/√

n≤ tn−1,α/2

=

[x− tn−1,α/2

s∗√n

, x + tn−1,α/2s∗√

n

], (15.58)

as before, except the interval is closed instead of open.A more interesting situation is X ∼ Binomial(n, p) for n small enough that the

normal approximation may not work. (Generally, people are comfortable with theapproximation if np ≥ 5 and n(1− p) ≥ 5.) The Clopper-Pearson interval invertsthe two-sided binomial test to obtain an exact confidence interval. It is exact in thesense that it is guaranteed to have level at least the nominal one. It tends to be veryconservative, so may not be the best choice. See Brown, Cai, and DasGupta (2001).Consider the two-sided testing problem

H0 : p = p0 versus HA : p 6= p0 (15.59)

for fixed p0 ∈ (0, 1). Given level α, the test we use has (approximately) equal tails. Itrejects the null hypothesis if

x ≤ k(p0) or x ≥ l(p0), wherek(p0) = maxinteger k | Pp0 [X ≤ k] ≤ α/2 and

l(p0) = mininteger l | Pp0 [X ≥ l] ≤ α/2. (15.60)

For data x, the confidence interval consists of all p0 that are not rejected:

C(x) = p0 | k(p0) + 1 ≤ x ≤ l(p0)− 1. (15.61)

Let bx and ax be the values of p defined by

Pbx [X ≤ x] =α

2= Pax [X ≥ x]. (15.62)

Both k(p0) and l(p0) are nondecreasing in p0, hence as in Exercise 15.7.12,

k(p0) + 1 ≤ x ⇔

p0 < bx if 0 ≤ x ≤ n− 1p0 < 1 if x = n . (15.63)

Similarly,

l(p0)− 1 ≥ x ⇔

p0 > 0 if x = 0p0 > ax if 1 ≤ x ≤ n . (15.64)

So the confidence interval in (15.61) is given by

C(0) = (0, b0); C(x) = (ax, bx), 1 ≤ x ≤ n− 1; C(n) = (an, 1). (15.65)

We can use Exercise 5.6.17 to more easily solve for ax (x ≤ n− 1) and bx (x ≥ 0) in(15.62):

ax = q(α/2 ; x, n− x + 1) and bx = q(1− α/2 ; x + 1, n− x), (15.66)

where q(γ ; a, b) is the γth quantile of a Beta(a, b).

15.7. Exercises 259

15.7 Exercises

Exercise 15.7.1. Suppose X ∼ Exponential(λ), and we wish to test the hypothesesH0 : λ = 1 versus HA : λ 6= 1. Consider the test that rejects the null hypothesis when|X − 1| > 3/4. (a) Find the size α of this test. (b) Find the power of this test as afunction of λ. (c) For which λ is the power at its minimum? What is the power at thisvalue? Is it less than the size? Is that a problem?

Exercise 15.7.2. Suppose X1, . . . , Xn are iid Uniform(0, θ), and we test H0 : θ ≤ 1/2versus HA : θ > 1/2. Take the test statistic to be T = maxX1, . . . , Xn, so we rejectwhen T > cα. Find cα so that the size of the test is α = 0.05. Calculate cα for n = 10.

Exercise 15.7.3. Suppose X | P = p ∼ Binomial(5, p), and we wish to test H0 : p =1/2 versus HA : p = 3/5. Consider the test that rejects the null when X = 5. (a)Find the size α and power of this test. Is the test very powerful? (b) Find LR(x) andP[H0 |X = x] as functions of x when the prior is that P[H0] = P[HA] = 1/2. Whenx = 5, is the posterior probability of the null hypothesis close to α?

Exercise 15.7.4. Take X | P = p ∼ Binomial(5, p), and now test H0 : p = 1/2 versusHA : p 6= 1/2. Consider the test that rejects the null hypothesis when X ∈ 0, 5.(a) Find the size α of this test. (b) Consider the prior distribution where P[H0] =P[HA] = 1/2, and P |HA ∼ Uniform(0, 1) (where p = 1/2 is removed from theuniform, with no ill effects). Find P[X = x |H0], P[X = x |HA], and the Bayes factor,BA0. (c) Find P[H0 |X = x]. When x = 0 or 5, is the posterior probability of the nullhypothesis close to α?

Exercise 15.7.5. Let φ(z | µ, σ2) denote the N(µ, σ2) pdf. (a) Show that the Bayes factorin (15.43) can be written as

BA0(xn) =φ(xn | 0, σ2/n)

φ(xn | 0, σ20 + σ2/n)

=1√

1 + τe

12 z2τ/(1+τ), (15.67)

where z =√

n xn/σ and τ = nσ20 /σ2, as in (15.44). (b) Show that as τ → ∞ with

z fixed (so the prior variance goes to infinity), the Bayes factor goes to 0, hence theposterior probability of the null goes to 1. (c) For fixed z, show that the minimum ofBA0(xn) over τ is acheived when τ = z2 − 1.

Exercise 15.7.6. Let X ∼ Binomial(n, p), and test H0 : p = 1/2 versus HA : p 6= 1/2.(a) Suppose n = 25, and let the test statistic be T(x) = |x − 12.5|, so that the null isrejected if T(x) > c for some constant c. Find the sizes for the tests with (i) c = 5; (ii)c = 6; (iii) c = 7.

For the rest of the question, consider a beta prior for the alternative, i.e., P |HA ∼Beta(γ, γ). (b) What is the prior mean given the alternative? Find the marginalpmf for X given the alternative. What distribution does it represent? [Hint: RecallDefinition 6.2 on page 86.] (c) Show that for X = x, the Bayes factor is

BA0(x) = 2n Γ(2γ)

Γ(γ)2Γ(x + γ)Γ(n− x + γ)

Γ(n + 2γ). (15.68)

(d) Suppose n = 25 and x = 7. Show that the p-value for the test statistic in part (a) is0.0433. Find the Bayes factor in (15.68), as well as the posterior probability of the nullassuming π0 = πA = 1/2, for γ equal to various values from 0.1 to 5. What is theminimum posterior probability (approximately)? Does it come close to the p-value?For what values of α is the posterior probability relatively stable?


Exercise 15.7.7. Consider testing a null hypothesis using a test statistic T. Assumethat the distribution of T under the null is the same for all parameters in the nullhypothesis, and is continuous with distribution function FT(t). Show that under thenull, the p-value has a Uniform(0,1) distribution. [Hint: First note that the p-valueequals 1− FT(t), then use (4.30).]

Exercise 15.7.8. In Exercise 15.7.7, we saw that in many situations, the p-value has aUniform(0,1) distribution under the null. In addition, one would usually expect thep-value to have a distribution “smaller” than the uniform when the alternative holds.A simple parametric abstraction of this notion is to have X ∼ Beta(γ, 1), and testH0 : γ = 1 versus HA : 0 < γ < 1. Then X itself is the p-value. (a) Show that underthe null, X ∼ Uniform(0, 1). (b) Show that for α in the alternative, P[X < x | γ] >P[X < x |H0]. (What is the distribution function?). (c) Show that the Bayes factor fora prior density π(γ) on the alternative space is

BA0 =∫ 1

0γxγ−1π(γ)dγ. (15.69)

(d) We wish to find an upper bound for BA0. Argue that BA0 is less than or equal tothe supremum of the integrand, γxγ−1, over 0 < γ < 1. (e) Show that

sup0<γ<1

γxγ−1 =

− 1

e x log(x) if x ≤ 1e

1 if x > 1e

. (15.70)

(f) We have that an upper bound on the Bayes factor BA0 is given in (15.70). Showthat the lower bound on the posterior probability of the null is as given in (15.53),where the p there is the x here. [Recall that B0A = 1/BA0.]

Exercise 15.7.9. Meta-analysis is a general term for combining results from manydifferent studies. It is very common in health studies. One aspect of meta-analysisis combining independent tests, where several studies test the same null hypothesis,and one wishes to combine the hypothesis tests into one overall test. The simplestform of the problem assumes p independent p-values, U1, . . . , Up, one from eachstudy, where the parameter for the ith study is θi. If θi = 0 then Ui ∼ Uniform(0, 1),and if θi 6= 0, Ui is smaller than a Uniform(0, 1) in some sense. Then we test thecombined null,

H0 : θ = 0 versus HA : θ 6= 0, (15.71)

where θ = (θ1, . . . , θp). There have been many omnibus methods proposed, where“omnibus” means they can be used no matter what the actual distributions of theoriginal test statistics are. The most famous is Fisher’s procedure, which is based onthe product of the p-values. Other tests include those based on the sum, minimum,or maximum of the p-values. In any case, the resulting statistic is not itself a p-value. We need to find the null distribution of the combined statistic, and from thatfind the overall p-value. (a) The Fisher, or Fisher-Pearson, statistic is usually writtenas a function of the product, TP(U) = −2 log(∏ Ui) = −2 ∑ log(Ui). Show thatunder the null, the −2 log(Ui)’s are iid Exponential(1/2), which is the same as χ2

2.What is the null distribution of TP(U)? (b) Tippett’s test uses the statistic TMin(U) =minUi, the value of the most significant p-value. For given α, find the cα such thatP[TMin(U) ≤ cα |H0] = α. (c) The maximum test is based on TMax(U) = maxUi,the least significant of the p-values. Find the cα such that P[TMax(U) ≤ cα |H0] = α.

15.7. Exercises 261

(d) The Edgington test uses the sum of the p-values, TS(U) = ∑ Ui. Find cα suchthat P[TS(U) ≤ cα |H0] for p = 2 and α ∈ (0, .5). [See Exercise 1.7.6.] (e) The Liptak-Stouffer statistic transforms each Ui into a normal under the null. The statistic isTN(U) = −∑ Φ−1(Ui), where Φ−1 is the inverse of the N(0, 1) distribution function.What is the null distribution of each −Φ−1(Ui)? Of TN(U)?

Exercise 15.7.10. When the null and alternative hypotheses are the same dimension-ality, it is possible for p-values and Bayesian posterior probabilities of the null to besimilar. For example, suppose X ∼ N(µ, 1), and we test the one-sided hypothesesH0 : µ ≤ 0 versus HA : µ > 0. The test statistic is X itself. (a) Show that the p-valuefor X = x is 1− Φ(x), where Φ is the N(0, 1) distribution function. (b) We do nothave to treat the two hypotheses separately to develop the prior in this case. Take theoverall prior to be M ∼ N(0, σ2

0 ). Show that under this prior, P[H0] = P[HA] = 1/2.(c) Using the prior in part (b), find the posterior distribution of M |X = x, and thenP[H0 |X = x]. (d) What is the limit of P[H0 |X = x] as σ2

0 → ∞? How does it compareto the p-value?

Exercise 15.7.11. Consider the polio vaccine example from Exercise 6.8.9, where XVis the number of subjects in the vaccine group that contracted polio, and XC is thenumber in the control group. We model XV and XC as independent, where

XV ∼ Poisson(cVθV) and XC ∼ Poisson(cCθC). (15.72)

Here θV and θC are the population rates of polio cases per 100,000 subjects for the twogroups. The sample sizes for the two groups are nV = 200, 745 and nC = 201, 229, sothat cV = 2.00745 and cC = 2.01229. We wish to test

H0 : θV = θC versus HA : θV 6= θC. (15.73)

Here we will perform a Bayesian test. (a) Let X |Θ = θ ∼ Poisson(cθ) and Θ ∼Gamma(α, λ) for given α > 0, λ > 0. Show that the marginal density of X is

f (x | α, λ) =Γ(x + α)

x!Γ(α)cxλα

(c + λ)x+α. (15.74)

(If α is a positive integer, then this f is the negative binomial pmf. The density is infact a generalization of the negative binomial to real-valued α > 0.) (b) Take XV andXC in (15.72) and the hypotheses in (15.73). Suppose the prior given the alternativehypothesis has ΘC and ΘV independent, both Gamma(α, λ). Show that the marginalfor (XC, XV) under the alternative is f (xV | α, λ) f (xC | α, λ). (c) Under the null, letΘV = ΘC = Θ, their common value, and set the prior Θ ∼ Gamma(α, λ). Show thatthe marginal joint pmf of (XV , XC) is

f (xV , xC |HA) =Γ(xV + xC + α)

xV !xC !Γ(α)cxV

V cxCC λα

(λ + cV + cC)xV+xC+α. (15.75)

[This distribution is the negative multinomial.] (d) For the vaccine group, there werexV = 57 cases of polio, and for the control group, there were xC = 142 cases. Find(numerically) the Bayes factor BA0(xV , xC), the ratio of integrated likelihoods as in(15.40), when α = 1, λ = 1/25. [Hint: You may wish to find the logs of the variousquantities first.] (e) What is the posterior probability of the null? What do youconclude about the effectiveness of the polio vaccine based on this analysis? (f) [Extracredit: Try some other values of α and λ. Does it change the conclusion much?]


Exercise 15.7.12. This exercise verifies the results for the Clopper-Pearson confidenceinterval procedure in (15.65) and (15.66). Let X ∼ Binomial(n, p) for 0 < p < 1, andfix α ∈ (0, 1). As in (15.60), for p0 ∈ (0, 1), define k(p0) = maxinteger k | Pp0 [X ≤k] ≤ α/2, and for given x, suppose that k(p0) ≤ x − 1. (a) Argue that for anyp ∈ (0, 1), k(p) ≤ n− 1. Thus if x = n, k(p0) ≤ x − 1 implies that p < 1. (b) Nowsuppose x ∈ 0, . . . , n− 1, and as in (15.62) define bx to satisfy Pbx [X ≤ x] = α/2,so that k(bx) = x. Show that if p < bx, then Pp[X ≤ x] > α/2. Argue that therefore,k(p) ≤ x− 1 if and only if p < bx. (c) Conclude that (15.63) holds. (d) Exercise 5.6.17shows that Pp[X ≤ x] = P[Beta(x + 1, n− x) > p]. Use this fact to prove (15.66).

Exercise 15.7.13. Continue with the setup in Exercise 15.7.12. (a) Suppose x = 0.Show that the confidence interval is C(0) = (0, 1 − n√

α/2). (b) Show that C(n) =(

n√α/2, 1).

Exercise 15.7.14. Suppose X1, . . . , Xn are iid N(µX , σ2X) and Y1, . . . , Ym are iid N(µY ,

σ2Y), and the Xi’s are independent of the Yi’s. The goal is a confidence interval for

the ratio of means, µX/µY . We could use the ∆-method on x/y, or bootstrap asin Exercise 11.7.11. Here we use an idea of Fieller (1932), who allowed correlationbetween Xi and Yi. Assume σ2

X and σ2Y are known. Consider the null hypothesis

H0 : µX/µY = γ0 for some fixed γ0. Write the null hypothesis as H0 : µX − γ0µY = 0.Let T(x, y, γ0) be the z-statistic (15.5) based on x− γ0y, so that the two-sided level α

test rejects the null when T(x, y, γ0)2 ≥ z2

α/2. We invert this test as in (15.56) to finda 100(1− α)% confidence region for γ. (a) Show that the confidence region can bewritten

C(x, y) = γ0 | ay(γ0 − x y/ay)2 < x2y2/ay − ax, (15.76)

where ax = x2 − z2α/2σ2

X/n and ay = y2 − z2α/2σ2

Y/m. (b) Let c = x2y2/ay − ax. Showthat if ay > 0 and c > 0, the confidence interval is x y/ay ± d. What is d? (c) Whatis the confidence interval if ay > 0 but c < 0. (d) What is the confidence interval ifay < 0 and c > 0? Is that reasonable? [Hint: Note that ay < 0 means that µY is notsignificantly different than 0.] (e) Finally, suppose ay < 0 and c < 0. Show that theconfidence region is (−∞, u) ∪ (v, ∞). What are u and v?

Chapter 16

Likelihood Testing and Model Selection

16.1 Likelihood ratio test

The likelihood ratio test (LRT) (or maximum likelihood ratio test, though LRT ismore common), uses the likelihood ratio statistic, but substitutes the MLE for theparameter value in each density as in the second ratio in (15.29). The statistic is

Λ(x) =supθA∈TA

f (x | θA)

supθ0∈T0f (x | θ0)

=f (x | θA)

f (x | θ0), (16.1)

where θ0 is the MLE of θ under the null model, and θA is the MLE of θ under thealternative model. Notice that in the simple versus simple situation, Λ(x) = LR(x).In many situations this statistic leads to a reasonable test, in the same way that theMLE is often a good estimator. Also, under appropriate conditions, it is easy to findthe cutoff point to obtain the approximate level α:

Under H0, 2 log(Λ(X)) −→D χ2ν, ν = dim(TA)− dim(T0), (16.2)

the dims being the number of free parameters in the two parameter spaces, whichmay or not may not be easy to determine.

First we will show some examples, then formalize the above result, at least forsimple null hypotheses.

16.1.1 Normal mean

Suppose X1, . . . , Xn are iid N(µ, σ2), and we wish to test whether µ = 0, with σ2

unknown:H0 : µ = 0, σ2 > 0 versus HA : µ 6= 0, σ2 > 0. (16.3)

Here, θ = (µ, σ2), so we need to find the MLE under the two models. Start with thenull. Here, T0 = (0, σ2) | σ2 > 0. Thus the MLE of µ is µ0 = 0, since it is the onlypossibility. For σ2, we then maximize

1σn e−Σx2

i /(2σ2), (16.4)

263

264 Chapter 16. Likelihood Testing and Model Selection

which by the usual calculations yields

σ20 = ∑ x2

i /n, hence θ0 = (0, ∑ x2i /n). (16.5)

The alternative has space TA = (µ, σ2) | µ 6= 0, σ2 > 0, which from Section 13.6yields MLEs

θA = (x, s2), where s2 = σ2A =

∑(xi − x)2

n. (16.6)

Notice that not only is the MLE of µ different in the two models, but so is the MLEof σ2. (Which should be reasonable, because if you know the mean is 0, you do nothave to use x in the variance.) Next, stick those estimates into the likelihood ratio,and see what happens (the

√2π’s cancel):

Λ(x) =f (x | x, s2)

f (x | 0, ∑ x2i /n)

=1sn e−Σ(xi−x)2/(2s2)

1(∑ x2

i /n)n/2 e−Σx2i /(2Σx2

i /n)

=

(∑ x2

i /ns2

)n/2e−n/2

e−n/2

=

(∑ x2

i∑(xi − x)2

)n/2

. (16.7)

Using (16.2), we have that the LRT

Rejects H0 when 2 log(Λ(x)) > χ2ν,α, (16.8)

which has a size of approximately α. To find the degrees of freedom, we count upthe free parameters in the alternative space, which is two (µ and σ2), and the freeparameters in the null space, which is just one (σ2). Thus the difference is ν = 1.

In this case, we can also find an exact test. We start by rewriting the LRT:

Λ(x) > c⇐⇒ ∑(xi − x)2 + nx2

∑(xi − x)2 > c∗ = c2/n

⇐⇒ nx2

∑(xi − x)2 > c∗∗ = c∗ − 1

⇐⇒ (√

nx)2

∑(xi − x)2/(n− 1)> c∗∗∗ = (n− 1)c∗∗

⇐⇒ |Tn(x)| > c∗∗∗∗ =√

c∗∗∗, (16.9)

where Tn is the t statistic,

Tn(x) =√

nx√∑(xi − x)2/(n− 1)

. (16.10)

Thus the cutoff is c∗∗∗∗ = tn−1,α/2. Which is to say, the LRT in this case is the usualtwo-sided t-test. We could reverse the steps to find the original cutoff c in (16.8), butit is not necessary since we can base the test on the Tn.

16.1. Likelihood ratio test 265


Here we again look at the multiple regression testing problem from Section 15.2.1.The model (12.9) is

Y ∼ N(xβ, σ2In), (16.11)

where β is p× 1, x is n× p, and we will assume that x′x is invertible. In simple linearregression, it is common to test whether the slope is zero, leaving the intercept to bearbitrary. More generally, we test whether some components of β are zero. Partitionβ and x into two parts,

β =

(β1β2

), x = (x1, x2), (16.12)

where β1 is p1 × 1, β2 is p2 × 1, x1 is n× p1, x2 is n× p2, and p = p1 + p2. We test

H0 : β2 = 0 versus HA : β2 6= 0, (16.13)

so that β1 is unspecified. Using (13.88), we can take the loglikelihood to be

l(β, σ2 ; y) = − 12σ2 ‖y− xβ‖2 − n

2log(σ2). (16.14)

Under the alternative, the MLE of β is the least squares estimate for the unrestrictedmodel, and the MLE of σ2 is the residual sum of squares over n. (See Exercise 13.8.20.)Using the projections as in (12.14), we have

σ2A =

1n‖y− βA‖

2 =1n

y′Qxy, Qx = In − Px, Px = x(x′x)−1x′. (16.15)

Under the null, the model is Y ∼ N(x1β1, σ2In), hence

β0 =

(β010

), and σ2

0 =1n‖y− x1 β01‖

2 =1n

y′Qx1 y, (16.16)

where β01 = (x′1x1)−1x′1y. The maximal loglikelihoods are

li(βi, σ2i ; y) = −n

2log(σ2

i )−n2

, i = 0, A, (16.17)

so that

2 log(Λ(y)) = n log(

y′Qx1 yy′Qxy

). (16.18)

We can find the exact distribution of the statistic under the null, but it is easier toderive after rewriting the model bit so that the two submatrices of x are orthogonal.Let

x∗ = xA and β∗ = A−1β where A =

(Ip1 −(x′1x1)

−1x′1x20 Ip2

). (16.19)

Thus xβ = x∗β∗. Exercise 16.7.3 shows that

β∗ =

(β∗1β2

)and x∗ = (x1, x∗2), where x′1x∗2 = 0. (16.20)


Now x1 has not changed, nor has β2, hence the hypotheses in (16.13) remain thesame. The exercise also shows that, because x′1x∗2 = 0,

Px = Px∗ = Px1 + Px∗2 =⇒ Qx = Qx∗ and Qx1 = Qx + Px∗2 . (16.21)

We can then write the ratio in the log in (16.18) as

y′Qx1 yy′Qxy

=y′Qxy + y′Px∗2 y

y′Qxy= 1 +

y′Px∗2 yy′Qxy

. (16.22)

It can further be shown that

Y′Px∗2 Y ∼ σ2χ2p2

and is independent of Y′QxY ∼ σ2χ2n−p, (16.23)

and from Exercise 7.8.18 on the F distribution,

y′P∗x∗2 y/p2

y′Qy/(n− p)=

y′P∗x∗2 y

p2σ2 ∼ Fp2,n−p, (16.24)

where σ2 = y′Qy/(n− p), the unbiased estimator of σ2. Thus the LRT is equivalentto the F test. What might not be obvious, but is true, is that this F statistic is the sameas the one we found in (15.24). See Exercise 16.7.4.

16.1.3 Independence in a 2× 2 table

Consider the setup in Exercise 13.8.23, where X ∼ Multinomial(n, p),and p = (p1,p2,p3,p4), arranged as a 2× 2 contingency table:

p1 p2 αp3 p4 1− αβ 1− β 1

(16.25)

The α = p1 + p2 and β = p3 + p4. The null hypothesis we want to test is that rowsand columns are independent, that is,

H0 : p1 = αβ, p2 = α(1− β), p3 = (1− α)β, p4 = (1− α)(1− β). (16.26)

The alternative is that the p is unrestricted:

HA : p ∈ θ ∈ R4 | θi > 0 and θ1 + · · ·+ θ4 = 1. (16.27)

Technically, we should exclude the null from the alternative space.Now for the LRT. The likelihood is

L(p ; x) = px11 px2

2 px33 px4

4 . (16.28)

Denoting the MLE of pi under the null by p0i and under alternative by pAi for each i,we have that LRT statistic can be written

2 log(Λ(x)) = 24

∑i=1

xi log(

npAinp0i

). (16.29)

16.2. Asymptotic null distribution of the LRT statistic 267

The n’s in the logs are unnecessary, but it is convention in contingency tables to writethis statistic in terms of the expected counts in the cells, E[Xi] = npi, so that

2 log(Λ(x)) = 24

∑i=1

Obsi log(

ExpAiExp0i

), (16.30)

where Obsi is the observed count xi, and the Exp’s are the expected counts under thetwo hypotheses.

Exercise 13.8.23 shows that the MLEs under the null in (16.26) of α and β areα = (x1 + x2)/n and β = (x1 + x3)/n, hence

p01 = αβ, p02 = α(1− β), p03 = (1− α)β, p04 = (1− α)(1− β). (16.31)

Under the alternative, from Exercise 13.8.22 we know that pAi = xi/n for each i. Inthis case, ExpAi = Obsi. The statistic is then

2 log(Λ(x)) = 2

(x1 log

(x1

nαβ

)+ x2 log

(x2

nα(1− β)

)

+ x3 log

(x3

n(1− α)β

)+ x4 log

(x4

n(1− α)(1− β)

)). (16.32)

The alternative space (16.27) has only three free parameters, since one is free tochoose three pi’s (within bounds), but then the fourth is set since they must sum to1. The null space is the set of p that satisfy the parametrization given in (16.26) for(α, β) ∈ (0, 1)× (0, 1), yielding two free parameters. Thus the difference in dimen-sions is 1, hence the 2 log(Λ(x)) is asymptotically χ2

1.

16.1.4 Checking the dimension

For many hypotheses, it is straightforward to count the number of free parametersin the parameter space. In some more complicated models, it may not be so obvious.I don’t know of a universal approach to counting the number, but if you do havewhat you think is a set of free parameters, you can check its validity by checkingthe Cramér conditions for the model when written in terms of those parameters. Inparticular, if there are K parameters, the parameter space must be open in RK , andthe Fisher information matrix must be finite and positive definite.

16.2 Asymptotic null distribution of the LRT statistic

Similar to the way that, under conditions, the MLE is asymptotically (multivariate)normal, the 2 log(Λ(X)) is asymptotically χ2 under the null as given in (16.2). Wewill not present the proof of the general result, but sketch it when the null is simple.We have X1, . . . , Xn iid, each with density f (xi | θ), θ ∈ T ⊂ RK , and make theassumptions in Section 14.4 used for likelihood estimation. Consider testing

H0 : θ = θ0 versus HA : θ ∈ T − θ0 (16.33)

for fixed θ0 ∈ T . One of the assumptions is that T is an open set, so that θ0 is inthe interior of T . In particular, this assumption rules out one-sided tests. The MLE


under the null is thus the fixed θ0 = θ0, and the MLE under the alternative, θA, isthe usual MLE.

From (16.1), since the xi’s are iid,

ln(θ ; x1, . . . , xn) = ln(θ) =n

∑i=1

l1(θ ; xi), (16.34)

where l1(θ ; xi) = log( f (xi | θ)) is the loglikelihood in one observation, and we havedropped the xi’s from the notation of ln for simplicity, as in Section 14.1. Thus thelog of the likelihood ratio is

2 log(Λ(x)) = 2 (ln(θA)− ln(θ0)). (16.35)

Expand ln(θ0) around θA in a Taylor series:

ln(θ0) ≈ ln(θA) + (θ0 − θA)′∇ln(θA)− 1

2 (θ0 − θA)′In(θA)(θ0 − θA), (16.36)

where the score function ∇ln is given in (14.77) and the observed Fisher informationmatrix In is given in (14.81).

As in (14.78), ∇ln(θA) = 0 because the score is zero at the MLE. Thus by (16.35),

2 log(Λ(x)) ≈ (θ0 − θA)′In(θA)(θ0 − θA). (16.37)

Now suppose H0 is true, so that θ0 is the true value of the parameter. As in (14.86),we have

I1/2n (θA)(θA − θ0) −→D Z = N(0, IK). (16.38)

Then using the mapping Lemma 9.1 on page 139,

2 log(Λ(x)) −→D Z′Z ∼ χ2K . (16.39)

16.2.1 Composite null

Again with T ⊂ RK , where T is open, we consider the null hypothesis that sets partof θ to zero. That is, partition

θ =

(θ(1)

θ(2)

), θ(1) is K1 × 1, θ(2) is K2 × 1, K1 + K2 = K. (16.40)

The problem is to test

H0 : θ(1) = 0 versus HA : θ(1) 6= 0, (16.41)

with θ(2) unspecified. More precisely,

H0 : θ ∈ T0 = θ ∈ T | θ(1) = 0 versus HA : θ ∈ TA = T − T0. (16.42)

The parameters in θ(2) are called nuisance parameters, because they are not of pri-mary interest, but still need to be dealt with. Without them, we would have a nicesimple null. The main result follows. See Theorem 7.7.4 in Lehmann (2004) for proof.

16.3. Score tests 269

Theorem 16.1. If the Cramér assumptions in Section 14.4 hold, then under the null, the LRTstatistic Λ for problem (16.42) has

2 log(Λ(X)) −→D χ2K1

. (16.43)

This theorem takes care of the simple null case as well, where K2 = 0.Setting some θi’s to zero may seem to be an overly restrictive type of null, but

many testing problems can be reparameterized into that form. For example, supposeX1, . . . , Xn are iid N(µX , σ2), and Y1, . . . , Yn are iid N(µY , σ2), and the Xi’s and Yi’sare independent. We wish to test µX = µY , with σ2 unknown. Then the hypothesesare

H0 : µX = µY , σ2 > 0 versus HA : µX 6= µY , σ2 > 0. (16.44)

To put these in the form (16.42), take a one-to-one reparameterizations

θ =

µX − µYµX + µY

σ2

; θ(1) = µX − µY and θ(2) =

(µX + µY

σ2

). (16.45)

ThenT0 = 0 ×R× (0, ∞) and T = R2 × (0, ∞). (16.46)

Here, K1 = 1, and there are K2 = 2 nuisance parameters, µX + µY and σ2. Thus theasymptotic χ2 has 1 degree of freedom.

16.3 Score tests

When the null hypothesis is simple, tests based directly on the score function canoften be simpler to implement than the LRT, since we do not need to find the MLEunder the alternative. Start with the iid model having a one-dimensional parameter,so that X1, . . . , Xn are iid with density f (xi | θ), θ ∈ R. Consider the one-sided testingproblem,

H0 : θ = θ0 versus HA : θ > θ0. (16.47)

The best test for a simple alternative θA > θ0 has test statistic ∏ f (xi | θA)/ ∏ f (xi | θ0).Here we take the log of that ratio, and expand it in a Taylor series in θA around θ0,so that

n

∑i=1

log(

f (xi | θA)

f (xi | θ0)

)= ln(θA)− ln(θ0)

≈ (θA − θ0)′l′n(θ0), (16.48)

where l′n(θ) is the score function in n observations as in (14.3). The test statistic in(16.48) is approximately the best statistic for alternative θA when θA is very close toθ0.

For fixed θA > θ0, the test that rejects the null when (θA − θ0)l′n(θ0) > c is then thesame that rejects when l′n(θ0) > c∗. Since we are in the iid case, l′n(θ) = ∑n

i=1 l′1(θ ; xi),where l′1 is the score for one observation. Under the null hypothesis,

Eθ0 [l′1(θ0 ; Xi)] = 0 and Varθ0 [l

′1(θ0 ; Xi)] = I1(θ0) (16.49)


as in Lemma 14.1 on page 226, hence by the central limit theorem,

Tn(x) ≡l′n(θ0 ; X)√

n I1(θ0)−→D N(0, 1) under the null hypothesis. (16.50)

Then the one-sided score test

Rejects the null hypothesis when Tn(x) > zα, (16.51)

which is approximately level α. Note that we did not need the MLE under the alter-native.

For example, suppose the Xi’s are iid under the Cauchy location family, so that

f (xi | θ) =1π

11 + (xi − θ)2 . (16.52)

We wish to testH0 : θ = 0 versus HA : θ > 0, (16.53)

The score at θ = θ0 = 0 and information in one observation are, respectively,

l′1(0 ; xi) =2xi

1 + x2i

and I1(0) =12

. (16.54)

See (14.74) for I1. Then the test statistic is

Tn(x) =l′n(0 ; x)√

n I1(0)=

∑ni=1

2xi1+x2

i√n/2

=2√

2√n

n

∑i=1

xi

1 + x2i

, (16.55)

and for an approximate 0.05 level, the cutoff point is z0.05 = 1.645. If the informationis difficult to calculate, which it is not in this example, then you can use the observedinformation at θ0 instead. This test has relatively good power for small θ. However,note that as θ gets large, so do the xi’s, hence Tn(x) becomes small. Thus the powerat large θ’s is poor, even below the level α.

16.3.1 Many-sided

Now consider a more general problem, where θ ∈ T ⊂ RK , and the null, θ0, is in theinterior of T . The testing problem is

H0 : θ = θ0 versus HA : θ ∈ T − θ0. (16.56)

Here the score is the vector of partial derivatives of the loglikelihood, ∇l1(θ ; X), asin (14.77). Equations (14.79) and (14.80) show that, under the null,

E[∇l1(θ ; Xi)] = 0 and Cov[∇l1(θ ; Xi)] = I1(θ0). (16.57)

Thus, again in the iid case, the multivariate central limit theorem shows that

1√n∇ln(θ0) −→D N(0,I1(θ0)). (16.58)

16.3. Score tests 271

Then the mapping theorem and (7.53) on the χ2 shows that under the null

S2n ≡

1n∇ln(θ0)

′I−11 (θ0)∇ln(θ0) −→D χ2

K . (16.59)

The Sn is the statistic for the score test, and the approximate level α score test rejectsthe null hypothesis if

S2n > χ2

K,α. (16.60)

Multinomial distribution

Suppose X ∼ Multinomial(n, (p1, p2, p3)), and we wish to test that the probabilitiesare equal:

H0 : p1 = p2 = 13 versus HA : (p1, p2) 6= ( 1

3 , 13 ). (16.61)

We’ve left out p3 because it is a function of the other two. Leaving it in will violatethe openness of the parameter space in R3.

The loglikelihood is

ln(p1, p2 ; x) = x1 log(p1) + x2 log(p2) + x3 log(1− p1 − p2). (16.62)

Exercise 16.7.12 shows that the score at the null is

∇ln( 13 , 1

3 ) = 3(

x1 − x3x2 − x3

), (16.63)

and the Fisher information matrix is

In(13 , 1

3 ) = 3n(

2 11 2

). (16.64)

Thus since In = nI1, after some manipulation, the score statistic is

S2n = ∇ln(θ0)

′I−1n (θ0)∇ln(θ0)

=3n

(X1 − X3X2 − X3

)′ ( 2 11 2

)−1 ( X1 − X3X2 − X3

)=

1n((X1 − X3)

2 + (X2 − X3)2 + (X1 − X2)

2). (16.65)

The cutoff point is χ22,α, because there are K = 2 parameters. The statistic looks

reasonable, because the Xi’s would tend to be different if their pi’s were. Also, it maynot look like it, but this S2

n is the same as the Pearson χ2 statistic for these hypotheses,which is

X2 =3

∑i=1

(Xi − n/3)2

n/3=

3

∑i=1

(Obsi − Expi)2

Expi. (16.66)

Here, Obsi is the observed count xi as above, and Expi is the expected count underthe null, which here is n/3 for each i.


16.4 Model selection: AIC and BIC

We often have a number of models we wish to consider, rather than just two as inhypothesis testing. (Note also that hypothesis testing may not be appropriate evenwhen choosing between two models, e.g., when there is no obvious allocation to“null” and “alternative” models.) For example, in the regression or logistic regressionmodel, each subset of explanatory variables defines a different model. Here, weassume there are K models under consideration, labeled M1, M2, . . . , MK . Each modelis based on the same data, Y, but has its own density and parameter space:

Model Mk ⇒ Y ∼ fk(y | θk), θk ∈ Tk. (16.67)

The densities need not have anything to do with each other, i.e., one could be normal,another uniform, another logistic, etc., although often they will be of the same family.It is possible that the models will overlap, so that several models might be correct atonce, e.g., when there are nested models.

Let

lk(θk ; y) = log(Lk(θk ; y)) = log( fk(y | θk)) + C(y), k = 1, . . . , K, (16.68)

be the loglikelihoods for the models. The constant C(y) is arbitrary, being the log ofthe constant multiplier in the likelihood from Definition 13.1 on page 199. As long asit is the same for each k, it will not affect the outcome of the following procedures.Define the deviance of the model Mk at parameter value θk by

Deviance(Mk(θk) ; y) = −2 lk(θk ; y). (16.69)

It is a measure of fit of the model to the data; the smaller the deviance, the better thefit. The MLE of θk for model Mk minimizes this deviance, giving us the observeddeviance,

Deviance(Mk(θk) ; y) = −2 lk(θk ; y) = −2 maxθk∈Tk

lk(θk ; y). (16.70)

Note that the likelihood ratio statistic in (16.2) is just the difference in observed de-viance of the two hypothesized models:

2 log(Λ(y)) = Deviance(H0(θ0) ; y)−Deviance(HA(θA) ; y). (16.71)

At first blush one might decide the best model is the one with the smallest ob-served deviance. The problem with that approach is that because the deviances arebased on minus the maximum of the likelihoods, the model with the best observeddeviance will be the largest model, i.e., one with highest dimension. Instead, we adda penalty depending on the dimension of the parameter space, as for Mallows’ Cpin (12.61). The two most popular likelihood-based procedures are the Bayes informa-tion criterion (BIC) of Schwarz (1978) and the Akaike information criterion (AIC) ofAkaike (1974) (who actually meant for the “A” to stand for “An”):

BIC(Mk ; y) = Deviance(Mk(θk) ; y) + log(n)dk, and

AIC(Mk ; y) = Deviance(Mk(θk) ; y) + 2dk, (16.72)

wheredk = dim(Tk). (16.73)

16.5. BIC: Motivation 273

Whichever criterion is used, it is implemented by finding the value for each model,then choosing the model with the smallest value of the criterion, or looking at themodels with the smallest values.

Note that the only difference between AIC and BIC is the factor multiplying thedimension in the penalty component. The BIC penalizes each dimension more heav-ily than does the AIC, at least if n > 7, so tends to choose more parsimonious models.In more complex situations than we deal with here, the deviance information criterion isuseful, which uses more general definitions of the deviance. See Spiegelhalter, Best,Carlin, and van der Linde (2002).

The AIC and BIC have somewhat different motivations. The BIC, as hinted at bythe “Bayes” in the name, is an attempt to estimate the Bayes posterior probability ofthe models. More specifically, if the prior probability that model Mk is the true oneis πk, then the BIC-based estimate of the posterior probability is

PBIC[Mk | y] =e−

12 BIC(Mk ; y)πk

e−12 BIC(M1 ; y)π1 + · · ·+ e−

12 BIC(MK ; y)πK

. (16.74)

If the prior probabilities are taken to be equal, then because each posterior probabilityhas the same denominator, the model that has the highest posterior probability isindeed the model with the smallest value of BIC. The advantage of the posteriorprobability form is that it is easy to assess which models are nearly as good as thebest, if there are any.

The next two sections present some further details on the two criteria.

16.5 BIC: Motivation

To see where the approximation in (16.74) arises, we first need a prior on the parame-ter space. As we did in Section 15.4 for hypothesis testing, we decompose the overallprior into conditional ones for each model. The marginal probability of each modelis the prior probability:

πk = P[Mk]. (16.75)

For a model M, where the parameter is d-dimensional, let the prior be

θ |M ∼ Nd(θ0, Σ0). (16.76)

Then the density of Y in (16.67), conditioning on the model, is

g(y |M) =∫T

f (y | θ)φ(θ | θ0, Σ0)dθ, (16.77)

where φ is the multivariate normal pdf.We will use the Laplace approximation, as in Schwarz (1978), to approximate this

density. The following requires a number of regularity assumptions, not all of whichwe will detail, including the Cramér conditions in Section 14.4. In particular, weassume Y consists of n iid observations, where n is large. Since ln(θ) ≡ ln(θ ; y) =log( f (y | θ)),

g(y |M) =∫T

eln(θ)φ(θ | θ0, Σ0)dθ. (16.78)


The Laplace approximation expands ln(θ) around its maximum, the maximum oc-curring at the maximum likelihood estimator θ. Then as in (16.36),

ln(θ) ≈ ln(θ) + (θ− θ)′∇ln(θ)− 12 (θ− θ)′In(θ)(θ− θ)

= ln(θ)− 12 (θ− θ)′In(θ)(θ− θ), (16.79)

where the score function at the MLE is zero: ∇ln(θ) = 0, and In is the d× d observedFisher information matrix. Then (16.78) and (16.79) combine to show that

g(y |M) ≈ eln(θ)∫T

e−12 (θ−θ)′In(θ)(θ−θ)φ(θ | θ0, Σ0)dθ. (16.80)

Kass and Wasserman (1995) give precise details on approximating this g. We willbe more heuristic. To whit, the first term in the integrand in (16.80) looks like a

N(θ, I−1n (θ)) pdf for θ without the constant, which constant would be

√|In(θ)| /

√2π

d. Putting in and taking out the constant yields

g(y |M) ≈ eln(θ)√

2πd√

|In(θ)|

∫Rd

φ(θ | θ, I−1n (θ))φ(θ | θ0, Σ0)dθ. (16.81)

Mathematically, this integral is the marginal pdf of θ when its conditional distribution

given θ is N(θ, I−1n (θ)) and the prior distribution of θ given the model is N(θ0, Σ0)

as in (16.76). Exercise 7.8.15(e) shows that this marginal is then N(θ0, Σ0 + I−1n (θ)).

Using this marginal pdf yields

g(y |M) ≈ eln(θ)√

2πd√

|In(θ)|

1√

2πd√|Σ0 + I−1

n (θ)|e−

12 (θ−θ0)′(Σ0+I−1

n (θ))−1(θ−θ0)

= eln(θ) 1√|In(θ)Σ0 + Id|

e−12 (θ−θ0)′(Σ0+I−1

n (θ))−1(θ−θ0), (16.82)

where the Id is the d× d identity matrix.In Section 15.4 on Bayesian testing, we saw some justification for taking a prior that

has about as much information as does one observation. In this case, the informationin the n observations is In(θ), so it would be reasonable to take Σ−1

0 to be In(θ)/n,giving us

g(y |M) ≈ eln(θ) 1√|nId + Id|

e−12 (θ−θ0)′((n+1)I−1

n (θ))−1(θ−θ0)

= eln(θ) 1√

n + 1d e−12 (θ−θ0)′((n+1)I−1

n (θ))−1(θ−θ0). (16.83)

The final approximation in the BIC works on the logs:

log(g(y |M)) ≈ ln(θ)−d2

log(n + 1)− 12(θ− θ0)

′((n + 1)I−1n (θ))−1(θ− θ0)

≈ ln(θ)−d2

log(n). (16.84)

16.6. AIC: Motivation 275

The last step shows two further approximations. For large n, replacing n + 1 withn in the log is very minor. The justification for erasing the final quadratic term isthat the first term on the right, ln(θ), is of order n, and the second term is of orderlog(n), while the final term is of constant order since In(θ)/(n + 1) ≈ I1(θ). Thusfor large n it can be dropped. There are a number of approximations and heuristics inthis derivation, and indeed the resulting approximation may not be especially good.See Berger, Ghosh, and Mukhopadhyay (2003), for example. A nice property is thatunder conditions, if one of the considered models is the correct one, then the BICchooses the correct model as n→ ∞.

The final expression in (16.84) is the BIC approximation to the log of the marginaldensity. The BIC statistic itself is based on the deviance, that is, for model M,

BIC(M ; y) = −2 log(g(y |M)) = −2 log(ln(θ)) + d log(n)

= Deviance(M(θ) ; y) + log(n)d, (16.85)

as in (16.72). Given a number of models, M1, . . . , MK , each with its own marginalprior probability πk and conditional marginal density gk(y |Mk), the posterior prob-ability of the model is

P[Mk | y] =gk(y |Mk)πk

g1(y |M1)π1 + · · ·+ gK(y |MK)πK. (16.86)

Thus from (16.85), we have the BIC-based estimate of gk,

gk(y |Mk) = e−12 BIC(Mk ; y), (16.87)

hence replacing the gk’s in (16.86) with their estimates yields the estimated posteriorgiven in (16.74).

16.6 AIC: Motivation

The Akaike information criterion can be thought of as a generalization of Mallows’ Cpfrom Section 12.5.3, based on deviance rather than error sum of squares. To evaluatemodel Mk as in (16.67), we imagine fitting the model based on the data Y, then testingit out on a new (unobserved) variable, YNew, which has the same distribution as andis independent of Y. The measure of discrepancy between the model and the newvariable is the deviance in (16.69), where the parameter is estimated using Y. We thentake the expected value, yielding the expected predictive deviance,

EPredDevk = E[Deviance(Mk(θk) ; YNew)]. (16.88)

The expected value is over θk, which depends on only Y, and YNew.As for Mallows’ Cp, we estimate the expected predictive deviance using the ob-

served deviance, then add a term to ameliorate the bias. Akaike (1974) argues thatfor large n, if Mk is the true model,

δ = EPredDevk − E[Deviance(Mk(θk) ; Y)] ≈ 2dk, (16.89)

where dk is the dimension of the model as in (16.73), from which the estimate AIC in(16.72) arises. A good model is then one with a small AIC.


Note also that by adjusting the priors πk = P[Mk] in (16.74), one can work it sothat the model with the lowest AIC has the highest posterior probability. See Exercise16.7.18.

Akaike’s original motivation was information-theoretic, based on the Kullback-Leibler divergence from density f to density g. This divergence is defined as

KL( f || g) = −∫

g(w) log(

f (w)

g(w)

)dw. (16.90)

For fixed g, the Kullback-Leibler divergence is positive unless g = f , in which caseit is zero. For the Akaike information criterion, g is the true density of Y and YNew,and for model k, f is the density estimated using the maximum likelihood estimateof the parameter, fk(w | θ), where θ is based on Y. Write

KL( fk(w | θk) || g) = −∫

g(w) log( fk(w | θ))dw +∫

g(w) log(g(w))dw

= 12 E[Deviance(Mk(θk) ; YNew) |Y = y]− Entropy(g). (16.91)

(The w is representing the YNew, and the dependence on the observed y is throughonly θk.) Here the g, the true density of Y, does not depend on the model Mk, henceneither does its entropy, defined by −

∫g(w) log(g(w))dw. Thus EPredDevk from

(16.88) is equivalent to (16.91) upon taking the further expectation over Y.One slight logical glitch in the development is that while the theoretical criterion

(16.88) is defined assuming Y and Y∗ have the true distribution, the approximation in(16.89) assumes the true distribution is contained in the model Mk. Thus it appearsthat the approximation is valid for all models under consideration only if the truedistribution is contained in all the models. Even so, the AIC is a legitimate method formodel selection. See the book Burnham and Anderson (2003) for more information.

Rather than justify the result in full generality, we will follow Hurvich and Tsai(1989) and derive the exact value for ∆ in multiple regression.

16.6.1 Multiple regression

The multiple regression model (12.9) is

Model M : Y ∼ Nn(xβ, σ2In), β ∈ Rp, (16.92)

where x is n× p. Now from (16.14),

l(β, σ2; y) = −n2

log(σ2)− 12‖y− xβ‖2

σ2 . (16.93)

Exercise 13.8.20 shows that MLEs are

β = (x′x)−1x′y and σ2 =1n‖y− xβ‖2. (16.94)

Using (16.69), we see that the deviances evaluated at the data Y and the unobservedYNew are, respectively,

Deviance(M(β, σ2) ; Y) = n log(σ2) + n, and

Deviance(M(β, σ2) ; YNew) = n log(σ2) +‖YNew − xβ‖2

σ2 . (16.95)

16.7. Exercises 277

The first terms on the right-hand sides in (16.95) are the same, hence the differencein (16.89) is

δ = E[‖U‖2/σ2]− n, where U = YNew − xβ = YNew − PxY. (16.96)

From Theorem 12.1 on page 183, we know that β and σ2 are independent, and furtherboth are independent of YNew, hence we have

E[‖U‖2/σ2] = E[‖U‖2]E[1/σ2]. (16.97)


δ =n

n− p− 22(p + 1), (16.98)

where in the “(p + 1)” term, the “p” is the number of βi’s and the “1” is for the σ2.Then from (16.89), the estimate of EPredDev is

AICc(M ; y) = Deviance(M(β, σ2) ; y) +n

n− p− 22(p + 1). (16.99)

The lower case “c” stands for “corrected.” For large n, ∆ ≈ 2(p + 1).

16.7 Exercises

Exercise 16.7.1. Continue with the polio example from Exercise 15.7.11, where herewe look at the LRT. Thus we have XV and XC independent,

XV ∼ Poisson(cVθV) and XC ∼ Poisson(cCθC), (16.100)

where cV and cV are known constants, and θV > 0 and θC > 0. We wish to test

H0 : θV = θC versus HA : θV 6= θC. (16.101)

(a) Under the alternative, find the MLEs of θV and θC. (b) Under the null, findthe common MLE of θV and θC. (c) Find the 2 log(Λ) version of the LRT statistic.What are the degrees of freedom in the asymptotic χ2? (d) Now look at the poliodata presented in Exercises 15.7.11, where xV = 57, xC = 142, cV = 2.00745, andcC = 2.01229. What are the values of the MLEs for these data? What is the value ofthe 2 log(Λ)? Do you reject the null hypothesis?

Exercise 16.7.2. Suppose X1, . . . , Xn are iid N(µX , σ2), and Y1, . . . , Ym are iid N(µY ,σ2), and the Xi’s and Yi’s are independent. We wish to test µX = µY , with σ2

unknown. Then the hypotheses are


(a) Find the MLEs of the parameters under the null hypothesis. (b) Find the MLEs ofthe parameters under the alternative hypothesis. (c) Find the LRT statistic. (d) Lettingσ2

0 and σ2A be the MLEs for σ2 under the null and alternative, respectively, show that

(n + m)(σ20 − σ2

A) = kn,m (x− y)2 (16.103)


for some constant kn,m that depends on n and m. Find the kn,m. (e) Let T be thetwo-sample t-statistic,

T =x− y

spooled

√1n + 1

m

, (16.104)

where s2pooled is the pooled variance estimate,

s2pooled =

∑(xi − x)2 + ∑(yi − y)2

n + m− 2. (16.105)

What is the distribution of T under the null hypothesis? (You don’t have to proveit; just say what the distribution is.) (f) Show that the LRT statistic is an increasingfunction of T2.

Exercise 16.7.3. Consider the regression problem in Section 16.1.2, where Y ∼ N(xβ,σ2In), x′x is invertible, β = (β′1, β′2)

′, x = (x1, x2), and we test

H0 : β2 = 0 versus HA : β2 6= 0. (16.106)

(a) Let A be an invertible p× p matrix, and rewrite the model with xβ replaced byx∗β∗, where x∗ = xA and β∗ = A−1β, so that xβ = x∗β∗. Show that the projectionmatrices for x and x∗ are the same, Px = Px∗ (hence Qx = Qx∗ ). (b) Now take A as in(16.19):

A =

(Ip1 −(x′1x1)

−1x′1x20 Ip2

). (16.107)

Show that with this A, x∗ = (x1, x∗2) and β∗ = (β∗′

1 , β′2)′, where x∗2 = Qx1 x2. Give

β∗1 explicitly. We use this x∗ and β∗ for the remainder of this exercise. (c) Show thatx′1x∗2 = 0. (d) Show that

Px = Px1 + Px∗2 . (16.108)

(e) Writing Y ∼ N(x∗β∗, σ2In), show that the joint distribution of QxY and Px∗2 Y is(QxYPx∗2 Y

)∼ N

((0

x∗2 β2

), σ2

(Qx 00 Px∗2

)). (16.109)

(f) Finally, argue that Y′Px∗2 Y is independent of Y′QxY, and when β2 = 0, Y′Px∗2 Y ∼σ2χ2

p2. Thus the statistic in (16.24) is Fp2,n−p under the null hypothesis.

Exercise 16.7.4. Continue with the model in Section 16.1.2 and Exercise 16.7.3. (a)With x∗ = xA for A invertible, show that β is the least squares estimate in theoriginal model if and only if β

∗is the least squares estimate in the model with

Y ∼ N(x∗β∗, σ2In). [Hint: Start by noting that if β minimizes ‖y − xβ‖2 over β,then it must minimize ‖y− x∗A−1β‖2 over β.] (b) Apply part (a) to show that withthe A in (16.107), the least squares estimate of β2 is the same whether using the modelwith xβ or x∗β∗. (c) We know that using the model with xβ, the least squares estima-tor β ∼ N(β, σ2C) where C = (x′x)−1, hence Cov[β2] = C22, the lower-right p2 × p2block of C. Show that for β2 using x∗β∗,

β2 = (x∗′2 x∗2)−1x∗′2 Y ∼ N(β2, σ2(x∗′2 x∗2)

−1), (16.110)

16.7. Exercises 279

hence C−122 = (x∗′2 x∗2)

−1. (d) Show that

β′2C−1

22 β2 = y′Px∗2 y. (16.111)

(e) Argue then that the F statistic in (16.24) is the same as that in (15.24).

Exercise 16.7.5. Refer back to the snoring and heart disease data in Exercise 14.9.7.The data consists of (Y1, x1), . . . , (Yn, xn) independent observations, where each Yi ∼Bernoulli(pi) indicates whether person i has heart disease. The pi’s follow a logisticregression model, logit(pi) = α + βxi, where xi is the extent to which the personsnores. Here are the data again:

Heart disease?→ Yes NoFrequency of snoring ↓ xiNever −3 24 1355Occasionally −1 35 603Nearly every night 1 21 192Every night 3 30 224

(16.112)

The MLEs are α = −2.79558, β = 0.32726. Consider testing H0 : β = 0 versusHA : β 6= 0. We could perform an approximate z-test, but here find the LRT. Theloglikelihood is ∑n

i=1(yi log(pi) + (1− yi) log(1− pi)). (a) Find the value of the log-likelihood under the alternative, ln(α, β). (b) The null hypothesis implies that thepi’s are all equal. What is their common MLE under H0? What is the value of theloglikelihood? (c) Find the 2 log(Λ(y, x)) statistic. (d) What are the dimensions of thetwo parameter spaces? What are the degrees of freedom in the asymptotic χ2? (e)Test the null hypothesis with α = 0.05. What do you conclude?

Exercise 16.7.6. Continue with the snoring and heart disease data from Exercise16.7.5. The model fit was a simple linear logistic regression. It is possible there isa more complicated relationship between snoring and heart disease. This exercisetests the “goodness-of-fit” of this model. Here, the linear logistic regression modelis the null hypothesis. The alternative is the “saturated” model where pi dependson xi in an arbitrary way. That is, the alternative hypothesis is that there are fourprobabilities corresponding to the four possible xi’s: q−3, q−1, q1, and q3. Then forperson i, pi = qxi . The hypotheses are

H0 : logit(pi) = α + βxi, (α, β) ∈ R2 versus HA : pi = qxi , (q−3, q−1, q1, q3) ∈ (0, 1)4.(16.113)

(a) Find the MLEs of the four qj’s under the alternative, and the value of the loglike-lihood. (b) Find the 2 log(Λ(y, x)) statistic. (The loglikelihood for the null here is thesame as that for the alternative in the previous exercise.) (c) Find the dimensions ofthe two parameter spaces, and the degrees of freedom in the χ2. (d) Do you reject thenull for α = 0.05? What do you conclude?

Exercise 16.7.7. Lazarsfeld, Berelson, and Gaudet (1968) collected some data to deter-mine the relationship between level of education and intention to vote in an election.The variables of interest were

• X = Education: 0 = Some high school, 1 = No high school;


• Y = Interest: 0 = Great political interest, 1 = Moderate political interest, 2 = Nopolitical interest;

• Z = Vote: 0 = Intends to vote, 1 = Does not intend to vote.

Here is the table of counts Nijk:

Y = 0 Y = 1 Y = 2

X = 0X = 1

Z = 0 Z = 1490 5279 6

Z = 0 Z = 1917 69602 67

Z = 0 Z = 174 58145 100

(16.114)

That is, N000 = 490 people had X = Y = Z = 0, etc. You would expect X and Z tobe dependent, that is, people with more education are more likely to vote. That’s notthe question. The question is whether education and voting are conditionally inde-pendent given interest, that is, once you know someone’s level of political interest,knowing their educational level does not help you predict whether they vote.

The model is that the vector N of counts is Multinomial(n, p), with n = 2812 andK = 12 categories. Under the alternative hypothesis, there is no restriction on theparameter p, where

pijk = P[X = i, Y = j, Z = k], (16.115)

and i = 0, 1; j = 0, 1, 2; k = 0, 1. The null hypothesis is that X and Z are conditionallyindependent given Y. Define the following parameters:

rij = P[X = i |Y = j], skj = P[Z = k |Y = j], and tj = P[Y = j]. (16.116)

(a) Under the null, pijk is what function of the rij, skj, tj? (b) Under the alternativehypothesis, what are the MLEs of the pijk’s? (Give the numerical answers.) (c) Underthe null hypothesis, what are the MLEs of the rij’s, skj’s, and tj’s? What are the MLEsof the pijk’s? (d) Find the loglikelihoods under the null and alternatives. What isthe value of 2 log(Λ(n)) for testing the null vs. the alternative? (e) How many freeparameters are there for the alternative hypothesis? How many free parameters arethere for the null hypothesis among the rij’s? How many free parameters are therefor the null hypothesis among the skj’s? How many free parameters are there forthe null hypothesis among the tj’s? How many free parameters are there for thenull hypothesis total? (f) What are the degrees of freedom for the asymptotic χ2

distribution under the null? What is the p-value? What do you conclude? (Use level0.05.)


), · · · ,

(XnYn

)are iid N

((00

), σ2

(1 ρρ 1

)). (16.117)

for −1 < ρ < 1 and σ2 > 0. The problem is to find the likelihood ratio test of

H0 : ρ = 0, σ2 > 0 versus HA : ρ 6= 0, σ2 > 0. (16.118)

Set T1 = ∑(X2i + Y2

i ) and T2 = ∑ XiYi, the sufficient statistics. (a) Show that theMLE of σ2 is T1/(2n) under H0. (b) Under HA, let Ui = (Xi + Yi)/

√2 and Vi =

(Xi −Yi)/√

2. Show that Ui and Vi are independent, Ui ∼ N(0, θ1) and Vi ∼ N(0, θ2),

16.7. Exercises 281

where θ1 = σ2(1 + ρ) and θ2 = σ2(1− ρ). (c) Find the MLEs of θ1 and θ2 in terms ofthe Ui’s and Vi’s. (d) Find the MLEs of σ2 and ρ in terms of T1 and T2. (e) Use parts(a) and (d) to derive the form of the likelihood ratio test. Show that it is equivalent torejecting H0 when 2(T2/T1)

2 > c.

Exercise 16.7.9. Find the score test based on X1, . . . , Xn, iid with the Laplace locationfamily density (1/2) exp(−|xi − µ|), for testing H0 : µ = 0 versus HA : µ > 0.Recall from (14.74) that the Fisher information here is I1(µ) = 1, even though theassumptions don’t all hold. (This test is called the sign test. See Section 18.1.)

Exercise 16.7.10. In this question, X1, . . . , Xn are iid with some location family distri-bution, density f (x− µ). The hypotheses to test are

H0 : µ = 0 versus HA : µ > 0. (16.119)

For each situation, find the statistic for the score test expressed so that the statistic isasymptotically N(0, 1) under the null. In each case, the score statistic will be

c ∑ni=1 h(xi)√

n(16.120)

for some function h and constant c. The f ’s: (a) f ∼ N(0, 1). (b) f ∼ Laplace. (SeeExercise 16.7.9.) (c) f ∼ Logistic.

Other questions: (d) For which (if any) of the above distributions is the score statisticexactly N(0, 1)? (e) Which distribution(s) (if any) has corresponding score statisticthat has the same distribution under the null for any of the above distributions?

Exercise 16.7.11. Suppose X1, . . . , Xn are iid Poisson(λ). Find the approximate levelα = 0.05 score test for testing H0 : λ = 1 versus HA : λ > 1.

Exercise 16.7.12. Consider the testing problem in (16.61), where X ∼ Multinomial(n, (p1, p2, p3)) and we test the null that p1 = p2 = 1/3. (a) Show that the scorefunction and Fisher information matrix at the null are as given in (16.63) and (16.64).(b) Verify the step from the second to third lines in (16.65) that shows that the scoretest function is ((X1 − X3)

2 + (X2 − X3)2 + (X1 − X2)

2)/n. (c) Find the 2 log(Λ(x))version of the LRT statistic. Show that it can be written as 2 ∑3

i=1 Obsi log(Obsi/Expi).

Exercise 16.7.13. Here we refer back to the snoring and heart disease data in Exercises16.7.5 and 16.7.6. Consider four models:

• M0: The pi’s are all equal (the null in Exercise 16.7.5);

• M1: The linear logistic model: logit(pi) = α + βxi (the alternative in Exercise16.7.5 and the null in Exercise 16.7.6);

• M2: The quadratic logistic model: logit(pi) = α + βxi + γzi;

• M3: The saturated model: There is no restriction on the pi’s (the alternative inExercise 16.7.6).

The quadratic model M2 fits a parabola to the logits, rather than just a straight lineas in M1. We could take zi = x2

i , but an equivalent model uses the more numericallyconvenient “orthogonal polynomials” with z = (1,−1,−1, 1)′. The MLEs of the pa-rameters in the quadratic model are α = −2.7733, β = 0.3352, γ = −0.2484. (a)


For each model, find the numerical value of the maximum loglikelihood (the form∑(yi log( pi) + (1− yi) log(1− pi))). (b) Find the dimensions and BICs for the fourmodels. Which has the best (lowest) BIC? (c) Find the BIC-based estimates of theposterior probabilities P[Mk |Y = y]. What do you conclude? (d) Now focus on justmodels M1 and M3, the linear logistic model and saturated model. In Exercise 16.7.6,we (just) rejected M1 in favor of M3 at the α = 0.05 level. Find the posterior proba-bility of M1 among just these two models. Is it close to 5%? What do you concludeabout the fit of the linear model?

Exercise 16.7.14. This questions uses data on diabetes patients. The data can befound at http://www-stat.stanford.edu/~hastie/Papers/LARS/. Thereare n = 442 patients, and 10 baseline measurements, which are the predictors. Thedependent variable is a measure of the progress of the disease one year after thebaseline measurements were taken. The ten predictors include age, sex, BMI, bloodpressure, and six blood measurements (hdl, ldl, glucose, etc.) denoted S1, . . . , S6. Theprediction problem is to predict the progress of the disease for the next year basedon these measurements. Here are the results for some selected subsets:

Name Subset A qA RSSA/nA 1, 4 2 3890.457B 1, 4, 10 3 3205.190C 1, 4, 5, 10 4 3083.052D 1, 4, 5, 6, 10 5 3012.289E 1, 3, 4, 5, 8, 10 6 2913.759F 1, 3, 4, 5, 6, 10 6 2965.772G 1, 3, 4, 5, 6, 7, 10 7 2876.684H 1, 3, 4, 5, 6, 9, 10 7 2885.248I 1, 3, 4, 5, 6, 7, 9, 10 8 2868.344J 1, 3, 4, 5, 6, 7, 9, 10, 11 9 2861.347K 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 11 2859.697

(16.121)

The qA is the number of predictors included in the model, and RSSA is the residualsum of squares for the model. (a) Find the AIC, BIC, and Mallows’ Cp’s for thesemodels. Find the BIC-based posterior probabilities of them. (b) Which are the toptwo models for each criterion? (c) What do you see?

Exercise 16.7.15. This exercise uses more of the hurricane data from Exercise 13.8.21originally analyzed by Jung et al. (2014). The model is a normal linear regressionmodel, where Y is the log of the number of deaths (plus one). The three explanatoryvariables are minimum atmospheric pressure, gender of the hurricane’s name (1=fe-male, 0=male), and square root of damage costs (in millions of dollars). There aren = 94 observations. The next table has the residual sums of squared errors, SSe, for

16.7. Exercises 283

each regression model obtained by using a subset of the explanatory variables:

MinPressure Gender√

Damage SSe0 0 0 220.160 0 1 100.290 1 0 218.480 1 1 99.661 0 0 137.691 0 1 97.691 1 0 137.141 1 1 97.16

(16.122)

For each model, the included variables are indicated with a “1.” For your information,we have the least squares estimates and their standard errors for the model with allthree explanatory variables:

Estimate Std. ErrorIntercept 12.8777 7.9033MinPressure −0.0123 0.0081Gender 0.1611 0.2303Damage 0.0151 0.0025

(16.123)

(a) Find the dimensions and BICs for the models. Which one has the best BIC? (b)Find the BIC-based estimates of the posterior probabilities of the models. Which oneshave essentially zero probability? Which ones have the highest probability? (c) Foreach of the three variables, find the probability it is in the true model. Is the gendervariable very likely to be in the model?

Exercise 16.7.16. Consider the linear model in (16.92), Y ∼ Nn(xβ, σ2In), where β isp× 1 and x′x is invertible. Let σ2 = ‖y− xβ‖2/n denote the MLE of σ2, where β is theMLE of β. (a) Show that E[1/σ2] = n/(σ2(n− p− 2)) if n > p + 2. (b) Suppose YNew

is independent of Y and has the same distribution as Y, and let U = YNew − PxYas in (16.96), where Px = x(x′x)−1x′ is given in (12.14). Show that E[U] = 0 andCov[U] = In + Px. (c) Argue that E[‖U‖2] = σ2 trace(In + Px) = σ2(n + p). (d) Showthat then, δ = E[‖U‖2/σ2]− n = 2n(p + 1)/(n− p− 2).

Exercise 16.7.17. Consider the subset selection model in regression, but where σ2 isknown. In this case, Mallows’ Cp can be given as

Cp(A) =RSSA

σ2 + 2qA − n, (16.124)

where, as in Exercise 16.7.14, A is the set of predictors included in the model, qA isthe number of predictors, and RSSA is the residual sum of squares. Show that theAIC(A) is a monotone function of Cp(A). (Here, n and σ2 are known constants.)

Exercise 16.7.18. Show that in (16.74), if we take the prior probabilities as

πk ∝(√

ne

)dk

, (16.125)

where dk is the dimension of Model k, then the model that maximizes the estimatedposterior probability is the model with the lowest AIC. Note that except for verysmall n, this prior places relatively more weight on higher-dimensional models.

Chapter 17

Randomization Testing

Up to now, we have been using the sampling model for inference. That is, we haveassumed the the data arose by sampling from a (usually infinite) population. E. g.,in the two-sample means problem, the model assumes independent random samplesfrom the two populations, and the goal is to infer something about the difference inpopulation means. By contrast, many good studies, especially in agriculture, psy-chology, and medicine, proceed by first obtaining a group of subjects (farm plots,rats, people), then randomly assigning some of the subjects to one treatment and therest to another treatment, often a placebo. For example, in the polio study (Exercise6.8.9), in selected school districts, second graders whose parents volunteered becamethe subjects of the experiment. About half were randomly assigned to receive thepolio vaccine and the rest were assigned a placebo. Such a design yields a random-ization model: the set of subjects is the population, and the statistical randomnessarises from the randomization within this small population, rather than the samplingfrom one or two larger populations. Inference is then on the subjects at hand, wherewe may want to estimate what the means would be if every subject had one treat-ment, or everyone had the placebo. The distribution of a test statistic depends not onsampling new subjects, but on how the subjects are randomly allocated to treatment.

A key aspect of the randomization model is that under an appropriate null, thedistribution of the test statistic can often be found exactly by calculating it under allpossible randomizations, of which there are (usually) only a finite number. If thenumber is too large, sampling a number of randomizations, or asymptotic approxi-mations, are used. Sections 17.1 and 17.2 illustrate the two-treatment randomizationmodel, for numerical and categorical data. Section 17.3 considers a randomizationmodel using the sample correlation as test statistic.

Interestingly, the tests developed for randomization models can also be used inmany sampling models. By conditioning on an appropriate statistic, under the null,the randomization distribution of the test statistic is the same as it would be underthe randomization model. The resulting tests, or p-values, are again exact, and im-portantly have desired properties under the unconditional sampling distribution. SeeSection 17.4.

In Chapter 18 we look at some traditional nonparametric testing procedures.These are based on using the signs and/or ranks of the original data, and havenull sampling distributions that follow directly from the randomization distributions

285

286 Chapter 17. Randomization Testing

found in the previous sections. They are again exact, and often very robust.

17.1 Randomization model: Two treatments

To illustrate a randomization model, we look at a study by Zelazo, Zelazo, and Kolb(1972) on whether walking exercises helped infants learn to walk. The researcherstook 24 one-week-old male infants, and randomly assigned them to one of four treat-ment groups, so that there were six in each group. We will focus on just two groups:The walking exercise group, who were given exercise specifically developed to teachwalking, and the regular exercise group, who were given the same amount of exer-cise, but without the specific walking exercises. The outcome measured was the agein months when the infant first walked. The question is whether the walking exercisehelped the infant walk sooner. The data are

Walking group Regular group9 9.5 9.75 10 13 9.5 11 10 10 11.75 10.5 15 (17.1)

The randomization model takes the N = 12 infants as the population, whereineach observation has two values attached to it: xi is the age first walked if given thewalking exercises, and yi is the age first walked if given the regular exercises. Thusthe population is

P = (x1, y1), . . . , (xN , yN), (17.2)

but we observe xi for only n = 6 of the infants, and observe yi only for the other m =6. Conceptually, the randomization could be accomplished by randomly permutingthe twelve observations, then assigning the first six to the walking group, and the restto the regular group.

There are two popular null hypotheses to consider. The exact null says that thewalking treatment has no effect, which means xi = yi for all i. The average null statesthat the averages over the twelve infants of the walking and regular outcomes areequal. We will deal with the exact null one here. The alternative could be general,i.e., xi 6= yi, but we will take the specific one-sided alternative that the walkingexercise is superior on average for these twelve:

H0 : xi = yi, i = 1, . . . , N versus HA :1N

N

∑i=1

xi <1N

N

∑i=1

yi. (17.3)

A reasonable test statistic is the difference of means for the two observed groups:

tobs =1n ∑

i∈Wxi −

1m ∑

i∈Ryi, (17.4)

where “W” indicates those assigned to the walking group, and “R” to the regulargroup. From (17.1) we can calculate the observed T = −1.25.

We will find the p-value, noting that small values of T favor the alternative. Thexi’s and yi’s are not random here. What is random according to the design of theexperiment is which observations are assigned to which treatment. Under the null,we actually do observe all the (xi, yi), because xi = yi. Thus the null distributionof the statistic is based on randomly assigning n of the values to the walking group,and m to the regular group. One way to represent this randomization is to use

17.1. Randomization model: Two treatments 287

permutations of the vector of values z = (9, 9.5, 9.75, . . . , 10.5, 15)′ from (17.1). AnN×N permutation matrix p has exactly one 1 in each row and one 1 in each column,and the rest of the elements are 0, so that pz just permutes the elements of z. Forexample, if N = 4,

p =

0 1 0 00 0 0 10 0 1 01 0 0 0

(17.5)

is a 4× 4 permutation matrix, and

pz =

0 1 0 00 0 0 10 0 1 01 0 0 0

z1z2z3z4

=

z2z4z3z1

. (17.6)

Let SN be the set of all N×N permutation matrices. It is called the symmetric group.The difference in means in (17.4) can be represented as a linear function of the

data: tobs = a′z where

a′ =(

1n

, . . . ,1n

,− 1m

, . . . ,− 1m

)=

(1n

1′n,− 1m

1′m

), (17.7)

i.e., there are n of the 1/n’s and m of the −1/m’s. The randomization distributionof the statistic is then given by

T(Pz) = a′Pz, P ∼ Uniform(SN). (17.8)

The p-value is the chance of being no larger than tobs = T(z):

p-value(tobs) = P[T(Pz) ≤ tobs] =1

#SN#p ∈ SN | T(pz) ≤ tobs]. (17.9)

There are 12! ≈ 480 million such permutations, but since we are looking at justthe averages of batches of six, the order of the observations within each group isirrelevant. Thus there are really only (12

6 ) = 924 allocations to the two groups weneed to find. It is not hard (using R, especially the function combn) to calculate allthe possibilities. The next table exhibits some of the permutations:

Walking group Regular group T(pz)9 9.5 9.75 10 13 9.5 11 10 10 11.75 10.5 15 −1.2509 9.5 10 13 10 10.5 9.75 9.5 11 10 11.75 15 −0.8339 9.5 11 10 10.5 15 9.75 10 13 9.5 10 11.75 0.167

......

...9.5 9.5 11 10 11.75 15 9 9.75 10 13 10 10.5 0.7509.75 13 9.5 10 11.75 10.5 9 9.5 10 11 10 15 0.000

(17.10)

Figure 17.1 graphs the distribution of these T(pz)’s. The p-value then is the propor-tion of them less than or equal to the observed −1.25, which is 123/924 = 0.133. Thuswe do not reject the null hypothesis that there is no treatment effect.


−2 −1 0 1 2

05

1525

# A

lloca

tions

T(pz)

Figure 17.1: The number of allocations corresponding to each value of T(pz).

Note that any statistic could be equally easily used. For example, using the differ-ence of medians, we calculate the p-value to be 0.067, smaller but still not significant.In Section 18.2.2 we will see that the Mann-Whitney/Wilcoxon statistic yields a sig-nificant p-value of 0.028. When the sample sizes are larger, it becomes impossibleto enumerate all the possibilities. In such cases, we can either simulate a number ofrandomizations, or use asymptotic considerations as in Section 17.5.

Other randomization models lead to corresponding randomization p-values. Forexample, it may be that the observations are paired up, and in each pair one observa-tion is randomly given the treatment and the other a placebo. Then the randomizationp-value would look at the statistic for all possible interchanges within pairs.

17.2 Fisher’s exact test

The idea in Section 17.1 can be extended to cases where the outcomes are binary.For example, Li, Harmer, Fisher, McAuley, Chaumeton, Eckstrom, and Wilson (2005)report on a study to assess whether tai chi, a Chinese martial art, can improve balancein the elderly. A group of 188 people over 70 years old were randomly assigned to twogroups, each of which had the same number and length of exercise sessions over a six-month period, but one group practiced tai chi and the other stretching. (There wereactually 256 people to begin, but some dropped out before the treatments started,and others did not fully report on their outcomes.) One outcome reported was thenumber of falls during the six-month duration of the study. Here is the observedtable, where the two outcomes are 1 = “no falls” and 0 = “one or more falls.”

Group No falls One or more falls TotalTai chi 68 27 95

Stretching 50 43 93Total 118 70 188

(17.11)

The tai chi group did have fewer falls with about 72% of the people experiencingno falls, while about 54% of the control group had no falls. To test whether thisdifference is statistically significant, we proceed as for the walking exercises example.

17.2. Fisher’s exact test 289

Take the population as in (17.2) to consist of (xi, yi), i = 1, . . . , N = 188, where xiindicates whether person i would have had no falls if in the tai chi group, and yiindicates whether the person would have had no falls if in the stretching group. Therandom element is the subset of people assigned to tai chi. Similar to (17.3), we takethe null hypothesis that the specific exercise has no effect, and the alternative that taichi is better:

H0 : xi = yi, i = 1, . . . , N versus HA : #i | xi = 1 > #i | yi = 1. (17.12)

The statistic we’ll use is the number of people in the tai chi group who had no falls:

tobs = i ∈ tai chi group | xi = 1, (17.13)

and now large values of the statistic support the alternative. Here, tobs = 68. Sincethe total numbers in the margins of (17.11) are known, any one of the other entries inthe table could also be used.

As in the walking example, we take z to be the observed vector under the null (sothat xi = yi), where the observations in the tai chi group are listed first, and here asums up the tai chi values. Then the randomization distribution of the test statistic isgiven by

T(Pz) = a′Pz, where a =

(195093

)and z =

168027150043

, (17.14)

with P ∼ Uniform(SN) again.We can find the exact distribution of T(Pz). The probability that T(Pz) = t is the

probability that the first 95 elements of Pz have t ones and 95− t zeroes. There are118 ones in the z vector, so that there are (118

t ) ways to choose the t ones, and ( 7095−t)

ways to choose the zeroes. Since there are (18895 ) ways to choose first 95 without regard

to outcome, we have

P[T(Pz) = t] =(118

t )( 7095−t)

(18895 )

, 25 ≤ t ≤ 95. (17.15)

This distribution is the Hypergeometric(118,70,95) distribution, where the pmf of theHypergeometric(k, l, n) is

f (t | k, l, n) =(k

t)(l

n−t)

(Nn )

, max0, n− l ≤ t ≤ mink, n, (17.16)

and k, l, n are nonnegative integers, N = k + l. (There are several common parame-terizations of the hypergeometric. We use the one in R.) A generic 2× 2 table corre-sponding to (17.11) is

Success Failure TotalTreatment 1 t n− t nTreatment 2 k− t l − n + t mTotal k l N

(17.17)


The p-value for our data is then

P[T(Pz) ≥ tobs] = P[Hypergeometric(118,70,95) ≥ 68] = 0.00863. (17.18)

Thus we would reject the null hypothesis that the type of exercise has no effect, i.e.,the observed superiority of tai chi is statistically significant.

The test here is called Fisher’s exact test. It yields an exact p-value when testingindependence in 2× 2 tables, and is especially useful when the sample size is smallenough that the asymptotic χ2 tests are not very accurate. See the next subsection foranother example.

17.2.1 Tasting tea

Joan Fisher Box (1978) relates a story about Sir Ronald Fisher, her father, while hewas at the Rothamstead Experimental Station in England in the 1920s. When thefirst woman researcher joined the staff, “No one in those days knew what to dowith a woman worker in a laboratory; it was felt, however, that she must have tea,and so from the day of her arrival a tray of tea and a tin of Bath Oliver biscuitsappeared each afternoon at four o’clock precisely.” One afternoon, as Fisher andcolleagues assembled for tea, he drew a cup of tea for one of the scientists, Dr. MurielBristol. “She declined it, saying she preferred a cup into which the milk had beenpoured first.” This pronouncement created quite a stir. What difference should itmake whether you put the milk in before or after the tea? They came up with anexperiment (described in Fisher (1935)) to test whether she could tell the difference.They prepared eight cups of tea with milk. A random four had the milk put in thecup first, and the other four had the tea put in first. Dr. Bristol wasn’t watching therandomization.

Once the cups were prepared, Dr. Bristol sampled each one, and for each tried toguess whether milk or tea had been put in first. She knew there were four cups ofeach type, which could have helped her in guessing. In any case, she got them allcorrect. Could she have just had lucky guesses? Let xi indicate whether milk (xi = 0)or tea (xi = 1) was put into cup i first, and let zi indicate her guess for cup i. Thuseach of x and z consists of four ones and four zeroes. The null hypothesis is that sheis just guessing, that is, she would have made the same guesses no matter which cupshad the milk first. We will take the test statistic to be the number of correct guesses:

T(x, z) = #i | xi = zi. (17.19)

The randomization permutes the x, Px, where P ∼ Uniform(S8). We could try tomodel Dr. Bristol’s thought process, but it will be enough to condition on her re-sponses z. Since she guessed all correctly (T(x, z) = 8), there is only one value pxcould be in order to do as well or better (and no one could do better):

p-value(z) = P[T(Px, z) ≥ 8] = P[Px = z] =1

(84)

=1

70≈ 0.0143. (17.20)

(Note that here, T(Px, z)/2 has a Hypergeometric(4,4,4) distribution.) This p-value isfairly small, so we conclude that it is unlikely she was guessing — She could detectwhich went into the cup first.

17.3. Testing randomness 291

0 100 200 300

010

020

030

0

Lotte

ry n

umbe

r

Day of the year

2 4 6 8 10 12

120

160

200

Ave

rage

lotte

ry n

umbe

r

Month

Figure 17.2: The 1969 draft lottery results. The first plot has day of the year on the X-axis and lottery number on the Y-axis. The second plots the average lottery numberfor each month.

17.3 Testing randomness

In 1969, a lottery was held in the United States that assigned a random draft numberfrom 1 to 366 to each day of the year (including February 29). Young men who wereat least 19 years old, but not yet 26 years old, were to be drafted into the military inthe order of their draft numbers. The randomization used a box with 366 capsules,each capsule containing a slip of paper with one of the days of the year written on it.So there was one January 1, one January 2, etc. There were also slots numbered from1 to 366 representing the draft numbers. Capsules were randomly chosen from thebox, without replacement. The first one chosen (September 14) was assigned to draftslot 1, the second (April 24) was assigned to draft slot 2, ..., and the last one (June 8)was assigned draft slot 366. Some of the results are next:

Draft # Day of the year1 Sep. 14 (258)2 Apr. 24 (115)3 Dec. 30 (365)...

...364 May 5 (126)365 Feb. 26 (57)366 Jun. 8 (16)

(17.21)

The numbers in the parentheses are the day numbers, e.g., September 14 is the 258th

day of the year.There were questions about whether this method produced a completely random

assignment. That is, when drawing capsules from the box, did each capsule have thesame chance of being chosen as the others still left? The left-hand plot in Figure 17.2shows the day of the year (from 1 to 366) on the X-axis, and the draft number on the


−0.2 −0.1 0.0 0.1 0.2

Correlation

Figure 17.3: The histogram of correlations arising from 10,000 random permutationsof the days.

Y-axis. It looks pretty random. But the correlation (12.81) is r = −0.226. If things arecompletely random, this correlation should be about 0. Is this −0.226 too far from0? If we look at the average draft number for each month as in the right-hand plotof Figure 17.2, we see a pattern. There is a strong negative correlation, −0.807. Draftnumbers earlier in the year tend to be higher than those later in the year. It looks likeit might not be completely random.

In the walking exercise experiment, the null hypothesis together with the random-ization design implied that every allocation of the data values to the two groups wasequally likely. Here, the null hypothesis is that the randomization in the lottery wastotally random, specifically, that each possible assignment of days to lottery numberswas equally likely. Thus the two null hypothesis have very similar implications.

To test the randomness of the lottery, we will use the absolute value of the cor-relation coefficient between the days of the year and the lottery numbers. Let eN =(1, 2, . . . , N)′, where here N = 366. The lottery numbers are then represented by eN ,and the days of the year assigned to the lottery numbers are a permutation of theelements of eN , peN for p ∈ SN . Let p0 be the observed permutation of the days, sothat p0eN is the vector of day numbers as in the second column of (17.21).

Letting r denote the sample correlation coefficient, our test statistic is the absolutecorrelation coefficient between the lottery numbers and the day numbers:

T(p) = |r(eN , peN)|. (17.22)

The observed value is T(p0) = 0.226. The p-value is the proportion of permutationmatrices that yield absolute correlations that size or larger:

p-value(p0) = P[T(P) ≥ T(p0)], P ∼ Uniform(SN). (17.23)

Since there are 366! ≈ ∞ such permutations, it is impossible to calculate the p-valueexactly. Instead, we generated 10,000 random permutations of the days of the year,each time calculating the correlation with the lottery numbers. Figure 17.3 is thehistogram of those correlations. The maximum correlation is 0.210 and the minimum

17.4. Randomization tests for sampling models 293

is −0.193, hence none have absolute value even very close to the observed 0.226. Thuswe estimate the p-value to be 0, leading to conclude that the lottery was not totallyrandom.

What went wrong? Fienberg (1971) describes the actual randomization process.Briefly, the capsules with the January dates were first prepared, placed in a largebox, mixed, and shoved to one end of the box. Then the February capsules wereprepared, put in the box and shoved to the end with the January capsules, mixingthem all together. This process continued until all the capsules were mixed into thebox. The box was shut, further shaken, and carried up three flights of stairs, to awaitthe day of the drawing. Before the televised drawing, the box was brought down thethree flights, and the capsules were poured from one end of the box into a bowl andfurther mixed. The actual drawing consisted of drawing capsules one-by-one fromthe bowl, assigning them sequentially to the lottery numbers. The draws tended to befrom near the top of the bowl. The p-value shows that the capsules were not mixedas thoroughly as possible. The fact that there was a significant negative correlationsuggests that the box was emptied into the bowl from which end?

17.4 Randomization tests for sampling models

The procedures developed for randomization models can also be used to find exacttests in sampling models. For an example, consider the two-sample problem whereX1, . . . , Xn are iid N(µX , σ2), Y1, . . . , Ym are iid N(µY , σ2), and the Xi’s are indepen-dent of the Yi’s, and we test


Under the null, the entire set of N = n+m observations constitutes an iid sample. Weknow that the vector of order statistics for an iid sample is sufficient. Conditioningon the order statistics, we have a mathematically identical situation as that for therandomization model in Section 17.1. That is, under the null, every arrangement ofthe n + m observations with n observations being xi’s and the rest being yi’s has thesame probability. Thus we can find a conditional p-value using a calculation similarto that in (17.9). If we consider rejecting the null if the conditional p-value is less orequal to a given α, then the conditional probability of rejecting the null is less than orequal to α, hence so is the unconditional probability.

Next we formalize that idea. In the two-sample model above, under the null,the distribution of the observations is invariant under permutations. That is, for iidobservations, any ordering of them has the same distribution. Generalizing, we willlook at groups of matrices under whose multiplication the observations’ distributionsdo not change under the null. Suppose the data is an N × 1 vector Z, and G is analgebraic group of N × N matrices. (A set of matrices G is a group if g ∈ G theng−1 ∈ G, and if g1, g2 ∈ G then g1g2 ∈ G. The symmetric group SN of N × Npermutation matrices is indeed such a group. See (22.61) for a general definition ofgroups.) Then the distribution of Z is said to be invariant under G if

gZ =D Z for all g ∈ G. (17.25)

Consider testing H0 : θ0 ∈ T0 based on Z. Suppose we have a finite group G suchthat for any θ0 ∈ T0, the distribution of Z is invariant under G. Then for a given test


statistic T(z), we obtain the randomization p-value in a manner analogous to (17.9)and (17.23):

p-value(z) =1

#G #g ∈ G | T(gz) ≥ T(z)

= P[T(Gz) ≥ T(z)], G ∼ Uniform(G). (17.26)

That is, G is a random matrix distributed uniformly over G, which is independent ofZ. To see that this p-value acts like it should, we first use Lemma 15.2 on page 256,emphasizing that G is the random element:

P[p-value(Gz) ≤ α] ≤ α (17.27)

for given α. Next, we magically transfer the randomness in G over to Z. Write theprobability conditionally,

P[p-value(Gz) ≤ α] = P[p-value(GZ) ≤ α |Z = z], (17.28)

then use (17.27) to show that, unconditionally,

P[p-value(GZ) ≤ α | θ0] ≤ α. (17.29)

By (17.25), [GZ |G = g] =D Z for any g, hence GZ =D Z, giving us

P[p-value(Z) ≤ α | θ0] ≤ α. (17.30)

Finally, we can take the supremum over θ0 ∈ T0 to show that the randomizationp-value does yield a level α test as in (15.49).

Returning to the two-sample testing problem (17.24), we let Z be the N × 1 vectorof all the observations, with the first sample listed first: Z = (X1, . . . , Xn, Y1, . . . , Ym)′.Then under the null, the Zi’s are iid, hence Z is invariant (17.25) under the group ofN × N permutation matrices SN . If our statistic is the difference in means, we haveT(z) = a′z for a as in (17.7), and the randomization p-value is as in (17.9) but for atwo-sided test, i.e.,

p-value(a′z) = P[|a′Pz| ≥ |a′z|], P ∼ Uniform(SN). (17.31)

This p-value depends only on the permutation invariance of the combined sample,so works for any distribution, not just normal. See Section 18.2.2 for a more generalstatement of the problem. Also, it is easily extended to any two-sample statistic. Itwill not work, though, if the variances of the two samples are not equal.

17.4.1 Paired comparisons

Stichler, Richey, and Mandel (1953) describes a study wherein each of n = 16 tireswas subject to measurement of tread wear by two methods, one based on weightloss and one on groove wear. Thus the data are (X1, Y1), . . ., (Xn, Yn), with theXi and Yi’s representing the measurements based on weight loss and groove wear,respectively. We assume the tires (i.e., (Xi, Yi)’s) are iid under both hypotheses. Wewish to test whether the two measurement methods are equivalent in some sense. Wecould take the null hypothesis to be that Xi and Yi have the same distribution, butfor our purposes need to take the stronger null hypothesis that the two measurement

17.4. Randomization tests for sampling models 295

methods are exchangeable, which here means that (Xi, Yi) and (Yi, Xi) have the samedistribution:

H0 : (Xi, Yi) =D (Yi, Xi), i = 1, . . . , n. (17.32)

Exercise 17.6.7 illustrates the difference between having the same distribution andexchangeability. See Gibbons and Chakraborti (2011) for another example. Note thatwe are not assuming Xi is independent of Yi. In fact, the design is taking advantageof the likely high correlation between Xi and Yi.

The test statistic we will use is the absolute value of median of the differences:

T(z) = |Median(z1, . . . , zn)|, (17.33)

where zi = xi − yi. Here are the data:

i xi yi zi1 45.9 35.7 10.22 41.9 39.2 2.73 37.5 31.1 6.44 33.4 28.1 5.35 31.0 24.0 7.06 30.5 28.7 1.87 30.9 25.9 5.08 31.9 23.3 8.6

i xi yi zi9 30.4 23.1 7.3

10 27.3 23.7 3.611 20.4 20.9 −0.512 24.5 16.1 8.413 20.9 19.9 1.014 18.9 15.2 3.715 13.7 11.5 2.216 11.4 11.2 0.2

(17.34)

Just scanning the differences, we see that in only one case is yi larger than xi, hencethe evidence is strong that the null is not true. But we will illustrate with our statistic,which is observed to be T(z) = 4.35.

To find the randomization p-value, we need a group. Exchangeability in the nullimplies that Xi − Yi has the same distribution as Yi − Xi, that is, Zi =

D −Zi. Sincethe Zi’s are iid, we can change the signs of any subset of them without changing thenull distribution of the vector Z. Thus the invariance group G± consists of all N × Ndiagonal matrices with ±1’s on the diagonal:

g =

±1 0 · · · 00 ±1 · · · 0...

.... . .

...0 0 · · · ±1

. (17.35)

There are 216 matrices in G±, though due to the symmetry in the null and the statistic,we can hold one of the diagonals at +1. The exact randomization p-value as in (17.26)is 0.0039 (= 128/215). It is less than 0.05 by quite a bit, in fact, we can easily rejectthe null hypothesis for α = 0.01.

17.4.2 Regression

In Section 12.6 we looked at the regression with X = damage and Y = deaths for thehurricane data. There was a notable outlier in Katrina. Here we test independenceof X and Y using randomization. The model is that (X1, Y1), . . . , (XN , YN), N = 94,are iid. The null hypothesis is that the Xi’s are independent of the Yi’s, which meansthe distribution of the data set is invariant under permutation of the Xi’s, or of theYi’s, or both. For test statistics, we will try the four slopes we used in Section 12.6,


which are given in (12.74). If β(x, y) is the estimator of the slope, then the one-sidedrandomization p-value is given by

p-value(x, y) = P[β(x, Py) ≥ β(x, y)], P ∼ Uniform(SN). (17.36)

Note the similarity to (17.22) and (17.23) for the draft lottery data, for which thestatistic was absolute correlation. We use 10,000 random permutations of the xi’s toestimate the randomization p-values. Table (17.37) contains the results. The secondcolumn contains one-sided p-values estimated using Student’s t.

Estimate p-valueStudent’s t Randomization

Least squares 7.7435 0.0000 0.0002Least squares w/o outlier 1.5920 0.0002 0.0039Least absolute deviations 2.0930 0.0213 0.0000Least absolute deviations w/o outlier 0.8032 0.1593 0.0005

(17.37)We see that the randomization p-values are consistently very small whether using

least squares or least absolute deviations, with or without the outlier. The p-valuesusing the Student’s t estimate are larger for least absolute deviations, especially withno outlier.

17.5 Large sample approximations

When the number of randomizations is too large to perform all, we generated a num-ber of random randomizations. Another approach is to use a normal approximationthat uses an extension of the central limit theorem. We will treat statistics that arelinear functions of z = (z1, . . . , zN)′, and look at the distribution under the group ofpermutation matrices. See Section 17.5.2 for the sign change group.

Let aN = (a1, . . . , aN)′ be the vector of constants defining the linear test statistic:

TN(zN) = a′NzN , (17.38)

so that the randomization distribution of T is given by

T(PNzN) = a′NPNzN where PN ∼ Uniform(SN). (17.39)

The a in (17.8) illustrates the aN when comparing two treatments. For the draft lotteryexample in Section 17.3, aN = eN = (1, 2, . . . , N)′. This idea can also be used in theleast squares case as in Section 17.4.2, where zN = x and aN = y, or vice versa.

The first step is to find the mean and variance of T in (17.39). Consider the randomvector UN = PNzN , which is just a random permutation of the elements of zN . Sinceeach permutation has the same probability, each Ui is equally likely to be any one ofthe zk’s. Thus

E[Ui] =1N ∑ zk = zN and Var[Ui] =

1N ∑(zk − zN)2 = s2

zN. (17.40)

Also, each pair (Ui, Uj) for i 6= j is equally likely to be equal to any pair (zk, zl),k 6= l. Thus Cov[Ui, Uj] = cN is the same for any i 6= j. To figure out cN , note that

17.5. Large sample approximations 297

∑ Ui = ∑ zk no matter what PN is. Since the zk’s are constant, Var[∑ Ui] = 0. Thus

0 = Var[N

∑i=1

Ui]

=N

∑i=1

Var[Ui] + ∑ ∑i 6=j

Cov[Ui, Uj]

= Ns2zN

+ N(N − 1)cN

=⇒ cN = − 1N − 1

s2zN

. (17.41)

Exercise 17.6.1 shows the following:

Lemma 17.1. Let UN = PNzN , where PN ∼ Uniform(SN). Then

E[UN ] = zN1N and Cov[UN ] =N

N − 1s2

zNHN , (17.42)

where HN = IN − (1/N)1N1′N is the centering matrix from (7.38).

Since T in (17.39) equals a′NUN , the lemma shows that

E[T(PNzN)] = a′NzN1N = NaNzN (17.43)

and

Var[T(PNzN)] =N

N − 1s2

Na′NHNaN =N2

N − 1s2

aNs2

zN, (17.44)

where s2aN

is the variance of the elements of aN . We will standardize the statistic tohave mean 0 and variance 1 (see Exercise 17.6.2):

VN =a′NPNzN − NaNzN

N√N−1

saN szN

=√

N − 1 r(aN , PNzN), (17.45)

where r(x, y) is the usual Pearson correlation coefficient between x and y as in (12.81).Under certain conditions discussed in Section 17.5.1, this VN → N(0, 1) as N → ∞,so that we can estimate the randomization p-value using the normal approximation.

For example, consider the two-treatment situation in Section 17.1, where aN =(1/n, . . . , 1/n,−1/m, . . . ,−1/m)′ as in (17.8). Since there are n of 1/n and m of−1/m,

aN = 0 and s2aN

=1

nm. (17.46)

A little manipulation (see Exercise 17.6.3) shows that the observed Vn is

vn =x− y

N√N−1

1√nm szN

=x− y

s∗√

1n + 1

m

, (17.47)

where s∗ =√

∑(zi − zN)2/(N − 1) from (7.34). The second expression is interestingbecause it is very close to the t-statistic in (15.16) used for the normal case, where theonly difference is that there a pooled standard deviation was used instead of the s∗.


Fisher considered that this similarity helps justify the use of the statistic in samplingsituations (for large N) even when the data are not normal.

For the the data in (17.1) on walking exercises for infants, we have n = m = 6,x− y = −1.25, and s2

zN= 2.7604. Thus vN = −1.2476, which yields an approximate

one-sided p-value of P[N(0, 1) ≤ vN ] = 0.1061. The exact p-value was 0.133, so theapproximation is fairly good even for this small sample.

In the draft lottery example of Section 17.3, we based the test on the Pearson cor-relation coefficient between the days and lottery numbers. We could have also usedr(x, y) in the regression model in Section 17.4.2. In either case, (17.45) immediatelygives the normalized statistic as vN =

√N − 1 r(x, y). For the draft lottery, N = 366

and r = −0.226, so that vN =√

365(−0.226) = −4.318. This statistic yields a two-sided p-value of 0.000016. The p-value we found earlier by sampling 10,000 randompermutations was 0, hence the results do not conflict: Reject the null hypothesis thatthe lottery was totally random.

17.5.1 Technical conditions

Here we consider the conditions for the asymptotic normality of VN in (17.45), wherePN ∼ Uniform(SN). We assume that we have sequences aN and zN , both withN = 1, 2, . . . , where aN = (aN1, . . . , aNN)′ and zN are both N × 1. Fraser (1957),Chapter 6, summarizes various conditions that imply asymptotic normality. We willlook at one specific condition introduced by Hoeffding (1952):

1N

h(zN)h(aN)→ 0 as N → ∞, (17.48)

where for any N × 1 vector cN ,

h(cN) =maxi=1,...,N(cNi − cN)2

s2cN

. (17.49)

Fraser’s Theorem 6.5 implies the following.

Theorem 17.2. If (17.48) holds for the sequences aN and zN , then VN → N(0, 1) as N →∞, where VN is given in (17.45).

We look at some special cases. In the two-treatment case where aN is as in (17.8),assume that the proportions of observations in each treatment is roughly constant asN → ∞, that is, n/N → p ∈ (0, 1) as N → ∞. Then Exercise 17.6.5 shows that

h(aN) =max1/n2, 1/m2

1/nm→ max

1− p

p,

p1− p

∈ (0, ∞). (17.50)

Thus the condition (17.48) holds if

1N

h(zN)→ 0. (17.51)

It may be a bit problematic to decide whether it is reasonable to assume this condition.According to Theorem 6.7 of Fraser (1957) it will hold if the zi’s are the observationsfrom an iid sample with positive variance and finite E|Zi|3. The data (17.1) in thewalking exercise example looks consistent with this assumption.

17.5. Large sample approximations 299

In the draft lottery example, both aN and zN are eN = (1, 2, . . . , N)′, or somepermutation thereof. Thus we can find the means and variances from those of theDiscrete Uniform(1, N) in Table 1.2 (page 9):

eN =N + 1

2and s2

eN=

N2 − 112

. (17.52)

The maxi=1,...,N(i− (N + 1)/2)2 = (N − 1)2/4, hence

h(eN) =3(N − 1)2

N2 − 1→ 3, (17.53)

and it is easy to see that (17.48) holds.If we are in the general correlation situation, VN =

√N − 1 r(xN , PNyN), a suffi-

cient condition for asymptotic normality of VN is that

1√N

h(xN)→ 0 and1√N

h(yN)→ 0. (17.54)

These conditions will hold if the xi’s and yi’s are from iid sequences with tails thatare “exponential” or less. The right tail of a distribution is exponential if as x → ∞,1− F(x) ≤ a exp(−bx) for some a and b. Examples include the normal, exponential,and logistic. The raw hurricane data analyzed in Section 17.4.2 may not conformto these assumptions due to the presence of extreme outliers. See Figure 12.2 onpage 192. The assumptions do look reasonable if we take logs of the y-variable as wedid in Section 12.5.2.

17.5.2 Sign changes

Similar asymptotics hold if the group used is G±, the group of sign-change matri-ces (17.35). Take the basic statistic to be T(zN) = a′NzN as in (17.38) for some setof constants aN . The randomization distribution is then of a′NGzN , where G ∼Uniform(G±). Since the diagonals of G all have distribution P[Gii = −1] = P[Gii =+1] = 1/2, E[Gii] = 0 and Var[Gii] = 1. Thus the normalized statistic here is

VN =a′NGzN√

∑ a2i z2

i

. (17.55)

The Lyapunov condition for asymptotic normality is useful for sums of indepen-dent but not necessarily identically distributed random variables. See Serfling (1980),or any text book on probability.

Theorem 17.3. Suppose X1, X2, . . . are independent with E[Xi] = µi and Var[Xi] = σ2i <

∞ for each i. Then

∑Ni=1(Xi − µi)√

∑Ni=1 σ2

i

−→D N(0, 1) if∑N

i=1 E[|Xi − µi|ν]√∑N

i=1 σ2i

ν −→ 0 for some ν > 2. (17.56)

For (17.55), Lyapunov says that

VN −→D N(0, 1) if∑N

i=1 E[|aizi|ν]√∑ a2

i z2i

ν −→ 0 for some ν > 2. (17.57)


17.6 Exercises

Exercise 17.6.1. (a) Show that (N/(N− 1))HN has 1’s on the diagonal and −1/(N−1)’s on the off-diagonals. (b) Prove Lemma 17.1.

Exercise 17.6.2. (a) For N × 1 vectors x and y, show that the Pearson correlationcoefficient can be written r(x, y) = (x′y − Nx y)/(Nsxsy), where sx and sy are thestandard deviations of the elements of x and y, respectively. (b) Verify (17.45).

Exercise 17.6.3. Let aN = (1/n, . . . , 1/n,−1/m, . . . ,−1/m)′ as in (17.8), where N =n + m. (a) Show that the mean of the elments is aN = 0 and the variance is

s2aN

=1N

(1n+

1m

)=

1nm

. (17.58)

(b) Verify (17.47).

Exercise 17.6.4. The Affordable Care Act (ACA) is informally called “Obamacare.”Even though they are the same thing, do some people feel more positive toward theAffordable Care Act than Obamacare? Each student in a group of 841 students wasgiven a survey with fifteen questions. All the questions were the same, except for one— Students were randomly asked one of the two questions:

• What are your feelings toward The Affordable Care Act?

• What are your feelings toward Obamacare?

Whichever question was asked, the response is a number from 1 to 5, where 1 meansone’s feelings are very negative, and a 5 means very positive. Consider the random-ization model in Section 17.1. Here, xi would be person i’s response to the questionreferring to the ACA, and yi the response to the question referring to Obamacare.Take the exact null (17.3) that xi = yi for all i, and use the difference in means ofthe two groups as the test statistic. There were n = 416 people assigned to the ACAgroup, with ∑ xi = 1349 and ∑ x2

i = 4797 for those people. The Obamacare grouphad m = 425, ∑ yi = 1285, and ∑ y2

i = 4443. (a) Find the difference in means for theobserved groups, x − y, and the normalized version vn as in (17.47). (b) Argue thatthe condition (17.51) is reasonable here (if it is). Find the p-value based on the normalapproximation for the statistic in part (a). What do you conclude?

Exercise 17.6.5. Let a′N = ((1/n)1′n,−(1/m)1′m). (a) Show that max(aNi − aN)2 =max1/n2, 1/m2. (b) Suppose n/N → p ∈ (0, 1) as N → ∞. Show that

h(aN) ≡ max(aNi − aN)2s2

aN

→ max

1− pp

,p

1− p

∈ (0, ∞) (17.59)

as in (17.50). [Recall s2aN

in (17.46).]

Exercise 17.6.6. The next table has the data from a study (Mendenhall, Million,Sharkey, and Cassisi, 1984) comparing surgery and radiation therapy for treatingcancer of the larynx.

Cancer controlled Cancer not controlled TotalSurgery X11 = 21 X12 = 2 X1+ = 23Radiation therapy X21 = 15 X22 = 3 X2+ = 18Total X+1 = 36 X+2 = 5 X++ = 41

(17.60)

17.6. Exercises 301

−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6

Slope

Figure 17.4: Histogram of 10,000 randomizations of the slope of Tukey’s resistant linefor the grades data.

The question is whether surgery is better than radiation therapy for controlling can-cer. Use X11, the upper left variable in the table, as the test statistic. (a) Conditionalon the marginals X1+ = 23 and X+1 = 36, what is the range of X11? (b) Find theone-sided p-value using Fisher’s exact test. What do you conclude?

Exercise 17.6.7. In the section on paired comparisons, Section 17.4.1, we noted thatwe needed exchangeability rather than just equal distributions. To see why, consider(X, Y), with joint pmf and space

f (x, y) = 15 , (x, y) ∈ W = (1, 2), (1, 3), (2, 1), (3, 4), (4, 1). (17.61)

(a) Show that marginally, X and Y have the same distribution. (b) Show that X and Yare not exchangeable. (c) Find the pmf of Z = X − Y. Show that Z is not symmetricabout zero, i.e., Z and −Z have different distributions. (d) What is the median of Z?Is it zero?

Exercise 17.6.8. For the grades in a statistics class of 107 students, let X = score onhourly exams, Y = score on final exam. We wish to test whether these two variablesare independent. (We would not expected them to be.) (a) Use the test statistic∑(Xi − X)(Yi − Y). The data yield the following: ∑(xi − x)(yi − y) = 6016.373,∑(xi − x)2 = 9051.411, ∑(yi − y)2 = 11283.514. Find the normalized version ofthe test statistic, normalized according to the randomization distribution. Do youreject the null hypothesis? (b) Tukey (1977) proposed a resistant-line estimate ofthe fit in a simple linear regression. The data are rearranged so that the xi’s are inincreasing order, then the data are split into three approximately equal-sized groupsbased on the values of xi: the lower third, middle third, and upper third. With 107observations, the group sizes are 36, 35, and 36. Then for each group, the median ofthe xi’s and median of the yi’s are calculated. The resistant slope is the slope betweenthe two extreme points: β(x, y) = (y∗3 − y∗1)/(x∗3 − x∗1), where x∗j is the median of

the xi’s in the jth group, and similarly for the y∗j ’s. The R routine line calculates this


slope, as well as an intercept. For the data here, β(x, y) = 0.8391. In order to usethis slope as a test statistic, we simulated 10,000 randomizations of β(x, Py). Figure17.4 contains the histogram of these values. What do you estimate the randomizationp-value is?

Exercise 17.6.9. Suppose TN ∼ Hypergeometric(k, l, n) as in (17.16), where N = k + l.Set m = N − n. As we saw in Section 17.2, we can represent this distribution witha randomization distribution of 0-1 vectors. Let a′N = (1′n, 0′m) and z′N = (1′k, 0′l), sothat TN =D a′NPNzN , where PN ∼ Uniform(SN). (a) Show that

E[TN ] =knN

and Var[TN ] =klmn

N2(N − 1). (17.62)

[Hint: Use (17.44). What are s2aN

and s2zN

?] (b) Suppose k/N → κ ∈ (0, 1) andn/N → p ∈ (0, 1). Show that Theorem 17.2 can be used to prove that

√N − 1

NTN − nm√klmn

−→D N(0, 1). (17.63)

[Hint: Show that a result similar to that in (17.50) holds for the aN and zN here, whichhelps verify (17.48).]

Chapter 18

Nonparametric Tests Based on Signs and Ranks

18.1 Sign test

There are a number of useful and robust testing procedures based on signs and ranksof the data that traditionally go under the umbrella of nonparametric tests. They canbe used as analogs to the usual normal-based testing situations when one does notwish to depend too highly on the normal assumption. These test statistics have theproperty that their randomization distribution is the same as their sampling distribu-tion, so the work we have done so far will immediately apply here.

One nonparametric analog to hypotheses on the mean are hypotheses on the me-dian. Assume that Z1, . . . , ZN are iid, Zi ∼ F ∈ F , where F is the set of continuousdistribution functions that are symmetric about their median η. We test

H0 : F ∈ F with η = 0 versus HA : F ∈ F with η 6= 0. (18.1)

(What follows is easy to extend to tests of η = η0; just subtract η0 from all the Zi’s.)If the median is zero, then one would expect about half of the observations to be

positive and half negative. The sign test uses the signs of the data, Sign(Z1), . . .,Sign(ZN), where for any z ∈ R,

Sign(z) =

+1 if z > 00 if z = 0−1 if z < 0

. (18.2)

The basic sign statistic is

S(z) =N

∑i=1

Sign(zi). (18.3)

The corresponding two-sided test statistic for (18.1) is |S(z)|. If F is continuous at 0,i.e., P[Zi = 0] = 0, then the exact null distribution is the same as |2W−N|where W ∼Binomial(N, 1/2). Thus it is easy to perform the test, either exactly or asymptotically.

The test is often used in paired comparisons, as for the tread wear data in Section17.4.1. There the data are iid (Xi, Yi)’s, and the null hypothesis (17.32) is that Xi andYi are exchangeable ((Xi, Yi) =

D (Yi, Xi)), hence Zi = Xi − Yi is symmetric about the

303

304 Chapter 18. Nonparametric Tests Based on Signs and Ranks

median of zero. From (17.34) we see that of the N = 16 zi’s, one is negative and 15are positive, so that w = 15 and S(z) = 14. The p-value is

P[|S(Z)| ≥ 14] = P[W ∈ 0, 1, 15, 16] = 0.0005, (18.4)

leading to rejection of the null as before.

18.2 Rank transform tests

A popular approach to nonparametric tests is to take a familiar test statistic, andrecreate it by replacing the observed values with their ranks. Such tests, called ranktransform tests are often robust with null distributions that are easy to apply, eitherexactly or asymptotically. First, we need to define rank.

We saw rankings in Section 3.2.1, where people ranked from 1 to 3 where theylike to live. For an N × 1 vector z with no ties (i.e., no two zi’s are equal), thecorresponding rank vector is r, where ri = 1 if zi is the smallest observation, ri = 2if zi is the second smallest, ..., ri = N if zi is the largest. When there are ties, we willdefine ranks using midranks, where the midrank of zi is the regular rank if no otherobservation equals zi, and equals the average of what would be the ranks if there aremultiple observations equalling zi. An illustration is easier to understand. Supposez = (12, 15, 14, 13, 16, 12, 14, 14). We order the zi’s, assign “ranks” from 1 to N to theorder statistics, then average the ranks over tied observations:

z(i)’s 12 12 13 14 14 14 15 16“ranks” 1 2 3 4 5 6 7 8

midranks 1.5 1.5 3 5 5 5 7 8(18.5)

That is, the two 12’s would be ranked 1 and 2, so their midranks are both (1 + 2)/2.Likewise the three 14’s share the average of the ranks 4, 5, and 6. The rank vectorthen arranges the ranks in the order of the original z:

z 12 15 14 13 16 12 14 14Rank(z) 1.5 7 5 3 8 1.5 5 5 (18.6)

To be precise,

Rank(z)i = 1 + ∑j 6=i

(I[zi > zj] +

12 I[zi = zj]

)= 1

2 (∑j 6=i

Sign(zi − zj) + n + 1). (18.7)

Below we show how to use ranks in the testing situations previously covered inChapter 17.

18.2.1 Signed-rank test

The sign test for testing the median is zero as in (18.1) is fairly crude in that it looksonly at which side of zero the observations are, not how far from zero. If underthe null, the distribution of the Zi’s is symmetric about zero, we can generalize the

18.2. Rank transform tests 305

test. The usual mean difference can be written as a sign statistic weighted by themagnitudes of the observations:

z =1N

N

∑i=1|zi| Sign(zi). (18.8)

A modification of the sign test introduced by Wilcoxon (1945) is the signed-rank test,which uses ranks of the magnitudes instead of the magnitudes themselves. Letting

R = Rank(|Z1|, . . . , |ZN |), and S = (Sign(Z1), . . . , Sign(ZN))′, (18.9)

the statistic is

T(Z) =N

∑i=1

RiSi = R′S. (18.10)

We will look at the randomization distribution of T under the null. As in Sec-tion 17.4.1, the distribution of Z is invariant under the group G± of sign-changematrices as in (17.35). For fixed zi 6= 0, if P[Gi = −1] = P[Gi = +1] = 1/2, thenP[Sign(Gizi) = −1] = P[Sign(Gizi) = +1] = 1/2. If zi = 0, then Sign(Gizi) = 0.Thus the randomization distribution of T is given by

T(Gz) =D ∑i | zi 6=0

riGi, G ∼ Uniform(G±). (18.11)

(In practice one can just ignore the zeroes, and proceed with a smaller sample size.)Thus we are in the same situation as Section 17.4.1, and can find the exact distributionif N is small, or simulate if not. Exercise 18.5.1 finds the mean and variance:

E[T(Gz)] = 0 and Var[T(Gz)] = ∑i | zi 6=0

r2i . (18.12)

When there are no zero zi’s and no ties among the zi’s, there are efficient algo-rithms (e.g., wilcox.test in R) to calculate the exact distribution for larger N, up to 50.To use the asymptotic normal approximation, note that under these conditions, r issome permutation of 1, . . . , N. Exercise 18.5.1 shows that

VN =T(Gz)√

N(N + 1)(2N + 1)/6−→ N(0, 1). (18.13)

If the distribution of the Zi’s is continuous (so there are no zeroes and no ties, withprobability one), then the randomization distribution and sampling distribution of Tare the same.

18.2.2 Mann-Whitney/Wilcoxon two-sample test

The normal-based two-sample testing situation (17.24) tests the null that the twomeans are equal versus either a one-sided or two-sided alternative. There are variousnonparametric analogs of this problem. The one we will deal with has

X1, . . . , Xn ∼ iid FX and Y1, . . . , Ym ∼ iid FY , (18.14)

where the Xi’s and Yi’s are independent. The null hypothesis is that the distributionfunctions are equal, and the alternative is that FX is stochastically larger than FY :


Definition 18.1. The distribution function F is stochastically larger than the distributionfunction G, written

F >st G, (18.15)

if

F(c) ≤ G(c) for all c ∈ R, andF(c) < G(c) for some c ∈ R. (18.16)

It looks like the inequality is going the wrong way, but the idea is that if X ∼ Fand Y ∼ G, then F being stochastically larger than G means that the X tends to belarger than the Y, or

P[X > c] > P[Y > c], which implies that 1− F(c) > 1− G(c) =⇒ F(c) < G(c).(18.17)

On can also say that “X is stochastically larger than Y.” For example, if X and Y areboth from the same location family, but X’s parameter is larger than Y’s, then X isstochastically larger than Y.

Back to the testing problem. With the data in (18.14), the hypotheses are

H0 : FX = FY , versus HA : FX >st FY . (18.18)

We reject the null if the xi’s are too much larger than the yi’s in some sense. Wilcoxon(1945) proposed, and Mann and Whitney (1947) further studied, the rank transformstatistic that replaces the difference in averages of the two groups, x − y, with thedifference in averages of their ranks. That is, let

r = Rank(x1, . . . , xn, y1, . . . , ym), (18.19)

the ranks of the data combining the two samples. Then the statistic is

WN =1n

n

∑i=1

ri −1m

n+m

∑i=n+1

ri = a′r, a′ =(

1n

, . . . ,1n

,− 1m

, . . . ,− 1m

), (18.20)

the difference in the average of the ranks assigned to the xi’s and those assigned tothe yi’s. For example, return to the walking exercises study from Section 17.1. If wetake ranks of the data in (17.1), then find the difference in average ranks, we obtainWN = −4. Calculating this statistic for all possible allocations as in (17.8), we find 26of the 924 values are less than or equal to -4. (The alternative here is FY >st FX .) Thusthe randomization p-value for the one-sided test is 0.028. This may be marginallystatistically significant, though the two-sided p-value is a bit over 5%.

The statistic in (18.20) can be equivalently represented, at least if there are no ties,by the number of xi’s larger than yj’s:

W∗N =n

∑i=1

m

∑j=1

I[xi > yj]. (18.21)

Exercise 18.5.4 shows that WN = N(W∗N/(nm)− 1/2).We are back in the two-treatment situation of Section 17.1, but using the ranks in

place of the zi’s. The randomization distribution involves permuting the combined

18.2. Rank transform tests 307

data vector, which similarly permutes the ranks. Thus the normalized statistic is asin (17.47):

VN =WN

N√N−1

1√nm srN

, (18.22)

where s2rN

is the variance of the ranks. Exercise 18.5.5 shows that if there are no tiesamong the observations,

VN =WN

N√

N+112nm

, (18.23)

which approaches N(0, 1) under the randomization distribution.As for the signed-rank test, if there are no ties, the R routine wilcox.test can calcu-

late the exact distribution of the statistic for N ≤ 50. Also, if FX and FY are continu-ous, the sampling distribution of WN under the null is the same as the randomizationdistribution.

18.2.3 Spearman’s ρ independence test

One nonparametric analog of testing for zero correlation between two variables istesting independence versus positive (or negative) association. The data are the iidpairs (X1, Y1), . . . , (XN , YN). The null hypothesis is that the Xi’s are independentof the Yi’s. There are a number of ways to define positive association between thevariables. A regression-oriented approach looks at the conditional distribution of Yigiven Xi = x, say Fx(y). Independence implies that Fx(y) does not depend on x. Apositive association could be defined by having Fx(y) being stochstically larger thanFx∗ (y) if x > x∗.

Here we look at Spearman’s ρ, which is the rank transform of the usual Pearsoncorrelation coefficient. That is, letting rx = Rank(x) and ry = Rank(y),

ρ(x, y) = r(rx, ry), (18.24)

where r is the usual Pearson correlation coefficient from (12.81). The ρ measures ageneral monotone relationship between X and Y, rather than the linear relation thePearson coefficient measures.

The randomization distribution here is the same as for the draft lottery data inSection 17.3. Thus again we can use the large-sample approximation to the ran-domization distribution for the statistic in (17.45):

√N − 1 r(rx, PNry) ≈ N(0, 1),

PN ∼ Uniform(SN). If the distributions of Xi and Yi are continuous, then again therandomization distribution and sampling distribution under the null coincide. Also,in the no-tie case, the R routine cor.test will find an exact p-value for N ≤ 10.

Continuing from Section 17.4.2 with the hurricane data, we look at the correlationbetween damage and deaths with and without Katrina:

Pearson (raw) Spearman Pearson (transformed)All data 0.6081 0.7605 0.7379Without Katrina 0.3563 0.7527 0.6962

(18.25)

Comparing the Pearson coefficient on the raw data and the Spearman coefficient onthe ranks, we can see how much more robust Spearman is. The one outlier adds0.25 to the Pearson coefficient. Spearman is hardly affected at all by the outlier. We


62 64 66 68 70 72 74

2022

2426

2830

Height

BM

I

Figure 18.1: Connecting the points with line segments in a scatter plot. Kendall’s τequals the number of positive slope minus the number of negative slopes, divided bythe total number of segments. See (18.27).

also include the Pearson coefficient where we take square root of the damage andlog of the deaths. These coefficients are similar to the Spearman coefficients, thoughslightly less robust. These numbers suggest that Spearman’s ρ gives a good simpleand robust measure of association.

18.3 Kendall’s τ independence test

Consider the same setup as for Spearman’s ρ in Section 18.2.3, that is, (X1, Y1), . . .,(XN , YN) are iid, and we test the null hypothesis that the Xi’s are independent of theYi’s. The alternative here is based on concordance. Given two of the pairs (Xi, Yi) and(Xj, Yj) (i 6= j), they are concordant if the line connecting the points in R2 has positiveslope, and discordant if the line has negative slope. For example, Figure 18.1 plotsdata for five students, with xi being height in inches and yi being body mass index(BMI). Each pair of point is connected by a line segment. Eight of these segmentshave positive slope, and two have negative slope.

A measure of concordance is

τ = P[(Xi − Xj)(Yi −Yj) > 0]− P[(Xi − Xj)(Yi −Yj) < 0]

= E[Sign(Xi − Xj) Sign(Yi −Yj)]. (18.26)

If τ > 0, we tend to see larger xi’s going with larger yi’s, and smaller xi’s going withsmaller yi’s. If τ < 0, the xi’s and yi’s are more likely to go in different directions. Ifthe Xi’s are independent of the Yi’s, then τ = 0. Kendall’s τ, which we saw briefly inExercise 4.4.2, is a statistic tailored to testing τ = 0 versus τ > 0 or < 0.

Kendall’s τ test statistic is an unbiased estimator of τ:

τ(x, y) =∑ ∑1≤i<j≤N Sign(xi − xj) Sign(yi − yj)

(N2 )

. (18.27)

18.3. Kendall’s τ independence test 309

This numerator is the number of positive slopes minus the number of negative slopes,so for the data in Figure 18.1, we have τ = (8− 2)/10 = 0.6. As for Spearman’s ρ, thisstatistic measures any kind of positive association, rather than just linear. To find thep-value based on the randomization distribution of τ(x, Py), P ∼ Uniform(SN), wecan enumerate the values if N is small, or simulate if N is larger, or use asymptoticconsiderations. The R function cor.test also handles Kendall’s τ, exactly for N ≤ 50.

To use the asymptotic normality approximation, we need the mean and varianceunder the randomization distribution. We will start by assuming that there are noties among the xi’s, and no ties among the yi’s. Recall Kendall’s distance in Exercises4.4.1 and 4.4.2, defined by

d(x, y) = ∑ ∑1≤i<j≤N

I[(xi − xj)(yi − yj) < 0]. (18.28)

With no ties,

τ(x, y) = 1− 4d(x, y)N(N − 1)

. (18.29)

Arrange the observations so that the xi’s are in increasing order, x1 < x2 < · · · < xN .Then we can write

d(x, y) =N−1

∑i=1

Ui, where Ui =N

∑j=i+1

I[yi > yj]. (18.30)

Extending the result in Exercise 4.4.3 for N = 3, it can be shown that under the ran-domization distribution, the Ui’s are independent with Ui ∼ Discrete Uniform(0, N−i). Exercise 18.5.11 shows that

E[d(x, Py)] =N(N − 1)

4and Var[d(x, Py)] =

N(N − 1)(2N + 5)72

, (18.31)

henceE[τ(x, Py))] = 0 and Var[τ(x, Py))] =

29

2N + 5N(N − 1)

, (18.32)

then uses Lyapunov’s condition to show that

VN =τ(x, Py)√29

2N+5N(N−1)

−→D N(0, 1). (18.33)

18.3.1 Ties

When X and Y are continuous, then the measure of concordance τ in (18.27) acts likea correlation coefficient, going from −1 if X and Y are perfectly negatively related,Y = g(X) for a monotone decreasing function g, to +1 if they are perfectly positivelyrelated. The same goes for the data-based Kendall’s τ in (18.28) if there are no ties.If the random variables are not continuous, or the observations of either or bothvariables contain ties, then the two measures are unable to achieve either −1 or +1.In practice, if ties are fairly scarce, then there is no need to make any modifications.For example, in data on 165 people’s height and weight, heights are given to thenearest inch and weights to the nearest pound. There are 19 different heights and


48 different weights, hence quite a few ties. Yet the randomization values of τ rangefrom −.971 and +.973, which is close enough to the ±1 range.

If there are extensive ties, which would occur if either variable is categorical, thenmodifications would be in order. The traditional approach is to note that τ is thecovariance of Sign(Xi − Xj) and Sign(Yi − Yj), of which the correlation is a naturalnormalization. That is, in general we define

τ = Corr[Sign(Xi − Xj), Sign(Yi −Yj)]

=E[Sign(Xi − Xj) Sign(Yi −Yj)]√

E[Sign(Xi − Xj)2]E[Sign(Yi −Yj)2], (18.34)

noting that E[Sign(Xi − Xj)] = E[Sign(Yi − Yj)] = 0. If the distribution of X iscontinuous, P[Sign(Xi −Xj)

2 = 1] = 1, and similarly for Y. Thus if both distributionsare continuous, the denominator is 1, so that τ is as in (18.26). Suppose X is notcontinuous. Let a1, . . . , aK be the points at which X has positive probability, andpk = P[X = ak]. If Y is not continuous, let b1, . . . , bL be the corresponding points, andql = P[Y = bl ]. Either or both of K and L could be ∞. Exercise 18.5.8 shows that

E[Sign(Xi − Xj)2] = 1−∑ p2

k and E[Sign(Yi −Yj)2] = 1−∑ q2

l . (18.35)

The τ in (18.27) is similarly modified. Let ux be the N(N − 1)× 1 vector with allthe Sign(xi − xj) for i 6= j,

u′x = (Sign(x1 − x2), . . . , Sign(x1 − xN),Sign(x2 − x1), Sign(x2 − x3), . . . , Sign(x2 − xN),

. . . , Sign(xN − x1), . . . , Sign(xN − xN−1)), (18.36)

and uy be the corresponding vector of the Sign(yi − yj)’s. Then u′xuy equals twicethe numerator of τ in (18.27) since it counts i < j and i > j. Kendall’s τ modified forties is then Pearson’s correlation coefficient of these signed difference vectors:

τ = r(ux, uy) =u′xuy

‖ux‖ ‖uy‖, (18.37)

since the means of the elements of ux and uy are 0.Now Exercise 18.5.8 shows that

‖ux‖2 = ∑ ∑i 6=j

Sign(xi − xj)2 = N(N − 1)−

K

∑k=1

ck(ck − 1) and

‖uy‖2 = ∑ ∑i 6=j

Sign(yi − yj)2 = N(N − 1)−

L

∑l=1

dl(dl − 1), (18.38)

where (c1, . . . , cK) is the pattern of ties for the xi’s and (d1, . . . , dL) is the pattern ofties for y. That is, letting a1, . . . , aK be the values that appear at least twice in thevector x, set ck = #i | xi = ak. Similarly for y. Finally,

τ(x, y) =2 ∑ ∑1≤i<j≤N Sign(xi − xj) Sign(yi − yj)√

N(N − 1)−∑Kk=1 ck(ck − 1)

√N(N − 1)−∑L

l=1 dl(dl − 1). (18.39)

18.3. Kendall’s τ independence test 311

If there are no ties in one of the vectors, there is nothing to subtract from its N(N− 1),hence if neither vector has ties, we are back to the original τ in (18.27). The statisticin (18.39) is often called Kendall’s τB, the original one in (18.27) being then referredto as Kendall’s τA. This modified statistic will generally not range all the way from−1 to +1, though it can get closer to those limits than without the modification. SeeExercise 18.5.9.

For testing independence, the randomization distribution of τ(x, Py) still has mean0, but the variance is a bit trickier if ties are present. In Section 18.3.2, we deal withties in just one of the vectors. Here we give the answer for the general case withoutproof. Let

S(x, y) = ∑ ∑1≤i<j≤N

Sign(xi − xj) Sign(yi − yj). (18.40)

The expectation under the randomization distribution is E[S(x, Py)] = 0 whateverthe ties situation is. Since S is a sum of signed differences, the variance is a sumof the variances plus the covariances of the signed differences, each of which can becalculated, though the process is tedious. Rather than go through the details, wewill present the answer, and refer the interested reader to Chapter 5 of Kendall andGibbons (1990). The variance in general is given by

Var[S(x, Py)] =N(N − 1)(2N + 5)−∑ ck(ck − 1)(ck + 5)−∑ dl(dl − 1)(2dl + 5)

18

+[∑ ck(ck − 1)(ck − 2)][∑ dl(dl − 1)(dl − 2)]

9N(N − 1)(N − 2)

+[∑ ck(ck − 1)][∑ dl(dl − 1)]

2N(N − 1). (18.41)

Notice that if there are ties in only one of the variables, the variance simplifies sub-stantially, as we will see in (18.48). Also, if the ties are relative sparse, the last twoterms are negligible relative to the first term for large N. The variance of τ in (18.39)can then be obtained from (18.41). Dropping those last two terms and rearranging tomake easy comparison to (18.32), we have

Var[τ(x, Py)] ≈ 29

2N + 5N(N − 1)

1− c∗∗ − d∗∗

(1− c∗)(1− d∗), (18.42)

where

c∗ = ∑ck(ck − 1)N(N − 1)

, c∗∗ = ∑ck(ck − 1)(2ck + 5)N(N − 1)(2N + 5)

, (18.43)

and similarly for d∗ and d∗∗.

18.3.2 Jonckheere-Terpstra test for trend among groups

Consider the two-sample situation in Section 18.2.2. We will switch to the notation inthat section, where we have x1, . . . , xn for group 1 and y1, . . . , ym for group 2, so thatz = (x′, y′)′ and N = n + m. Let a = (1, . . . , 1, 2, . . . , 2)′, where there are n 1’s and m2’s. Then since ai = aj if i and j are both between 1 and n, or both between n + 1 and


n + m,

d(a, z) = ∑ ∑1≤i<j≤n

I[(ai − aj)(xi − xj) < 0] +n

∑i=1

m

∑j=1

I[(ai − an+j)(xi − yj) < 0]

+ ∑ ∑1≤i<j≤m

I[(an+i − an+j)(yi − yj) < 0]

=n

∑i=1

m

∑j=1

I[(ai − an+j)(xi − yj) < 0]

=n

∑i=1

m

∑j=1

I[xi > yj], (18.44)

which equals W∗N , the representation in (18.21) of the Mann-Whitney/Wilcoxon statis-tic.

If there are several groups, and ordered in such a way that we are looking fora trend across groups, then we can again use d(a, z), where a indicates the groupnumber. For example, if there are K groups, and nk observations in group k, wewould have

a 1 · · · 1 2 · · · 2 · · · K · · · Kz z11 · · · z1n1 z22 · · · z2n2 · · · zK1 · · · zKnK

. (18.45)

The d(a, z) is called the Jonckheere-Terpstra statistic (Terpstra (1952), Jonckheere(1954)), and is an extension of the two-sample statistic, summing the W∗N for the pairsof groups:

d(a, z) = ∑ ∑1≤k<l≤K

nk

∑i=1

nl

∑j=1

I[zki > zl j]. (18.46)

To find the mean and variance of d, it is convenient to first imagine the sum of allpairwise comparisons d(eN , z), where eN = (1, 2, . . . , N)′. We can then decomposethis sum into the parts comparing observations between groups and those comparingobservations within groups:

d(eN , z) = d(a, z) +K

∑k=1

d(enk , zk), where zk = (zk1, . . . , zknk)′. (18.47)

Still assuming no ties in z, it can be shown that using the randomization distribu-tion on z, the K + 1 random variables d(a, Pz), d(en1 , (Pz)1), . . ., d(enK , (Pz)K) aremutually independent. See Terpstra (1952), Lemma I and Theorem I. The idea isthat the relative rankings of zki’s within one group are independent of those in anyother group, and that the rankings within groups are independent of relative sizes ofelements in one group to another.

The independence of the d’s on the right-hand side of (18.47) implies that theirvariances sum, hence we can find the mean and variance of d(a, Pz) by subtraction:

E[d(a, Pz)] =N(N − 1)

4−

K

∑k=1

nk(nk − 1)4

≡ µ(n), and

Var[d(a, Pz)] =N(N − 1)(2N + 5)

72−

K

∑k=1

nk(nk − 1)(2nk + 5)72

≡ σ2(n), (18.48)

18.4. Confidence intervals 313

n = (n1, . . . , nK), using (18.31) on d(eN , Pz) and the d(enk , (Pz)k)’s. For asymptotics,as long as nk/N → λk ∈ (0, 1) for each k, we have that

JTN ≡d(a, Pz)− µ(n)

σ(n)−→D N(0, 1). (18.49)

See Exercise 18.5.12. Since we based this statistic on d rather than τ, testing forpositive association means rejecting for small values of JTN , and testing for negativeassociation means rejecting for large values of JTN . We could also use τ (A or B),noting that

∑ ∑1≤i<j≤N

Sign(ai − aj) Sign(zi − zj) =

(N2

)−

K

∑k=1

(nk2

)− 2d(a, z). (18.50)

As long as there are no ties in z, the test here works for any a, where in (18.48) the(n1, . . . , nK) is replaced by the pattern of ties for a.

18.4 Confidence intervals

As in Section 15.6, we can invert these nonparametric tests to obtain nonparametricconfidence intervals for certain parameters. To illustrate, consider the sign test basedon Z1, . . . , ZN iid. We assume the distribution is continuous, and find a confidenceinterval for the median η. It is based on the order statistics z(1), . . . , z(N). The ideastems from looking at (z(1), z(N)) as a confidence interval for η. Note that the medianis between the minimum and maximum unless all observations are greater than themedian, or all are less than the median. By continuity, P[Zi > η] = P[Zi < η] = 1/2,hence the chance η is not in the interval is 2/2N . That is, (z(1), z(N)) is a 100(1−2N−1)% confidence interval for the median. By using other order statistics as limits,other percentages are obtained.

For the general interval, we test the null hypothesis H0 : η = η0 using the signstatistic in (18.3): S(z ; η0) = ∑N

i=1 Sign(zi − η0). An equivalent statistic is the numberof zi’s larger than η0:

S(z ; η0) = 2N

∑i=1

I[zi > η0]− N. (18.51)

Under the null, ∑Ni=1 I[Zi > η0] ∼ Binomial(N, 1/2), so that for level α, the exact test

rejects whenN

∑i=1

I[zi > η0] ≤ A orN

∑i=1

I[zi > η0] ≥ B, (18.52)

where

A = maxinteger k | P[Binomial(N, 12 ) ≤ k] ≤ 1

2 α and

B = mininteger l | P[Binomial(N, 12 ) ≥ l] ≤ 1

2 α. (18.53)


Exercise 18.5.17 shows that the confidence interval then consists of the η0’s for which(18.52) fails:

C(z) = η0 | A + 1 ≤N

∑i=1

I[zi > η0] ≤ B− 1

= [z(N−B+1), z(N−A)). (18.54)

Since the Zi’s are assumed continuous, it does not matter whether the endpoints areclosed or open, so typically one would use (z(N−B+1), z(N−A)).

For large N, we can use the normal approximation to estimate A and B. Theseare virtually never exact integers, so a good idea is to choose the closest integers thatgive the widest interval. That is, use

A = floor

(N2− zα/2

√N4

)and B = ceiling

(N2+ zα/2

√N4

), (18.55)

where floor(x) is the largest integer less than or equal to x, and ceiling(x) is thesmallest integer greater than or equal to x.

18.4.1 Kendall’s τ and the slope

A similar approach yields a nonparametric confidence interval for the slope in aregression. We will assume fixed x and random Y, so that the model is

Yi = α + βxi + Ei, i = 1, . . . , N, Ei are iid with continuous distribution F. (18.56)

We allow ties in x, but with continuous F are assuming there are no ties in theY. Kendall’s τ and Spearman’s ρ can be used to test that β = 0, and can also berepurposed to test H0 : β = β0 for any β0. Writing Yi − β0xi = α + (β− β0)xi + Ei,we can test that null by replacing Yi with Yi − β0xi in the statistic. For Kendall, it iseasiest to use the distance d, where

d(x, y ; β0) = ∑ ∑1≤i<j≤N

I[(xi − xj)((yi − β0xi)− (yj − β0xj)) > 0]. (18.57)

Exercise 18.5.18 shows that we can write

d(x, y ; β0) = ∑ ∑1≤i<j≤N

xi 6=xj

I[bij > β0], (18.58)

where bij is the slope of the line segment connecting points i and j, bij = (yi −yj)/(xi − xj). Figure 18.1 illustrates these segments. Under the null, this d has thedistribution of the Jonckheere-Terpstra statistic in Section 18.3.2. The analog to theconfidence interval defined in (18.54) is (b(N−B+1), b(N−A)), where we use the nulldistribution of d(x, Y ; β0) in place of the Binomial(N, 1/2)’s in (18.53). Here, b(k) isthe kth order statistic of the bij’s for i, j in the summation in (18.58).

Using the asymptotic approximation in (18.49), we have

A = floor(µ(c)− zα/2σ(c)) and B = ceiling(µ(c) + zα/2σ(c)). (18.59)

18.5. Exercises 315

The mean µ(c) and variance σ2(c) are given in (18.48), where c is the pattern of tiesfor x.

An estimate of β can be obtained by shrinking the confidence interval down to apoint. Equivalently, it is the β0 for which the normalized test statistic JTN of (18.49)is 0, or as close to 0 as possible. This value can be seen to be the median of the bij’s,and is known as the Sen-Theil estimator of the slope (Theil, 1950; Sen, 1968). SeeExercise 18.5.18.

Ties in the y are technically not addressed in the above work, since the distribu-tions are assumed continuous. The problem of ties arises only when there are ties inthe (yi − β0xi)’s. Except for a finite set of β0’s, such ties occur only for i and j with(xi, yi) = (xj, yj). Thus if such tied pairs are nonexistent or rare, we can proceed as ifthere are no ties.

To illustrate, consider the hurricane data with x = damage and y = deaths asin Figure 12.2 (page 192) and the table in (12.74). The pattern of ties for x is c =(2, 2, 2, 2), and for y is (2, 2, 2, 2, 2, 2, 2, 3, 4, 4, 5, 8, 10, 12, 13), indicating quite a fewties. But there is only one tie among the (xi, yi) pairs, so we use the calculations forno ties in y. Here, N = 94, so that µ(c) = 2183.5 and σ2(c) = 23432.42, and from(18.59), A = 1883 and B = 2484. There are 4367 bij’s, so as in (18.54) but with Nreplaced by 4367, we have confidence interval and estimate

C(x, y) = (b(1884), b(2484)) = (0.931, 2.706) and β = Medianbij = 1.826. (18.60)

In (12.74) we have estimates and standard errors of the slope using least squares ormedian regression, with and without the outlier, Katrina. Below we add the currentSen-Theil estimator as above, and that without the outlier. The confidence intervalsfor the original estimates use ±1.96 se for consistency.

Estimate Confidence intervalLeast squares 7.743 (4.725, 10.761)Least squares w/o outlier 1.592 (0.734, 2.450)Least absolute deviations 2.093 (−0.230, 4.416)Least absolute deviations w/o outlier 0.803 (−0.867, 2.473)Sen-Theil 1.826 (0.931, 2.706)Sen-Theil w/o outlier 1.690 (0.884, 2.605)

(18.61)

Of the three estimating techniques, Sen-Theil is least affected by the outlier. It is alsoquite close to least squares when the outlier is removed.

18.5 Exercises

Exercise 18.5.1. Consider the randomization distribution of the signed-rank statisticgiven in (18.11), where there are no ties in the zi’s. That is, r is a permutation of theintegers from 1 to N, and

T(Gz) =DN

∑i=1

iGi, (18.62)

where the Gi’s are independent with P[Gi = −1] = P[Gi = +1] = 1/2. (a) Showthat E[T(Gz)] = 0 and Var[T(Gz)] = N(N + 1)(2N + 1)/6. [Hint: Use the formulafor ∑K

i=1 i2.] (b) Apply Theorem 17.3 on page 299 with ν = 3 to show that VN −→DN(0, 1) in (18.13), where VN = T(Gz)/

√Var[T(Gz)]. [Hint: In this case, Xi = iGi.


Use the bound E[|Xi|3] ≤ i3 and the formula for ∑Ki=1 i3 to show (17.56).] (c) Find

T(z) and VN for the tread-wear data in (17.34). What is the two-sided p-value basedon the normal approximation? What do you conclude?

Exercise 18.5.2. In the paper Student (1908), W.S. Gosset presented the Student’s t dis-tribution, using the alias Student because his employer, Guinness, did not normallyallow employees to publish (fearing trade secrets would be revealed). One examplecompared the yields of barley depending on whether the seeds were kiln-dried ornot. Eleven varieties of barley were used, each having one batch sown with regularseeds, and one with kiln-dried seeds. Here are the results, in pounds per acre:

Regular Kiln-dried1903 20091935 19151910 20112496 24632108 21801961 19252060 21221444 14821612 15421316 14431511 1535

(18.63)

Consider the differences, Xi = Regulari −Kiln-driedi. Under the null hypothesis thatthe two methods are exchangeable as in (17.32), these Xi’s are distributed symmet-rically about the median 0. Calculate the test statistics for the sign test, signed-ranktest, and regular t-test. (For the first two, normalize by subtracting the mean anddividing by the standard deviation of the statistic, both calculated under the null.)Find p-values of these statistics for the two-sided alternative, where for the first twostatistics you can approximate the null distribution with the standard normal. Whatdo you see?

Exercise 18.5.3. Suppose z, N× 1, has no ties and no zeroes. This exercise shows thatthe signed-rank statistic in (18.10) can be equivalently represented by

T∗(z) = ∑ ∑1≤i≤j≤N

I[zi + zj > 0]. (18.64)

Note that the summation includes i = j. (a) Letting r = Rank(|z1|, . . . , |zN |), showthat

T(z) ≡N

∑i=1

Sign(zi)ri = 2N

∑i=1

I[zi > 0]ri −N(N + 1)

2. (18.65)

[Hint: Note that with no zeroes, Sign(zi) = 2I[zi > 0]− 1.] (b) Use the definition ofrank given in (18.7) to show that

N

∑i=1

I[zi > 0]ri = #zi > 0+ ∑ ∑j 6=i

I[zi > 0]I[|zi| > |zj|]. (18.66)

(c) Show that I[zi > 0]I[|zi| > |zj|] = I[zi > |zj|], and

∑ ∑j 6=i

I[zi > 0]I[|zi| > |zj|] = ∑ ∑i<j

(I[zi > |zj|] + I[zj > |zi|]). (18.67)

18.5. Exercises 317

(d) Show that I[zi > |zj|] + I[zj > |zi|] = I[zi + zj > 0]. [Hint: Write out all thepossibilities, depending on the signs of zi and zj and their relative absolute values.](e) Verify that T(z) = 2T∗(z) − N(N + 1)/2. (f) Use (18.12) and Exercise 18.5.1 toshow that under the null, E[T∗(Gz)] = N(N + 1)/2 and Var[T∗(Gz)] = N(N +1)(2N + 1)/24].

Exercise 18.5.4. This exercise shows the equivalence of the two Mann-Whitney/Wilcoxon statistics in (18.20) and (18.21). We have r = Rank(x1, . . . , xn, y1, . . . , ym),where N = n + m. We assume there are no ties among the N observations. (a) Showthat

WN =1n

n

∑i=1

ri −1m

N

∑i=n+1

ri =

(1n+

1m

) n

∑i=1

ri −N(N + 1)

2m. (18.68)

[Hint: Since there are no ties, ∑Ni=n+1 ri = cN −∑n

i=1 ri for a known constant cN .] (b)Using the definition of rank in (18.7), show that

n

∑i=1

ri =n

∑i=1

1 + ∑1≤j≤n,j 6=i

I[xi > xj]

+W∗N , where W∗N =n

∑i=1

m

∑j=1

I[xi > yj]. (18.69)

Then note that the summation on the right-hand side of the first equation in (18.69)equals n(n + 1)/2. (c) Conclude that WN = N(W∗N/(nm)− 1/2).

Exercise 18.5.5. Continue with consideration of the Mann-Whitney/Wilcoxon statis-tic as in (18.20), and again assume there are no ties. (a) Show that Var[WN ] =N2(N + 1)/(12nm) as used in (18.23). (b) Assume that n/N → p ∈ (0, 1) as N → ∞.Apply Theorem 17.2 (page 298) to show that VN −→D N(0, 1) for VN in (18.23). [Hint:The condition (17.50) holds here, hence only (17.51) needs to be verified for z beingr.]

Exercise 18.5.6. The BMI (Body Mass Index) was collected on 165 people, n = 62men (the xi’s) and m = 103 women (the yi’s). The WN from (18.20) is 43.38. (Youcan assume there were no ties.) (a) Under the null hypothesis in (18.18), what arethe mean and variance of WN? (b) Use the statistic VN from (18.23) to test the nullhypothesis in part (a), using a two-sided alternative. What do you conclude?

Exercise 18.5.7. Let d(x, y) = ∑ ∑1≤i<j≤N I[(xi − xj)(yi − yj) < 0] as in (18.28). As-sume that there are no ties among the xi’s nor among the yi’s. (a) Show that

∑ ∑1≤i<j≤N

I[(xi − xj)(yi − yj) > 0] =(

N2

)− d(x, y). (18.70)

(b) Show that

∑ ∑1≤i<j≤N

Sign(xi − xj) Sign(yi − yj) =

(N2

)− 2d(x, y). (18.71)

(c) Conclude that τ(x, y) in (18.27) equals 1− 4d(x, y)/(N(N − 1)) as in (18.29).

Exercise 18.5.8. (a) Suppose X is a random variable, and a1, . . . , aK (K may be ∞)are the points at which X has positive probability. Let X1 and X2 be iid with the


same distribution as X. Show that E[Sign(X1−X2)2] = 1−∑K

k=1 P[X = ak]2, proving

(18.35). (b) Let x be N× 1 with pattern of ties (c1, . . . , cK) as below (18.38). Show that∑ ∑i 6=j Sign(xi − xj)

2 = N(N− 1)−∑Kk=1 ck(ck − 1). [Hint: There are N(N− 1) terms

in the summation, and they are 1 unless xi = xj. So you need to subtract the numberof such tied pairs.]

Exercise 18.5.9. Let x = (1, 2, 2, 3)′ and y = (1, 1, 2, 3)′, and τ(x, y) be Kendall’s τBfrom (18.39). Note x and y have the same pattern of ties. (a) What is the value ofτ(x, y)? (b) What are the minimum and maximum values of τ(x, py) as p rangesover the 4× 4 permutation matrices? Are they ±1? (c) Let w = (3, 3, 4, 7)′. What isτ(w, y)? What are the minimum and maximum values of τ(w, py)? Are they ±1?

Exercise 18.5.10. Show that (18.42) holds for τ in (18.39) when we drop the final twosummands in the expression for Var[S(x, Py)] in (18.41).

Exercise 18.5.11. Let U1, . . . , UN−1 be independent with Ui ∼ Discrete Uniform(0,N-i), as in (18.30), and set U = ∑N

i=1 Ui. (a) Show that E[U] = N(N − 1)/4. (b) Showthat Var[U] = N(N − 1)(2N + 5)/72. [Hint: Table 1.2 on page 9 has the mean andvariance of the discrete uniform. Then use the formulas for ∑K

i=1 i for the mean

and ∑Ki=1 i2 for the variance.] (c) Show that ∑N

i=1 E[|Ui − E[Ui]|3]/√

Var[U]3 → 0

as N → ∞. [Hint: First show that E[|Ui − E[Ui]|3] ≤ |N − i|3/8. Then use theformula for ∑K

i=1 i3 to show that ∑Ni=1 E[|Ui − E[Ui]|3] ≤ N4/32. Finally, use part (b)

to show the desired limit.] (d) Let KN = (U− E[U])/√

Var[U]. Use Theorem 17.3 onpage 299 and part (c) to show that KN −→D N(0, 1) as N → ∞. Argue that thereforethe asymptotic normality of d(x, Py) as in (18.33) holds.

Exercise 18.5.12. This exercise proves the asymptotic normality of the Jonckheere-Terpstra statistic. Assume there are no ties in the zi’s, and nk/N → λk ∈ (0, 1)for each k. (a) Start with the representation given in (18.47). Find the constantscN , d1N , . . . , dKN such that

WN = cN JTN + W∗N where W∗N =K

∑k=1

dkNWkN , (18.72)

JTN is the normalized Jonckheere-Terpstra statistic in (18.49), and the W’s are thenormalized Kendall distances,

WN =d(eN , PzN)− E[d(eN , PzN)]√

Var[d(eN , PzN)]and WkN =

d(enk , (Pz)k)− E[d(enk , (Pz)k)]√Var[d(enk , (Pz)k)]

,

(18.73)k = 1, . . . , K. (b) Show that d2

kN → λ3k and c2

N → 1 − ∑Kk=1 λ3

k . (c) We know thatthe W’s are asymptotically N(0, 1). Why is W∗N −→D N(0, ∑K

k=1 λ3k)? (d) Show that

cN JTN −→D N(0, 1− ∑Kk=1 λ3

k). [Hint: Use moment generating functions on bothsides of the expression for WN in (18.72), noting that JTN and W∗N are independent.Then we know the mgfs of the asymptotic limits of WN and W∗N , hence can find thatof cN JTN .] (e) Finally, use parts (b) and (d) to show that JTN −→D N(0, 1).

18.5. Exercises 319

0 100 200 300

010

020

030

0

Lotte

ry n

umbe

r

Day of the year

50 150 250 350

150

170

190

210

Ave

rage

lotte

ry n

umbe

r

Month

Figure 18.2: The 1970 draft lottery results. The first plot has day of the year on the X-axis and lottery number on the Y-axis. The second plots the average lottery numberfor each month.

Exercise 18.5.13. Consider the draft lottery data of Section 17.3. Letting x be the daysof the year and y be the corresponding lottery numbers, we calculate d(x, y) = 38369for d being Kendall’s distance (18.28). (a) Find Kendall’s τ. (Here, N = 366.) (b) Findthe value of the standard deviation of Kendall’s τ under the null hypothesis thatthe lottery was totally random, and the normalized statistic τ/

√Var[τ]. What is the

approximate p-value based on Kendall’s τ? Is the conclusion different than that foundbelow (17.23)? (c) In 1970, they tried to do a better randomization. There were twobowls of capsules, one for the days of the year, and one for the lottery numbers. Theywere both thoroughly mixed, and lottery numbers were assigned to dates by choosingone capsule from each bowl. See Figure 18.2. This time Kendall’s distance betweenthe days and the lottery numbers was 32883. Now N = 365, since the lottery wasonly for people born in 1951. What is Kendall’s τ and its standard deviation for thisyear? What is the approximate two-sided p-value for testing complete randomness?What do you conclude? (d) Which year had a better randomization? Or were theyabout equally good?

Exercise 18.5.14. Continue with the draft lottery example from Exercise 18.5.13. Thedescription of the randomization process at the end of Section 17.3 suggests that theremay be a trend over months, but leaves open the possibility that there is no trendwithin months. To assess this idea, decompose the overall distance d(x, y) = 38369from above as in (18.47), where a as in (18.45) indicates month, so that K = 12 andnk is the number of days in the kth month. The between-month Kendall distance isd(a, z) = 35787, and the within-month distances are given in the table:

Month 1 2 3 4 5 6 7 8 9 10 11 12nk 31 29 31 30 31 30 31 31 30 31 30 31

d(enk , zk) 260 172 202 195 212 215 237 195 186 278 173 257(18.74)

(a) Normalize the Jonckheere-Terpstra statistic d(a, z) to obtain JTN as in (18.49). Findthe two-sided p-value based on the normal approximation. What do you conclude?


(b) Test for trend within each month. Do there appear to be any months where thereis a significant trend? (c) It may be that the monthly data can be combined for a morepowerful test. There are various ways to combine the twelve months into one overalltest for trend. If looking for the same direction within all months, then summingthe d(enk , zk)’s is reasonable. Find the normalized statistic based on this sum. Is itstatistically significant? (d) Another way to combine the individual months is to findthe sum of squares of the individual statistics, i.e., ∑12

k=1 W2kN , for WkN ’s as in (18.73).

What is the asymptotic distribution of this statistic under the null? What is its valuefor these data? Is it statistically significant? (e) What do you conclude concerning theissue of within-month and between-month trends.

Exercise 18.5.15. Suppose Z1, . . . , ZN are iid with distribution symmetric about themedian η, which means Zi − η =D η − Zi. (If the median is not unique, let η be themiddle of the interval of medians.) We will use the signed-rank statistic as expressedin (18.64) to find a confidence interval and estimate for η. (a) Show that for testingH0 : η = η0, we can use the statistic

T∗(z ; η0) = ∑ ∑1≤i≤j≤N

I[(zi + zj)/2 > η0]. (18.75)

(b) Using the normal approximation, find a and b so that C(z) = η0 | a < T∗(z ; η0)< b is an approximate 100(1− α)% confidence interval for η. [Hint: See Exercise18.5.3(f).] (c) With A = floor(a) and B = ceiling(b), the confidence interval becomes(w(N−B+1), wN−A)), where the w(i)’s are the order statistics of which quantities? (d)What is the corresponding estimate of η?

Exercise 18.5.16. Here we look at a special case of the two-sample situation as in(18.14). Let F be a continuous distribution function, and suppose that for “shift”parameter δ, X1, . . . , Xn are iid with distribution function F(xi − δ), and Y1, . . . , Ymare iid with distribution function F(yi). Also, the Xi’s are independent of the Yi’s.The goal is to find a nonparametric confidence interval and estimate for δ. (a) Showthat Xi is stochastically larger than Yj if δ > 0, and Yj is stochastically larger thanXi if δ < 0. (b) Show that Median(Xi) = Median(Yj) + δ. (c) Consider testing thehypotheses H0 : δ = δ0 versus HA : δ 6= δ0. Show that we can base the test on theWilcoxon/Mann-Whitney statistic as in (18.21) given by

W∗N(z ; δ0) =n

∑i=1

m

∑j=1

I[xi − yj > δ0], (18.76)

where z = (x1, . . . , xn, y1, . . . , ym)′. (d) Find E[W∗N(Pz ; δ0)] and Var[W∗N(Pz ; δ0)] un-der the null. [Hint: See (18.21) and therebelow.] (e) Using the normal approximation,find A and B so that (w(N−B+1), w(N−A)) is an approximate 100(1− α)% confidenceinterval for δ. What are the w(i)’s? (f) What is the corresponding estimate of δ?

Exercise 18.5.17. Let z(1), . . . , z(N) be the order statistics from the sample z1, . . . , zN ,and A and B integers between 0 and N inclusive. (a) Show that for integer K,

η0 < z(K) if and only if #i | η0 < zi ≥ N − K + 1, (18.77)

hence ∑Ni=1 I[η0 < zi] ≥ A + 1 if and only if η0 < z(N−A). (b) Show that (18.77) is

equivalent toη0 ≥ z(K) if and only if #i | η0 < zi ≤ N − K. (18.78)

18.5. Exercises 321

Conclude that ∑Ni=1 I[η0 < zi] ≤ B− 1 if and only if η0 ≥ z(N−B+1). (c) Argue that

parts (a) and (b) prove the confidence interval formula in (18.54).

Exercise 18.5.18. Consider the linear model Yi = α + βxi + Ei, where the Ei are iidF, and F is continuous. Suppose that the xi’s are distinct (no ties). The least squaresestimates are those for which the estimated residuals have zero correlation with thexi’s. An alternative (and more robust) estimator of the slope β finds β so that theestimated residuals have zero Kendall’s τ with the xi’s. It is called the Sen-Theilestimator, as mentioned in Section 18.4.1. The residuals are the Yi − α− βxi, so thenumerator of Kendall’s τ between the residuals and the xi’s (which depends on β butnot α) is

U(β) =n−1

∑i=1

n

∑j=i+1

Sign((yi − βxi)− (yj − βxj)) Sign(xi − xj). (18.79)

Then β satisfies U(β) = 0. (Or at least as close to 0 as possible.) (a) Show that

Sign((yi − βxi)− (yj − βxj)) Sign(xi − xj) = Sign(bij − β), (18.80)

where bij = (yi − yj)/(xi − xj) is the slope of the line segment connecting points iand j. (Note that Sign(a) Sign(b) = Sign(ab).) (b) Now β is what familiar statistic ofthe bij’s?

Part III

Optimality

323

Chapter 19

Optimal Estimators

So far, our objectives have been mostly methodological, that is, we have a particularmodel and wish to find and implement procedures to infer something about theparameters. A more meta goal of mathematical statistics is to consider the proceduresas the data, and try to find the best procedure(s), or at least evaluate how effectiveprocedures are relative to each other or absolutely. Statistical decision theory is anoffshoot of game theory that attempts to do such comparison of procedures. We haveseen a little of this idea when comparing the asymptotic efficiencies of the medianand the mean in (9.32), (14.74), and (14.75).

Chapter 20 presents the general decision-theoretic approach. Before then we willlook at some special optimality results for estimation: best unbiased estimators andbest shift-equivariant estimators.

Our first goal is to find the best unbiased estimator, by which we mean the unbi-ased estimator that has the lowest possible variance for all values of the parameteramong all unbiased estimators. Such an estimator is called a uniformly minimumvariance unbiased estimator (UMVUE). In Section 19.2 we introduce the concept ofcompleteness of a model, which together with sufficiency is key to finding UMVUEs.Section 19.5 gives a lower bound on the variance of an unbiased estimator, which canbe used as a benchmark even if there is no UMVUE.

Throughout this chapter we assume we have the basic statistical model:

Random vector X, space X , and set of distributions P = Pθ | θ ∈ T , (19.1)

where T is the parameter space. We wish to estimate a function of the parameter,g(θ). Repeating (11.3), an estimator is a function δ of x:

δ : X −→ A, (19.2)

where here A is generally the space of g(θ). It may be that A is somewhat largerthan that space, e.g., in the binomial case where θ ∈ (0, 1), one may wish to allowestimators in the range [0,1]. As in (10.10), an estimator is unbiased if

Eθ[δ(X)] = g(θ) for all θ ∈ T . (19.3)

Next is the formal definition of UMVUE.

325

326 Chapter 19. Optimal Estimators

Definition 19.1. The procedure δ in (19.2) is a uniformly minimum variance unbiasedestimator (UMVUE) of the function g(θ) if it is unbiased, has finite variance, and for anyother unbiased estimator δ′,

Varθ[δ(X)] ≤ Varθ[δ′(X)] for all θ ∈ T . (19.4)

19.1 Unbiased estimators

The first step in finding UMVUEs is to find the unbiased estimators. There is nogeneral automatic method for finding unbiased estimators, in contrast to maximumlikelihood estimators or Bayes estimators. The latter two types may be difficult tocalculate, but it is a mathematical or computational problem, not a statistical one.

One method that occasionally works for finding unbiased estimators is to find apower series based on the definition of unbiasedness. For example, suppose X ∼Binomial(n, θ), θ ∈ (0, 1). We know X/n is an unbiased estimator for θ, but whatabout estimating the variance of X, g(θ) = θ(1− θ)/n? An unbiased δ will satisfy

n

∑x=0

δ(x)(

nx

)θx(1− θ)n−x =

1n

θ(1− θ) for all θ ∈ (0, 1). (19.5)

Both sides of equation (19.5) are polynomials in θ, hence for equality to hold for everyθ, the coefficients of each power θk must match.

First, suppose n = 1, so that we need

δ(0)(1− θ) + δ(1)θ = θ(1− θ) = θ − θ2. (19.6)

There is no δ that will work, because the coefficient of θ2 on the right-hand side is -1,and on the left-hand side is 0. That is, with n = 1, there is no unbiased estimator ofθ(1− θ). With n = 2, we have

δ(0)(1− θ)2 + 2δ(1)θ(1− θ) + δ(2)θ2 = θ(1− θ)/2‖ ‖

δ(0) + (−2δ(0) + 2δ(1))θ + (δ(0)− 2δ(1) + δ(2))θ2 ↔ (θ − θ2)/2.(19.7)

Matching coefficents of θk:

k Left-hand side Right-hand side0 δ(0) 01 −2δ(0) + 2δ(1) 1/22 δ(0)− 2δ(1) + δ(2) −1/2

(19.8)

we see that the only solution is δ0(0) = δ0(2) = 0 and δ0(1) = 1/4, which actually iseasy to see directly from the first line of (19.7). Thus the δ0 must be the best unbiasedestimator, being the only one. In fact, because for any function δ(x), Eθ [δ(X)] is annth-degree polynomial in θ, the only functions g(θ) that have unbiased estimators arethose that are themselves polynomials of degree n or less (see Exercise 19.8.2), andeach one is a UMVUE by the results in Section 19.2. For example, 1/θ and eθ do nothave unbiased estimators.

Another approach to finding an unbiased estimator is to take a biased one andsee if it can be modified. For example, suppose X1, . . . , Xn are iid Poisson(θ), and

19.2. Completeness and sufficiency 327

consider estimating g(θ) = Pθ [X1 = 0] = exp(−θ). Thus if Xi is the number of phonecalls in a hour, g(θ) is the chance that there are no calls in the next hour. (We sawsomething similar in Exercise 11.7.6.) The MLE of g(θ) is exp(−X), which is biased.But one can find a c such that exp(−cX) is unbiased. See Exercise 19.8.3.

We have seen that there may or may not be an unbiased estimator. Often, thereare many. For example, suppose X1, . . . , Xn are iid Poisson(θ). Then X and S2

∗ (thesample variance with n− 1 in the denominator) are both unbiased, being unbiasedestimators of the mean and variance, respectively. Weighted averages of the Xi’s arealso unbiased, e.g., X1, (X1 + X2)/2, and X1/2 + X2/6 + X3/3 are all unbiased. Therest of this chapter uses sufficiency and the concept of completeness to find UMVUEsin many situations.

19.2 Completeness and sufficiency

We have already answered the question of finding the best unbiased estimator incertain situations without realizing it. We know from the Rao-Blackwell theorem(Theorem 13.8 on page 210) that any estimator that is not a function of just the suffi-cient statistic can be improved. That is, if δ(X) is an unbiased estimator of g(θ), andS = s(X) is a sufficient statistic, then

δ∗(s) = E[δ(X) | s(X) = s] (19.9)

is also unbiased, and has no larger variance than δ.Also, if there is only one unbiased estimator that is a function of the sufficient

statistic, then it must be the best one that depends on only the sufficient statistic.Furthermore, it must be the best overall, because it is better than any estimator thatis not a function of just the sufficient statistic.

The concept of “there being only one unbiased estimator” is called completeness.It is a property attached to a model. Consider the model with random vector Yand parameter space T . Suppose there are at least two unbiased estimators of somefunction g(θ) in this model, say δg(y) and δ∗g(y). That is,

Pθ[δg(Y) 6= δ∗g(Y)] > 0 for some θ ∈ T , (19.10)

andEθ[δg(Y)] = g(θ) = Eθ[δ

∗g(Y)] for all θ ∈ T . (19.11)

ThenEθ[δg(Y)− δ∗g(Y)] = 0 for all θ ∈ T . (19.12)

This δg(y)− δ∗g(y) is an unbiased estimator of 0. Now suppose δh(y) is an unbiasedestimator of the function h(θ). Then so is

δ∗h (y) = δh(y) + δg(y)− δ∗g(y), (19.13)

becauseEθ[δ

∗h (Y)] = Eθ[δh(Y)] + Eθ[δg(Y)− δ∗g(Y)] = h(θ) + 0. (19.14)

That is, if there is more than one unbiased estimator of one function, then thereis more than one unbiased estimator of any other function (that has at least oneunbiased estimator). Logically, it follows that if there is only one unbiased estimatorof some function, then there is only one (or zero) unbiased estimator of any function.That one function may as well be the zero function.


Definition 19.2. Suppose for the model on Y with parameter space T , the only unbiasedestimator of 0 is 0 itself. That is, suppose

Eθ[δ(Y)] = 0 for all θ ∈ T (19.15)

implies thatPθ[δ(Y) = 0] = 1 for all θ ∈ T . (19.16)

Then the model is complete.

The important implication follows.

Lemma 19.3. Suppose the model is complete. Then there exists at most one unbiased estima-tor of any function g(θ).

Illustrating with the binomial again, suppose X ∼ Binomial(n, θ) with θ ∈ (0, 1),and δ(x) is an unbiased estimator of 0:

Eθ [δ(X)] = 0 for all θ ∈ (0, 1). (19.17)

We know the left-hand side is a polynomial in θ, as is the right-hand side. All thecoefficients of θi are zero on the right, hence on the left. Write

Eθ [δ(X)] = δ(0)(

n0

)(1− θ)n + δ(1)

(n1

)θ(1− θ)n−1

+ · · ·+ δ(n− 1)(

nn− 1

)θn−1(1− θ) + δ(n)

(nn

)θn. (19.18)

The coefficient of θ0, i.e, the constant, arises from just the first term, so is δ(0)(n0). For

that to be 0, we have δ(0) = 0. Erasing that first term, we see that the coefficient ofθ is δ(1)(n

1), hence δ(1) = 0. Continuing, we see that δ(2) = · · · = δ(n) = 0, whichmeans that δ(x) = 0, which means that the only unbiased estimator of 0 is 0 itself.Hence this model is complete, verifying (with Exercise 19.8.2) the fact mentionedbelow (19.8) that g(θ) has a UMVUE if and only if it is a polynomial in θ of degreeless than or equal to n.


Suppose X1, . . . , Xn are iid Poisson(θ), θ ∈ (0, ∞), with n > 1. Is this model complete?No. Consider δ(X) = x1 − x2:

Eθ [δ(X)] = Eθ [X1]− Eθ [X2] = θ − θ = 0, (19.19)

butPθ [δ(X) = 0] = Pθ [X1 = X2] > 0. (19.20)

Thus “0” is not the only unbiased estimator of 0; X1 − X2 is another. You can comeup with an infinite number, in fact. Note that no iid model is complete when n > 1,unless the distribution is just a constant.

19.3. Uniformly minimum variance estimators 329

Now let S = X1 + · · ·+ Xn, which is a sufficient statistic. Then S ∼ Poisson(nθ)for θ ∈ (0, ∞). Is the model for S complete? Suppose δ∗(s) is an unbiased estimator of0. Then

Eθ [δ∗(S)] = 0 for all θ ∈ (0, ∞)⇒

∞

∑s=0

δ∗(s)e−nθ (nθ)s

s!= 0 for all θ ∈ (0, ∞)

⇒∞

∑s=0

δ∗(s)ns

s!θs = 0 for all θ ∈ (0, ∞)

⇒ δ∗(s)ns

s!= 0 for all s = 0, 1, 2, . . .

⇒ δ∗(s) = 0 for all s = 0, 1, 2, . . . . (19.21)

Thus the only unbiased estimator of 0 that is a function of S is 0, meaning the modelfor S is complete.

19.2.2 Uniform distribution

Suppose X1, . . . , Xn are iid Uniform(0, θ), θ ∈ (0, ∞), with n > 1. This model again isnot complete. Consider the sufficient statistic S = maxX1, . . . , Xn. The model for Shas space (0, ∞) and pdf

fθ(s) =

nsn−1/θn if 0 < s < θ0 if not . (19.22)

To see if the model for S is complete, suppose that δ∗ is an unbiased estimator of 0.Then

Eθ [δ∗(S)] = 0 for all θ ∈ (0, ∞)⇒

∫ θ

0δ∗(s)sn−1ds/θn = 0 for all θ ∈ (0, ∞)

⇒∫ θ

0δ∗(s)sn−1ds = 0 for all θ ∈ (0, ∞)

(taking d/dθ)⇒ δ∗(θ)θn−1 = 0 for (almost) all θ ∈ (0, ∞)

⇒ δ∗(θ) = 0 for (almost) all θ ∈ (0, ∞). (19.23)

That is, δ∗ must be 0, so that the model for S is complete. [The “(almost)” means thatone can deviate from zero for a few values (with total Lebesgue measure 0) withoutchanging the fact that Pθ [δ

∗(S) = 0] = 1 for all θ.]

19.3 Uniformly minimum variance estimators

This section contains the key result of this chapter:

Theorem 19.4. Suppose S = s(X) is a sufficient statistic for the model (19.1) on X, and themodel for S is complete. If δ∗(s) is an unbiased estimator (depending on s) of the functiong(θ), then δ0(X) = δ∗(s(X)) is the UMVUE of g(θ).

Proof. Let δ be any unbiased estimator of g(θ), and consider

eδ(s) = E[δ(X) | S = s]. (19.24)


Because S is sufficient, eδ does not depend on θ, so it is an estimator. Furthermore,since

Eθ[eδ(S)] = Eθ[δ(X)] = g(θ) for all θ ∈ T , (19.25)

it is unbiased, and because it is a conditional expectation, as in (13.61),

Varθ[eδ(S)] ≤ Varθ[δ(X)] for all θ ∈ T . (19.26)

But by completeness of the model for S, there is only one unbiased estimator that isa function of just S, δ∗, hence δ∗(s) = eδ(s) (with probability 1), and

Varθ[δ0(X)] = Varθ[δ∗(S)] ≤ Varθ[δ(X)] for all θ ∈ T . (19.27)

This equation holds for any unbiased δ, hence δ0 is best.

This proof is actually constructive in a sense. If you do not know δ∗, but haveany unbiased δ, then you can find the UMVUE by using the Rao-Blackwell theorem(Theorem 13.8 on page 210), conditioning on the sufficient statistic. Or, if you can byany means find an unbiased estimator that is a function of S, then it is UMVUE.


Consider again the Poisson case in Section 19.2.1. We have that S = X1 + · · ·+ Xn issufficient, and the model for S is complete. Because

Eθ [S/n] = θ, (19.28)

we know that S/n is the UMVUE of θ.Now let g(θ) = e−θ = Pθ [X1 = 0]. We have from Exercise 19.8.3 an unbiased

estimator of g(θ) that is a function of S, hence it is UMVUE. But finding the estimatortook a bit of work, and luck. Instead, we could start with a very simple estimator,

δ(X) = I[X1 = 0], (19.29)

which indicates whether there were 0 calls in the first minute. That estimator isunbiased, but obviously not using all the data. From (13.36) we have that X givenS = s is multinomial, hence X1 is binomial:

X1 | S = s ∼ Binomial(s, 1n ). (19.30)

Then the UMVUE is

E[δ(X) | s(X) = s] = P[X1 = 0 | S = s] = P[Binomial(s, 1n ) = 0] = (1− 1

n )s. (19.31)

As must be, it is the same as the estimator in Exercise 19.8.3.

19.4 Completeness for exponential families

Showing completeness of a model in general is not an easy task. Fortunately, expo-nential families as in (13.25) usually do provide completeness for the natural sufficientstatistic. To present the result, we start by assuming the model for the p× 1 vectorS has exponential family density with θ ∈ T ⊂ Rp as the natural parameter, and Sitself as the natural sufficient statistic:

fθ(s) = a(s) eθ1s1+···+θpsp−ψ(θ). (19.32)

19.4. Completeness for exponential families 331

Lemma 19.5. In the above model for S, suppose that the parameter space contains a nonemptyopen p-dimensional rectangle. Then the model is complete.

If S is a lattice, e.g., the set of vectors containing all nonnegative integers, then thelemma can be proven by looking at the power series in exp(θi)’s. See Theorem 4.3.1of Lehmann and Romano (2005) for the general case.

The requirement on the parameter space guards against exact constraints amongthe parameters. For example, suppose X1, . . . , Xn are iid N(µ, σ2), where µ is knownto be positive and the coefficient of variation, σ/µ, is known to be 10%. The exponentin the pdf, with µ = 10σ, is

µ

σ2 ∑ xi −1

2σ2 ∑ x2i =

10σ ∑ xi −

12σ2 ∑ x2

i . (19.33)

The exponential family terms are then

θ =

(10σ

,− 12σ2

)and S = (∑ Xi, ∑ X2

i ). (19.34)

The parameter space, with p = 2, is

T =

(10σ

,− 12σ2

)| σ > 0

, (19.35)

which does not contain a two-dimensional open rectangle, because θ2 = −θ21/100.

Thus we cannot use the lemma to show completeness.

Note. It is important to read the lemma carefully. It does not say that if the require-ment is violated, then the model is not complete. (Although realistic counterexamplesare hard to come by.) To prove a model is not complete, you must produce a nontriv-ial unbiased estimator of 0. For example, in (19.34),

Eθ

[(S1n

)2]= Eθ[X

2] = µ2 +

σ2

n= (10σ)2 +

σ2

n=

(100 +

1n

)σ2 (19.36)

and

Eθ

[S2n

]= Eθ[X

2i ] = µ2 + σ2 = (10σ)2 + σ2 = 101σ2. (19.37)

Thenδ(s1, s2) =

s1100 + 1/n

− s2101

(19.38)

has expected value 0, but is not zero itself. Thus the model is not complete.

19.4.1 Examples

Suppose p = 1, so that the natural sufficient statistic and parameter are both scalars.Then all that is needed is that T contains an interval (a, b), a < b. The table below hassome examples, where X1, . . . , Xn are iid with the given distribution (the parameterspace is assumed to be the most general):

Distribution Sufficient statistic T Natural parameter θ TN(µ, 1) ∑ Xi µ RN(0, σ2) ∑ X2

i 1/(2σ2) (0, ∞)Poisson(λ) ∑ Xi log(λ) R

Exponential(λ) ∑ Xi −λ (−∞, 0)

(19.39)


The parameter spaces all contain open intervals; in fact they are open intervals. Thusthe models for the T’s are all complete.

For X1, . . . , Xn iid N(µ, σ2), with (µ, σ2) ∈ R× (0, ∞), the exponential family termsare

T =

(µ

σ,− 1

2σ2

)and T = (∑ Xi, ∑ X2

i ). (19.40)

Here, without any extra constraints on µ and σ2, we have

T = R× (−∞, 0), (19.41)

because for any (a, b) with a ∈ R, b < 0, µ = a/√−2b ∈ R and σ2 = −1/(2b) ∈

(0, ∞) are valid parameter values. Thus the model is complete for (T1, T2). From The-orem 19.4, we then have that any function of (T1, T2) is the UMVUE for its expectedvalue. For example, X is the UMVUE for µ, S2

∗ = ∑(Xi − X)2/(n− 1) is the UMVUEfor σ2. Also, since E[X2

] = µ2 + σ2/n, the UMVUE for µ2 is X2 − S2∗/n.

19.5 Cramér-Rao lower bound

If you find a UMVUE, then you know you have the best unbiased estimator. Often-times, there is no UMVUE, or there is one but it is too difficult to calculate. Thenwhat? At least it would be informative to have an idea of whether a given estimatoris very poor, or just not quite optimal. One approach is to find a lower bound for thevariance. The closer an estimator’s variance is to the lower bound, the better.

The model for this section has random vector X, space X , densities fθ(x), andparameter space T = (a, b) ⊂ R. We need the likelihood assumptions from Section14.4 to hold. In particular, the pdf should always be positive (so that Uniform(0, θ)’sare not allowed) and have several derivatives with respect to θ, and certain expectedvalues must be finite. Suppose δ is an unbiased estimator of g(θ), so that

Eθ [δ(x)] =∫X

δ(x) fθ(x)dx = g(θ) for all θ ∈ T . (19.42)

(Use a summation if the density is a pmf.) Now take the derivative with respect to θof both sides. Assuming interchanging the derivative and integral below is valid, wehave

g′(θ) =∂

∂θ

∫X

δ(x) fθ(x)dx

=∫X

δ(x)∂

∂θfθ(x)dx

=∫X

δ(x)∂∂θ fθ(x)

fθ(x)fθ(x)dx

=∫X

δ(x)∂

∂θlog( fθ(x)) fθ(x)dx

= Eθ

[δ(X)

∂

∂θlog( fθ(X))

]= Covθ

[δ(X),

∂

∂θlog( fθ(X))

]. (19.43)

19.5. Cramér-Rao lower bound 333

The last step follows from Lemma 14.1 on page 226, which shows that

Eθ

[∂

∂θlog( fθ(X))

]= 0, (19.44)

recalling that the derivative of the log of the density is the score function.Now we can use the correlation inequality (2.31), Cov[U, V]2 ≤ Var[U]Var[V]:

g′(θ)2 ≤ Varθ [δ(X)] Varθ

[∂

∂θlog( fθ(X))

]= Varθ [δ(X)]I(θ), (19.45)

where I(θ) is the Fisher information, which is the variance of the score function (seeLemma 14.1 again). We leave off the “1” in the subscript. Thus

Varθ [δ(X)] ≥g′(θ)2

I(θ) ≡ CRLBg(θ), (19.46)

the Cramér-Rao lower bound (CRLB) for unbiased estimators of g(θ). Note that The-orem 14.7 on page 232 uses the same bound for the asymptotic variance for consistentand asymptotically normal estimators.

If an unbiased estimator achieves the CRLB, that is,

Varθ [δ(X)] =g′(θ)2

I(θ) for all θ ∈ T , (19.47)

then it is the UMVUE. The converse is not necessarily true. The UMVUE may notachieve the CRLB, which of course means that no unbiased estimator does. Thereare other more accurate bounds, called Bhattacharya bounds, that the UMVUE mayachieve in such cases. See Lehmann and Casella (2003), page 128.

In fact, basically the only time an estimator achieves the CRLB is when we canwrite the density of X as an exponential family with natural sufficient statistic δ(x),in which case we already know it is UMVUE by completeness. Wijsman (1973) hasa precise statement and proof. Here we give a sketch of the idea under the currentassumptions. The correlation inequality shows that equality in (19.43) for each θimplies that there is a linear relationship between δ and the score function as functionsof x:

∂

∂θlog( fθ(x)) = a(θ) + b(θ)δ(x) for almost all x ∈ X . (19.48)

Now looking at (19.48) as a function of θ for fixed x, we solve the simple differentialequation, remembering the constant (which may depend on x):

log( fθ(x)) = A(θ) + B(θ)δ(x) + C(x), (19.49)

so that fθ(x) has a one-dimensional exponential family form (19.32) with naturalparameter B(θ) and natural sufficient statistic δ(x). (Some regard should be given tothe space of B(θ) to make sure it is an interval.)


Suppose X1, . . . , Xn are iid Laplace(θ), θ ∈ R, so that

fθ(x) =1

2n e−Σ|xi−θ|. (19.50)


This density is not an exponential family one, and if n > 1, the sufficient statistic is thevector of order statistics, which does not have a complete model. Thus the UMVUEapproach does not bear fruit. Because Eθ [Xi] = θ, X is an unbiased estimator of θ,with

Varθ [X] =Varθ [Xi]

n=

2n

. (19.51)

Is this variance reasonable? We will compare it to the CRLB:

∂

∂θlog( fθ(x)) =

∂

∂θ−∑ |xi − θ| = ∑ Sign(xi − θ), (19.52)

because the derivative of |x| is +1 if x > 0 and −1 if x < 0. (We are ignoring thepossibility that xi = θ, where the derivative does not exist.) By symmetry of thedensity around θ, each Sign(Xi − θ) has a probabilty of 1

2 to be either −1 or +1, so ithas mean 0 and variance 1. Thus

I(θ) = Varθ [∑ Sign(Xi − θ)] = n. (19.53)

Here, g(θ) = θ, hence

CRLBg(θ) =g′(θ)2

I(θ) =1n

. (19.54)

Compare this bound to the variance in (19.51). The variance of X is twice the CRLB,which is not very good. It appears that there should be a better estimator. In fact,there is, which we will see later in Section 19.7, although even that estimator does notachieve the CRLB. We also know that the median is better asymptotically.

19.5.2 Normal µ2

Suppose X1, . . . , Xn are iid N(µ, 1), µ ∈ R. We know that X is sufficient, and themodel for it is complete, hence any unbiased estimator based on the sample mean isUMVUE. In this case, g(µ) = µ2, and Eµ[X

2] = µ2 + 1/n, hence

δ(x) = x2 − 1n

(19.55)

is the UMVUE. To find the variance of the estimator, start by noting that√

nX ∼N(√

nµ, 1), so that its square is noncentral χ2 (Definition 7.7 on page 113),

nX2 ∼ χ21(nµ2). (19.56)

Thus from (7.75),Varµ[nX2

] = 2 + 4nµ2(= 2ν + 4∆). (19.57)

Finally,

Varµ[δ(X)] = Varµ[X2] =

2n2 +

4µ2

n. (19.58)

For the CRLB, we first need Fisher’s information. Start with

∂

∂µlog( fµ(X)) =

∂

∂µ− 1

2 ∑(xi − µ)2 = ∑(xi − µ). (19.59)

19.6. Shift-equivariant estimators 335

The Xi’s are independent with variance 1, so that In(µ) = n. Thus with g(µ) = µ2,

CRLBg(µ) =(2µ)2

n=

4µ2

n. (19.60)

Comparing (19.58) to (19.60), we see that the UMVUE does not achieve the CRLB,which implies that no unbiased estimator will. But note that the variance of theUMVUE is only off by 2/n2, so that for large n, the ratio of the variance to the CRLBis close to 1.

19.6 Shift-equivariant estimators

As we have seen, pursuing UMVUEs is generally most successful when the modelis an exponential family. Location family models (see Section 4.2.4) provide anotheropportunity for optimal estimation, but instead of unbiasedness we look at shift-equivariance.

Recall that a location family of distributions is one for which the only parameteris the center; the shape of the density stays the same. Start with the random variableZ. We will assume it has density f and take the space to be R, where it may be thatf (z) = 0 for some values of z. (E.g., Uniform(0, 1) is allowed.) The model on X hasparameter θ ∈ T ≡ R, where X = θ + Z, hence has density f (x− θ).

The data will be n iid copies of X, X = (X1, . . . , Xn), so that the density of X is

fθ(x) =n

∏i=1

f (xi − θ). (19.61)

This θ may or may not be the mean or median. Examples include the Xi’s beingN(θ, 1), or Uniform(θ, θ + 1), or Laplace(θ) as in (19.50). By contrast, Uniform(0, θ)and Exponential(θ) are not location families since the spread as well as the center isaffected by the θ.

A location family model is shift-invariant in that if we add the same constant a toall the Xi’s, the resulting random vector has the same model as X. That is, supposea ∈ R is fixed, and we look at the transformation

X∗ = (X∗1 , . . . , X∗n) = (X1 + a, . . . , Xn + a). (19.62)

The Jacobian of the transformation is 1, so the density of X∗ is

f ∗θ (x∗) =

n

∏i=1

f (x∗i − a− θ) =n

∏i=1

f (x∗i − θ∗), where θ∗ = θ + a. (19.63)

Thus adding a to everything just shifts everything by a. Note that the space of X∗ isthe same as the space of X, Rn, and the space for the parameter θ∗ is the same as thatfor θ, R:

Model Model∗

Data X X∗Sample space Rn Rn

Parameter space R RDensity ∏n

i=1 f (xi − θ) ∏ni=1 f (x∗i − θ∗)

(19.64)

Thus the two models are the same. Not that X and X∗ are equal, but that the sets ofdistributions considered for them are the same. This is the sense in which the model


is shift-invariant. (If the model includes a prior on θ, then the two models are notthe same, because the prior distribution on θ∗ would be different than that on θ.)

If δ(x1, . . . , xn) is an estimate of θ in the original model, then δ(x∗1 , . . . , x∗n) = δ(x1 +c, . . . , xn + c) is an estimate of θ∗ = θ + c in Model∗. Thus it may seem reasonablethat δ(x∗) = δ(x) + c. This idea leads to shift-equivariant estimators, where δ(x) isshift-equivariant if for any x and c,

δ(x1 + c, x2 + c, . . . , xn + c) = δ(x1, x2, . . . , xn) + c. (19.65)

The mean and the median are both shift-equivariant. For example, suppose we haveiid measurements in degrees Kelvin and wish to estimate the population mean. Ifsomeone else decides to redo the data in degrees Celsius, which entails adding 273.15to each observation, then you would expect the new estimate of the mean would bethe old one plus 273.15.

Our goal is to find the best shift-equivariant estimator, where we evaluate estima-tors based on the mean square error:

MSEθ [δ] = Eθ [(δ(X)− θ)2]. (19.66)

If δ is shift-equivariant, and has a finite variance, then Exercise 19.8.19 shows that

Eθ [δ(X)] = θ + E0[δ(X)] and Varθ [δ(X)] = Var0[δ(X)], (19.67)

hence (using Exercise 11.7.1),

Biasθ [δ] = E0[δ(X)] and MSEθ [δ] = Var0[δ(X)] + E0[δ(X)]2 = E0[δ(X)2], (19.68)

which do not depend on θ.

19.7 The Pitman estimator

For any biased equivariant estimator, it is fairly easy to find one that is unbiased buthas the same variance by just shifting it a bit. Suppose δ is biased, and b = E0[δ(X)] 6=0 is the bias. Then δ∗ = δ− b is also shift-equivariant, with

Varθ [δ∗(X)] = Var0[δ(X)] and Biasθ [δ

∗(X)] = 0, (19.69)

hence

MSEθ [δ∗] = Var0[δ(X)] < Var0[δ(X)] + E0[δ(X)]2 = MSEθ [δ]. (19.70)

Thus if a shift-equivariant estimator is biased, it can be improved, ergo ...

Lemma 19.6. If δ is the best shift-equivariant estimator, and it has a finite expected value,then it is unbiased.

In the normal case, X is the best shift-equivariant estimator, because it is the bestunbiased estimator, and it is shift-equivariant. But, of course, in the normal case, Xis always the best at everything. Now consider the general location family case. Wewill assume that Var0[Xi] < ∞. To find the best estimator, we first characterize theshift-equivariant estimators. We can use a similar trick as we did when finding theUMVUE when we took a simple unbiased estimator, then found its expected valuegiven the sufficient statistic.

19.7. The Pitman estimator 337

Start with an arbitrary shift-equivariant estimator δ. Let “Xn" be another shift-equivariant estimator, and look at the difference:

δ(x1, . . . , xn)− xn = δ(x1 − xn, x2 − xn, . . . , xn−1 − xn, 0). (19.71)

Define the function v on the differences yi = xi − xn,

v : Rn−1 −→ R,y −→ v(y) = δ(y1, . . . , yn−1, 0). (19.72)

Then what we have is that for any equivariant δ, there is a v such that

δ(x) = xn + v(x1 − xn, . . . , xn−1 − xn). (19.73)

Now instead of trying to find the best δ, we will look for the best v, then use (19.73)to get the δ. From (19.68), the best v is found by minimizing

E[δ(Z)2] = E0[(Zn + v(Z1 − Zn, . . . , Zn−1 − Zn))2]. (19.74)

(The Zi’s are iid with pdf f , i.e., θ = 0.) The trick is to condition on the differencesZ1 − Zn, . . . , Zn−1 − Zn:

E[(Zn + v(Z1 − Zn, . . . , Zn−1 − Zn))2] = E[e(Z1 − Zn, . . . , Zn−1 − Zn)], (19.75)

where

e(y1, . . . , yn−1) = E[(Zn + v(Z1 − Zn, . . . , Zn−1 − Zn))2 |

Z1 − Zn = y1, . . . , Zn−1 − Zn = yn−1]

= E[(Zn + v(y1, . . . , yn−1))2 |

Z1 − Zn = y1, . . . , Zn−1 − Zn = yn−1]. (19.76)

It is now possible to minimize e for each fixed set of yi’s, e.g., by differentiatingwith respect to v(y1, . . . , yn−1). But we know the minimum is minus the (conditional)mean of Zn, that is, the best v is

v(y1, . . . , yn−1) = −E[Zn | Z1 − Zn = y1, . . . , Zn−1 − Zn = yn−1]. (19.77)

To find that conditional expectation, we need the joint pdf divided by the marginalto get the conditional pdf. For the joint, let

Y1 = Z1 − Zn, . . . , Yn−1 = Zn−1 − Zn, (19.78)

and find the pdf of (Y1, . . . , Yn−1, Zn). We use the Jacobian approach, so need theinverse function:

z1 = y1 + zn, . . . , zn−1 = yn−1 + zn, zn = zn. (19.79)

Exercise 19.8.20 verifies that the Jacobian of this transformation is 1, so that the pdfof (Y1, . . . , Yn−1, Zn) is

f ∗(y1, . . . , yn−1, zn) = f (zn)n−1

∏i=1

f (yi + zn), (19.80)


and the marginal of the Yi’s is

f ∗Y(y1, . . . , yn−1) =∫ ∞

−∞f (zn)

n−1

∏i=1

f (yi + zn)dzn. (19.81)

The conditional pdf is then

f ∗(zn|y1, . . . , yn−1) =f (zn)∏n−1

i=1 f (yi + zn)∫ ∞−∞ f (u)∏n−1

i=1 f (yi + u)du, (19.82)

hence conditional mean is

E[Zn |Y1 = y1, . . . , Yn−1 = yn−1] =

∫ ∞−∞ zn f (zn)∏n−1

i=1 f (yi + zn)dzn∫ ∞−∞ f (zn)∏n−1

i=1 f (yi + zn)dzn

= −v(y1, . . . , yn−1). (19.83)

The best δ uses that v, so that from (19.73), with yi = xi − xn,

δ(x) = xn −∫ ∞−∞ zn f (zn)∏n−1

i=1 f (xi − xn + zn)dzn∫ ∞−∞ f (zn)∏n−1

i=1 f (xi − xn + zn)dzn

=

∫ ∞−∞(xn − zn) f (zn)∏n−1

i=1 f (xi − xn + zn)dzn∫ ∞−∞ f (zn)∏n−1

i=1 f (xi − xn + zn)dzn. (19.84)

Exercise 19.8.20 derives a more pleasing expression:

δ(x) =

∫ ∞−∞ θ ∏n

i=1 f (xi − θ)dθ∫ ∞−∞ ∏n

i=1 f (xi − θ)dθ. (19.85)

This best estimator is called the Pitman estimator (Pitman, 1939). Although we as-sumed finite variance for Xi, the following theorem holds more generally. See Theo-rem 3.1.20 of Lehmann and Casella (2003).

Theorem 19.7. In the location-family model (19.61), if there exists any equivariant estimatorwith finite MSE, then the equivariant estimator with lowest MSE is given by the Pitmanestimator (19.85).

The final expression in (19.85) is very close to a Bayes posterior mean. With priorπ, the posterior mean is

E[θ |X = x] =

∫ ∞−∞ θ fθ(x)π(θ)dθ∫ ∞−∞ fθ(x)π(θ)dθ

. (19.86)

Thus the Pitman estimator can be thought of as the posterior mean for improper priorπ(θ) = 1. Note that we do not need iid observations, but only that the pdf is of theform fθ(x) = f (x1 − θ, . . . , xn − θ).

One thing special about this theorem is that it is constructive, that is, there is aformula telling exactly how to find the best. By contrast, when finding the UMVUE,you have to do some guessing. If you find an unbiased estimator that is a function

19.7. The Pitman estimator 339

of a complete sufficient statistic, then you have it, but it may not be easy to find.You might be able to use the power-series approach, or find an easy one then useRao-Blackwell, but maybe not. A drawback to the formula for the Pitman estimatoris that the integrals may not be easy to perform analytically, although in practice itwould be straightforward to use numerical integration. The exercises have a coupleof examples, the normal and uniform, in which the integrals are doable. Additionalexamples follow.

19.7.1 Shifted exponential distribution

Here, f (x) = e−x I[x > 0], i.e., the Exponential(1) pdf. The location family is notExponential(θ), but a shifted Exponential(1) that starts at θ rather than at 0:

f (x− θ) = e−(x−θ) I[x− θ > 0] =

e−(x−θ) if x > θ0 if x ≤ θ

. (19.87)

Then the pdf of X is

f (x | θ) =n

∏i=1

e−(xi−θ) I[xi > θ] = enθe−Σxi I[x(1) > θ], (19.88)

because all xi’s are greater than θ if and only if the minimum x(1) is. Then the Pitmanestimator is, from (19.85),

δ(x) =

∫ ∞−∞ θenθe−Σxi I[x(1) > θ]dθ∫ ∞−∞ enθe−Σxi I[x(1) > θ]dθ

=

∫ x(1)−∞ θenθdθ∫ x(1)−∞ enθdθ

. (19.89)

Use integration by parts in the numerator, so that

∫ x(1)

−∞θenθdθ =

θ

nenθ |x(1)−∞ −

1n

∫ x(1)

−∞enθdθ =

x(1)n

enx(1) − 1n2 enx(1) , (19.90)

and the denominator is (1/n)enx(1) , hence

δ(x) =x(1)n enx(1) − 1

n2 enx(1)

1n enx(1)

= x(1) −1n

. (19.91)

Is that estimator unbiased? Yes, because it is the best equivariant estimator. Actually,it is the UMVUE, because x(1) is a complete sufficient statistic.


Now

fθ(x) =n

∏i=1

12

e−|xi−θ| =1

2n e−Σ|xi−θ|, (19.92)


and the Pitman estimator is

δ(x) =

∫ ∞−∞ θe−Σ|xi−θ|dθ∫ ∞−∞ e−Σ|xi−θ|dθ

=

∫ ∞−∞ θe−Σ|x(i)−θ|dθ∫ ∞−∞ e−Σ|x(i)−θ|dθ

, (19.93)

where in the last expression we just substituted the order statistics. These integralsneed to be broken up, depending on which (x(i) − θ)’s are positive and which nega-tive. We’ll do the n = 2 case in detail, where we have three regions of integration:

θ < x(1) ⇒ ∑ |x(i) − θ| = x(1) + x(2) − 2θ;x(1) < θ < x(2) ⇒ ∑ |x(i) − θ| = −x(1) + x(2);

x(2) < θ ⇒ ∑ |x(i) − θ| = −x(1) − x(2) + 2θ.(19.94)

The numerator in (19.93) is

e−x(1)−x(2)∫ x(1)

−∞θe2θdθ + ex(1)−x(2)

∫ x(2)

x(1)θdθ + ex(1)+x(2)

∫ ∞

x(2)θe−2θdθ

= e−x(1)−x(2)( x(1)

2− 1

4

)e2x(1) + ex(1)−x(2)

x2(2) − x2

(1)

2+ ex(1)+x(2)

( x(2)2

+14

)e−2x(2)

= 12 ex(1)−x(2) (x(1) + x2

(2) − x2(1) + x(2)). (19.95)

The denominator is

e−x(1)−x(2)∫ x(1)

−∞e2θdθ + ex(1)−x(2)

∫ x(2)

x(1)dθ + ex(1)+x(2)

∫ ∞

x(2)e−2θdθ

= e−x(1)−x(2) 12 e2x(1) + ex(1)−x(2) (x(2) − x(1)) + ex(1)+x(2) 1

2 e−2x(2)

= ex(1)−x(2) (1 + x(2) − x(1)). (19.96)

Finally,

δ(x) =12

x(1) + x2(2) − x2

(1) + x(2)1 + x(2) − x(1)

=x(1) + x(2)

2, (19.97)

which is a rather long way to calculate the mean! The answer for n = 3 is not themean. Is it the median?

19.8 Exercises

Exercise 19.8.1. Let X ∼ Geometric(θ), θ ∈ (0, 1), so that the space of X is 0, 1, 2, . . .,and the pmf is fθ(x) = (1− θ)θx. For each g(θ) below, find an unbiased estimator ofg(θ) based on X, if it exists. (Notice there is only one X.) (a) g(θ) = θ. (b) g(θ) = 0.(c) g(θ) = 1/θ. (d) g(θ) = θ2. (e) g(θ) = θ3. (f) g(θ) = 1/(1− θ).

Exercise 19.8.2. At the start of Section 19.1, we noted that if X ∼ Binomial(n, θ) withθ ∈ (0, 1), the only functions g(θ) for which there is an unbiased estimator are thepolynomials in θ of degree less than or equal to n. This exercise finds these estimators.To that end, suppose Z1, . . . , Zn are iid Bernoulli(θ) and X = Z1 + · · ·+ Zn. (a) Forinteger 1 ≤ k ≤ n, show that δk(z) = z1 · · · zk is an unbiased estimator of θk. (b) What

19.8. Exercises 341

is the distribution of Z given X = x? Why is this distribution independent of θ? (c)Show that

E[δk(Z) |X = x] =xn

x− 1n− 1

· · · x− k + 1n− k + 1

. (19.98)

[Hint: Note that δk(z) is either 0 or 1, and it equals 1 if and only if the first k zi’s areall 1. This probability is the same as that of drawing without replacement k 1’s froma box with x 1’s and n− x 0’s.] (d) Find the UMVUE of g(θ) = ∑n

i=0 aiθi. Why is it

UMVUE? (e) Specialize to g(θ) = Varθ [X] = nθ(1− θ). Find the UMVUE of g(θ) ifn ≥ 2.

Exercise 19.8.3. Suppose X1, . . . , Xn are iid Poisson(θ), θ ∈ (0, ∞). We wish to findan unbiased estimator of g(θ) = exp(−θ). Let δc(x) = exp(−cx). (a) Show that

E[δc(X)] = e−nθenθe−c/n. (19.99)

[Hint: We know that S = X1 + · · ·+ Xn ∼ Poisson(nθ), so the expectation of δc(X)can be found using an exponential summation.] (b) Show that the MLE, exp(−x), isbiased. What is its bias as n → ∞? (c) For n ≥ 2, find c∗ so that δc∗ (x) is unbiased.Show that δc∗ (x) = (1− 1/n)nx. (d) What happens when you try to find such a c forn = 1? What is the unbiased estimator when n = 1?

Exercise 19.8.4. Suppose X ∼ Binomial(n, θ), θ ∈ (0, 1). Find the Cramér-Rao lowerbound for estimating θ. Does the UMVUE achieve the CRLB?

Exercise 19.8.5. Suppose X1, . . . , Xn are iid Poisson(λ), λ ∈ (0, ∞). (a) Find theCRLB for estimating λ. Does the UMVUE achieve the CRLB? (b) Find the CRLB forestimating e−λ. Does the UMVUE achieve the CRLB?

Exercise 19.8.6. Suppose X1, . . . , Xn are iid Exponential(λ), λ ∈ (0, ∞). (a) Findthe CRLB for estimating λ. (b) Find the UMVUE of λ. (c) Find the variance ofthe UMVUE from part (b). Does it achieve the CRLB? (d) What is the limit ofCRLB/Var[UMVUE] as n→ ∞?

Exercise 19.8.7. Suppose the data consist of just one X ∼ Poisson(θ), and the goalis to estimate g(θ) = exp(−2θ). Thus you want to estimate the probability thatno calls will come in the next two hours, based on just one hour’s data. (a) Findthe MLE of g(θ). Find its expected value. Is it unbiased? (b) Find the UMVUEof g(θ). Does it make sense? [Hint: Show that if the estimator δ(x) is unbiased,exp(−θ) = ∑∞

x=1 δ(x)θk/k! for all θ. Then write the left-hand side as a power seriesin θ, and match coefficients of θk on both sides.]

Exercise 19.8.8. For each situation, decide how many different unbiased estimators ofthe given g(θ) there are (zero, one, or lots). (a) X ∼ Binomial(n, θ), θ ∈ (0, 1), g(θ) =θn+1. (b) X ∼ Poisson(θ1) and Y ∼ Poisson(θ2), and X and Y are independent,θ = (θ1, θ2) ∈ (0, ∞) × (0, ∞). (i) g(θ) = θ1; (ii) g(θ) = θ1 + θ2; (iii) g(θ) = 0. (c)(X(1), X(n)) derived from a sample of iid Uniform(θ, θ + 1)’s, θ ∈ R. (i) g(θ) = θ; (ii)g(θ) = 0.

Exercise 19.8.9. Let X1, . . . , Xn be iid Exponential(λ), λ ∈ (0, ∞), and

g(λ) = Pλ[Xi > 1] = e−λ. (19.100)


Then δ(X) = I[X1 > 1] is an unbiased estimator of g(λ). Also, T = X1 + · · ·+ Xn issufficient. (a) What is the distribution of U = X1/T? (b) Is T independent of the Uin part (a)? (c) Find P[U > u] for u ∈ (0, ∞). (d) P[X1 > 1 | T = t] = P[U > u | T =t] = P[U > u] for what u (which is a function of t)? (e) Find E[δ(X) | T = t]. It thatthe UMVUE of g(λ)?

Exercise 19.8.10. Suppose X and Y are independent, with X ∼ Poisson(2θ) andY ∼ Poisson(2(1− θ)), θ ∈ (0, 1). (This is the same model as in Exercise 13.8.2.) (a)How many unbiased estimators are there of g(θ) when (i) g(θ) = θ? (ii) g(θ) = 0? (b)Is (X, Y) sufficient? Is (X, Y) minimal sufficient? Is the model complete for (X, Y)?(c) Find Fisher’s information. (d) Consider the unbiased estimators δ1(x, y) = x/2,δ2(x, y) = 1− y/2, and δ3(x, y) = (x− y)/4 + 1/2. None of these achieve the CRLB.But for each estimator, consider CRLBθ/Varθ [δi]. For each estimator, this ratio equals1 for some θ ∈ [0, 1]. Which θ for which estimator?

Exercise 19.8.11. For each of the models below, say whether it is an exponential familymodel. If it is, give the natural parameter and sufficient statistic, and say whetherthe model for the sufficient statistic is complete or not, justifying your assertion.(a) X1, . . . , Xn are iid N(µ, µ2), where µ ∈ R. (b) X1, . . . , Xn are iid N(θ, θ), whereθ ∈ (0, ∞). (c) X1, . . . , Xn are iid Laplace(µ), where µ ∈ R. (d) X1, . . . , Xn are iidBeta(α, β), where (α, β) ∈ (0, ∞)2.

Exercise 19.8.12. Suppose X ∼ Multinomial(n, p), where n is fixed and

p ∈ p ∈ RK | 0 < pi for all i = 1, . . . , K, and p1 + · · ·+ pK = 1. (19.101)

(a) Find a sufficient statistic for which the model is complete, and show that it iscomplete. (b) Is the model complete for X? Why or why not? (c) Find the UMVUE,if it exists, for p1. (d) Find the UMVUE, if it exists, for p1 p2.

Exercise 19.8.13. Go back to the fruit fly example, where in Exercise 14.9.2 we havethat (N00, N01, N10, N11) ∼ Multinomial(n, p(θ)), with

p(θ) = ( 12 (1− θ)(2− θ), 1

2 θ(1− θ), 12 θ(1− θ), 1

2 θ(1 + θ)). (19.102)

(a) Show that for n > 2, the model is a two-dimensional exponential family withsufficient statistic (N00, N11). Give the natural parameter (η1(θ), η2(θ)). (b) Sketchthe parameter space of the natural parameter. Does it contain a two-dimensionalopen rectangle? (c) The Dobzhansky estimator for θ is given in (6.65). It is unbi-ased with variance 3θ(1− θ)/4n. The Fisher information is given in (14.110). Doesthe Dobzhansky estimator achieve the CRLB? What is the ratio of the CLRB to thevariance? Does the ratio approach 1 as n → ∞? [Hint: The ratio is the same as theasymptotic efficiency found in Exercise 14.9.2(c).]


), · · · ,

(XnYn

)are iid N2

((00

), σ2

(1 ρρ 1

)), (19.103)

where (σ2, ρ) ∈ (0, ∞)× (−1, 1), and n > 2. (a) Write the model as a two-dimensionalexponential family. Give the natural parameter and sufficient statistic. (b) Sketch theparameter space for the natural parameter. (First, show that θ2 = −2θ1ρ.) Does

19.8. Exercises 343

it contain an open nonempty two-dimensional rectangle? (c) Is the model for thesufficient statistic complete? Why or why not? (d) Find the expected value of thesufficient statistic (there are two components). (e) Is there a UMVUE for σ2? If so,what is it?

Exercise 19.8.15. Take the same model as in Exercise 19.8.14, but with σ2 = 1, sothat the parameter is just ρ ∈ (−1, 1). (a) Write the model as a two-dimensionalexponential family. Give the natural parameter and sufficient statistic. (b) Sketch theparameter space for the natural parameter. Does it contain an open nonempty two-dimensional rectangle? (c) Find the expected value of the sufficient statistic. (d) Is themodel for the sufficient statistic complete? Why or why not? (e) Find an unbiasedestimator of ρ that is a function of the sufficient statistic. Either show this estimatoris UMVUE, or find another unbiased estimator of ρ that is a function of the sufficientstatistic.

Exercise 19.8.16. Suppose Y ∼ N(xβ, σ2In) as in (12.10), where x′x is invertible, sothat we have the normal linear model. Exercise 13.8.19 shows that (β, SSe) is thesufficient statistic, where β is the least squares estimate of β, and SSe = ‖y− βx‖2,the sum of squared residuals. (a) Show that the pdf of Y is an exponential familydensity with natural statistic (β, SSe). What is the natural parameter? [Hint: See thelikelihood in (13.89).] (b) Show that the model is complete for this sufficient statistic.(c) Argue that βi is the UMVUE of βi for each i. (d) What is the UMVUE of σ2? Why?

Exercise 19.8.17. Suppose X1 and X2 are iid Uniform(0, θ), with θ ∈ (0, ∞). (a) Let(X(1), X(2)) be the order statistics. Find the space and pdf of the order statistics. Arethe order statistics sufficient? (b) Find the space and pdf of W = X(1)/X(2). [FindP[W ≤ w | θ] in terms of (X(1), X(2)), then differentiate. The pdf should not dependon θ.] What is E[W]? (c) Is the model for (X(1), X(2)) complete? Why or why not? (d)We know that T = X(2) is a sufficient statistic. Find the space and pdf of T . (e) FindEθ [T], and an unbiased estimator of θ. (f) Show that the model for T is complete.(Note that if

∫ θ0 h(t)dt = 0 for all θ, then ∂

∂θ

∫ θ0 h(t)dt = 0 for almost all θ. See Section

19.2.2 for general n.) (g) Is the estimator in part (e) the UMVUE?

Exercise 19.8.18. Suppose Xijk’s are independent, i = 1, 2; j = 1, 2; k = 1, 2. LetXijk ∼ N(µ + αi + β j, 1), where (µ, α1, α2, β1, β2) ∈ R5. These data can be thought ofas observations from a two-way analysis of variance with two observations per cell:

Column 1 Column 2Row 1 X111, X112 X121, X122Row 2 X211, X212 X221, X222

(19.104)

Let

Xi++ = Xi11 + Xi12 + Xi21 + Xi22, i = 1, 2 (row sums),X+j+ = X1j1 + X1j2 + X2j1 + X2j2, j = 1, 2 (column sums), (19.105)

and X+++ be the sum of all eight observations. (a) Write the pdf of the data as a K =5 dimensional exponential family, where the natural parameter is (µ, α1, α2, β1, β2).What is the natural sufficient statistic? (b) Rewrite the model as a K = 3 dimensional


exponential family. (Note that X2++ = X+++−X1++, and similarly for the columns.)What are the natural parameter and natural sufficient statistic? (c) Is the model inpart (b) complete? (What is the space of the natural parameter?) (d) Find the expectedvalues of the natural sufficient statistics from part (b) as functions of (µ, α1, α2, β1, β2).(e) For each of the following, find an unbiased estimator if possible, and if you can,say whether it is UMVUE or not. (Two are possible, two not.) (i) µ; (ii) µ + α1 + β1;(iii) β1; (iv) β1 − β2.

Exercise 19.8.19. Suppose X1, . . . , Xn are iid with the location family model givenby density f , and δ(x) is a shift-equivariant estimator of the location parameter θ.Let Z1, . . . , Zn be the iid random variables with pdf f (zi), so that Xi =

D Zi + θ. (a)Show that δ(X) =D δ(Z) + θ. (b) Assuming the mean and variance exists, showthat Eθ [δ(X)] = E[δ(Z)] + θ and Varθ [δ(X)] = Var[δ(Z)]. (c) Show that thereforeBiasθ [δ(X)] = E[δ(Z)] and MSEθ [δ(X)] = E[δ(Z)2], which imply (19.68).

Exercise 19.8.20. Suppose Z1, . . . , Zn are iid with density f (zi), and let Y1 = Z1 −Zn, . . . , Yn−1 = Zn−1 − Zn as in (19.78). (a) Consider the one-to-one transformation(z1, . . . , zn) ↔ (y1, . . . , yn−1, zn). Show that the inverse transformation is given as in(19.79), its Jacobian is 1, and its pdf is f (zn)∏n−1

i=1 f (yi + zn) as in (19.80). (b) Showthat∫ ∞

−∞(xn − zn) f (zn)∏n−1i=1 f (xi − xn + zn)dzn∫ ∞

−∞ f (zn)∏n−1i=1 f (xi − xn + zn)dzn

=

∫ ∞−∞ θ ∏n

i=1 f (xi − θ)dθ∫ ∞−∞ ∏n

i=1 f (xi − θ)dθ, (19.106)

verifying the expression of the Pitman estimator in (19.85). [Hint: Use the substitutionθ = xn − zn in the integrals.]

Exercise 19.8.21. Consider the location family model with just one X. Show that thePitman estimator is X− E0[X] if the expectation exists.

Exercise 19.8.22. Show that X is the Pitman estimator in the N(θ, 1) location familymodel. [Hint: In the pdf of the normal, write ∑(xi − θ)2 = n(x− θ)2 + ∑(xi − x)2.]

Exercise 19.8.23. Find the Pitman estimator in the Uniform(θ, θ + 1) location familymodel, so that f (x) = I[0 < x < 1]. [Hint: Note that the density ∏ I[θ < xi <θ + 1] = 1 if x(n) − 1 < θ < x(1), and 0 otherwise.]

Chapter 20

The Decision-Theoretic Approach

20.1 Binomial estimators

In the previous chapter, we found the best estimators when looking at a restrictedset of estimators (unbiased or shift-equivariant) in some fairly simple models. Moregenerally, one may not wish to restrict choices to just unbiased estimators, say, orthere may be no obvious equivariance or other structural limitations to impose. De-cision theory is one approach to this larger problem, where the key feature is thatsome procedures do better for some values of the parameter, and other proceduresdo better for other values of the parameter.

For example, consider the simple situation where Z1 and Z2 are iid Bernoulli(θ),θ ∈ (0, 1), and we wish to estimate θ using the mean square error as the criterion.Here are five possible estimators:

δ1(z) = (z1 + z2)/2 (The MLE, UMVUE);δ2(z) = z1/3 + 2z2/3 (Unbiased);δ3(z) = (z1 + z2 + 1)/6 (Bayes wrt Beta(1,3) prior);δ4(z) = (z1 + z2 + 1/

√2)/(2 +

√2) (Bayes wrt Beta(1/

√2, 1/√

2) prior);δ5(z) = 1/2 (Constant at 1/2).

(20.1)

Figure 20.1 graphs the MSEs, here called the “risks.” Notice the risk functions cross,that is, for the most part when comparing two estimators, sometimes one is better,sometimes the other is. The one exception is that δ1 is always better than δ2. Weknow this because both are unbiased, and the first one is the UMVUE. Decision-theoretically, we say that δ2 is inadmissible among these five estimators. The otherfour are admissible among these five, since none of them is dominated by any of theothers. Even δ5(z) = 1/2 is admissible, since its MSE at θ = 1/2 is zero, and no otherestimator can claim such.

Rather than evaluate the entire curve, one may wish to know what is the worstrisk each estimator has. An estimator with the lowest worst risk is called minimax.The table contains the maximum risk for each of the estimators:

δ1 δ2 δ3 δ4 δ5Maximum risk 0.1250 0.1389 0.1600 0.0429 0.2500 (20.2)

345

346 Chapter 20. The Decision-Theoretic Approach

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.05

0.10

0.15

0.20

0.25

δ5δ2

δ1

δ4

δ3

δ1δ2δ3δ4δ5

θ

Ris

k

Figure 20.1: MSEs for the estimators given in (20.1).

As can be seen from either the table or the graph, the minimax procedure is δ4, witha maximum risk of 0.0429(= 1/(12 + 8

√2)). In fact, the risk for this estimator is

constant, which the√

2’s in the estimator were chosen to achieve.This example exhibits the three main concepts of statistical decision theory: Bayes,

admissibility, and minimaxity. The next section presents the formal setup.

20.2 Basic setup

We assume the usual statistical model: random vector X, space X , and set of distri-butions P = Pθ | θ ∈ T , where T is the parameter space. The decision-theoreticapproach supposes an action space A that specifies the possible “actions” we mighttake, which represent the possible outputs of the inference. For example, in estima-tion the action is the estimate, in testing the action is accept or reject the null, and inmodel selection the action is the model selected.

A decision procedure specifies which action to take for each possible value of thedata. Formally, a decision procedure is a function δ(x),

δ : X −→ A. (20.3)

20.3. Bayes procedures 347

The above is a nonrandomized decision procedure. A randomized procedure woulddepend on not just the data x, but also some outside randomization element. SeeSection 20.7.

A good procedure is one that takes good actions. To measure how good, we need aloss function that specifies a penalty for taking a particular action when a particulardistribution obtains. Formally, the loss function L is a function

L : A× T −→ [0, ∞). (20.4)

When estimating a function g(θ), common loss functions are squared-error loss,

L(a, θ) = (a− g(θ))2, (20.5)

and absolute-error loss,L(a, θ) = |a− g(θ)|. (20.6)

In hypothesis testing or model selection, a “0− 1” loss is common, where you lose 0by making the correct decision, and lose 1 if you make a mistake.

A frequentist evaluates procedures by their behavior over experiments. In thisdecision-theoretic framework, the risk function for a particular decision procedure δis key. The risk is the expected loss, where δ takes place of the a. It is a function of θ:

R(θ ; δ) = Eθ[L(δ(X), θ)] = E[L(δ(X), Θ) |Θ = θ]. (20.7)

The two expectations are the same, but the first is written for frequentists, and thesecond for Bayesians. In estimation problems with L being squared-error loss (20.5),the risk is the mean square error:

R(θ ; δ) = Eθ[(δ(X)− g(θ))2] = MSEθ[δ]. (20.8)

The idea is to choose a δ with small risk. The challenge is that usually there is noone procedure that is best for every θ, as we saw in Section 20.1. One way to chooseis to restrict consideration to a subset of procedures, e.g., unbiased estimators as inSection 19.1, or shift-equivariant ones as in Section 19.6, or in hypothesis testing, testswith set level α. Often, one either does not wish to use such restrictions, or cannot. Inthe absence of a uniquely defined best procedure, frequentists have several possiblestrategies, among them determining admissible procedures, minimax procedures, orBayes procedures.

20.3 Bayes procedures

One method for selecting among various δ’s is to find one that minimizes the averageof the risk, where the average is taken over θ. This averaging needs a probabilitymeasure on T . From a Bayesian perspective, this distribution is the prior. Froma frequentist perspective, it may or may not reflect prior belief, but it should be“reasonable.” The procedure that minimizes this average is the Bayes procedurecorresponding to the distribution. We first extend the definition of risk to a functionof distributions π on T , where the Bayes risk at π is the expectation of the risk overθ:

R(π ; δ) = Eπ [R(θ ; δ)]. (20.9)


Definition 20.1. For given risk function R(θ ; δ), set of procedures D, and distributionπ on T , a Bayes procedure with respect to (wrt) D and π is a procedure δπ ∈ D withR(π, δπ) < ∞ that minimizes the Bayes risk over δ, i.e.,

R(π ; δπ) ≤ R(π ; δ) for any δ ∈ D. (20.10)

It might look daunting to minimize over an entire function δ, but we can reducethe problem to minimizing over a single value by using an iterative expected value.With both X and Θ random, the Bayes risk is the expected value of the loss overthe joint distribution of (X, Θ), hence can be written as the expected value of theconditional expected value of L given X:

R(π ; δ) = E[L(δ(X), Θ)]

= E[eL(X)], (20.11)

whereeL(x) = E[L(δ(x), Θ) |X = x]. (20.12)

In that final expectation, Θ is random, having the posterior distribution given X = x.In (20.12), because x is fixed, δ(x) is just a constant, hence it may not be too difficultto minimize eL(x) over δ(x). If we find the δ(x) to minimize eL(x) for each x, thenwe have also minimized the overall Bayes risk (20.11). Thus a Bayes procedure is δπ

such that

δπ(x) minimizes E[L(δ(x), Θ) |X = x] over δ(x) for each x ∈ X . (20.13)

In estimation with squared-error loss, (20.12) becomes

eL(x) = E[(δ(x)− g(Θ))2 |X = x]. (20.14)

That expression is minimized with δ(x) being the mean, in this case the conditional(posterior) mean of g:

δπ(x) = E[g(Θ) |X = x]. (20.15)

If the loss is absolute error (20.6), the Bayes procedure is the posterior median.A Bayesian does not care about x’s not observed, hence would immediately go

to the conditional equation (20.13), and use the resulting δπ(x). It is interestingthat the decision-theoretic approach appears to bring Bayesians and frequentists to-gether. They do end up with the same procedure, but from different perspectives.The Bayesian is trying to limit expected losses given the data, while the frequentist istrying to limit average expected losses, taking expected values as the experiment isrepeated.

20.4 Admissibility

Recall Figure 20.1 in Section 20.1. The important message in the graph is that none ofthe five estimators is obviously “best” in terms of MSE. In addition, only one of themis discardable in the sense that another estimator is better. A fairly weak criterion isthis lack of discardability. Formally, the estimator δ′ is said to dominate the estimatorδ if

R(θ ; δ′) ≤ R(θ ; δ) for all θ ∈ T , and

R(θ ; δ′) < R(θ ; δ) for at least one θ ∈ T . (20.16)

20.4. Admissibility 349

In Figure 20.1, δ1 dominates δ2, but there are no other such dominations. The conceptof admissibility is based on lack of domination.

Definition 20.2. Let D be a set of decision procedures. A δ ∈ D is inadmissible amongprocedures in D if there is another δ′ ∈ D that dominates δ. If there is no such δ′, then δ isadmissible among procedures in D.

A handy corollary of the definition is that δ is admissible if

R(θ ; δ′) ≤ R(θ ; δ) for all θ ∈ T ⇒ R(θ ; δ′) = R(θ ; δ) for all θ ∈ T . (20.17)

If D is the set of unbiased estimators, then the UMVUE is admissible in D, and anyother unbiased estimator is inadmissible. Similarly, if D is the set of shift-equivariantprocedures (assuming that restriction makes sense for the model), then the best shift-invariant estimator is the only admissible estimator in D. In the example of Section20.1 above, if D consists of the five given estimators, then all but δ2 are admissible inD.

The presumption is that one would not want to use an inadmissible procedure,at least if risk is the only consideration. Other considerations, such as intuitivityor computational ease, may lead one to use an inadmissible procedure, provided itcannot be dominated by much. Conversely, any admissible procedure is presumedto be at least plausible, although there are some strange ones.

It is generally not easy to decide whether a procedure is admissible or not. Forthe most part, Bayes procedures are admissible. A Bayes procedure with respect to aprior π has good behavior averaging over the θ. Any procedure that dominated thatprocedure would also have to have at least as good Bayes risk, hence would also beBayes. The next lemma collects some sufficient conditions for a Bayes estimator to beadmissible. The lemma also holds if everything is stated relative to a restricted set ofprocedures D.

Lemma 20.3. Suppose δπ is Bayes wrt to the prior π. Then it is admissible if any of thefollowing hold: (a) It is admissible among the set of estimators that are Bayes wrt π. (b) Itis the unique Bayes procedure, up to equivalence, wrt π. That is, if δ′π is also Bayes wrt π,then R(θ ; δπ) = R(θ ; δ′π) for all θ ∈ T . (c) The parameter space is finite or countable,T = θ1, . . . , θK, and the pmf π is postive, π(θk) > 0 for each k = 1, . . . , K. (K maybe +∞.) (d) The parameter space is T = (a, b) (−∞ ≤ a < b ≤ ∞), the risk function forany procedure δ is continuous in θ, and for any nonempty interval (c, d) ⊂ T , π(c, d) > 0.(e) The parameter space T is an open subset of Rp, the risk function for any procedure δ iscontinuous in θ, and for any nonempty open set B ⊂ T , π(B) > 0.

The proofs of parts (b) and (c) are found in Exercises 20.9.1 and 20.9.2. Note thatthe condition on π in part (d) holds if π has a pdf that is positive for all θ. Part (e) isa multivariate analog of part (d), left to the reader.

Proof. (a) Suppose the procedure δ′ satisfies R(θ ; δ′) ≤ R(θ ; δπ) for all θ ∈ T . Thenby taking expected values over θ wrt π,

R(π ; δ′) = Eπ [R(Θ ; δ′)] ≤ Eπ [R(Θ ; δπ)] = R(π ; δπ). (20.18)

Thus δ′ is also Bayes wrt π. But by assumption, δπ is admissible among Bayes proce-dures, hence δ′ must have the same risk as δπ for all θ, which by (20.17) proves δπ isadmissible.


θ'

d(θ)

ε

θ'α α

θ

d(θ)

Figure 20.2: Illustration of domination with a continuous risk function. Here, δ′

dominates δ. The function graphs the difference in risks, d(θ) = R(θ ; δ′)− R(θ ; δ).The top graph shows the big view, where the dashed line represents zero. The θ′ isa point at which δ′ is strictly better than δ. The bottom graph zooms in on the areanear θ′, showing d(θ) ≤ −ε for θ′ − α < θ < θ′ + α.

(d) Again suppose δ′ has risk no larger than δπ ’s, and for some θ′ ∈ T , R(θ′ ; δ′) <R(θ′ ; δπ). By the continuity of the risk functions, the inequality must hold for aninterval around θ′. That is, there exist α > 0 and ε > 0 such that

R(θ ; δ′)− R(θ ; δπ) ≤ −ε for θ′ − α < θ < θ′ + α. (20.19)

See Figure 20.2. (Since T is open, α can be taken small enough so that θ′ ± α are inT .) Then integating over θ,

R(π ; δ′)− R(π ; δπ) ≤ −επ(θ′ − α < θ < θ′ + α) < 0, (20.20)

meaning δ′ has better Bayes risk than δπ . This is a contradiction, hence there is nosuch θ′, i.e., δπ is admissible.

In Section 20.8, we will see that when the parameter space is finite (and otherconditions hold), all admissible tests are Bayes. In general, not all admissible pro-cedures are Bayes, but they are at least limits of Bayes procedures in some sense.Exactly which limits are admissible is a bit delicate, though. In any case, at least ap-proximately, one can think of a procedure being admissible if there is some Bayesian

20.5. Estimating a normal mean 351

somewhere, or a sequence of Bayesians, who would use it. See Section 22.3 for thehypothesis testing case. Ferguson (1967) is an accessible introduction to the relation-ship between Bayes procedures and admissibility; Berger (1993) and Lehmann andCasella (2003) contain additional results and pointers to more recent work.

20.5 Estimating a normal mean

Consider estimating a normal mean based on an iid sample with known varianceusing squared-error loss, so that the risk is the mean square error. The sample meanis the obvious estimator, which is the UMVUE, MLE, best shift-equivariant estimator,etc. It is not a Bayes estimator because it is unbiased, as seen in Exercise 11.7.21.It is the posterior mean for the improper prior Uniform(−∞, ∞) on µ, as in Exer-cise 11.7.16. A posterior mean using an improper prior is in this context called ageneralized Bayes procedure. It may or may not be admissible, but any admissi-ble procedure here has to be Bayes or generalized Bayes wrt some prior, proper orimproper as the case may be. See Sacks (1963) or Brown (1971).

We first simplify the model somewhat by noting that the sample mean is a suf-ficient statistic. From the Rao-Blackwell theorem (Theorem 13.8 on page 210), anyestimator not a function of just the sufficient statistic can be improved upon by afunction of the sufficient statistic that has the same bias but smaller variance, hencesmaller mean square error. Thus we may as well limit ourselves to function of themean, which can be represented by the model

X ∼ N(µ, 1), µ ∈ R. (20.21)

We wish to show that δ0(x) = x is admissible. Since it is unbiased and has variance1, its risk is R(µ ; δ0) = 1.

We will use the method of Blyth (1951) to show admissibility. The first step is tofind a sequence of proper priors that approximates the uniform prior. We will takeπn to be the N(0, n) prior on µ. Exercise 7.8.14 shows that the Bayes estimator wrt πnis

δn(x) = Eπn [M |X = x] =n

n + 1x, (20.22)

which is admissible. Exercise 20.9.5 shows that the Bayes risks are

R(πn ; δ0) = 1 and R(πn ; δn) =n

n + 1. (20.23)

Thus the Bayes risk of δn is very close to that of δ0 for large n, suggesting that δ0 mustbe close to admissible.

As in the proof of Lemma 20.3(d), assume that the estimator δ′ dominates δ0. SinceR(µ ; δ0) = 1, it must be that

R(µ ; δ′) ≤ 1 for all µ, and R(µ′ ; δ′) < 1 for some µ′. (20.24)

We look at the difference in Bayes risks:

R(πn ; δ0)− R(πn ; δ′) = (R(πn ; δ0)− R(πn ; δn)) + (R(πn ; δn)− R(πn ; δ′))

≤ R(πn ; δ0)− R(πn ; δn)

=1

n + 1, (20.25)


the inequality holding since δn is Bayes wrt πn. Using πn explicitly, we have∫ ∞

−∞(1− R(µ ; δ′))

1√2πn

e−µ2

2n dµ ≤ 1n + 1

. (20.26)

Both sides look like they would go to zero, but multiplying by√

n may produce anonzero on the left: ∫ ∞

−∞(1− R(µ ; δ′))

1√2π

e−µ2

2n dµ ≤√

nn + 1

. (20.27)

The risk function R(µ ; δ′) is continuous in µ (see Ferguson (1967), Section 3.7), whichtogether with (20.23) shows that there exist α > 0 and ε > 0 such that

1− R(µ ; δ′) ≥ ε for µ′ − α < µ < µ′ + α. (20.28)

See Figure 20.2. Hence from (20.27) we have that

ε√2π

∫ µ′+α

µ′−αe−

µ2

2n dµ ≤√

nn + 1

. (20.29)

Letting n→ ∞, we obtain2αε√

2π≤ 0, (20.30)

which is a contradiction. Thus there is no δ′ that dominates δ0, hence δ0 is admissible.

To recap, the basic idea in using Blyth’s method to show δ0 is admissible is to finda sequence of priors πn and constants cn (which are

√n in our example) so that if δ′

dominates δ0,

cn(R(πn ; δ0)− R(πn ; δ′))→ C > 0 and cn(R(πn ; δ0)− R(πn ; δn))→ 0. (20.31)

If the risk function is continuous and T = (a, b), then the first condition in (20.31)holds if

cnπn(c, d)→ C′ > 0 (20.32)

for any c < d such that (c, d) ⊂ T .It may not always be possible to use Blyth’s method, as we will see in the next

section.

20.5.1 Stein’s surprising result

Admissibility is generally considered a fairly weak criterion. An admissible proce-dure does not have to be very good everywhere, but just have something going for it.Thus the statistical community was rocked when Charles Stein (Stein, 1956b) showedthat in the multivariate normal case, the usual estimator of the mean could be inad-missible.

The model has random vector

X ∼ N(µ, Ip), (20.33)

20.5. Estimating a normal mean 353

with µ ∈ Rp. The objective is to estimate µ with squared-error loss, which in this caseis multivariate squared error:

L(a, µ) = ‖a− µ‖2 =p

∑i=1

(ai − µi)2. (20.34)

The risk is then the sum of the mean square errors for the individual estimators ofthe µi’s. Again the obvious estimator is δ0(x) = x, which has risk

R(µ ; δ) = Eµ[‖X− µ‖2] = p (20.35)

because the Xi’s all have variance 1. As we saw above, when p = 1 δ0 is admissible.When p = 2, we could try to use Blyth’s method with prior on the bivariate µ

being N(0, nI2), but in the step analogous to (20.27), we would multiply by n insteadof√

n, hence the limit on the right-hand side of (20.30) would be 1 instead of 0,thus not necessarily be a contradiction. Brown and Hwang (1982) present a morecomplicated prior for which Blyth’s method does prove admissibility of δ0(x) = x.

The surprise is that when p ≥ 3, δ is inadmissible. The most famous estimatorthat dominates it is the James-Stein estimator (James and Stein, 1961),

δJS(x) =(

1− p− 2‖x‖2

)x. (20.36)

It is a shinkage estimator, because it takes the usual estimator, and shrinks it (towards0 in this case), at least when (p− 2)/‖x‖2 < 1. Throughout the 1960s and 1970s, therewas a frenzy of work on various shrinkage estimators. They are still quite popular.The domination result is not restricted to normality. It is quite broad. The generalnotion of shrinkage is very important in machine learning, where better predictionsare found by restraining estimators from becoming too large using regularization(Section 12.5).

To find the risk function for the James-Stein estimator when p ≥ 3, start by writing

R(µ ; δJS) = Eµ[‖δJS(X)− µ‖2]

= Eµ

[∥∥∥∥X− µ− p− 2‖X‖2 X

∥∥∥∥2]

= Eµ[‖X− µ‖2] + Eµ

[(p− 2)2

‖X‖2

]− 2(p− 2)Eµ

[X′(X− µ)

‖X‖2

]. (20.37)

The first term we recognize from (20.35) to be p.Consider the third term, where

Eµ

[X′(X− µ)

‖X‖2

]=

p

∑i=1

Eµ

[Xi(Xi − µi)

‖X‖2

]. (20.38)

We take each term in the summation separately. The first one can be written

Eµ

[X1(X1 − µ1)

‖X‖2

]=∫ ∞

−∞· · ·

∫ ∞

−∞

x1(x1 − µ1)

‖x‖2 φµ1 (x1)dx1φµ2 (x2)dx2 · · · φµp (xp)dxp,

(20.39)


where φµi is the N(µi, 1) pdf,

φµi (xi) =1√2π

e−12 (xi−µi)

2. (20.40)

Exercise 20.9.9 looks at the innermost integral, showing that

∫ ∞

−∞

x1(x1 − µ1)

‖x‖2 φµ1 (x1)dx1 =∫ ∞

−∞

(1‖x‖2 −

2x21

‖x‖4

)φµ1 (x1)dx1. (20.41)

Replacing the innermost integral in (20.39) with (20.41) yields

Eµ

[X1(X1 − µ1)

‖X‖2

]= Eµ

[1‖X‖2 −

2X21

‖X‖4

]. (20.42)

The same calculation works for i = 2, . . . , p, so that from (20.38),

Eµ

[X′(X− µ)

‖X‖2

]=

p

∑i=1

Eµ

[1‖X‖2 −

2X2i

‖X‖4

]

= Eµ

[p‖X‖2

]− Eµ

[2 ∑ X2

i‖X‖4

]

= Eµ

[p− 2‖X‖2

]. (20.43)

Exercise 20.9.10 verifies that from (20.37),

R(µ ; δJS) = p− Eµ

[(p− 2)2

‖X‖2

]. (20.44)

That’s it! The expected value at the end is positive, so that the risk is less than p. Thatis

R(µ ; δJS) < p = R(µ ; δ) for all µ ∈ Rp, (20.45)

meaning δ(x) = x is inadmissible.How much does the James-Stein estimator dominate δ? It shrinks towards zero, so

if the true mean is zero, one would expect the James-Stein estimator to be quite good.In fact, Exercise 20.9.10 shows that R(0; δJS) = 2. Especially when p is large, this riskis much less than that of δ, which is always p. Even for p = 3, the James-Stein risk is2/3 of δ’s. The farther from 0 the µ is, the less advantage the James-Stein estimatorhas. As ‖µ‖ → ∞, with ‖X‖ ∼ χ2

p(‖µ‖2), the Eµ[1/‖X‖2]→ 0, so

lim‖µ‖→∞

R(µ ; δJS) −→ p = R(µ ; δ). (20.46)

If rather than having good risk at zero, one has a “prior” idea that the mean is nearsome fixed µ0, one can instead shrink towards that vector:

δ∗JS(x) =(

1− p− 2‖x− µ0‖2

)(x− µ0) + µ0. (20.47)

20.6. Minimax procedures 355

This estimator has the same risk as the regular James-Stein estimator, but with shiftedparameter:

R(µ ; δ∗JS) = p− Eµ

[(p− 2)2

‖X− µ0‖2

]= p− Eµ−µ0

[(p− 2)2

‖X‖2

], (20.48)

and has risk of 2 when µ = µ0.The James-Stein estimator itself is not admissible. There are many other similar

estimators in the literature, some that dominate δJS but are not admissible (such asthe “positive part” estimator that does not allow the shrinking factor to be negative),and many admissible estimators that dominate δ. See, e.g., Strawderman and Cohen(1971) and Brown (1971) for overviews.

20.6 Minimax procedures

Using a Bayes procedure involves choosing a prior π. When using an admissibleestimator, one is implicitly choosing a Bayes, or close to a Bayes, procedure. Oneattempt to objectifying the choice of a procedure is for each procedure, see what itsworst risk is. Then you choose the procedure that has the best worst, i.e., the minimaxprocedure. Next is the formal definition.

Definition 20.4. Let D be a set of decision procedures. A δ ∈ D is minimax amongprocedures in D if for any other δ′ ∈ D,

supθ∈T

R(θ ; δ) ≤ supθ∈T

R(θ ; δ′). (20.49)

For the binomial example with n = 2 in Section 20.1, Figure 20.1 graphs the riskfunctions of five estimators. Their maximum risks are given in (20.2), repeated here:

δ1 δ2 δ3 δ4 δ5Maximum risk 0.1250 0.1389 0.1600 0.0429 0.2500 (20.50)

Of these, δ4 (the Bayes procedure wrt Beta(1/√

2, 1/√

2)) has the lowest maximum,hence is minimax among these five procedures.

Again looking at Figure 20.1, note that the minimax procedure is the flattest. Infact, it is also maximin in that it has the worst best risk. It looks as if when trying tolimit bad risk everywhere, you give up very good risk somewhere. This idea leadsto one method for finding a minimax procedure: A Bayes procedure with flat risk isminimax. The next lemma records this result and some related ones.

Lemma 20.5. Suppose δ0 has a finite and constant risk,

R(θ ; δ0) = c < ∞ for all θ ∈ T . (20.51)

Then δ0 is minimax if any of the following conditions hold: (a) δ0 is Bayes wrt a proper priorπ. (b) δ0 is admissible. (c) There exists a sequence of Bayes procedures δn wrt priors πn suchthat their Bayes risks approach c, i.e.,

R(πn ; δn) −→ c. (20.52)


Proof. (a) Suppose δ0 is not minimax, so that there is a δ′ such that

supθ∈T

R(θ ; δ′) < supθ∈T

R(θ ; δ0) = c. (20.53)

But thenEπ [R(Θ ; δ′)] < c = Eπ [R(Θ ; δ0)], (20.54)

meaning that δ0 is not Bayes wrt π. Hence we have a contradiction, so that δ0 isminimax.

Exercise 20.9.14 verifies parts (b) and (c).Continuing with the binomial, let X ∼ Binomial(n, θ), θ ∈ (0, 1), so that the Bayes

estimator using the Beta(α, β) prior is δα,β = (α + x)/(α + β + n). See (11.43) and(11.44). Exercise 20.9.15 shows that the mean square error is

R(θ ; δα,β) =nθ(1− θ)

(n + α + β)2 +

((α + β)θ − α

n + α + β

)2. (20.55)

If we can find an (α, β) so that this risk is constant, then the corresponding estimatoris minimax. As in the exercise, the risk is constant if α = β =

√n/2, hence (x +√

n/2)/(n +√

n) is minimax.Note that based on the results from Section 20.5, the usual estimator δ(x) = x for

estimating µ based on X ∼ N(µ, Ip) is minimax for p = 1 or 2 since it is admissiblein those cases. It is also minimax for p ≥ 3. See Exercise 20.9.19.

20.7 Game theory and randomized procedures

We take a brief look at simple two-person zero-sum games. The two players we willcall “the house” and “you.” Each has a set of possible actions to take: You can choosefrom the set A, and the house can choose from the set T . Each player chooses anaction without knowledge of the other’s choice. There is a loss function, L(a, θ) (as in(20.4), but negative losses are allowed), where if you choose a and the house choosesθ, you lose L(a, θ) and the house wins the L(a, θ). (“Zero-sum” refers to the fact thatwhatever you lose, the house gains, and vice versa.) Your aim is to minimize L, whilethe house wants to maximize L.

Consider the game with A = a1, a2 and T = θ1, θ2, and loss function

House You↓ a1 a2θ1 2 0θ2 0 1

(20.56)

If you play this game once, deciding which action to take involves trying to psychout your opponent. E.g., you might think that a2 is your best choice, since at worstyou lose only 1. But then you realize the house may be thinking that’s what you arethinking, so you figure the house will pick θ2 so you will lose. Which leads you tochoose a1. But then you wonder if the house is thinking two steps ahead as well. Andso on.

To avoid such circuitous thinking, the mathematical analysis of such games pre-sumes the game is played repeatedly, and each player can see what the other’s overall

20.7. Game theory and randomized procedures 357

strategy is. Thus if you always play a2, the house will catch on and always play θ2,and you would always lose 1. Similarly, if you always play a1, the house would al-ways play θ1, and you’d lose 2. An alternative is to not take the same action eachtime, nor to have any regular repeated pattern, but to randomly choose an actioneach time. The house does the same.

Let pi = P[You choose ai] and πi = P[House chooses θi]. Then if both players usethese probabilities each time, independently, your long-run average loss would be

R(π ; p) =2

∑i=1

2

∑j=1

piπjL(ai, θj) = 2π1 p1 + π2 p2. (20.57)

If the house knows your p, which it would after playing the game enough, it canadjust its π, to π1 = 1 if 2p1 > p2 and π2 = 1 if 2p1 < p2, yielding the average lossof max2p1, p2. You realize that the house will take that strategy, so choose p tominimize that maximum, i.e., take 2p1 = p2, so p = (1/3, 2/3). Then no matter whatthe house does, R(π ; p) = 2/3, which is better than 1 or 2. Similarly, if you knowthe house’s π, you can choose the p to minimize the expected loss, hence the housewill choose π to maximize that minimum. We end up with π = p = (1/3, 2/3).

The fundamental theorem analyzing such games (two-person, zero-sum, finite Aand T ) by John von Neumann (von Neumann and Morgenstern, 1944) states thatthere always exists a minimax strategy p0 for you, and maximin strategy π0 for thehouse, such that

V ≡ R(π0 ; p0) = minp

maxπ

R(π ; p) = maxπ

minp

R(π ; p). (20.58)

This V is called the value of the game, and the distribution π0 is called the leastfavorable distribution. If either player deviates from their optimum strategy, theother player will benefit, hence the theorem guarantees that the game with rationalplayers will always have you losing V on average.

The statistical decision theory we have seen so far in this chapter is based on gametheory, but has notable differences. The first is that in statistics, we have data thatgives us some information about what θ the “house” has chosen. Also, the actionspaces are often infinite. Either of these modifications easily fit into the vast amountof research done in game theory since 1944.

The most important difference is the lack of a house trying to subvert you, thestatistician. You may be cautious or pessimistic, and wish to minimize your maxi-mum expected loss, but it is perfectly rational to use non-minimax procedures. An-other difference is that, for us so far, actions have not been randomized. Once wehave the data, δ(x) gives us the estimate of θ, say. We don’t randomize to decide be-tween several possible estimates. In fact, a client would be quite upset if after settingup a carefully designed experiment, the statistician flipped a coin to decide whetherto accept or reject the null hypothesis. But theoretically, randomized procedures havesome utility in statistics, which we will see in Chapter 21 on hypothesis testing. It ispossible for a non-randomized test to be dominated by a randomized test, especiallyin discrete models where the actual size of a nonrandomized test is lower than thedesired level.

A more general formulation of statistical decision theory does allow randomiza-tion. A decision procedure is defined to be a function from the sample space X tothe space of probability measures on the action space A. For our current purposes,we can instead explicitly incorporate randomization into the function. The idea is


that in addition to the data x, we can make a decision based on spinning a spinner asoften as we want. Formally, suppose U = Uk | i = 1, 2, . . . is an infinite sequence ofindependent Uniform(0, 1)’s, all independent of X. Then a randomized procedure isa function of X and possibly a finite number of the Uk’s:

δ(x, u1, . . . , uK) ∈ A, (20.59)

for some K ≥ 0.In estimation with squared-error loss, the Rao-Blackwell theorem (Theorem 13.8

on page 210) shows that any such estimator that non-trivially depends on u is inad-missible. Suppose δ(x, u1, . . . , uK) has finite mean square error, and let

δ∗(x) = E[δ(X, U1, . . . , UK) |X = x] = E[δ(x, U1, . . . , UK))], (20.60)

where the expectation is over U. Then δ∗ has the same bias and lower variance thanδ, hence lower MSE. This result holds if L(a, θ) is strictly convex in a for each θ. If itis convex but not strictly so, then at least δ∗ is no worse than δ.

20.8 Minimaxity and admissibility when T is finite

Here we present a statistical analog of the minimax theorem for game theory in(20.58), which assumes a finite parameter space, and from that show that all admis-sible procedures are Bayes. Let D be the set of procedures under consideration.

Theorem 20.6. Suppose the parameter space is finite

T = θ1, . . . , θK. (20.61)

Define the risk set to be the set of all achievable vectors of risks for the procedures:

R = (R(θ1 ; δ), . . . , R(θK ; δ)) | δ ∈ D. (20.62)

Suppose the risk setR is closed, convex, and bounded from below. Then there exists a minimaxprocedure δ0 and a least favorable distribution (prior) π0 on T such that δ0 is Bayes wrt π0.

The assumption of a finite parameter space is too restrictive to have much practi-cal use in typical statistical models, but the theorem does serve as a basis for moregeneral situations, as for hypothesis testing in Section 22.3. The convexity of the riskset often needs the use of randomized procedures. For example, if D is closed underrandomization (see Exercise 20.9.25), then the risk set is convex. Most loss functionswe use in statistics are nonnegative, so that the risks are automatically bounded frombelow. The closedness of the risk set depends on how limits of procedures behave.Again, for testing see Section 22.3.

We will sketch the proof, which is a brief summary of the thorough but accessibleproof is found in Ferguson (1967) for his Theorem 2.9.1. A set C is convex if everyline segment connecting two points in C is in C. That is,

b, c ∈ C =⇒ αb + (1− α)c ∈ C for all 0 ≤ α ≤ 1. (20.63)

It is closed if any limit of points in C is also in C:

cn ∈ C for n = 1, 2, . . . and cn −→ c =⇒ c ∈ C. (20.64)

20.8. Minimaxity and admissibility when T is finite 359

Figure 20.3: This plot illustrates the separating hyperplane theorem, Theorem 20.7.Two convex sets A and B have empty intersection, so there is a hyperplane, in thiscase a line, that separates them.

Bounded from below means there is a finite κ such that c = (c1, . . . , cK) ∈ C impliesthat ci ≥ κ for all i.

For real number s, consider the points in the risk set whose maximum risk is nolarger than s. We can express this set as an intersection of R with the set Ls definedbelow:

r ∈ R | ri ≤ s, i = 1, . . . , K = Ls ∩R, Ls = x ∈ RK | xi ≤ s, i = 1, . . . , K. (20.65)

We want to find the minimax s, i.e., the smallest s obtainable:

s0 = infs | Ls ∩R 6= ∅. (20.66)

It does exist, since R being bounded from below implies that the set s | Ls ∩R 6= ∅is bounded from below. Also, there exists an r0 ∈ R with s0 = maxr0i, i = 1, . . . , Kbecause we have assumed R is closed. Let δ0 be a procedure that achieves this risk,so that it is minimax:

maxi=1,...,K

R(θi ; δ0) = s0 ≤ maxi=1,...,K

R(θi ; δ), δ ∈ D. (20.67)

Next we argue that this procedure is Bayes. Let int(Ls0 ) be the interior of Ls0 : x ∈RK | xi < s0, i = 1, . . . , K. It can be shown that int(Ls0 ) is convex, and

int(Ls0 ) ∩R = ∅ (20.68)

by definition of s0. Now we need to bring out a famous result, the separating hyper-plane theorem:

Theorem 20.7. SupposeA and B are two nonempty convex sets in RK such thatA∩B = ∅.Then there exists a nonzero vector γ ∈ RK such that

γ · x ≤ γ · y for all x ∈ A and y ∈ B. (20.69)

See Figure 20.3 for an illustration, and Exercises 20.9.28 through 20.9.30 for a proof.The idea is that if two convex sets do not intersect, then there is a hyperplane sepa-rating them. In the theorem, such a hyperplane is the set x |γ · x = a, where a is


any constant satisfying

aL ≡ supγ · x | x ∈ A ≤ a ≤ infγ · y | y ∈ B ≡ aU . (20.70)

Neither γ nor a is necessarily unique.Apply the theorem with A = int(Ls0 ) and B = R. In this case, the elements of

γ must be nonnegative: Suppose γj < 0, and take x ∈ int(Ls0 ). Note that we canlet xj → −∞, and x will still be in int(Ls0 ). Thus γ · x → +∞, which contradicts thebound aL in (20.70). So γj ≥ 0 for all j. Since γ 6= 0, we can define π0 = γ/ ∑ γiand have (20.69) hold for π0 in place of γ. Note that π0 is a legitimate pmf on θ withP[θ = θi] = π0i.

By the definition in (20.65), the points (s, s, . . . , s) ∈ int(Ls0 ) for all s < s0. Now∑ π0is = s, hence that sum can get arbitrarily close to s0, meaning aL = s0 in (20.70).Translating back,

s0 ≤ π0 · r for all r ∈ R =⇒ s0 ≤∑ π0iR(θi ; δ)(= R(π0 ; δ)) for all δ ∈ D,(20.71)

hence from (20.67),

maxi=1,...,K

R(θi ; δ0) ≤ R(π0 ; δ) for all δ ∈ D, (20.72)

which implies thatR(π0 ; δ0) ≤ R(π0 ; δ) for all δ ∈ D. (20.73)

That is, δ0 is Bayes wrt π0.To complete the proof of Theorem 20.6, we need only that π0 is the least favorable

distribution, which is shown in Exercise 20.9.26.The next result shows that under the same conditions as Theorem 20.6, any ad-

missible procedure is Bayes.

Theorem 20.8. Suppose the parameter space is finite

T = θ1, . . . , θK, (20.74)

and define the risk set R as in (20.62). Suppose the risk set R is closed, convex, and boundedfrom below. If δ0 is admissible, then it is Bayes wrt some prior π0.

Proof. Assume δ0 is admissible. Consider the same setup, but with risk function

R∗(θ ; δ) = R(θ ; δ)− R(θ ; δ0). (20.75)

Then the new risk set, R∗ = R∗(θ ; δ) | δ ∈ D is also closed, convex, and boundedfrom below. (See Exercise 20.9.27.) Since R∗(θ ; δ0) = 0 for all θ ∈ T , the maximumrisk of δ∗ is zero. Suppose δ is another procedure with smaller maximum risk:

maxθ∈T

R∗(θ ; δ) < 0. (20.76)

But then we would have

R(θ ; δ) < R(θ ; δ0) for all θ ∈ T , (20.77)

which contradicts the assumption that δ0 is admissible. Thus (20.76) cannot hold,which means that δ0 is minimax for R∗. Then by Theorem 20.6, there exists a π0 wrtwhich δ0 is Bayes under risk R∗, so that R∗(π0 ; δ0) ≤ R∗(π0 ; δ) for any δ ∈ D. Butsince R∗(π0 ; δ0) = 0, we have R(π0 ; δ0) ≤ R(π0 ; δ) for all δ ∈ D, hence δ0 is Bayeswrt π0 under the original risk R.

20.9. Exercises 361

20.9 Exercises

Exercise 20.9.1. (Lemma 20.3(b).) Suppose that δπ is a Bayes procedure wrt π, andif δ′π is also Bayes wrt π, then R(θ ; δπ) = R(θ ; δ′π) for all θ ∈ T . Argue that δπ isadmissible.

Exercise 20.9.2. (Lemma 20.3(c).) Suppose that the parameter space is finite orcountable, T = θ1, . . . , θK (K possibly infinite), and π is a prior on T such thatπk = P[Θ = θk] > 0 for k = 1, . . . , K. Show that δπ , the Bayes procedure wrt π, isadmissible.

Exercise 20.9.3. Consider estimating θ with squared-error loss based on X ∼ DiscreteUniform(0, θ), where T = 0, 1, 2. Let π be the prior with π(0) = π(1) = 1/2 (sothat it places no probability on θ = 2). (a) Show that any Bayes estimator wrt πsatisfies δc(0) = 1/3, δc(1) = 1, and δc(2) = c for any c. (b) Find the risk function forδc, and show that its Bayes risk R(π, δc) = 1/6. (c) Let D be the set of estimators δcfor 0 ≤ c ≤ 2. For which c is δc the only estimator admissible among those in D? Isit admissible among all estimators?

Exercise 20.9.4. Here we have X with density f (x | θ), θ ∈ (b, c), and wish to estimateθ with a weighted squared-error loss:

L(a, θ) = g(θ)(a− θ)2, (20.78)

where g(θ) ≥ 0 is the weight function. (a) Show that if the prior π has pdf π(θ) andthe integrals below exists, then the Bayes estimator δπ is given by

δπ(x) =

∫ cb θ f (x | θ)g(θ)π(θ)dθ∫ cb f (x | θ)g(θ)π(θ)dθ

. (20.79)

(b) Suppose g(θ) > 0 for all θ ∈ T . Show that δ is admissible for squared-error lossif and only if it is admissible for the weighted loss in (20.78).

Exercise 20.9.5. Suppose X ∼ N(µ, 1), µ ∈ R, and we wish to estimate µ usingsquared-error loss. Let the prior πn on µ be N(0, n). (a) Show that the risk, henceBayes risk, of δ0(x) = x is constant at 1. (b) The Bayes estimator wrt πn is givenin (20.22) to be δn(x) = nx/(n + 1). Show that R(µ ; δn) = (µ2 + n)/(n + 1)2, andR(πn ; δn) = n/(n + 1).

Exercise 20.9.6. This exercise shows that an admissible estimator can sometimes besurprising. Let X1 and X2 be independent, Xi ∼ N(µi, σ2

i ), i = 1, 2, with unrestrictedparameter space θ = (µ1, µ2, σ2

1 , σ22 ) ∈ R2 × (0, ∞)2. We are interested in estimating

just µ1 under squared-error loss, so that L(a, θ) = (a − µ1)2. Let D be the set of

linear estimators δa,b,c(x) = ax1 + bx2 + c for constants a, b, c ∈ R. Let δ1(x) = x1 andδ2(x) = x2. (a) Show that the risk of δa,b,c is ((a− 1)µ1 + bµ2 + c)2 + a2σ2

1 + b2σ22 . (b)

Find the risks of δ1 and δ2. Show that neither one of these two estimators dominatesthe other. (c) Show that δ1 is admissible among the estimators in D. Is this resultsurprising? [Hint: Suppose R(θ ; δa,b,c) ≤ R(θ ; δ1) for all θ ∈ T . Let µ2 → ∞ to showb must be 0; let µ1 → ∞ to show that a must be 1, then argue further that c = 0. Thusδa,b,c must be δ1.] (d) Show that δ2 is admissible among the estimators in D, eventhough the distribution of X2 does not depend on µ1. [Hint: Proceed as in the hint topart (b), but let σ2

1 → ∞, then µ1 = µ2 = µ→ ∞.]


Exercise 20.9.7. Continue with the setup in Exercise 20.9.6, but without the restrictionto linear estimators. Again let δ2(x) = x2. The goal here is to show δ2 is a limit ofBayes estimators. (a) For fixed σ2

0 > 0, let πσ20

be the prior on θ where µ1 = µ2 = µ,

σ21 = σ2

0 , σ22 = 1, and µ ∼ N(0, σ2

0 ). Show that the Bayes estimator wrt πσ20

is

δσ20(x1, x2) =

x1/σ20 + x2

2/σ20 + 1

. (20.80)

(b) Find the risk of δσ20

as a function of θ. (c) Find the Bayes risk of δσ20

wrt πσ20. (d)

What is the limit of δσ20

as σ20 → ∞?

Exercise 20.9.8. Let X1, . . . , Xn be iid N(µ, 1), with µ ∈ R. The analog to the regular-ized least squares in (12.43) for this simple situation defines the estimator δκ(x) to bethe value of m that minimizes

objκ(m ; x1, . . . , xn) =n

∑i=1

(xi −m)2 + κm2, (20.81)

where κ ≥ 0 is some fixed constant. (a) What is δκ(x)? (b) For which value of κ is δκ

the MLE? (c) For κ > 0, δκ is the Bayes posterior mean using the N(µ0, σ20 ) for which

µ0 and σ20 ? (d) For which κ ≥ 0 is δκ admissible among all estimators?

Exercise 20.9.9. Let x be p × 1, and φµ1 (x1) be the N(µi, 1) pdf. Fixing x2, . . . , xp,show that (20.41) holds, i.e.,∫ ∞

−∞

x1(x1 − µ1)

‖x‖2 φµ1 (x1)dx1 =∫ ∞

−∞

(1‖x‖2 −

2x21

‖x‖4

)φµ1 (x1)dx1. (20.82)

[Hint: Use integration by parts, where u = x1/‖x‖2 and dv = (x1 − µ1)φµ1 (x1).]

Exercise 20.9.10. (a) Use (20.37) and (20.43) to show that the risk of the James-Steinestimator is

R(µ ; δJS) = p− Eµ

[(p− 2)2

‖X‖2

]. (20.83)

as in (20.44). (b) Show that R(0 ; δJS) = 2. [Hint: When µ = 0, ‖X‖2 ∼ χ2p. What is

E[1/χ2p]?]

Exercise 20.9.11. In Exercise 11.7.17 we found that the usual estimator of the binomialparameter θ is a Bayes estimator wrt the improper prior 1/(θ(1− θ)), at least whenx 6= 0 or n. Here we look at a truncated version of the binomial, where the usualestimator is proper Bayes. The truncated binomial is given by the usual binomialconditioned to be between 1 and n− 1. That is, take the pmf of X to be

f ∗(x | θ) = f (x | θ)α(θ)

, x = 1, . . . , n− 1 (20.84)

for some α(θ), where f (x | θ) is the usual Binomial(n, θ) pmf. (Assume n ≥ 2.) Thegoal is to estimate θ ∈ (0, 1) using squared-error loss. For estimator δ, the risk isdenoted

R∗(θ ; δ) =n−1

∑x=1

(δ(x)− θ)2 f ∗(x | θ). (20.85)

20.9. Exercises 363

(a) Find α(θ). (b) Let π∗(θ) = cα(θ)/(θ(1− θ)). Find the constant c so that π∗ is aproper pdf on θ ∈ (0, 1). [Hint: Note that α(θ) is θ(1− θ) times a polynomial in θ.](c) Show that the Bayes estimator wrt π∗ for the risk R∗ is δ0(x) = x/n. Argue thattherefore δ0 is admissible for the truncated binomial, so that for estimator δ′,

R∗(θ ; δ′) ≤ R∗(θ ; δ0) for all θ ∈ (0, 1) ⇒ R∗(θ ; δ′) = R∗(θ ; δ0) for all θ ∈ (0, 1).(20.86)

Exercise 20.9.12. This exercise proves the admissibility of δ0(x) = x/n for the usualbinomial using two stages. Here we have X ∼ Binomial(n, θ), θ ∈ (0, 1), and estimateθ with squared-error loss. Suppose δ′ satisfies

R(θ ; δ′) ≤ R(θ ; δ0) for all θ ∈ (0, 1). (20.87)

(a) Show that for any estimator,

limθ→0

R(θ ; δ) = δ(0)2 and limθ→1

R(θ ; δ) = (1− δ(n))2, (20.88)

hence (20.87) implies that δ′(0) = 0 and δ′(n) = 1. Thus δ0 and δ′ agree at x = 0 andn. [Hint: What are the limits in (20.88) for δ0?] (b) Show that for any estimator δ,

R(θ ; δ) = (δ(0)− θ)2(1− θ)2 + α(θ)R∗(θ ; δ) + (δ(n)− θ)2θn, (20.89)

where R∗ and α are given in Exercise 20.9.11. (c) Use the conclusion in part (a) toshow that (20.87) implies

R∗(θ ; δ′) ≤ R∗(θ ; δ0) for all θ ∈ (0, 1). (20.90)

(d) Use (20.86) to show that (20.90) implies R(θ ; δ′) = R(θ ; δ0) for all θ ∈ (0, 1), henceδ0 is admissible in the regular binomial case. (See Johnson (1971) for this two-stageidea in the binomial, and Brown (1981) for a generalization to problems with finitesample space.)

Exercise 20.9.13. Suppose X ∼ Poisson(2θ), Y ∼ Poisson(2(1− θ)), where X and Yare independent, and θ ∈ (0, 1). The MLE of θ is x/(x + y) if x + y > 0, but notunique if x + y = 0. For fixed c, define the estimator δc by

δc(x, y) =

c if x + y = 0x

x+y if x + y > 0 . (20.91)

This question looks at the decision-theoretic properties of these estimators undersquared-error loss. (a) Let T = X + Y. What is the distribution of T? Note that itdoes not depend on θ. (b) Find the conditional distribution of X | T = t. (c) FindEθ [δc(X, Y)] and Varθ [δc(X, Y)], and show that

MSEθ [δc(X, Y)] = θ(1− θ)r + (c− θ)2 p where r =∞

∑t=1

1t

fT(t) and p = ft(0), (20.92)

where fT is the pmf of T. [Hint: First find the conditional mean and variance ofδc given T = t.] (d) For which value(s) of c is δc unbiased, if any? (e) Sketch theMSEs for δc when c = 0, .5, .75, 1, and 2. Among these four estimators, which areadmissible and which inadmissible? Which is minimax? (f) Now consider the set


of estimators δc for c ∈ R. Show that δc is admissible among these if and only if0 ≤ c ≤ 1. (g) Which δc, if any, is minimax among the set of all δc’s? What is itsmaximum risk? [Hint: You can restrict attention to 0 ≤ c ≤ 1. First show that themaximum risk of δc over θ ∈ (0, 1) is (r − 2cp)2/(4(r − p)) + c2 p, then find the cfor which the maximum is minimized.] (h) Show that δ0(x, y) = (x− y)/4 + 1/2, isunbiased, and find its MSE. Is it admissible among all estimators? [Hint: Compare itto those in part (e).] (i) The Bayes estimator with respect to the prior Θ ∼ Beta(α, β)is δα,β(x, y) = (x + α)/(x + y + α + β). (See Exercise 13.8.2.) None of the δc’s equals aδα,β. However, some δc’s are limits of δα,β’s for some sequences of (α, β)’s. For whichc’s can one find such a sequence? (Be sure that the α’s and β’s are positive.)

Exercise 20.9.14. (Lemma 20.5(b) and (c).) Suppose δ has constant risk, R(θ ; δ) = c.(a) Show that if δ admissible, it is minimax. (b) Suppose there exists a sequence ofBayes procedures δn wrt πn such that R(πn ; δn) → c and n → ∞. Show that δ isminimax. [Hint: Suppose δ′ has better maximum risk than δ, R(θ ; δ′) < c for allθ ∈ T . Show that for large enough n, R(θ ; δ′) < R(πn ; δn), which can be used toshow that δ′ has better Bayes risk than δn wrt πn.]

Exercise 20.9.15. Let X ∼ Binomial(n, θ), θ ∈ (0, 1). The Bayes estimator using theBeta(α, β) prior is δα,β = (α + x)/(α + β + n). (a) Show that the risk R(θ ; δα,β) is as in(20.55). (b) Show that if α = β =

√n/2, the risk has constant value 1/(4(

√n + 1)2).

Exercise 20.9.16. Consider a location-family model, where the object is to estimate thelocation parameter θ with squared-error loss. Suppose the Pitman estimator has finitevariance. (a) Is the Pitman estimator admissible among shift-equivariant estimators?(b) Is the Pitman estimator minimax among shift-equivariant estimators? (c) Is thePitman estimator Bayes among shift-equivariant estimators?

Exercise 20.9.17. Consider the normal linear model as in (12.9), where Y ∼ N(xβ,σ2 In), Y is n× 1, x is a fixed known n× p matrix, β is the p× 1 vector of coefficients,and σ2 > 0. Assume that x′x is invertible. The objective is to estimate β usingsquared-error loss,

L(a, (β, σ2)) =p

∑j=1

(aj − β j)2 = ‖a− β‖2. (20.93)

Argue that the ridge regression estimator of β in (12.45), βκ = (x′x + κIp)−1x′Y, isadmissible for κ > 0. (You can assume the risk function is continuous in β.) [Hint:Choose the appropriate β0 and K0 in (12.37).]

Exercise 20.9.18. Let X ∼ Poisson(λ), λ > 0. We wish to estimate g(λ) = exp(−2λ)with squared-error loss. Recall from Exercise 19.8.7 that the UMVUE is δU(x) =(−1)x. (a) Find the variance and risk of δU . (They are the same.) (b) For prior densityπ(λ) on λ, write down the expression for the Bayes estimator of g(λ). Is it possibleto find a prior so that the Bayes estimate equals δU? Why or why not? (c) Find anestimator δ∗ that dominates δU . [Hint: Which value of δU is way outside of the rangeof g(λ)? What other value are you sure must be closer to g(λ) than that one?] (d) Isthe estimator δ∗ in part (c) unbiased? Is δU admissible?

Exercise 20.9.19. Let X ∼ N(µ, Ip) with parameter space µ ∈ Rp, and consider esti-mating µ using squared-error loss. Then δ0(x) has the constant risk of p. (a) Find the

20.9. Exercises 365

Bayes estimator for prior πn being N(0p, nIp). (b) Show that the Bayes risk of πn ispn/(n + 1). (c) Show that δ0 is minimax. [Hint: What is the limit of the Bayes risk inpart (b) as n→ ∞?]

Exercise 20.9.20. Suppose X1, . . . , Xp are independent, Xi | µi ∼ N(µi, 1), so thatX | µ ∼ Np(µ, Ip). The parameter space for µ is Rp. Consider the prior on µ wherethe µi are iid N(0, V), so that µ ∼ Np(0p, VIp). The goal is to estimate µ usingsquared-error loss as in (20.34), L(a, µ) = ‖a − µ‖2. (a) For known prior varianceV ∈ (0, ∞), show that the Bayes estimator is

δV(x) = (1− cV)x, where cV =1

V + 1. (20.94)

(b) Now suppose that V is not known, and you wish to estimate cV based on themarginal of distribution of X. The marginal distribution (i.e., not conditional on theµ) is X ∼ N(0p, dVIp) for what dV? (c) Using the marginal distribution in X, find theap and bV so that

EV

[1‖X‖2

]=

1ap

1bV

. (20.95)

(d) From part (c), find an unbiased estimator of cV : cV = fp/‖X‖2 for what fp? (e)Now put that estimator in for cV in δV . Is the result an estimator for µ? It is called anempirical Bayes estimator, because it is similar to a Bayes estimator, but uses the datato estimate the parameter V in the prior. What other name is there for this estimator?

Exercise 20.9.21. Let X ∼ N(µ, Ip), µ ∈ Rp. This problem will use a different riskfunction than in Exercise 20.9.20, one based on prediction. The data are X, but imaginepredicting a new vector XNew that is independent of X but has the same distributionas X. This XNew is not observed, so it cannot be used in the estimator. An estimatorδ(x) of µ can be thought of as a predictor of the new vector XNew. The loss is how faroff the prediction is,

PredSS ≡ ‖XNew − δ(X)‖2, (20.96)

which is unobservable, and the risk is the expected value over both the data and thenew vector,

R(µ ; δ) = E[PredSS | µ] = E[‖XNew − δ(X)‖2 | µ]. (20.97)

(a) Suppose δ(x) = x itself. What is R(µ ; δ)? (b) Suppose δ(x) = 0p. What is R(µ ; δ)?(c) For a subset A ⊂ 1, 2, . . . , p, define the estimator δA(x) by setting δi(x) = xi fori ∈ A and δi(x) = 0 for i 6∈ A. That is, the estimator starts with x, then sets thecomponents with indices not in A to zero. For example, if p = 4, then

δ1,4(x) =

x100x4

and δ2(x) =

0x200

. (20.98)

In particular, δ∅(x) = 0p and δ1,2,...,p(x) = x. Let q = #A, that is, q is the numberof µi’s being estimated rather than being set to 0. For general p, find R(µ ; δA) asa function of p, q, and the µi’s. (d) Let D be the set of estimators δA as in part (c).Which (if any) are admissible among those in D? Which (if any) are minimax amongthose in D?


Exercise 20.9.22. Continue with the setup in Exercise 20.9.21. One approach to de-ciding which estimator to use is to try to estimate the risk for each δA, then choosethe estimator with the smallest estimated risk. A naive estimator of the PredSS justuses the observed x in place of the XNew, which gives the observed error:

ObsSS = ‖x− δ(x)‖2. (20.99)

(a) What is ObsSS for δA as in Exercise 20.9.21(c)? For which such estimator is ObsSSminimized? (b) Because we want to use ObsSS as an estimator of E[PredSS | µ], itwould be helpful to know whether it is a good estimator. What is E[ObsSS | µ] for agiven δA? Is ObsSS an unbiased estimator of E[PredSS | µ]? What is E[PredSS | µ]−E[ObsSS | µ]? (c) Find a constant CA (depending on the subset A) so that ObsSS+CAis an unbiased estimator of E[PredSS | µ]. (The quantity ObsSS + CA is a special caseof Mallows’ Cp statistic from Section 12.5.3.) (d) Let δ∗(x) be δA(x)(x), where A(x) isthe subset that minimizes ObsSS + CA for given x. Give δ∗(x) explicitly as a functionof x.

Exercise 20.9.23. The generic two-person zero-sum game has loss function given bythe following table:

House You↓ a1 a2θ1 a cθ2 b d

(20.100)

(a) If a ≥ c and b > d, then what should your strategy be? (b) Suppose a > c andd > b, so neither of your actions is always better than the other. Find your minimaxstrategy p0, the least favorable distribution π0, and show that the value of the gameis V = (ad− bc)/(a− b− c + d).

Exercise 20.9.24. In the two-person game rock-paper-scissors, each player choosesone of the three options (rock, paper, or scissors). If they both choose the sameoption, then the game is a tie. Otherwise, rock beats scissors (by crushing them);scissors beats paper (by cutting it); and paper beats rock (by wrapping it). If you areplaying the house, your loss is 1 if you lose, 0 if you tie, and −1 if you win. (a) Writeout the loss function as a 3× 3 table. (b) Show that your minimax strategy (and theleast favorable distribution) is to choose each option with probability 1/3. (c) Findthe value of the game.

Exercise 20.9.25. The set of randomized procedures D is closed under randomizationif given any two procedures in D, the procedure that randomly chooses betweenthe two is also in D. For this exercise, suppose the randomized procedures can berepresented as in (20.59) by δ(X, U1, . . . , UK), where U1, U2, . . . are iid Uniform(0, 1)and independent of X. Suppose that if δ1(x, u1, . . . , uK) and δ2(x, u1, . . . , uL) are bothin D, then for any α ∈ [0, 1], so is δ defined by

δ(x, u1, . . . , uM+1) =

δ1(x, u1, . . . , uK) if uM+1 < αδ2(x, u1, . . . , uL) if uM+1 ≥ α

, (20.101)

where M = maxK, L. (a) Show that if δ1 and δ2 both have finite risk at θ, then

R(θ ; δ) = αR(θ ; δ1) + (1− α)R(θ ; δ2). (20.102)

(b) Show that the risk set is convex.

20.9. Exercises 367

z

x0

Figure 20.4: This plot illustrates the result in Exercise 20.9.28. The x0 is the closestpoint in C to z. The solid line is the set x | γ · x = γ · x0, where γ = (z− x0)/‖z−x0‖.

Exercise 20.9.26. Consider the setup in Theorem 20.6. Let δ0 and π0 be as in (20.71) to(20.73), so that δ0 is minimax and Bayes wrt π0, with R(π0 ; δ0) = s0 = maxi R(θi ; δ0).(a) Show that R(π ; δ0) ≤ s0 for any prior π. (b) Argue that

infδ∈D

R(π ; δ) ≤ infδ∈D

R(π0 ; δ), (20.103)

so that π0 is a least favorable prior.

Exercise 20.9.27. Suppose the set R ⊂ RK is closed, convex, and bounded frombelow. For constant vector a ∈ RK , set R∗ = r− a | r ∈ R. Show that R∗ is alsoclosed, convex, and bounded from below.

The next three exercises prove the separating hyperplane theorem, Theorem 20.7.Exercise 20.9.28 proves the theorem when one of the sets contains a single pointseparated from the other set. Exercise 20.9.29 extends the proof to the case that thesingle point is on the border of the other set. Exercise 20.9.30 then completes theproof.

Exercise 20.9.28. Suppose C ∈ RK is convex, and z /∈ closure(C). The goal is to showthat there exists a vector γ with ‖γ‖ = 1 such that

γ · x < γ · z for all x ∈ C. (20.104)

See Figure 20.4 for an illustration. Let s0 = inf‖x− z‖ | x ∈ C, the shortest distancefrom z to C. Then there exists a sequence xn ∈ C and point x0 such that xn → x0 and‖x0 − z‖ = s0. [Extra credit: Prove that fact.] (Note: This x0 is unique, and called theprojection of 0 onto C, analogous to the project y in (12.14) for linear regression.) (a)Show that x0 6= z, hence s0 > 0. [Hint: Note that x0 ∈ closure(C).] (b) Take any x ∈ C.Argue that for any α ∈ [0, 1], αx + (1− α)xn ∈ C, hence ‖αx + (1− α)xn − z‖2 ≥ s0.Then by letting n→ ∞, we have that ‖αx+ (1− α)x0− z‖2 ≥ ‖x0− z‖2. (c) Show thatthe last inequality in part (b) can be written α2‖x− x0‖2 − 2α(z− x0) · (x− x0) ≥ 0.For α ∈ (0, 1), divide by α and let α→ 0 to show that (z− x0) · (x− x0) ≤ 0. (d) Takeγ = (z− x0)/‖z− x0‖, so that ‖γ‖ = 1. Part (c) shows that γ · (x− x0) ≤ 0 for x ∈ C.Show that γ · (z− x0) > 0. (e) Argue that therefore (20.104) holds.


Exercise 20.9.29. Now suppose C is convex and z /∈ C, but z ∈ closure(C). It isa nontrivial fact that the interior of a convex set is the same as the interior of itsclosure. (You don’t need to prove this. See Lemma 2.7.2 of Ferguson (1967).) Thusz /∈ interior(closure(C)), which means that there exists a sequence zn → z withzn /∈ closure(C). (Thus z is on the boundary of C.) (a) Show that for each n thereexists a vector γn, ‖γn‖ = 1, such that

γn · x < γn · zn for all x ∈ C. (20.105)

[Hint: Use Exercise 20.9.28.] (b) Since the γn’s exist in a compact space, there is asubsequence of them and a vector γ such that γni → γ. Show that by taking the limitalong this subsequence in (20.105), we have that

γ · x ≤ γ · z for all x ∈ C. (20.106)

The set x |γ · x = c where c = γ · z is called a supporting hyperplane of C throughz.

Exercise 20.9.30. Let A and B be convex sets with A ∩ B = ∅. Define C to be theirdifference:

C = x− y | x ∈ A and y ∈ B. (20.107)

(a) Show that C is convex and 0 /∈ C. (b) Use (20.106) to show that there exists a γ,‖γ‖ = 1, such that

γ · x ≤ γ · y for all x ∈ A and y ∈ B. (20.108)

Chapter 21

Optimal Hypothesis Tests

21.1 Randomized tests

Chapter 15 discusses hypothesis testing, where we choose between the null and al-ternative hypotheses,

H0 : θ ∈ T0 versus HA : θ ∈ TA, (21.1)

T0 and TA being disjoint subsets of the overall parameter space T . The goal is to makea good choice, so that we desire a procedure that has small probability of rejectingthe null when it is true (the size), and large probability of rejecting the null whenit is false (the power). This approach to hypothesis testing usually fixes a level α,and considers tests whose size is less than or equal to α. Among the level α tests,the ones with good power are preferable. In certain cases a best level α test existsin that it has the best power for any parameter value θ in the alternative space TA.More commonly, different tests are better for different values of the parameter, hencedecision-theoretic concepts such as admissibility and minimaxity are relevant, whichwill be covered in Chapter 22.

In Chapter 15, hypothesis tests are defined using a test statistic and cutoff point,rejecting the null when the statistic is larger (or smaller) than the cutoff point. Theseare non-randomized tests, since once we have the data we know the outcome. Ran-domized tests, as in the randomized strategies for game theory presented in Section20.7, are useful in the decision-theoretical analysis of testing. As we noted earlier,actual decisions in practice should not be randomized.

To understand the utility of randomized tests, let X ∼ Binomial(4, θ), and testH0 : θ = 1/2 versus HA : θ = 3/5 at level α = 0.35. The table below gives the pmfsunder the null and alternative:

x f1/2(x) f3/5(x)0 0.0625 0.02561 0.2500 0.15362 0.3750 0.34563 0.2500 0.34564 0.0625 0.1296

(21.2)

If we reject the null when X ≥ 3, the size is 0.3125, and if we reject when X ≥ 2, thesize is 0.6875. Thus to keep the size from exceeding α, we use X ≥ 3. The power at

369

370 Chapter 21. Optimal Hypothesis Tests

θ = 3/5 is 0.4752. Consider a randomized test that rejects the null when X ≥ 3, andif X = 2, it rejects the null with probability 0.1. Then the size is

P[X ≥ 3 | θ = 12 ] + 0.1 · P[X = 2 | θ = 1

2 ] = 0.3125 + 0.1 · 0.3750 = 0.35, (21.3)

hence its level is α. But it has a larger power than the original test since it rejects moreoften:

P[X ≥ 3 | θ = 35 ] + 0.1 · P[X = 2 | θ = 3

5 ] = 0.50976 > 0.4752 = P[X ≥ 3 | θ = 35 ].(21.4)

In order to accommodate randomized tests, rather than using test statistics andcutoff points, we define a testing procedure as a function

φ : X −→ [0, 1], (21.5)

where φ(x) is the probability of rejecting the null given X = x:

φ(x) = P[Reject |X = x]. (21.6)

For a nonrandomized test, φ(x) = I[T(x) > c] as in (15.6). The test in (21.3) and (21.4)is given by

φ′(x) =

1 if x ≥ 30.1 if x = 20 if x ≤ 1

. (21.7)

Now the level and power are easy to represent, since Eθ[φ(X)] = Pθ[φ rejects].

21.2 Simple versus simple

We will start simple, where each hypothesis has exactly one distribution as in (15.25).We observe X with density f (x), and test

H0 : f = f0 versus HA : f = fA. (21.8)

(The parameter space consists of just two points, T = 0, A.) In Section 15.3 wementioned that the test based on the likelihood ratio,

LR(x) =fA(x)f0(x)

, (21.9)

is optimal. Here we formalize this result.Fix α ∈ [0, 1]. We wish to find a level α test that maximizes the power among level α

tests. For example, suppose X ∼ Binomial(4, θ), and the hypotheses are H0 : θ = 1/2versus HA : θ = 3/5, so that the pmfs are given in (21.2). With α = 0.35, the objectiveis to find a test φ that

Maximizes E3/5[φ(X)] subject to E1/2[φ(X)] ≤ α = 0.35, (21.10)

that is, maximizes the power subject to being of level α. What should we look for?First, an analogy.

Bang for the buck. Imagine you have some bookshelves you wish to fill up as cheaplyas possible, e.g., to use as props in a play. You do not care about the quality of the

21.2. Simple versus simple 371

books, just their widths (in inches) and prices (in dollars). You have $3.50, and fivebooks to choose from:

Book Cost Width Inches/Dollar0 0.625 0.256 0.40961 2.50 1.536 0.61442 3.75 3.456 0.92163 2.50 3.456 1.38244 0.625 1.296 2.0736

(21.11)

You are allowed to split the books lengthwise, and pay proportionately. You wantto maximize the total number of inches for your $3.50. Then, for example, book 4 is abetter deal than book 0, because they cost the same but book 4 is wider. Also, book 3is more attractive than book 2, because they are the same width but book 3 is cheaper.Which is better between books 3 and 4? Book 4 is cheaper by the inch: It costs about48¢ per inch, while book 3 is about 73¢ per inch. This suggests the strategy shouldbe to buy the books that give you the most inches per dollar.

Let us definitely buy book 4, and book 3. That costs us $3.125, and gives us1.296+3.456 = 4.752 inches. We still have 37.5¢ left, with which we can buy a tenth ofbook 2, giving us another 0.3456 inches, totaling 5.0976 inches.

Returning to the hypothesis testing problem, we can think of having α to spend,and we wish to spend where we get the most bang for the buck. Here, bang is power.The key is to look at the likelihood ratio of the densities:

LR(x) =f3/5(x)f1/2(x)

, (21.12)

which turn out to be the same as the inches per dollar in table (21.11). (The cost is tentimes the null pmf, and the width is ten times the alternative pmf.) If LR(x) is large,then the alternative is much more likely than the null is. If LR(x) is small, the nullis more likely. One uses the likelihood ratio as the statistic, and finds the right cutoffpoint, randomizing at the cutoff point if necessary. The likelihood ratio test is then

φLR(x) =

1 if LR(x) > cγ if LR(x) = c0 if LR(x) < c

. (21.13)

Looking at the table (21.11), we see that taking c = LR(2) works, because we reject ifx = 3 or 4, and use up only .3125 of our α. Then the rest we put on x = 2, the γ = 0.1since we have .35− .3125 = 0.0375 left. Thus the test is

φLR(x) =

1 if LR(x) > 0.92160.1 if LR(x) = 0.92160 if LR(x) < 0.9216

=

1 if x ≥ 30.1 if x = 20 if x ≤ 1

, (21.14)

which is φ′ in (21.7). The last expression is easier to deal with, and valid since LR(x)is a strictly increasing function of x. Then the power and level are 0.35 and 0.50976, asin (21.3) and (21.4). This is the same as for the books: The power is identified with thenumber of inches. Is this the best test? Yes, as we will see from the Neyman-Pearsonlemma in the next section.


21.3 Neyman-Pearson lemma

Let X be the random variable or vector with density f , and f0 and fA be two possibledensities for X. We are interested in testing f0 versus fA as in (21.8). For givenα ∈ [0, 1], we wish to find a test function φ that

Maximizes EA[φ(X)] subject to E0[φ(X)] ≤ α. (21.15)

A test function ψ has Neyman-Pearson form if for some constant c ∈ [0, ∞] andfunction γ(x) ∈ [0, 1],

ψ(x) =

1 if fA(x) > c f0(x)γ(x) if fA(x) = c f0(x)

0 if fA(x) < c f0(x)

=

1 if LR(x) > cγ(x) if LR(x) = c

0 if LR(x) < c, (21.16)

with the caveat thatif c = ∞ then γ(x) = 1 for all x. (21.17)

Note that this form is the same as φLR in (21.13), but allows γ to depend on x. Here,

LR(x) =fA(x)f0(x)

∈ [0, ∞] (21.18)

is defined unless fA(x) = f0(x) = 0, in which case ψ(x) = γ(x). Notice that LR andc are allowed to take on the value ∞.

Lemma 21.1. Neyman-Pearson. Any test ψ of Neyman-Pearson form (21.16,21.17) forwhich E0[ψ(X)] = α satisfies (21.15).

So basically, the likelihood ratio test is best. One can take the γ(x) to be a constant,but sometimes it is convenient to have it depend on x. Before getting to the proof,consider some special cases, of mainly theoretical interest.

• α=0. If there is no chance of rejecting when the null is true, then one mustalways accept if f0(x) > 0, and it always makes sense to reject when f0(x) = 0.Such actions invoke the caveat (21.17), that is, when fA(x) > 0,

ψ(x) =

1 if LR(x) = ∞0 if LR(x) < ∞

=

1 if f0(x) = 00 if f0(x) > 0 . (21.19)

• α=1. This one is silly from a practical point of view, but if you do not careabout rejecting when the null is true, then you should always reject, i.e., takeφ(x) = 1.

• Power = 1. If you want to be sure to reject if the alternative is true, then φ(x) = 1when fA(x) > 0, so take the test (21.16) with c = 0. Of course, you may not beable to achieve your desired α.

Proof. (Lemma 21.1) If α = 0, then the above discussion shows that taking c =∞, γ(x) = 1 as in (21.17) is best. For α ∈ (0, 1], suppose ψ satisfies (21.16) for some

21.3. Neyman-Pearson lemma 373

c and γ(x) with E0[ψ(X)] = α, and φ is any other test function with E0[φ(X)] ≤ α.Look at

EA[ψ(X)− φ(X)]− c E0[ψ(X)− φ(X)] =∫X(ψ(x)− φ(x)) fA(x)dx

− c∫X(ψ(x)− φ(x)) f0(x)dx

=∫X(ψ(x)− φ(x))( fA(x)− c f0(x))dx

≥ 0. (21.20)

The final inequality holds because ψ = 1 if fA(x)− c f0(x) > 0, and ψ = 0 if fA(x)−c f0(x) < 0, so that the final integrand is always nonnegative. Thus

EA[ψ(X)− φ(X)] ≥ c E0[ψ(X)− φ(X)]≥ 0, (21.21)

because E0[ψ(X)] = α ≥ E0[φ(X)]. Hence EA[ψ(X)] ≥ EA[φ(X)], i.e., any other levelα test has lower or equal power.

There are a couple of addenda to the lemma that we will not prove here, butLehmann and Romano (2005) does in their Theorem 3.2.1. First, for any α, there isa test of Neyman-Pearson form. Second, if the φ in the proof is not essentially ofNeyman-Pearson form, then the power of ψ is strictly better than that of φ. That is,

P0[φ(X) 6= ψ(X)& LR(X) 6= c] > 0 =⇒ EA[ψ(X)] > EA[φ(X)]. (21.22)

21.3.1 Examples

If f0(x) > 0 and fA(x) > 0 for all x ∈ X , then it is straightforward (though maybenot easy) to find the Neyman-Pearson test. It can get tricky if one or the other densityis 0 at times.

Normal means

Suppose µ0 and µA are fixed, µA > µ0, and X ∼ N(µ, 1). We wish to test

H0 : µ = µ0 versus HA : µ = µA (21.23)

with α = 0.05. Here,

LR(x) =e−

12 (x−µA)

2

e−12 (x−µ0)2

= ex(µA−µ0)− 12 (µ

2A−µ2

0). (21.24)

Because µA > µ0, LR(x) is strictly increasing in x, so LR(x) > c is equivalent tox > c∗ for some c∗. For level 0.05, we know that c∗ = 1.645 + µ0, so the test mustreject when LR(x) > LR(c∗), i.e.,

ψ(x) =

1 if ex(µA−µ0)− 1

2 (µ2A−µ2

0) > c0 if ex(µA−µ0)− 1

2 (µ2A−µ2

0) ≤ c, c = e(1.645+µ0)(µA−µ0)− 1

2 (µ2A−µ2

0).

(21.25)


We have taken the γ = 0; the probability LR(X) = c is 0, so it doesn’t matter whathappens then. Expression (21.25) is unnecessarily complicated. In fact, to find c wealready simplified the test, that is

LR(x) > c ⇐⇒ x− µ0 > 1.645, (21.26)

hence

ψ(x) =

1 if x− µ0 > 1.6450 if x− µ0 ≤ 1.645 . (21.27)

That is, we really do not care about c, as long as we have the ψ.

Laplace versus normal

Suppose f0 is the Laplace pdf and fA is the N(0, 1) pdf, and α = 0.1. Then

LR(x) =1√2π

e−12 x2

12 e−|x|

=

√2π

e|x|−12 x2

. (21.28)

Now LR(x) > c if and only if

|x| − 12

x2 > c∗ = log(c) +12

log(π/2) (21.29)

if and only if (completing the square)

(|x| − 1)2 < c∗∗ = −2c∗ − 1 ⇐⇒ ||x| − 1| < c∗∗∗ =√

c∗∗. (21.30)

We need to find the constant c∗∗∗ so that

P0[||X| − 1| < c∗∗∗] = 0.10, X ∼ Laplace . (21.31)

For a smallish c∗∗∗, using the Laplace pdf,

P0[||X| − 1| < c∗∗∗] = P0[−1− c∗∗∗ < X < −1 + c∗∗∗ or 1− c∗∗∗ < X < 1 + c∗∗∗]= 2 P0[−1− c∗∗∗ < X < −1 + c∗∗∗]

= e−(1−c∗∗∗) − e−(1+c∗∗∗). (21.32)

Setting that probability equal to 0.10, we find c∗∗∗ = 0.1355. Figure 21.1 shows ahorizontal line at 0.1355. The rejection region consists of the x’s for which the graphof ||x| − 1| is below the line.

The power substitutes the normal for the Laplace in (21.32):

PA[||N(0, 1)| − 1| < 0.1355] = 2(Φ(1.1355)−Φ(0.8645)) = 0.1311.

Not very powerful, but at least it is larger than α! Of course, it is not surprising thatit is hard to distinguish the normal from the Laplace with just one observation.

21.3. Neyman-Pearson lemma 375

−2 −1 0 1 20.

00.

40.

8

x

||x|−

1|

Figure 21.1: The rejection region for testing Laplace versus normal is x | ||x| − 1 <0.1355. The horizontal line is at 0.1355.

Uniform versus uniform I

Suppose X ∼ Uniform(0, θ), and the question is whether θ = 1 or 2. We could thentest

H0 : θ = 1 versus HA : θ = 2. (21.33)The likelihood ratio is

LR(x) =fA(x)f0(x)

=12 I[0 < x < 2]I[0 < x < 1]

=

12 if 0 < x < 1∞ if 1 ≤ x < 2

. (21.34)

No matter what, you would reject the null if 1 ≤ x < 2, because it is impossible toobserve an x in that region under the Uniform(0, 1).

First try α = 0. Usually, that would mean never reject, so power would be 0 aswell, but here it is not that bad. We invoke (21.17), that is, take c = ∞ and γ(x) = 1:

ψ(x) =

1 if LR(x) = ∞0 if LR(x) < ∞

=

1 if 1 ≤ x < 20 if 0 < x < 1 . (21.35)

Then

α = P[1 ≤ U(0, 1) < 2] = 0 and Power = P[1 ≤ U(0, 2) < 2] =12

. (21.36)

What if α = 0.1? Then the Neyman-Pearson test would take c = 1/2:

ψ(x) =

1 if LR(x) > 1

2γ(x) if LR(x) = 1

20 if LR(x) < 1

2

=

1 if 1 ≤ x < 2

γ(x) if 0 < x < 1 , (21.37)

because LR cannot be less than 1/2. Notice that

E0[ψ(X)] = E0[γ(X)] =∫ 1

0γ(x)dx, (21.38)

so that any γ that integrates to α works. Some examples:

γ(x) = 0.1, γ(x) = I[0 < x < 0.1], γ(x) = I[0.9 < x < 1], γ(x) = 0.2 x. (21.39)

No matter which you choose, the power is the same:

Power = EA[ψ(X)] =12

∫ 1

0γ(x)dx +

12

∫ 2

1dx =

12

α +12= 0.55. (21.40)


Uniform versus uniform II

Now switch the null and alternative in (21.33), keeping X ∼ Uniform(0, θ):

H0 : θ = 2 versus HA : θ = 1. (21.41)

Then the likelihood ratio is

LR(x) =fA(x)f0(x)

=I[0 < x < 1]

12 I[0 < x < 2]

=

2 if 0 < x < 10 if 1 ≤ x < 2 . (21.42)

For level α = 0.1, we have to take c = 2:

ψ(x) =

γ(x) if LR(x) = 20 if LR(x) < 2

=

γ(x) if 0 < x < 1

0 if 1 ≤ x < 2 . (21.43)

Then any γ(x) with

12

∫ 1

0γ(x)dx = α =⇒

∫ 1

0γ(x)dx = 0.2 (21.44)

will work. And that is the power, 0.2.

21.4 Uniformly most powerful tests

Simple versus simple is too simple. Some testing problems are not so simple, and yetdo have a best test. Here is the formal definition.

Definition 21.2. The test function ψ is a uniformly most power (UMP) level α test fortesting H0 : θ ∈ T0 versus HA : θ ∈ TA if it is level α and

Eθ[ψ(X)] ≥ Eθ[φ(X)] for all θ ∈ TA (21.45)

for any other level α test φ.

Often, one-sided tests do have a UMP test, while two-sided tests do not. Forexample, suppose X ∼ N(µ, 1), and we test whether µ = 0. In a one-sided testingproblem, the alternative is one of µ > 0 or µ < 0, say

H0 : µ = 0 versus H(1)A : µ > 0. (21.46)

The corresponding two-sided testing problem is

H0 : µ = 0 versus H(2)A : µ 6= 0. (21.47)

The usual level α = 0.05 tests for these are, respectively,

φ(1)(x) =

1 if x > 1.6450 if x ≤ 1.645 and φ(2)(x) =

1 if |x| > 1.960 if |x| ≤ 1.96 . (21.48)

Their powers are

Eµ[φ(1)(X)] = P[N(µ, 1) > 1.645] = Φ(µ− 1.645) and

Eµ[φ(2)(X)] = P[|N(µ, 1)| > 1.96] = Φ(µ− 1.96) + Φ(−µ− 1.96), (21.49)

21.4. Uniformly most powerful tests 377

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

cbin

d(ph

i1, p

hi2,

phi

3)

φ(1)

φ(1)

φ(2)φ(2)

φ(3)

φ(3)

µ

Eµ[

φ]

Figure 21.2: The probability of rejecting for testing a normal mean µ is zero. Foralternative µ > 0, φ(1) is the best. For alternative µ < 0, φ(3) is the best. For alternativeµ 6= 0, the two-sided test is φ(2).

where Φ is the N(0,1) distribution function. See Figure 21.2, or Figure 15.1.For the one-sided problem, the power is good for µ > 0, but bad (below α) for

µ < 0. But the alternative is just µ > 0, so it does not matter what φ(1) does whenµ < 0. For the two-sided test, the power is fairly good on both sided of µ = 0, butit is not quite as good as the one-sided test when µ > 0. The other line in the graphis the one-sided test φ(3) for alternative µ < 0, which mirrors φ(1), rejecting whenx < −1.645.

The following are true, and to be proved:

• For the one-sided problem (21.46), the test φ(1) is the UMP level α = 0.05 test.

• For the two-sided problem (21.47), there is no UMP level α test. Test φ(1) isbetter on one side (µ > 0), and test φ(3) is better on the other side. None ofthe three tests is always best. In Section 21.6 we will see that φ(2) is the UMPunbiased test.

We start with the null being simple and the alternative being composite (i. e., notsimple). The way to prove a test is UMP level α is to show that it is level α, and that itis of Neyman-Pearson form for each simple versus simple subproblem derived fromthe big problem. That is, suppose we are testing

H0 : θ = θ0 versus HA : θ ∈ TA. (21.50)

A simple versus simple subproblem takes a specific value from the alternative, sothat for a given θA ∈ TA, we consider

H0 : θ = θ0 versus H(θA)A : θ = θA. (21.51)


Theorem 21.3. Suppose that for testing problem (21.50), ψ satisfies

Eθ0[ψ(X)] = α, (21.52)

and that for each θA ∈ TA,

ψ(x) =

1 if LR(x ; θA) > c(θA)γ(x) if LR(x ; θA) = c(θA)

0 if LR(x ; θA) < c(θA), for some constant c(θA), (21.53)

where

LR(x ; θA) =fθA

(x)

fθ0(x)

. (21.54)

Then ψ is a UMP level α test for (21.50).

Proof. Suppose φ is another level α test. Then for the subproblem (21.51), ψ has atleast as high power, i.e.,

EθA[ψ(X)] ≥ EθA

[φ(X)]. (21.55)

But that inequality is true for any θA ∈ TA, hence ψ is UMP level α.

The difficulty is to find a test ψ which is Neyman-Pearson for all θA. Consider theexample with X ∼ N(µ, 1) and hypotheses

H0 : µ = 0 versus H(1)A : µ > 0, (21.56)

and take α = 0.05. For fixed µA > 0, the Neyman-Pearson test is found as in (21.25)to (21.27) with µ0 = 0:

φ(1)(x) =

1 if LR(x ; µA) > c(µA)0 if LR(x ; µA) ≤ c(µA)

=

1 if e−

12 ((x−µA)

2−x2) > c(µA)

0 if e−12 ((x−µA)

2−x2) ≤ c(µA)

=

1 if x > (log(c(µA)) +

12 µ2

A)/µA0 if x ≤ (log(c(µA)) +

12 µ2

A)/µA. (21.57)

The last step is valid because we know µA > 0. That messy constant is chosen so thatthe level is 0.05, which we know must be

(log(c(µA)) +12 µ2

A)/µA = 1.645. (21.58)

The key point is that (21.58) is true for any µA > 0, that is,

φ(1)(x) =

1 if x > 1.6450 if x ≤ 1.645 (21.59)

is true for any µA. Thus φ(1) is indeed UMP. Note that the constant c(µA) is differ-ent for each µA, but the test φ(1) is the same. (See Figure 21.2 again for its powerfunction.)

21.4. Uniformly most powerful tests 379

Why is there no UMP test for the two-sided problem,

H0 : µ = 0 versus H(2)A : µ 6= 0? (21.60)

The best test at alternative µA > 0 is (21.59), but the best test at alternative µA < 0 isfound as in (21.57), except that the inequalities reverse in the fourth equality, yielding

φ(3)(x) =

1 if x < −1.6450 if x ≥ −1.645 , (21.61)

which is different than φ(1) in (21.59). That is, there is no test that is best at bothpositive and negative values of µA, so there is no UMP test.

21.4.1 One-sided exponential family testing problems

The normal example above can be extended to general exponential families. Thekey to the existence of a UMP test is that the LR(x ; θA) is increasing in the samefunction of x no matter what the alternative. That is, suppose X1, . . . , Xn are iid witha one-dimensional exponential family density

f (x | θ) = a(x) eθΣt(xi)−nρ(θ). (21.62)

A one-sided testing problem is

H0 : θ = θ0 versus HA : θ > θ0. (21.63)

Then for fixed alternative θA > θ0,

LR(x ; θA) =f (x | θA)

f (x | θ0)= e(θA−θ0)Σt(xi)−n(ρ(θA)−ρ(θ0)). (21.64)

Similar calculations as in (21.57) show that the best test at the alternative θA is

ψ(x) =

1 if LR(x ; θA) > c(θA)γ if LR(x ; θA) = c(θA)0 if LR(x ; θA) < c(θA)

=

1 if ∑ t(xi) > cγ if ∑ t(xi) = c0 if ∑ t(xi) < c

. (21.65)

Then c and γ are chosen to give the right level, but they are the same for any alterna-tive θA > θ0. Thus the test (21.65) is UMP level α.

If the alternative were θ < θ0, then the same reasoning would work, but theinequalities would switch. For a two-sided alternative, there would not be a UMPtest.

21.4.2 Monotone likelihood ratio

A generalization of exponential families, which guarantee UMP tests, are familieswith monotone likelihood ratio, which is a stronger condition than the stochasticincreasing property we saw in Definition 18.1 on page 306. Non-exponential familyexamples include the noncentral χ2 and F distributions.


Definition 21.4. A family of densities f (x | θ), θ ∈ T ⊂ R has monotone likelihood ratio(MLR) with respect to parameter θ and statistic s(x) if for any θ′ < θ,

f (x | θ)f (x | θ′) (21.66)

is a function of just s(x), and is nondecreasing in s(x). If the ratio is strictly increasing ins(x), then the family has strict monotone likelihood ratio.

Note in particular that this s(x) is a sufficient statistic. It is fairly easy to see thatone-dimensional exponential families have MLR. The general idea of MLR is that insome sense, as θ gets bigger, s(X) gets bigger. The next lemma formalizes such asense.

Lemma 21.5. If the family f (x | θ) has MLR with respect to θ and s(x), then for any nonde-creasing function g(w),

Eθ [g(s(X))] is nondecreasing in θ. (21.67)

If the family has strict MLR, and g is strictly increasing, then the expected value in (21.67)is strictly increasing in θ.

Proof. We present the proof using pdfs. Suppose g(w) is nondecreasing, and θ′ < θ.Then

Eθ [g(s(X))]− Eθ′ [g(s(X))] =∫

g(s(x))( fθ(x)− fθ′ (x))dx

=∫

g(s(x))(r(s(x))− 1) fθ′ (x)dx, (21.68)

where r(s(x)) = fθ(x)/ fθ′ (x), the ratio guaranteed to be a function of just s(x) by theMLR definition. (It does depend on θ and θ′.) Since both f ’s are pdfs, neither one canalways be larger than the other, hence the ratio r(s) is either always 1, or sometimesless than 1 and sometimes greater. Thus there must be a constant s0 such that

r(s) ≤ 1 if s ≤ s0 and r(s) ≥ 1 if s ≥ s0. (21.69)

Note that if r is defined at s0, then r(s0) = 1. From (21.68),

Eθ [g(s(X))]− Eθ′ [g(s(X))] =∫

s(x)<s0

g(s(x))(r(s(x))− 1) fθ′ (x)dx

+∫

s(x)>s0

g(s(x))(r(s(x))− 1) fθ′ (x)dx

≥∫

s(x)<s0

g(s0)(r(s(x))− 1) fθ′ (x)dx

+∫

s(x)>s0

g(s0)(r(s(x))− 1) fθ′ (x)dx

= g(s0)∫(r(s(x))− 1) fθ′ (x)dx = 0. (21.70)

The last equality holds because the integral is∫( fθ(x) − fθ′ (x))dx = 0. Thus Eθ [

g(s(X))] is nondecreasing in θ. The proof of the result for strict MLR and strictlyincreasing g is left to the reader, but basically replaces the “≥” in (21.70) with a“>.”

21.5. Locally most powerful tests 381

The key implication for hypothesis testing is the following, proved in Exercise21.8.6.

Lemma 21.6. Suppose the family f (x | θ) has MLR with respect to θ and s(x), and we aretesting

H0 : θ = θ0 versus HA : θ > θ0 (21.71)

for some level α. Then the test

ψ(x) =

1 if s(x) > cγ if s(x) = c0 if s(x) < c

, (21.72)

where c and γ are chosen to achieve level α, is UMP level α.

In the situation in Lemma 21.6, the power function Eθ [ψ(X)] is nondecreasing byLemma 21.5, since ψ is nondecreasing in θ. In fact, MLR also can be used to showthat the test (21.72) is UMP level α for testing

H0 : θ ≤ θ0 versus HA : θ > θ0. (21.73)

21.5 Locally most powerful tests

We now look at tests that have the best power for alternatives very close to the null.Consider the one-sided testing problem

H0 : θ = θ0 versus HA : θ > θ0. (21.74)

Suppose the test ψ has level α, and for any other level α test φ, there exists an εφ > 0such that

Eθ [ψ] ≥ Eθ [φ] for all θ ∈ (θ0, θ0 + εφ). (21.75)

Then ψ is a locally most powerful (LMP) level α test. Note that the ε depends onφ, so there may not be an ε that works for all φ. A UMP test will be locally mostpowerful.

Suppose φ and ψ both have size α: Eθ0 [φ] = Eθ0 [ψ] = α. Then (21.75) implies thanfor any θ an arbitrarily small amount above θ0,

Eθ [ψ]− Eθ0 [ψ]

θ≥

Eθ [φ]− Eθ0 [φ]

θ. (21.76)

If the power function Eθ [φ] is differentiable in θ for any φ, we can let θ → θ0 in (21.76),so that

∂

∂θEθ [ψ]

∣∣∣∣θ=θ0

≥ ∂

∂θEθ [φ]

∣∣∣∣θ=θ0

, (21.77)

i.e., a LMP test will maximize the derivative of the power at θ = θ0. Often the scoretests of Section 16.3 are LMP.

To find the LMP test, we need to assume that the pdf fθ(x) is positive and differ-entiable in θ for all x, and that for any test φ, we can move the derivative under theintegral:

∂

∂θEθ [φ]

∣∣∣∣θ=θ0

=∫X

φ(x)[

∂

∂θfθ(x)

]θ=θ0

dx. (21.78)


Consider the analog of (21.20), where fA is replaced by fθ , and the first summandon the left has a derivative. That is,

∂

∂θEθ [ψ(X)− φ(X)]

∣∣∣∣θ=θ0

− c Eθ0 [ψ(X)− φ(X)]

=∫X(ψ(x)− φ(x))

([∂

∂θfθ(x)

]θ=θ0

− c fθ0 (x)

)dx

=∫X(ψ(x)− φ(x))(l′(θ0 ; x)− c) fθ0 (x)dx, (21.79)

where

l′(θ ; x) =∂

∂θlog( fθ(x)), (21.80)

the score function from Section 14.1. Now the final expression in (21.79) will benonnegative if ψ is 1 or 0 depending on the sign of l′(θ0 ; x)− c, which leads us todefine the Neyman-Pearson-like test

ψ(x) =

1 if l′(θ0 ; x) > cγ(x) if l′(θ0 ; x) = c

0 if l′(θ0 ; x) < c, (21.81)

where c and γ(x) are chosen so that Eθ0 [ψ] = α, the desired level. Then using calcu-lations as in the proof of the Neyman-Pearson lemma (Lemma 21.1), we have (21.77)for any other level α test φ.

Also, similar to (21.22),

Pθ0 [φ(X) 6= ψ(X)& l′(θ0 ; X) 6= c] > 0 =⇒ ∂

∂θEθ [ψ]

∣∣∣∣θ=θ0

>∂

∂θEθ [φ]

∣∣∣∣θ=θ0

. (21.82)

Satisfying (21.81) is necessary for ψ to be LMP level α, but it is not sufficient. Forexample, it could be that several tests have the same best first derivative, but not allhave the highest second derivative. See Exercises 21.8.16 and 21.8.17. One sufficientcondition is that if φ has the same derivative as ψ, then it has same risk for all θ. Thatis, ψ is LMP level α if for any other level α test φ∗ of the form (21.81) but with γ∗ inplace of γ,

Eθ [γ∗(X) | l′(θ0 ; X) = c]P[l′(θ0 ; X) = c] =Eθ [γ(X) | l′(θ0 ; X) = c]P[l′(θ0 ; X) = c]

for all θ > θ0. (21.83)

This condition holds immediately if the distribution of l′(θ0 ; X) is continuous, or ifthere is one or zero x’s with l′(θ0 ; x) = c.

As an example, if X1, . . . , Xn are iid Cauchy(θ), so that the pdf of Xi is 1/(1 +(xi − θ)2), then the LMP level α test rejects when

l′n(θ0 ; x) ≡n

∑i=1

2xi

1 + x2i> c, (21.84)

where c is chosen to achieve size α. See (16.54). As mentioned there, this test haspoor power if θ is much larger than θ0.

21.6. Unbiased tests 383

21.6 Unbiased tests

Moving on a bit from situations with UMP tests, we look at restricting considerationto tests that have power of at least α for all parameter values in the alternative, suchas φ(2) in Figure 21.2 for the alternative µ 6= 0. That is, you are more likely to rejectwhen you should than when you shouldn’t. Such tests are unbiased, as in the nextdefinition.

Definition 21.7. Consider the general hypotheses

H0 : θ ∈ T0 versus HA : θ ∈ TA (21.85)

and fixed level α. The test ψ is unbiased level α if

EθA[ψ(X)] ≥ α ≥ Eθ0

[ψ(X)] for any θ0 ∈ T0 and θA ∈ TA. (21.86)

In some two-sided testing problems, most prominently one-dimensional exponen-tial families, there exists a uniformly most powerful unbiased level α test. Here weassume a one-dimensional parameter θ with parameter space T an open intervalcontaining θ0, and test

H0 : θ = θ0 versus HA : θ 6= θ0. (21.87)

We also assume that for any test φ, Eθ [φ] is differentiable (and continuous) in θ. Thislast assumption holds in the exponential family case by Theorem 2.7.1 in Lehmannand Romano (2005). If φ is unbiased level α, then Eθ [φ] ≥ α for θ 6= θ0, henceby continuity Eθ0 [φ] = α. Furthermore, the power must have relative minimum atθ = θ0. Thus differentiability implies that the derivative is zero at θ = θ0. That is, anyunbiased level α test φ satisfies

Eθ0 [φ] = α and∂

∂θEθ [φ]

∣∣∣∣θ=θ0

= 0. (21.88)

Again, test φ(2) in Figure 21.2 exemplifies these conditions.Another assumption we need is that the derivative and integral in the latter equa-

tion can be switched (which holds in the exponential family case, or more generallyunder the Cramér conditions in Section 14.4):

∂

∂θEθ [φ]

∣∣∣∣θ=θ0

= Eθ0 [φ(X)l(X | θ0)], where l(x | θ0) =∂

∂θ

f (x | θ)f (x | θ0)

∣∣∣∣θ=θ0

. (21.89)

A generalization of the Neyman-Pearson lemma (Lemma 21.1) gives conditionsfor the unbiased level α test with the highest power at specific alternative θA. Letting

LR(x | θA) =f (x | θA)

f (x | θ0), (21.90)

the test has the form

ψ(x) =

1 if LR(x | θA) > c1 + c2l(x | θ0)γ(x) if LR(x | θA) = c1 + c2l(x | θ0)

0 if LR(x | θA) < c1 + c2l(x | θ0)(21.91)


for some constants c1 and c2. Suppose we can choose c1 and c2 so that ψ is unbiasedlevel α, i.e., it satisfies (21.88). If φ is also unbiased level α, then as in the proof of theNeyman-Pearson lemma,

EθA [ψ]− EθA [φ] = EθA [ψ]− EθA [φ]− c1(Eθ0 [ψ]− Eθ0 [φ])

− c2(Eθ0 [ψ(X)l(X | θ0)]− Eθ0 [φ(X)l(X | θ0)])

=∫X(ψ(x)− φ(x))(LR(x | θA)− c1 − c2l(x | θ0)) f (x | θ0)dx

≥ 0. (21.92)

Thus ψ has at least as good power at θA as φ. If we can show that the same ψ satisfies(21.88) for any θA 6= θ0, then it must be a UMP unbiased level α test.

Now suppose X has a one-dimensional exponential family distribution with natu-ral parameter θ and natural sufficient statistic s(x):

f (x | θ) = a(x)eθs(x)−ρ(θ). (21.93)

Then since ρ′(θ) = µ(θ) = Eθ [s(X)] (see Exercise 14.9.5), l(x | θ0) = s(x) − µ(θ0),hence

LR(x | θ)− c1 − c2l(x | θ) = eρ(θ0)−ρ(θ)e(θ−θ0)s(x) − c1 − c2(s(x)− µ(θ0)). (21.94)

If θ 6= θ0, then the function in (21.94) is strictly convex in s(x) (see Definition 14.2 onpage 226). Thus the set of x for which it is less than 0 is either empty or an interval(possibly half-infinite or infinite) based on s(x). In the latter case, ψ in (21.91) can bewritten

ψ(x) =

1 if s(x) < a or s(x) > bγ(x) if s(x) = a or s(x) = b

0 if a < s(x) < b(21.95)

for some −∞ ≤ a < b ≤ ∞. In fact, for any a and b, and any θA 6= θ0, we can find c1and c2 so that (21.91) equals (21.95). The implication is that if for some a, b, and γ(x),the ψ in (21.95) satisfies the conditions in (21.88), then it is a UMP unbiased level αtest.

To check the second condition in (21.88) for the exponential family case, (21.89)shows that for any level α test φ,

∂

∂θEθ [φ]

∣∣∣∣θ=θ0

= Eθ0 [φ(X)l(X | θ0)] = Eθ0 [φ(X)(s(X)− µ(θ0))]

= Eθ0 [φ(X)s(X)]− αµ(θ0). (21.96)

For any α ∈ (0, 1), we can find a test ψ of the form (21.95) such that (21.88) holds. SeeExercise 21.8.18 for the continuous case.

21.6.1 Examples

In the normal mean case, where X ∼ N(µ, 1) and we test µ = 0 versus HA : µ 6= 0,the test that rejects when |x| > zα/2 is indeed UMP unbiased level α, since it is levelα, unbiased, and of the form (21.95) with s(x) = x.

For testing a normal variance, suppose U ∼ σ2χ2ν. It may be that U = ∑(Xi − X)2

for an iid normal sample. We test H0 : σ2 = 1 versus HA : σ2 6= 1. A reasonable

21.6. Unbiased tests 385

test is the equal-tailed test, where we reject the null when U < a or U > b, a andb are chosen so that P[χ2

ν < a] = P[χ2ν > b] = α/2. Unfortunately, that test is not

unbiased. The density is an exponential family type with natural statistic U andnatural parameter θ = −1/(2σ2), so that technically we are testing θ = −1/2 versusθ 6= −1/2. Because the distribution of U is continuous, we do not have to worryabout the γ. Letting fν(u) be the χ2

ν pdf, we wish to find a and b so that

∫ b

afν(u)du = 1− α and

∫ b

au fν(u)du = ν(1− α). (21.97)

These equations follow from (21.88) and (21.96). They cannot be solved in closedform. Exercise 21.8.23 suggests an iterative approach for finding the constants. Hereare a few values:

ν 1 2 5 10 50 100a 0.0032 0.0847 0.9892 3.5162 32.8242 74.7436b 7.8168 9.5303 14.3686 21.7289 72.3230 130.3910

P[χ2ν < a] 0.0448 0.0415 0.0366 0.0335 0.0289 0.0277

P[χ2ν > b] 0.0052 0.0085 0.0134 0.0165 0.0211 0.0223

(21.98)

Note that as ν increases, the two tails become more equal.Now let X ∼ Poisson(λ). We wish to find the UMP unbiased level α = 0.05 test

of H0 : λ = 1 versus HA : λ 6= 1. Here the natural sufficient statistic is X, and thenatural parameter is θ = log(λ), so we are testing θ = 0 versus θ 6= 0. We need tofind the a and b, as well as the randomization values γ(a) and γ(b), in (21.95) so that(since E1[X] = 1)

1− α = (1− γ(a))p(a) +b−1

∑i=a+1

p(i) + (1− γ(b))p(b)

= a(1− γ(a))p(a) +b−1

∑i=a+1

i p(i) + b(1− γ(b))p(b), (21.99)

where p(x) is the Poisson(1) pmf, p(x) = e−1/x!. For given a and b, (21.99) is a linearsystem of two equations in γ(a) and γ(b), hence γ(a)

a!

γ(b)b!

=

1 1

a b

−1 ∑bi=a

1i! − e(1− α)

∑bi=a i 1

i! − e(1− α)

. (21.100)

We can try pairs (a, b) until we find one for which the γ(a) and γ(b) in (21.100) arebetween 0 and 1. It turns out that (0,4) works for α = 0.05, yielding the UMP unbiasedlevel 0.05 test

φ(x) =

1 if x ≥ 5

0.5058 if x = 40 if 1 ≤ x ≤ 3

0.1049 if x = 0

. (21.101)


21.7 Nuisance parameters

The optimal tests so far in this chapter applied to just one-parameter models. Usually,even if we are testing only one parameter, there are other parameters needed todescribe the distribution. For example, testing problems on a normal mean usuallyneed to deal with the unknown variance. Such extra parameters are called nuisanceparameters. Often their presence prevents there from being UMP or UMP unbiasedtests. Exceptions can be found in certain exponential family models in which thereare UMP unbiased tests.

We will illustrate with Fisher’s exact test from Section 17.2. We have X1 and X2independent, with Xi ∼ Binomial(ni, pi), i = 1, 2, and test

H0 : p1 = p2 versus HA : p1 > p2, (21.102)

where otherwise the only restriction on the pi’s is that they are in (0,1). Fisher’s exacttest arises by conditioning on T = X1 + X2. First, we find the conditional distributionof X1 given T = t. The joint pmf of (X1, T) can be written as a two-dimensionalexponential family, where the first parameter θ1 is the log odds ratio (similar to thatin Exercise 13.8.22),

θ1 = log(

p11− p1

1− p2p2

). (21.103)

The pmf is

f(θ1,θ2)(x1, t) =(

n1x1

)(n2

t− x1

)px1

1 (1− p1)n1−x1 pt−x1

2 (1− p2)n2−t+x1

=

(n1x1

)(n2

t− x1

)(p1

1− p1

1− p2p2

)x1(

p21− p2

)t(1− p1)

n1 (1− p2)n2

=

(n1x1

)(n2

t− x1

)eθ1x1+θ2t−ρ(θ1,θ2), (21.104)

where θ2 = log(p2/(1− p2)). Hence the conditional pmf is

f(θ1,θ2)(x1 | t) =f(θ1,θ2)(x1, t)

∑y1∈Xtf(θ1,θ2)(y1, t)

=(n1

x1)( n2

t−x1)eθx1

∑y1∈Xt (n1y1)( n2

t−y1)eθy1

, Xt = (max0, t− n2, . . . , mint, n1).

(21.105)

Thus conditional on T = t, X1 has a one-dimensional exponential family distributionwith natural parameter θ1, and the hypotheses in (21.102) become H0 : θ1 = 0 versusHA : θ1 > 0. The distribution for X1 | T = t in (21.105) is called the noncentralhypergeometric distribution. When θ1 = 0, it is the Hypergeometric(n1, n2, t) from(17.16). There are three main steps to showing the test is UMP unbiased level α.

Step 1: Show the test ψ is the UMP conditional test. The Neyman-Pearson test forthe problem conditioning on T = t is as in (21.65) for the exponential family case:

ψ(x1, t) =

1 if x1 > c(t)γ(t) if x1 = c(t)

0 if x1 < c(t), (21.106)

21.7. Nuisance parameters 387

where the constants c(t) and γ(t) are chosen so that

E(0,θ2)[ψ(X1, t) | T = t] = α. (21.107)

(Note that the conditional distribution does not depend on θ2.) Thus by the Neyman-Pearson lemma (Lemma 21.1), for given t, ψ(x1, t) is the conditional UMP level α testgiven T = t. That is, if φ(x1, t) is another test with conditional level α, it cannot havebetter conditional power:

E(0,θ2)[φ(X1, t) | T = t] = α =⇒ E(θ1,θ2)[ψ(X1, t) | T = t] ≥ E(θ1,θ2)[φ(X1, t) | T = t]

for all θ1 > 0, θ2 ∈ R. (21.108)

Step 2: Show that any unbiased level α test has conditional level α for each t. Nowlet φ be any unbiased level α test for the unconditional problem. Since the powerfunction is continuous in θ, φ must have size α:

E(0,θ2)[φ(X1, T)] = α for all θ2 ∈ R. (21.109)

Look at the conditional expected value of φ under the null, which is a function of justt:

eφ(t) = E(0,θ2)[φ(X1, t) | T = t]. (21.110)

Thus from (21.109), if θ1 = 0,

E(0,θ2)[eφ(T)] = α for all θ2 ∈ R. (21.111)

The null θ1 = 0 is the same as p1 = p2, hence marginally, T ∼ Binomial(n1 + n2, p2).Since this model is a one-dimensional exponential family model with parameter θ2 ∈R, we know from Lemma 19.5 on page 331 that the model is complete. That is, thereis only one unbiased estimator of α, which is the constant α itself. Thus eφ(t) = α forall t, or by (21.110),

E(0,θ2)[φ(X1, t) | T = t] = α for all t ∈ 0, . . . , n1 + n2. (21.112)

Step 3: Argue that conditionally best implies unconditionally best. Suppose φ isunbiased level α, so that (21.112) holds. Then by (21.108), for each t,

E(θ1,θ2)[ψ(X1, t) | T = t] ≥ E(θ1,θ2)[φ(X1, t) | T = t] for all θ1 > 0, θ2 ∈ R. (21.113)

Taking expectations over T yields

E(θ1,θ2)[ψ(X1, T)] ≥ E(θ1,θ2)[φ(X1, T)] for all θ1 > 0, θ2 ∈ R. (21.114)

Thus ψ is indeed UMP unbiased level α.

If the alternative hypothesis in (21.102) is two sided, p1 6= p2, the same idea willwork, but where the ψ is conditionally the best unbiased level α test, so has form(21.95) for each t. This approach works for exponential families in general. We need tobe able to write the exponential family so that with natural parameter θ = (θ1, . . . , θp)and natural statistic (t1(x), . . . , tp(x)), the null hypothesis is θ1 = 0 and the alternativeis either one-sided or two-sided. Then we condition on (t2(X), . . . , tp(X)) to find thebest conditional test. To prove that the test is UMP unbiased level α, the marginalmodel for (t2(X), . . . , tp(X)) under the null needs to be complete, which will often bethe case. Section 4.4 of Lehmann and Romano (2005) details and extends these ideas.Also, see Exercises 21.8.20 through 21.8.23.


21.8 Exercises

Exercise 21.8.1. Suppose X ∼ Exponential(λ), and consider testing H0 : λ = 2 versusHA : λ = 5. Find the best level α = 0.05 test and its power.

Exercise 21.8.2. Suppose X1, X2, X3 are iid Poisson(λ), and consider testing H0 : λ =2 versus HA : λ = 3. Find the best level α = 0.05 test and its power.

Exercise 21.8.3. Suppose X ∼ N(θ, θ) (just one observation). Find explicitly the bestlevel α = 0.05 test of H0 : θ = 1 versus HA : θ > 1.

Exercise 21.8.4. Suppose X ∼ Cauchy(θ), i.e., has pdf 1/(π(1 + (x− θ)2)). (a) Findthe best level α = 0.05 test of H0 : θ = 0 versus HA : θ = 1. Find its power. (b)Consider using the test from part (a) for testing H0 : θ = 0 versus HA : θ > 0. What isits power as θ → ∞? Is there a UMP level α = 0.05 test for this situation?

Exercise 21.8.5. The table below describes the horses in a race. You have $35 to bet,which you can distribute among the horses in any way you please as long as you donot bet more than the maximum bet for any horse. In the “φ” column, put down anumber in the range [0,1] that indicates the proportion of the maximum bet you wishto bet on each horse. (Any money left over goes to me.) So if you want to bet themaximum bet on a particular horse, put “1,” and if you want to bet nothing, put “0,”or put something in between. If that horse wins, then you get $100×φ. Your objectiveis to fill in the φ’s to maximize your expected winnings,

$100×5

∑i=1

φiP[Horse i wins] (21.115)

subject to the constraint that

5

∑i=1

φi × (Maximum bet)i = $35. (21.116)

(a) Fill in the φ’s and amount bet on the five horses to maximize the expected win-nings subject to the constraints.

Horse Maximum Bet Probability of winning φ Amount BetTrigger $6.25 0.0256

Man-o-War $25.00 0.1356Mr. Ed $37.50 0.3456Silver $25.00 0.3456

Sea Biscuit $6.25 0.1296(21.117)

(b) What are the expected winnings for the best strategy?

Exercise 21.8.6. Prove Lemma 21.6.

Exercise 21.8.7. Suppose X1, . . . , Xn are iid Beta(β, β) for β > 0. (a) Show that thisfamily has monotone likelihood ratio with respect to T and β, and give the statisticT. (b) Find the form of the UMP level α test of H0 : β = 1 versus HA : β < 1. (c) Forn = 1 and α = 0.05, find the UMP level α test explicitly. Find and sketch the powerfunction.

21.8. Exercises 389

Exercise 21.8.8. Suppose (Xi, Yi), i = 1, . . . , n, are iid

N2

((00

),(

1 ρρ 1

)), ρ ∈ (−1, 1). (21.118)

(a) Show that a sufficient statistic is (T1, T2), where T1 = ∑(X2i +Y2

i ) and T2 = ∑ XiYi.(b) Find the form of the best level α test for testing H0 : ρ = 0 versus HA : ρ = 0.5.(The test statistic is a linear combination of T1 and T2.) (c) Find the form of the bestlevel α test for testing H0 : ρ = 0 versus HA : ρ = 0.7. (d) Does there exist a UMP levelα test of H0 : ρ = 0 versus HA : ρ > 0? If so, find it. If not, why not? (e) Find the formof the LMP level α test for testing H0 : ρ = 0 versus HA : ρ > 0.

Exercise 21.8.9. Consider the null hypothesis to be that X is Discrete Uniform(0, 4),so that it has pmf

f0(x) = 1/5, x = 0, 1, 2, 3, 4, (21.119)

and 0 otherwise. The alternative is that X ∼ Geometric(1/2), so that

fA(x) =1

2x+1 , x = 0, 1, 2, .... (21.120)

(a) Give the best level α = 0 test φ. What is the power of this test? (b) Give the bestlevel α = 0.30 test φ. What is the power of this test?

Exercise 21.8.10. Now reverse the hypotheses from Exercise 21.8.9, so that the nullhypothesis is that X ∼ Geometric(1/2), and the alternative is that X ∼ DiscreteUniform(0,4). (a) Give the best level α = 0 test φ. What is the power of this test? (b)Give the best level α = 0.30 test φ. What is the power of this test? (c) Among testswith power=1, find that with the smallest level. What is the size of this test?

Exercise 21.8.11. Suppose X ∼ N(µ, µ2), so that the absolute value of the mean andthe standard deviation are the same. (There is only one observation.) Consider testingH0 : µ = 1 versus HA : µ > 1. (a) Find the level α = 0.10 test with the highest powerat µ = 2. (b) Find the level α = 0.10 test with the highest power at µ = 3. (c) Find thepowers of the two tests in parts (a) and (b) at µ = 2 and 3. (d) Is there a UMP level0.10 test for this hypothesis testing problem?

Exercise 21.8.12. For each testing problem, say whether there is a UMP level 0.05 testor not. (a) X ∼ Uniform(0, θ), H0 : θ = 1 versus HA : θ < 1. (b) X ∼ Poisson(λ),H0 : λ = 1 versus HA : λ > 1. (c) X ∼ Poisson(λ), H0 : λ = 1 versus HA : λ 6= 1.(d) X ∼ N(µ, σ2), H0 : µ = 0, σ2 = 1 versus HA : µ > 0, σ2 > 0. (e) X ∼ N(µ, σ2),H0 : µ = 0, σ2 = 1 versus HA : µ > 0, σ2 = 1. (f) X ∼ N(µ, σ2), H0 : µ = 0, σ2 = 1versus HA : µ = 1, σ2 = 10.

Exercise 21.8.13. This exercise shows that the noncentral chisquare and noncentralF distributions have monotone likelihood ratio. Assume the degrees of freedom arefixed, so that the noncentrality parameter ∆ ≥ 0 is the only parameter. From (7.130)and (7.134) we have that the pdfs, with w > 0 as the variable, can be written as

f (w |∆) = f (w | 0)e−12 ∆

∞

∑k=0

ck∆kwk, (21.121)


where ck > 0 for each k. For given ∆ > ∆′, write

f (w |∆)f (w |∆′) = e

12 (∆

′−∆)R(w, ∆, ∆′), where R(w, ∆, ∆′) =∑∞

k=0 ck∆kwk

∑∞k=0 ck∆′kwk . (21.122)

For fixed ∆′, consider the random variable K with space the nonnegative integers,parameter w, and pmf

g(k |w) =ck∆

′kwk

∑∞l=0 cl∆

′ lwl . (21.123)

(a) Is g(k |w) a legitimate pmf? Show that it has strict monotone likelihood ratio withrespect to k and w. (b) Show that R(w, ∆, ∆′) = Ew[(∆/∆′)K ] where K has pdf g. (c)Use part (a) and Lemma 21.5 to show that for fixed ∆ > ∆′, Ew[(∆/∆′)K ] is increasingin w. (d) Argue that f (w |∆)/ f (w |∆′) is increasing in w, hence f has strict monotonelikelihood ratio wrt w and ∆.

Exercise 21.8.14. Find the form of the LMP level α test testing H0 : θ = 0 versusHA : θ > 0 based on X1, . . . , Xn iid Logistic(θ). (So the pdf of Xi is exp(xi − θ)/(1 +exp(xi − θ))2.

Exercise 21.8.15. Recall the fruit fly example in Exercise 14.9.2 (and points earlier).Here, (N00, N01, N10, N11) is Multinomial(n, p(θ)) with

p(θ) = ( 12 (1− θ)(2− θ), 1

2 θ(1− θ), 12 θ(1− θ), 1

2 θ(1 + θ)). (21.124)

Test the hypotheses H0 : θ = 1/2 versus HA : θ > 1/2. (a) Show that there is no UMPlevel α test for α ∈ (0, 1). (b) Show that any level α test that maximizes the derivativeof Eθ [φ] at θ = 1/2 can be written as

φ(n) =

1 if n11 − n00 > cγ(n) if n11 − n00 = c

0 if n11 − n00 < c(21.125)

for some constant c and function γ(n). (c) Do you think φ in (21.125) is guaranteedto be the LMP level α test? Or does it depend on what γ is.

Exercise 21.8.16. Suppose X ∼ N(θ2, 1) and we wish to test H0 : θ = 0 versus HA : θ >0 for level α = 0.05. (a) Show that [∂Eθ [φ]/∂θ]|θ=0 = 0 for any test φ. (b) Argue thatthe test φ∗(x) = α has level α and maximizes the derivative of Eθ [φ] at θ = 0. Is it LMPlevel α? (c) Find the UMP level α test ψ. Is it LMP level α? (d) Find [∂2Eθ [φ]/∂θ2]|θ=0for φ = φ∗ and φ = ψ. Which is larger?

Exercise 21.8.17. This exercise provides an example of finding an LMP test when thecondition (21.83) fails. Suppose X1 and X2 are independent, with X1 ∼ Binomial(4, θ)and X2 ∼ Binomial(3, θ2). We test H0 : θ = 1/2 versus HA : θ > 1/2. (a) Show thatany level α test that maximizes [∂Eθ [φ]/∂θ]|θ=1/2 has the form

φ(x1, x2) =

1 if 3x1 + 4x2 > cγ(x1, x2) if 3x1 + 4x2 = c

0 if 3x1 + 4x2 < c. (21.126)

21.8. Exercises 391

(b) Show that for tests of the form (21.126),

Eθ [φ] = Pθ [3X1 + 4X2 > c] + Eθ [γ(X1, X2) | 3X1 + 4X2 = c]Pθ [3X1 + 4X2 = c].(21.127)

(c) For level α = 0.25, the c = 12. Then P1/2[3X1 + 4X2 > 12] = 249/210 ≈ 0.2432 andP1/2[3X1 + 4X2 = 12] = 28/210 ≈ 0.02734. Show that in order for the test (21.126) tohave level 0.25, we need

E1/2[γ(X1, X2) | 3X1 + 4X2 = 12] =14

. (21.128)

(d) Show that (x1, x2) | 3x1 + 4x2 = 12 consists of just (4, 0) and (0, 3), and

Pθ [(X1, X2) = (4, 0) | 3X1 + 4X2 = 12] =(1 + θ)3

(1 + θ)3 + θ3 , (21.129)

hence

Eθ [γ(X1, X2) | 3X1 + 4X2 = 12] =γ(4, 0)(1 + θ)3 + γ(0, 3)θ3

(1 + θ)3 + θ3 . (21.130)

(e) Using (21.128) and (21.130), to obtain size 0.25, we need 27γ(4, 0) + γ(0, 3) = 7.Find the range of such possible γ(0, 3)’s. [Don’t forget that 0 ≤ φ(x) ≤ 1.] (f)Show that among the level 0.25 tests of the form (21.126), the one with γ(0, 3) = 1maximizes (21.130) for all θ ∈ (0.5, 1). Call this test ψ. (g) Argue that ψ from part (f)is the LMP level 0.25 test.

Exercise 21.8.18. Suppose X has an exponential family pdf, where X itself is thenatural sufficient statistic, so that f (x | θ) = a(x) exp(xθ − ρ(θ)). We test H0 : θ = θ0versus HA : θ 6= θ0. Assume that X = (k, l), where k or l could be infinite, anda(x) > 0 for x ∈ X . Consider tests ψ of the form (21.95) for some a, b, where bycontinuity we can set γ(x) = 0. Fix α ∈ (0, 1). (a) Show that

1− Eθ0 [ψ] =∫ b

af (x | θ0)dx = Fθ0 (b)− Fθ0 (a) and

∂

∂θEθ [ψ]

∣∣∣∣θ=θ0

= −∫ b

a(x− µ(θ0)) f (x | θ0)dx. (21.131)

(b) Let a∗ be the lower α cutoff point, i.e., Fθ0 (a∗) = α. For a ≤ a∗, define b(a) =

F−1θ0

(Fθ0 (a) + 1− α). Show that b(a) is well-defined and continuous in a ∈ (k, a∗), andthat Fθ0 (b(a))− Fθ0 (a) = 1− α. (c) Show that lima→k b(a) = b∗, where 1− Fθ0 (b

∗) = α,and lima→a∗ b(a) = l. (d) Consider the function of a,

d(a) =∫ b(a)

a(x− µ(θ0)) f (x | θ0)dx. (21.132)

Show that

lima→k

d(a) =∫ b∗

k(x− µ(θ0)) f (x | θ0)dx < 0 and

lima→a∗

d(a) =∫ l

a∗(x− µ(θ0)) f (x | θ0)dx > 0. (21.133)


[Hint: Note that the integral from k to l is 0.] Argue that by continuity of d(a),there must be an a0 such that d(a0) = 0. (e) Using ψ with a = a0 from part (d) andb = b(a0), show that (21.88) holds, proving that ψ is the UMP unbiased level α test.

Exercise 21.8.19. Continue the setup from Exercise 21.8.18, where now θ0 = 0, so thatwe test H0 : θ = 0 versus HA : θ 6= 0. Also, suppose the distribution under the nullis symmetric about 0, i.e., f (x | 0) = f (−x | 0), so that µ(0) = 0. Let a be the upperα/2 cutoff point for the null distribution of X, so that P0[|X| > a] = α. Show that theUMP level α test rejects the null when |X| > a.

Exercise 21.8.20. Suppose X1, . . . , Xn are iid N(µ, σ2), and we test

H0 : µ = 0, σ2 > 0 versus HA : µ > 0, σ2 > 0. (21.134)

The goal is to find the UMP unbiased level α test. (a) Write the density of X as atwo-parameter exponential family, where the natural parameter is (θ1, θ2) with θ1 =nµ/σ2 and θ2 = −1/(2σ2), and the natural sufficient statistic is (x, w) with w = ∑ x2

i .Thus we are testing θ1 = 0 versus θ1 6= 0, with θ2 as a nuisance parameter. (b) Showthat the conditional distribution of X given W = w has space (−

√w/n,

√w/n) and

pdf

fθ1 (x |w) =(w− nx2)(n−3)/2eθ1x∫ √w/n

−√

w/n(w− nz2)(n−3)/2eθ1zdz. (21.135)

[Hint: First write down the joint pdf of (X, V) where V = ∑(Xi − X)2, then use thetransformation w = v + nx2.] (c) Argue that, conditioning on W = w, the conditionalUMP level α test of the null is

φ(x, w) =

1 if x ≥ c(w)0 if x < c(w)

, (21.136)

where c(w) is the constant such that P0[X > c(w) |W = w] = α. (d) Show that underthe null, W has a one-dimensional exponential family distribution, and the model iscomplete. Thus the test φ(x, w) is the UMP unbiased level α test.

Exercise 21.8.21. Continue with the testing problem in Exercise 21.8.20. Here weshow that the test φ in (21.136) is the t test. We take θ1 = 0 throughout this exercise.(a) First, let u =

√n x/√

w, and show that the conditional distribution of U givenW = w is

g(u |w) = du−12 (1− u2)

n−32 , (21.137)

where d is a constant not depending on w. Note that this conditional distributiondoes not depend on W, hence U is independent of W. (b) Show that

T ≡√

n− 1U√

1−U2=

√n X√

∑(Xi − X)2/(n− 1), (21.138)

the usual t statistic. Why is T independent of W? (c) Show that T is a function of(X, W), and argue that

φ(x, w) =

1 if t(x, w) ≥ tn−1,α0 if t(x, w) < tn−1,α

, (21.139)

where tn−1,α is the upper α cutoff point of a tn−1 distribution. Thus the one-sided ttest is the UMP unbiased level α test.

21.8. Exercises 393

Exercise 21.8.22. Suppose X1, . . . , Xn are iid N(µ, σ2) as in Exercise 21.8.20, but herewe test the two-sided hypotheses

H0 : µ = 0, σ2 > 0 versus HA : µ 6= 0, σ2 > 0. (21.140)

Show that the two-sided t test, which rejects the null when |T| > tn−1,α/2, is the UMPunbiased level α test for (21.140). [Hint: Follow Exercises 21.8.20 and 21.8.21, but useExercise 21.8.19 as well.]

Exercise 21.8.23. Let U ∼ σ2χ2ν. We wish to find the UMP unbiased level α test for

testing H0 : σ2 = 1 versus HA : σ2 6= 1. The test is to reject the null when u < a0 oru > b0, where a0 and b0 satisfy the conditions in (21.97). (a) With fν being the χ2

ν pdf,show that ∫ b

au fν(u)du = ν

∫ b

afν+2(u)du. (21.141)

(b) Letting Fν be the χ2ν distribution function, show that the conditions in (21.97) can

be writtenFν(b)− Fν(a) = 1− α = Fν+2(b)− Fν+2(a). (21.142)

Thus with b(a) = F−1ν (Fν(a) + 1− α), we wish to find a0 so that

g(a0) = 0 where g(a) = Fν+2(b(a))− Fν+2(a)− (1− α). (21.143)

Based on an initial guess a1 for a0, the Newton-Raphson iteration for obtaining anew guess ai+1 from guess ai is ai+1 = ai − g(ai)/g′(ai). (c) Show that g′(a) =fν(a)(b(a)− a)/ν. [Hint: Note that dF−1

ν (x)/dx = fν(F−1ν (x)).] Thus the iterations

are

ai+1 = ai − νFν+2(b(a))− Fν+2(a)− (1− α)

fν(a)(b(a)− a), (21.144)

which can be implemented using just the χ2 pdf and distribution function.

Exercise 21.8.24. In testing hypotheses of the form H0 : θ = 0 versus HA : θ > 0, anasymptotically most powerful level α test is a level α test ψ such that for any otherlevel α test φ, there exists an Nφ such that

Eθ [ψ] ≥ Eθ [φ] for all θ > Nφ. (21.145)

Let X1, . . . , Xn be iid Laplace(θ), so that the pdf of Xi is (1/2) exp(−|xi − θ|). Con-sider the test

ψ(x) =

1 if ∑ max0, xi > c0 if ∑ max0, xi ≤ c , (21.146)

where c is chosen to obtain size α. (a) Sketch the acceptance region for ψ when n = 2and c = 1. (b) Show that for each x,

limθ→∞

enθ−2c f (x | θ)f (x | 0) = e2(∑ max0,xi−c). (21.147)

(c) Let φ be level α. Show that

enθ−2cEθ [ψ(X)− φ(X)]− E0[ψ(X)− φ(X)]

=∫X(ψ(x)− φ(x))

(enθ−2c f (x | θ)

f (x | 0) − 1)

f (x | 0)dx, (21.148)


and since ψ has size α and φ has level α,

enθ−2cEθ [ψ(X)− φ(X)] ≥∫X(ψ(x)− φ(x))

(enθ−2c f (x | θ)

f (x | 0) − 1)

f (x | 0)dx. (21.149)

(d) Let θ → ∞ on the right-hand side of (21.149). Argue that the limit is nonnegative,and unless P0[φ(X) = ψ(X)] = 1, the limit is positive. (e) Explain why part (d) showsthat ψ is asymptotically most powerful level α.

Chapter 22

Decision Theory in Hypothesis Testing

22.1 A decision-theoretic framework

Again consider the general hypothesis testing problem

H0 : θ ∈ T0 versus HA : θ ∈ TA. (22.1)

The previous chapter exhibits a number of best-test scenarios, all where the essentialpart of the null hypothesis was based on a single parameter. This chapter deals withmultiparametric hypotheses, where there typically is no UMP or UMP unbiased test.Admissibility and minimaxity are then relevant concepts.

The typical decision-theoretic framework used for testing has action space A =Accept, Reject, denoting accepting or rejecting the null hypothesis. The usual lossfunction used for hypothesis testing is called 0/1 loss, where we lose 1 if we makea wrong decision, and lose nothing if we are correct. The loss function combineselements of the tables on testing in (15.7) and on game theory in (20.56):

L(a, θ) ActionAccept Reject

θ ∈ T0 0 1θ ∈ TA 1 0

(22.2)

The risk is thus the probability of making an error given θ. Using test functionsφ : X → [0, 1] as in (21.5), where φ(x) is the probability of rejecting the null when xis observed, the risk function is

R(θ ; φ) =

Eθ[φ(X)] if θ ∈ T0

1− Eθ[φ(X)] if θ ∈ TA. (22.3)

Note that if θ ∈ TA, the risk is one minus the power.There are a few different approaches to evaluating tests decision-theoretically, de-

pending on how one deals with the level. The generic approach does not place anyrestrictions on level, evaluating tests on their power as well as their size function. Amore common approach to hypothesis testing is to fix α, and consider only tests φ oflevel α. The question then becomes whether to look at the risk for parameter valuesin the null, or just worry about the power. For example, suppose X ∼ N(µ, 1) and we

395

396 Chapter 22. Decision Theory in Hypothesis Testing

test H0 : µ ≤ 0 versus HA : µ > 0, restricting to tests with level α = 0.05. If we takerisk at the null and alternative into account, then any test that rejects the null whenX ≥ c for some c ≥ 1.645 is admissible, since it is the uniformly most powerful testof its size. That is, the test I[X ≥ 1.645] is admissible, but so is the test I[X ≥ 1.96],which has smaller power but smaller size. If we evaluate only on power, so ignoresize except for making sure it is no larger than 0.05, I[X ≥ 1.96] is dominated byI[X ≥ 1.645]; in fact, the latter is the only admissible test. If we restrict to tests withsize function exactly equal to α, i.e., R(θ ; φ) = α for all θ ∈ T0, then the power is theonly relevant decider.

The Rao-Blackwell theorem (Theorem 13.8 on page 210) showed that when esti-mating with squared-error loss, any estimator that is not essentially a function ofjust the sufficient statistic is inadmissible. For testing, the result is not quite asstrong. Suppose φ(x) is any test, and T = t(X) is a sufficient statistic. Then leteφ(t) = Eθ[φ(X) | t(X) = t]. (The conditional expected value does not depend on theparameter by sufficiency.) Note that 0 ≤ eφ ≤ 1, hence it is also a test function. Foreach θ, we have

Eθ[eφ(T)] = Eθ[φ(X)] =⇒ R(θ, φ) = R(θ, eφ). (22.4)

That is, any test’s risk function can be exactly matched by a test depending on justthe sufficient statistic. Thus when analyzing a testing problem, we lose nothing byreducing by sufficiency: We look at the same hypotheses, but base the tests on thesufficient statistic T.

The UMP, UMP unbiased, and LMP level α tests we saw in Chapter 21 will be ad-missible under certain reasonable conditions. See Exercise 22.8.1. In the next sectionwe look at Bayes tests, and conditions under which they are admissible. Section 22.3looks at necessary conditions for a test to be admissible. Basically, it must be a Bayestest or a certain type of limit of Bayes tests. Section 22.4 considers the special case ofcompact parameter spaces for the hypotheses, and Section 22.5 contains some caseswhere tests with convex acceptance regions are admissible. Section 22.6 introducesinvariance, which is a method for exploiting symmetries in the model to simplifyanalysis of test statistics. It is especially useful in multivariate analysis.

We will not say much about minimaxity in hypothesis testing, though it can beuseful. Direct minimaxity for typical testing problems is not very interesting sincethe maximal risk for a level α test is 1− α (if α < 0.5, the null and alternative are notseparated, and the power function is continuous in θ). See the second graph in Figure22.1. If we restrict to level α tests, then they all have the same maximal risk, and if weallow all levels, then the minimax tests are the ones with level 0.5. If the alternativeis separated from the null, e.g., testing θ = 0 versus θ > 1, then the minimax test willgenerally be one that is most powerful at the closest point in the alternative, or Bayeswrt a prior concentrated on the set of closest points if there are more than one (as inthe hypotheses in (22.30)). More informative is maximal regret, where we restrict tolevel α tests, and define the risk to be the distance between the actual power and thebest possible power at each alternative:

R(θ ; φ) = suplevel α tests ψ

Eθ[ψ]− Eθ[φ]. (22.5)

See van Zwet and Oosterhoff (1967) for some applications.

22.2. Bayes tests 397

22.2 Bayes tests

Section 15.4 introduced Bayes tests. Here we give their formal decision-theoretic jus-tification. The prior distribution π over T0 ∪ TA is given in two stages. The marginalprobabilities of the hypotheses are π0 = P[θ ∈ T0] and πA = P[θ ∈ TA], π0 +πA = 1.Then conditionally, θ given H0 is true has conditional density ρ0(θ), and given HAis true has conditional density ρA(θ). If we look at all possible tests, and take intoaccount size and power for the risk, a Bayes test wrt π minimizes

R(π ; φ) = πA

∫TA

(1− Eθ[φ(X)])ρA(θ)dθ+ π0

∫T0

Eθ[φ(X)]ρ0(θ)dθ (22.6)

over φ. If X has pdf f (x | θ), then

R(π ; φ) = πA

∫TA

(1−

∫X

φ(x) f (x | θ)dx)

ρA(θ)dθ

+ π0

∫T0

∫X

φ(x) f (x | θ)dxρ0(θ)dθ

=∫X

φ(x)(

π0

∫T0

f (x | θ)ρ0(θ)dθ− πA

∫TA

f (x | θ)ρA(θ)dθ

)dx + πA.

(22.7)

To minimize this Bayes risk, we take φ(x) to minimize the integrand in the last line.Since φ must be in [0,1], we take φ(x) = 0 if the quantity in the large parentheses ispositive, and φ(x) = 1 if it is negative, yielding a test of the form

φπ(x) =

1 if BA0(x)

πAπ0

> 1γ(x) if BA0(x)

πAπ0

= 10 if BA0(x)

πAπ0

< 1, (22.8)

where BA0 is the Bayes factor as in (15.40),

BA0(x) =

∫TA

f (x | θ)ρA(θ)dθ∫T0

f (x | θ)ρ0(θ)dθ. (22.9)

(If BA0(x)πA/π0 is 0/0, take it to be 1.) Thus the Bayes test rejects the null if, underthe posterior, the null is probably false, and accepts the null if it is probably true.

22.2.1 Admissibility of Bayes tests

Lemma 20.3 (page 349) gives some sufficient conditions for a Bayes procedure φπ tobe admissible, which apply here: (a) φπ is admissible among the tests that are Bayeswrt π; (b) φπ is the unique Bayes test (up to equivalence) wrt π; (c) the parameterspace is finite or countable, and π places positive probability on each parametervalue. Parts (d) and (e) require the risk to be continuous in θ, which is usually nottrue in hypothesis testing. For example, suppose X ∼ N(µ, 1) and we test H0 : µ ≤ 0versus HA : µ > 0 using the test that rejects when X > zα. Then the risk at µ = 0 isexactly α, but for µ just a bit larger than zero, the risk is almost 1− α. Thus the riskis discontinuous at µ = 0 (unless α = 1/2). See Figure 22.1.


−3 −2 −1 0 1 2 3

0.0

0.4

0.8

µ

Eµ[

φ]

−3 −2 −1 0 1 2 3

0.0

0.4

0.8

µ

R(µ

;φ)

Figure 22.1: Testing H0 : µ ≤ 0 versus HA : µ > 0 based on X ∼ N(µ, 1). The test φrejects when X > 1.28, and has level α = 0.10. The top graph is the power function,which is continuous. The bottom graph is the risk function, which inverts the powerfunction when µ > 0. It is not continuous at µ = 0.

These parts of the lemma can be extended to hypothesis testing if Eθ[φ] is contin-uous in θ for any test φ. We decompose the parameter space into three pieces. Let T∗be the border between the null and hypothesis spaces, formally, T ∗ = closure(TA) ∩closure(T0). It is the set of points θ∗ for which there are points in both the null andalternative spaces arbitrarily close to θ∗. We assume that T0 − T ∗ and TA − T ∗ areboth open. Then if prior π has π(B) > 0 for any open set B ∈ T0−TB or B ∈ TA−TB,the Bayes test φπ wrt π is admissible. The proof is basically the same as in Lemma20.3, but we also need to note that if a test φ is at least as good as φπ , then the twotests have the same risk on the border (at all θ ∈ TB). For example, the test based onthe Bayes factor in (15.43) is admissible, as are the tests in Exercises 15.7.6 and 15.7.11.

Return to condition (a), and let Dπ be the set of Bayes tests wrt π, so that they allsatisfy (22.8) (with probability 1) for some γ(x). As in (21.127) of Exercise 21.8.17, thepower of any test φ ∈ Dπ can be written

Eθ[φ] = Pθ [B(X) > 1] + Eθ[γ(X) | B(X) = 1]Pθ[B(X) = 1], B(x) = BA0(x)πAπ0

.

(22.10)If Pθ[B(X) = 1] = 0 for all θ ∈ T0 ∪ TA, then all tests in Dπ have the same riskfunction. They are thus all admissible among the Bayes tests, hence admissible amongall tests.

If Pθ[B(X) = 1] > 0 for some θ, then any differences in power are due to the γ(x)when B(x) = 1. Consequently, a test is admissible in Dπ if and only if it is admis-

22.2. Bayes tests 399

sible in the conditional testing problem where γ(x) is the test, and the distributionunder consideration is the conditional distribution of X | B(X) = 1. For example, ifx | B(x) = 1 consists of just the one point x0, the γ(x0) for the conditional problemis a constant, in which case any value 0 ≤ γ(x0) ≤ 1 yields an admissible test. SeeExercise 22.8.2.

For another example, consider testing H0 : θ = 1 versus HA : θ > 1 based onX ∼ Uniform(0, θ). Let the prior put half the probability on θ = 1 and half on θ = 2.Then

B(x) =

∞ if x ∈ [1, 2)1(= 0

0 ) if x ≥ 212 if x ∈ (0, 1)

. (22.11)

Thus any Bayes test φ has φ(x) = 1 if 1 ≤ x < 2 and φ(x) = 0 if 0 < x < 1. The γgoes into effect if x ≥ 2. If θ = 1 then P1[B(X) = 1] = 0, so only power is relevant incomparing Bayes tests on their γ. But γ(x) ≡ 1 will maximize the conditional power,so the only admissible Bayes test is φ(x) = I[x ≥ 1].

22.2.2 Level α Bayes tests

The Bayes test in (22.8) minimizes the Bayes risk among all tests, with no guaranteeabout level. An expression for the test that minimizes the Bayes risk among levelα tests is not in general easy to find. But if the null is simple and we evaluate therisk only on θ ∈ TA, then the Neyman-Pearson lemma can again be utilized. Thehypotheses are now

H0 : θ = θ0 versus HA : θ ∈ TA (22.12)

for some θ0 6∈ TA, and for given α ∈ (0, 1), we consider just the set of level α tests

Dα = φ | Eθ0[φ] ≤ α. (22.13)

The prior π is the same as above, but since the null has only one point, ρ0(θ0) = 1,the Bayes factor is

BA0(x) =

∫TA

f (x | θA)ρA(θA)dθA

f (x | θ0). (22.14)

If the Bayes test wrt π in (22.8) for some γ(x) has level α, then since it has thebest Bayes risk among all tests, it must have the best Bayes risk among level α tests.Suppose its size is larger than α. Consider the test φα given by

φα(x) =

1 if BA0(x) > cα

γα if BA0(x) = cα

0 if BA0(x) < cα

, (22.15)

where cα and γα are chosen so that Eθ0[φα(X)] = α. Suppose φ is another level α test.

It must be that cα > π0/πA, because otherwise the Bayes test would be level α. Using


(22.7) and (22.14), we can show that the difference in Bayes risks between φ and φα is

R(π ; φ)− R(π ; φα) =∫X(φα(x)− φ(x))(πABA0(x)− π0) f (x | θ0))dx

=∫X(φα(x)− φ(x))(πA(BA0(x)− cα) + πAcα − π0) f (x | θ0))dx

= πA

∫X(φα(x)− φ(x))(BA0(x)− cα) f (x | θ0)dx

+ (πAcα − π0)(Eθ0[φα]− Eθ0

[φ]). (22.16)

In the last line, the integral term is nonnegative by the definition (22.15) and the factthat πAcα − π0 > 0, and Eθ0

[φα] ≥ Eθ0[φ] because φα has size α and φ has level

α. Thus R(π ; φ) ≥ R(π ; φα), proving that φα is Bayes wrt π among level α tests.Exercise 22.8.3 shows that the Bayes test (or any test) is admissible among level αtests if and only if it is level α and admissible among all tests.

22.3 Necessary conditions for admissibility

As in estimation, under reasonable conditions, admissible tests are either Bayes testsor limits of Bayes tests, though not all Bayes tests are admissible. Here we extend theresults in Section 20.8 to parameter spaces in (22.1) that are contained in RK .

We first need to define what we mean by limits of tests, in this case weak limits.

Definition 22.1. Suppose φ1, φ2, . . . is a sequence of test functions, and φ is another testfunction. Then φn converges weakly to φ, written φn →w φ, if∫

Xφn(x) f (x)dx −→

∫X

φ(x) f (x)dx for any f such that∫X| f (x)|dx < ∞. (22.17)

This definition is apropos for models with pdfs. This convergence is weaker thanpointwise, since φn(x) → φ(x) for all x implies that φn →w φ, but we can changeφ at isolated points without affecting weak convergence. If X is countable, then thedefinition replaces the integral with summation, and weak convergence is equivalentto pointwise convergence.

Note that if we do have pdfs, then we can use fθ(x) for f to show

φn →w φ =⇒ Eθ[φn(X)]→ Eθ[φ(X)] for all θ ∈ T=⇒ R(θ ; φn(X))→ R(θ ; φ(X)) for all θ ∈ T . (22.18)

As for Theorems 20.6 and 20.8 (pages 358 and 360), we need the risk set for anyfinite collection of θ’s to be closed, convex, and bounded from below. The last re-quirement is automatic, since all risks are in [0,1]. The first two conditions will holdif the corresponding conditions hold for D:

1. If φ1, φ2 ∈ D then βφ1 + (1− β)φ2 ∈ D for all β ∈ [0, 1].2. If φ1, φ2, . . . ∈ D and φn →w φ then φ ∈ D. (22.19)

These conditions hold, e.g., if D consists of all tests, or of all level α tests. Here is themain result, a special case of the seminal results in Wald (1950), Section 3.6.

22.3. Necessary conditions for admissibility 401

Theorem 22.2. Suppose D satisfies the conditions in (22.19), and Eθ[φ] is continuous in θfor any test φ. Then if φ0 ∈ D is admissible among the tests in D, there exists a sequence ofBayes tests φ1, φ2, . . . ∈ D and a test φ ∈ D such that

φn →w φ and R(θ ; φ) = R(θ ; φ0) for all θ ∈ T . (22.20)

Note that the theorem doesn’t necessarily guarantee that a particular admissibletest is a limit of Bayes tests, but rather that there is a limit of Bayes tests that has theexact same risk. So you will not lose anything if you consider only Bayes tests andtheir weak limits. We can also require that each πn in the theorem is concentrated ona finite set of points.

The proof relies on a couple of mathematical results that we won’t prove here.First, we need that T0 and TA both have countable dense subsets. For given setC, a countable set C∗ ⊂ C is dense in C if for any x ∈ C, there exists a sequencex1, x2, . . . ∈ C∗ such that xn → x. For example, the rational numbers are dense in thereals. As long as the parameter space is contained in RK , this condition is satisfied.

Second, we need that the set of tests φ is compact under weak convergence. Thiscondition means that for any sequence φ1, φ2, . . . of tests, there exists a subsequenceφn1 , φn2 , . . . and test φ such that

φni −→w φ as i→ ∞. (22.21)

See Theorem A.5.1 in the Appendix of Lehmann and Romano (2005).

Proof. (Theorem 22.2) Suppose φ0 is admissible, and consider the new risk function

R∗(θ ; φ) = R(θ ; φ)− R(θ ; φ0). (22.22)

Let T ∗0 = θ01, θ02, . . . , and T ∗A = θA1, θA2, . . . , be countable dense subsets of T0and TA, respectively, and set

T0n = θ01, . . . , θ0n and TAn = θA1, . . . , θAn. (22.23)

Consider the testing problem

H0 : θ ∈ T0n versus HA : θ ∈ TAn. (22.24)

The parameter set here is finite, hence we can use Theorem 20.6 on page 358 to showthat there exists a test φn and prior πn on T0n ∪ TAn such that φn is Bayes wrt πn andminimax for R∗. Since φ0 has maximum risk 0 under R∗, φn can be no worse:

R∗(θ ; φn) ≤ 0 for all θ ∈ T0n ∪ TAn. (22.25)

Now by the compactness of the set of tests under weak convergence, there exist asubsequence φni and test φ such that (22.21) holds. Then (22.18) implies that

R∗(θ ; φni ) −→ R∗(θ ; φ) for all θ ∈ T . (22.26)

Take any θ ∈ T ∗0 ∪ T ∗A . Since it is a member of one of the sequences, there is some Ksuch that θ ∈ T0n ∪ TAn for all n ≥ K. Thus by (22.25),

R∗(θ ; φni ) ≤ 0 for all ni ≥ K ⇒ R∗(θ ; φni )→ R∗(θ ; φ) ≤ 0 for all θ ∈ T ∗0 ∪ T ∗A .(22.27)


Exercise 22.8.11 shows that since Eθ[φ] is continuous in θ and T ∗0 ∪ T ∗A is dense inT0 ∪ TA, we have

R∗(θ ; φ) ≤ 0 for all θ ∈ T0 ∪ TA, (22.28)

i.e.,R(θ ; φ) ≤ R(θ ; φ0) for all θ ∈ T0 ∪ TA. (22.29)

Thus by the assumed admissibility of φ0, (22.20) holds.

If the model is complete as in Definition 19.2 (page 328), then R(θ ; φ) = R(θ ; φ0)for all θ means that Pθ[φ(X) = φ0(X)] = 1, so that any admissible test is a weak limitof Bayes tests.

22.4 Compact parameter spaces

The previous section showed that in many cases, all admissible tests must be Bayesor limits of Bayes, but that fact it not easy to apply directly. Here and in the nextsection, we look at some more explicit characterizations of admissibility.

We first look at testing with compact null and alternatives. That is, both T0 and TAare closed and bounded. This requirement is somewhat artificial, since it means thereis a gap between the two spaces. For example, suppose we have a bivariate normal,X ∼ N(µ, I2). The hypotheses

H0 : µ = 0 versus HA : 1 ≤ ‖µ‖ ≤ 2 (22.30)

would fit into the framework. Replacing the alternative with 0 < ‖µ‖ ≤ 2 or 1 ≤‖µ‖ < ∞ or µ 6= 0 would not fit. Then if the conditions of Theorem 22.2 hold, alladmissible tests are Bayes (so we do not have to worry about limits).

Theorem 22.3. Suppose (22.19) hold for D, and Eθ[φ] is continuous in θ. Then if φ0 isadmissible among the tests in D, it is Bayes.

The proof uses the following lemmas.

Lemma 22.4. Let φ1, φ2, . . . be a sequence of tests, where

φn(x) =

1 if gn(x) > 0γn(x) if gn(x) = 0

0 if gn(x) < 0

for some functions gn(x). Suppose there exists a function g(x) such that gn(x) → g(x) foreach x ∈ X , and a test φ such that φn →w φ. Then (with probability one),

φ(x) =

1 if g(x) > 0γ(x) if g(x) = 0

0 if g(x) < 0

for some function γ(x).

In the lemma, γ is unspecified, so the lemma tells us nothing about φ when g(x) =0.

22.4. Compact parameter spaces 403

Proof. Let f be any function with finite integral (∫| f (x)|dx < ∞ as in (22.17)). Then

the function f (x)I[g(x) > 0] also has finite integral. Hence by Definition 22.1,∫X

f (x)I[g(x) > 0]φn(x)dx −→∫X

f (x)I[g(x) > 0]φ(x)dx. (22.31)

If g(x) > 0, then gn(x) > 0 for all sufficiently large n, hence φn(x) = 1 for all sufficientlarge n. Thus φn(x)→ 1 if g(x) > 0, and∫

Xf (x)I[g(x) > 0]φn(x)dx −→

∫X

f (x)I[g(x) > 0](1)dx. (22.32)

Thus the two limits in (22.31) and (22.32) must be equal for any such f , which meansφ(x) = 1 if h(x) > 0 with probability one (i.e., Pθ[φ(X) = 1 | h(X) > 0] = 1 for all θ).Similarly, φ(x) = 0 for g(x) < 0 with probability one, which completes the proof.

Weak convergence for probability distributions, πn →w π, is the same as conver-gence in distribution of the corresponding random variables. The next result frommeasure theory is analogous to the weak compactness we saw for test functions in(22.21). See Section 5 on Prohorov’s Theorem in Billingsley (1999) for a proof.

Lemma 22.5. Suppose π1, π2, . . . is a sequence of probability measures on the compact spaceT . Then there exists a subsequence πn1 , πn2 , . . ., and probability measure π on T , such thatπni →w π.

Proof. (Theorem 22.3) Suppose φ0 is admissible. Theorem 22.2 shows that there is asequence of Bayes tests such that φn →w φ, where φ and φ0 have the same risk func-tion. Let πn be the prior for which φn is Bayes. Decompose πn into its components(ρn0, ρnA, πn0, πnA) as in the beginning of Section 22.2, so that from (22.8),

φπn (x) =

1 if Bn(x) > 1γn(x) if Bn(x) = 1

0 if Bn(x) < 1, where Bn(x) =

EρnA [ f (x | θ)]πnA

Eρn0 [ f (x | θ)]πn0. (22.33)

By Lemma 22.5, there exists a subsequence and prior π such that πni →w π, wherethe components also converge,

ρni0 →w ρ0, ρni A →

w ρA, πni0 → π0, πni A → πA. (22.34)

If for each x, f (x | θ) is bounded and continuous in θ, then the two expected valuesin (22.33) will converge to the corresponding ones with π, hence the entire ratioconverges:

Bn(x)→EρA [ f (x | θ)]πA

Eρ0 [ f (x | θ)]π0≡ B(x). (22.35)

Then Lemma 22.4 can be applied to show that

φni (x)→w φπ(x) =

1 if B(x) > 1γ(x) if B(x) = 1

0 if B(x) < 1(22.36)

for some γ(x). This φπ is the correct form to be Bayes wrt π. Above we have φn →w φ,hence φ and φπ must have the same risk function, which is also the same as theoriginal test φ0. That is, φ0 is Bayes wrt π.


−4 −2 0 2 4

−4

−2

02

4

x2

x 1φ3

φ1

φ2

Figure 22.2: Testing H0 : µ = 0 versus HA : 1 ≤ ‖µ‖2 ≤ 2 based on X ∼ N(µ, I2),withlevel α = 0.05. Test φ1 rejects the null when ‖x‖2 > χ2

2,α, and φ2 rejects when|2x1 + x2| >

√5 zα/2. These two tests are Bayes and admissible. Test φ3 rejects

the null when max|x1|, |x2| > z(1−√

1−α)/2. It is not Bayes, hence not admissible.

The theorem will not necessarily work if the parameter spaces are not compact,since there may not be a limit of the πn’s. For example, suppose TA = (0, ∞). Thenthe sequence of Uniform(0, n)’s will not have a limit, nor will the sequence of πnwhere πn[Θ = n] = 1. The parameter spaces also need to be separated. For example,if the null is 0 and the alternative is (0, 1], consider the sequence of priors withπn0 = πnA = 1/2, ρ0[Θ = 0] = 1 and ρn[Θ = 1/n] = 1. The limit ρn →w ρ isρ[Θ = 0] = 1, same as ρ0, and not a probability on TA. Now B(x) = 1, and (22.36)has no information about the limit. But see Exercise 22.8.5. Also, Brown and Marden(1989) contains general results on admissibility when the null is simple.

Going back to the bivariate normal problem in (22.30), any admissible test is Bayes.Exercise 22.8.7 shows that the test that rejects the null when ‖x‖2 > χ2

2,α is Bayesand admissible, as are any tests that reject the null when |aX1 + bX2| > c for someconstants a, b, c. However, the test that rejects when max|x1|, |x2| > c has a squareas an acceptance region. It is not admissible, because it can be shown that any Bayestest here has to have “smooth” boundaries, not sharp corners.

22.5. Convex acceptance regions 405

22.5 Convex acceptance regions

If all Bayes tests satisfy a certain property, and that property persists through limits,then the property is a necessary condition for a test to be admissible. For example,suppose the set of distributions under consideration is a one-dimensional exponentialfamily distribution. By the discussion around (22.4), we do not lose anything bylooking at just tests that are functions of the sufficient statistic. Thus we will assumex itself is the natural sufficient statistic, so that its density is f (x | θ) = a(x) exp(θx−ψ(θ)). We test the two-sided hypotheses H0 : θ = θ0 versus HA : θ 6= θ0. A Bayes testwrt π is

φπ(x) =

1 if Bπ(x) > 1γπ(x) if Bπ(x) = 1

0 if Bπ(x) < 1, (22.37)

where we can write

Bπ(x) =πAπ0

K

∑i=1

ex(θi−θ0)−(φ(θi)−φ(θ0))ρA(θi), (22.38)

at least if π has pmf ρA on a finite set of θ’s in TA. Similar to (21.94), B(x) is convexas a function of x, hence the acceptance region of φπ is an interval (possibly infiniteor empty):

φπ(x) =

1 if x < aπ or x > bπ

γπ(x) if x = aπ or x = bπ

0 if aπ < x < bπ

(22.39)

for some aπ and bπ . If φn is a sequence of such Bayes tests with φn → φ, then Exercise22.8.10 shows that φ must have the same form (22.39) for some −∞ ≤ a ≤ b ≤ ∞,hence any admissible test must have that form.

Now suppose we have a p-dimensional exponential family for X with parameterspace T , and for θ0 ∈ T test

H0 : θ = θ0 versus HA : θ ∈ T − θ0. (22.40)

Then any Bayes test wrt a π whose probability is on a finite set of θi’s has the form(22.37) but with

B(x) =πAπ0

K

∑i=1

ex·(θi−θ0)−(φ(θi)−φ(θ0))ρA(θi). (22.41)

Consider the setC = x ∈ Rp | B(x) < 1. (22.42)

It is convex by the convexity of B(x). Also, it can be shown that the φπ of (22.39)(with x in place of x) is

φπ(x) =

1 if x /∈ closure(C)γπ(x) if x ∈ boundary(C)

0 if x ∈ C. (22.43)

(If C is empty, then let it equal the set with B(x) ≤ 1.) Note that if p = 1, C is aninterval as in (22.39).


−4 0 2 4−

40

24

x2

x 1

Figure 22.3: The test that rejects the null when min|x1|, |x2| > cα. The acceptanceregion is the shaded cross. The test is not admissible.

Suppose φ is admissible, so that there exists a sequence of Bayes tests whose weaklimit is a test with the same risk function as φ. Each Bayes test has the form (22.43),i.e., its acceptance region is a convex set. Birnbaum (1955), with correction and ex-tensions by Matthes and Truax (1967), has shown that a weak limit of such tests alsois of that form. If we have completeness, then any admissible test has to have thatform. Here is the formal statement of the result.

Theorem 22.6. Suppose X has a p-dimensional exponential family distribution, where X isthe natural sufficient statistic and θ is the natural parameter. Suppose further that the modelis complete for X. Then a necessary condition for a test φ to be admissible is that there existsa convex set C and function γ(x) such that

φ(x) =

1 if x /∈ closure(C)γ(x) if x ∈ boundary(C)

0 if x ∈ interior(C), (22.44)

or at least equals that form with probability 1.

If the distributions have a pdf, then the probability of the boundary of a convexset is zero, hence we can drop the randomization part in (22.44), as well as for Bayestests in (22.43). The latter fact means that all Bayes tests are essentially unique fortheir priors, hence admissible.

The three tests in Figure 22.2 all satisfy (22.44), but only φ1 and φ2 are admissiblefor the particular bivariate normal testing problem in (22.30). Another potentiallyreasonable test in this case is the one based on the minimum of the absolute xi’s:

φ(x1, x2) =

1 if min|x1|, |x2| > cα

0 if min|x1|, |x2| ≤ cα. (22.45)

See Figure 22.3. The test accepts within the shaded cross, which is not a convex set.Thus the test is inadmissible.

22.5. Convex acceptance regions 407

22.5.1 Admissible tests

Not all tests with convex acceptance regions are admissible in general. However, ifthe parameter space is big enough, then they usually are. We continue with the p-dimensional exponential family distribution, with X as the natural sufficient statisticand θ the natural parameter. Now suppose that T = Rp, and we test the hypothesesin (22.40), which are now H0 : θ = θ0 versus HA : θ 6= θ0.

For a vector a ∈ Rp and constant c, let Ha,c be the closed half-space defined by

Ha,c = x ∈ Rp | a · x ≤ c. (22.46)

In the plane, Ha,c is the set of points on one side of a line, including the line. Anyclosed half-space is a convex set. It is not hard to show that the test that rejects thenull when x ∈ HC

a,c is Bayes (see Exercise 22.8.6), hence likely admissible. (Recall thecomplement of a set A is Rp −A, denoted by AC.) We can show a stronger result: Atest that always rejects outside of the half-space has better power for some parametervalues than one that does not so reject. Formally:

Lemma 22.7. For given a and c, suppose for test φ0 that φ0(x) = 1 for all x ∈ HCa,c, and for

test φ that Pθ[φ(X) < 1 & X ∈ HCa,c] > 0. Then

eψ(θ0+ηa)−ψ(θ0)−cη Eθ0+ηa[φ0 − φ] −→ ∞ as η → ∞. (22.47)

Thus for some θ′ 6= θ0,Eθ′ [φ0] > Eθ′ [φ]. (22.48)

The test that rejects the null if and only if x ∈ HCa,c is then admissible, since it has

the smallest size for any test whose rejection region is contained in HCa,c.

Proof. Write

eψ(θ0+ηa)−ψ(θ0)−cη Eθ0+ηa[φ0 − φ]

= eψ(θ0+ηa)−ψ(θ0)−cη∫X(φ0(x)− φ(x))e(θ0+ηa)·x−ψ(θ0+ηa)a(x)dx

=∫X(φ0(x)− φ(x))eη(a·x−c) f (x | θ0)dx

=∫Ha,c

(φ0(x)− φ(x))eη(a·x−c) f (x | θ0)dx

+∫HC

a,c

(1− φ(x))eη(a·x−c) f (x | θ0)dx, (22.49)

since φ0(x) = 1 for x ∈ HCa,c. For η > 0, the first integral in the last equality of (22.47)

is bounded by ±1 since a · x ≤ c. The exponential in the second integral goes toinfinity as η → ∞, and since by assumption 1− φ(X) > 0 on Ha,c with positive prob-ability (and is never negative), the integral goes to infinity, proving (22.47). Becausethe constant in front of the expectation in the first expression of (22.49) is positive, wehave (22.48). In fact, there exists an η0 such that

Eθ0+ηa[φ0] > Eθ0+ηa[φ] for all η > η0. (22.50)

(Compare the result here to Exercise 21.8.24.)


If a test rejects the null hypothesis whenever x is not in any of a set of half-spaces,then it has better power for some parameter values than any test that doesn’t alwaysreject when not in any one of those half-spaces. The connection to tests with convexacceptance regions is based on the fact that a set is closed and convex (other thanRp) if and only if it is an intersection of closed half-spaces. Which is the next lemma,shown in Exercise 22.8.12.

Lemma 22.8. Suppose the set C ⊂ Rp, C 6= Rp, is closed. Then it is convex if and only ifthere is a set of vectors a ∈ Rp and constants c such that

C = ∩a,cHa,c. (22.51)

Next is the main result of this section, due to Stein (1956a).

Theorem 22.9. Suppose C is a closed convex set. Then the test

φ0(x) =

1 if x /∈ C0 if x ∈ C (22.52)

is admissible.

Proof. Let φ be any test at least as good as φ0, i.e., R(θ ; φ) ≤ R(θ ; φ0) for all θ.By Lemma 22.8, there exists a set of half-spaces Ha,c such that (22.51) holds. Thusφ0(x) = 1 whenever x /∈ Ha,c for any of those half-spaces. Lemma 22.7 then impliesthat φ0(x) must also be 1 (with probability one) whenever x /∈ Ha,c, or else φ0 wouldhave better power somewhere. Thus φ(x) = 1 whenever φ0(x) = 1 (with probabilityone). Also, Pθ[φ(X) > 0 & φ0(X) = 0] = 0, since otherwise Eθ0

[φ] > Eθ0[φ0]. Thus

Pθ[φ(X) = φ0(X)] = 1, hence they have the same risk function, proving that φ0 isadmissible.

This theorem implies that the three tests in Figure 22.2, which reject the null when‖x‖2 > c, |ax1 + bx2| > c, and max|x1|, |x2|, respectively, are admissible for thecurrent hypotheses. Recall that the third one is not admissible for the compact hy-potheses in (22.30).

If the distributions of X have pdfs, then the boundary of any convex set has prob-ability zero. Thus Theorems 22.6 and 22.9 combine to show that a test is admissibleif and only if it is of the form (22.52) with probability one. In the discrete case, it canbe tricky, since a test of the form (22.44) may not be admissible if the boundary of Chas positive probability.

22.5.2 Monotone acceptance regions

If instead of a general alternative hypothesis, the alternative is one-sided for all θi,then it would seem reasonable that a good test would tend to reject for larger values ofthe components of X, but not smaller values. More precisely, suppose the hypothesesare

H0 : θ = 0 versus HA : θ ∈ TA = θ ∈ T | θi ≥ 0 for each i − 0. (22.53)

Then the Bayes ratio Bπ(x) as in (22.38) is nondecreasing in each xi, since in theexponent we have ∑ xiθi with all θi ≥ 0. Consequently, if Bπ(x) < 1, so that thetest accepts the null, then Bπ(x) < 1 for all y with yi ≤ xi, i = 1, . . . , p. We can

22.6. Invariance 409

reverse the inequalities as well, so that if we reject at x, we reject at any y whosecomponents are at least as large as x’s. Assuming continuous random variables, sothat the randomized parts of Bayes tests can be ignored, any Bayes test for (22.53) hasthe form

φ(x) =

1 if x /∈ A0 if x ∈ A (22.54)

for some nonincreasing convex set A ⊂ Rp, where by nonincreasing we mean

x ∈ A =⇒ Lx ⊂ A, where Lx = y | yi ≤ xi, i = 1, . . . , p. (22.55)

(Compare Lx here to Ls in (20.65).)Now suppose we have a sequence of Bayes tests φn, and another test φ such that

φn → φ. Eaton (1970) has shown that φ has the form (22.54), hence any admissibletest must have that form, or be equal to a test of that form with probability one.Furthermore, if the overall parameter space T is unbounded in such a way thatTA = θ ∈ Rp | θi ≥ 0 for each i − 0, then an argument similar to that in theproof of Theorem 22.9 shows that all tests of the form (22.54) are admissible.

In the case X ∼ N(θ, Ip), the tests in Figure 22.2 are inadmissible for the hypothe-ses in (22.53) because the acceptance regions are not nonincreasing. Admissible testsinclude that with rejection region a1x1 + a2x2 > c where a1 > 0 and a2 > 0, and thatwith rejection region minx1, x2 > c. See Exercise 22.8.13 for the likelihood ratiotest.

22.6 Invariance

We have seen that in many testing problems, especially multiparameter ones, thereis no uniquely best test. Admissibility can help, but there may be a large numberof admissible tests, and it can be difficult to decide whether any particular test isadmissible. We saw shift equivariance in Sections 19.6 and 19.7, where by restrictingconsideration to shift equivariant estimators, we could find an optimal estimator incertain models. A similar idea applies in hypothesis testing.

For example, in the Student’s t test situation, X1, . . . , Xn are iid N(µ, σ2), and wetest µ = 0 with σ2 unknown. Then the two parameter spaces are

T0 = (0, σ2) | σ2 > 0 and TA = (µ, σ2) | µ 6= 0 and σ2 > 0. (22.56)

Changing units of the data shouldn’t affect the test. That is, if we reject the nullwhen the data is measured in feet, we should also reject when the data is measuredin centimeters. This problem is invariant under multiplication by a constant. That is,let G = a ∈ R | a 6= 0, the nonzero reals. This is a group under multiplication. Theaction of a group element on the data is to multiply each xi by the element, which iswritten

a x = ax for x ∈ X and a ∈ G. (22.57)

For given a 6= 0, set X∗i = aXi. Then the transformed problem has X∗1 , . . . , X∗n iidN(µ∗, σ∗2), where µ∗ = aµ∗ and σ∗2 = a2σ2. The transformed parameter spaces are

T ∗0 = (0, σ∗2) | σ∗2 > 0 and T ∗A = (µ∗, σ∗2) | µ∗ 6= 0 and σ∗2 > 0. (22.58)

Those are the exact same spaces as in (22.56), and the data has the exact same dis-tribution except for asterisks in the notation. That is, these two testing problems areequivalent.


The thinking is that therefore, any test based on X should have the same outcomeas a test based on X∗. Such a test function φ is called invariant under G, meaning

φ(x) = φ(ax) for all x ∈ X and a ∈ G. (22.59)

The test which rejects the null when X > c is not invariant, nor is the one-sided t test,which rejects when T =

√nx/s∗ > tn−1,α. The two-sided t test is invariant:

|T∗| =√

n|x∗|s∗∗

=√

n|a||x||a|s∗

=√

n|x|s∗

= |T|. (22.60)

We will later see that the two-sided t test is the uniformly most power invariant levelα test.

22.6.1 Formal definition

In Section 17.4 we saw algebraic groups of matrices. Here we generalize slightly toaffine groups. Recall affine transformations from Sections 2.2.2 and 2.3, where anaffine transformation of a vector x is Ax + b for some matrix A and vector b. A set Gof affine transformations of dimension p is a subset of Ap ×Rp, where Ap is the setof p× p invertible matrices. For the set G to be a group, it has to have an operation that combines two elements such that the following properties hold:

1. Closure: g1, g2 ∈ G ⇒ g1 g2 ∈ G;2. Associativity: g1, g2, g3 ∈ G ⇒ g1 (g2 g3) = (g1 g2) g3;3. Identity: There exists e ∈ G such that g ∈ G ⇒ g e = e g = g;4. Inverse: For each g ∈ G there exists a g−1 ∈ G such that

g g−1 = g−1 g = e.(22.61)

Note that we are using the symbol “” to represent the action of the group on thesample space as well as the group composition. Which is meant should be clear fromthe context. For affine transformations, we want to define the composition so that itwill conform to taking an affine transformation of an affine transformation. That is,if (A1, b1) and (A2, b2) are two affine transformations, then we want the combinedtransformation (A, b) = (A1, b1) (A2, b2) to satisfy

Ax + b = A1(A2x + b2) + b1 ⇒ A = A1A2 and b = A1b2 + b1

⇒ (A1, b1) (A2, b2) = (A1A2, A1b2 + b1). (22.62)

The groups we consider here then have the composition (22.62). For each, we mustalso check that the conditions in (22.61) hold.

Let G be the invariance group, and g x be the action on x for g ∈ G. An action hasto conform to the group’s structure: If g1, g2 ∈ G, then (g1 g2) x = g1 (g2 x),and if e ∈ G is the identity element of the group, then e x = x. You might trychecking these conditions on the action defined in (22.57).

A model is invariant under G if for each θ ∈ T and g ∈ G, there exists a parametervalue θ∗ ∈ T such that

X ∼ Pθ =⇒ (g X) ∼ Pθ∗ . (22.63)

We will denote θ∗ by g θ, the action of the group on the parameter space, thoughtechnically the ’s for the sample space and parameter space may not be the same.

22.6. Invariance 411

The action in the t test example of (22.56) is a (µ, σ2) = (aµ, a2σ2). The testingproblem is invariant if both hypotheses’ parameter spaces are invariant, so that forany g ∈ G,

θ ∈ T0 ⇒ g θ ∈ T0 and θ ∈ TA ⇒ g θ ∈ TA. (22.64)

22.6.2 Reducing by invariance

Just below (22.4), we introduced the notion of “reducing by sufficiency,” which sim-plifies the problem by letting us focus on just the sufficient statistics. Similarly, re-ducing by invariance simplifies the problem by letting us focus on just invarianttests. The key is to find the maximal invariant statistic, which is an invariant statisticW = w(X) such that any invariant function of x is a function of just w(x).

The standard method for showing that a potential such function w is indeed max-imal invariant involves two steps:

1. Show that w is invariant: w(g x) = w(x) for all x ∈ X , g ∈ G;

2. Show that for each x ∈ X , there exists a gx ∈ G such that gxx is a function ofjust w(x).

To illustrate, return to the t test example, where X1, . . . , Xn are iid N(µ, σ2), andwe test µ = 0 versus µ 6= 0, so that the parameter spaces are as in (22.56). We couldtry to find a maximal invariant for X, but instead we first reduce by sufficiency, to(X, S2

∗), the sample mean and variance (with n− 1 in the denominator). The actionof the group G = a ∈ R | a 6= 0 from (22.57) on the sufficient statistic is

a (x, s2∗) = (ax, a2s2

∗), (22.65)

the same as the action on (µ, σ2). There are a number of equivalent ways to ex-press the maximal invariant (any one-to-one function of a maximal invariant is alsomaximal invariant). Here we try w(x, s2

∗) = x2/s2∗. The two steps:

1. w(a (x, s2∗)) = w(ax, a2s2

∗) = (ax)2/a2s2∗ = x2/s2

∗ = w(x, s2∗); 3

2. Let a(x,s2∗)

= Sign(x)/s∗. Then a(x,s2∗) (x, s2

∗) = (|x|/s∗, 1) = (√

w(x, s2∗), 1), a

function of just w(x, s2∗). 3

(If x = 0, take the sign to be 1, not 0.) Thus x2/s2∗ is a maximal invariant statistic, as

is the absolute value of the t statistic in (22.60).The invariance-reduced problem is based on the random variable W = w(X), and

still tests the same hypotheses. But the parameter can also be simplified (usually), byfinding the maximal invariant parameter ∆. It is defined the same as for the statistic,but with X replaced by θ. In the t test example, the maximal invariant parameter canbe taken to be ∆ = µ2/σ2. The distribution of the maximal invariant statistic dependson θ only through the maximal invariant parameter. The two parameter spaces forthe hypotheses can be expressed through the latter, hence we have

H0 : ∆ ∈ D0 versus HA : ∆ ∈ DA, based on W ∼ P∗∆, (22.66)

where P∗∆ is the distribution of W. For the t test, µ = 0 if and only if ∆ = 0, hence thehypotheses are simply H0 : ∆ = 0 versus HA : ∆ > 0, with no nuisance parameters.


22.7 UMP invariant tests

Since an invariant test is a function of just the maximal invariant statistic, the riskfunction of an invariant test in the original testing problem is exactly matched by atest function in the invariance-reduced problem, and vice versa. The implication isthat when decision-theoretically evaluating invariant tests, we need to look at onlythe invariance-reduced problem. Thus a test is uniformly most power invariant levelα in the original problem if and only if its corresponding test in the reduced problemis UMP level α in the reduced problem. Similarly for admissibility. In this sectionwe provide some examples where there are UMP invariant tests. Often, especially inmultivariate analysis, there is no UMP test even after reducing by invariance.

22.7.1 Multivariate normal mean

Take X ∼ N(µ, Ip), and test H0 : µ = 0 versus HA : µ ∈ Rp − 0. This problem isinvariant under the group Op of p× p orthogonal matrices. Here the “b” part of theaffine transformation is always 0, hence we omit it. The action is multiplication, andis the same on X as on µ: For Γ ∈ Op,

Γ X = ΓX and Γ µ = Γµ. (22.67)

The maximal invariant statistic is w(x) = ‖x‖2, since ‖Γx‖ = ‖x‖ for orthogonal Γ,and as in (7.72), if x 6= 0, we can find an orthogonal matrix, Γx, whose first rowis x′/‖x‖, so that Γxx = (‖x‖, 0′p−1)

′, a function of just w(x). Also, the maximal

invariant parameter is ∆ = ‖µ‖2. The hypotheses become H0 : ∆ = 0 versus HA : ∆ >0 again.

From Definition 7.7 on page 113 we see that W = ‖X‖2 has a noncentral chi-squared distribution, W ∼ χ2

p(∆). Exercise 21.8.13 shows that the noncentral chi-square has strict monotone likelihood ratio wrt W and ∆. Thus the UMP level α testfor this reduced problem rejects the null when W > χ2

p,α, which is then the UMPinvariant level α test for the original problem.

Now for p1 and p2 positive integers, p1 + p2 = p, partition X and µ as

X =

(X1X2

)and µ =

(µ1µ2

), (22.68)

where X1 and µ1 are p1 × 1, and X2 and µ2 are p2 × 1. Consider testing

H0 : µ2 = 0, µ1 ∈ Rp1 versus HA : µ2 ∈ Rp2 − 0, µ1 ∈ Rp1 . (22.69)

That is, we test just µ2 = 0 versus µ2 6= 0, and µ1 is a nuisance parameter. Theproblem is not invariant under Op as in (22.67), but we can multiply X2 by the smallergroup Op2 . Also, adding a constant to an element of X1 adds the constant to anelement of µ1, which respects the two hypotheses. Thus we can take the group to be

G =

((Ip1 00 Γ2

),(

b10

)) ∣∣∣∣ Γ2 ∈ Op2 , b1 ∈ Rp1

. (22.70)

Writing group elements more compactly as g = (Γ2, b1), the action is

(b1, Γ2) X =

(X1 + b1

Γ2X2

). (22.71)

22.7. UMP invariant tests 413

Exercise 22.8.18 shows that the maximal invariant statistic and parameter are, respec-tively, W2 = ‖X2‖2 and ∆2 = ‖µ2‖2. Now W2 ∼ χ2

p2(∆2), hence as above the UMP

level α invariant test rejects the null when W2 > χ2p2,α.

22.7.2 Two-sided t test

Let X1, . . . , Xn be iid N(µ, σ2), and test µ = 0 versus µ 6= 0 with σ2 > 0 as a nuisanceparameter, i.e., the parameter spaces are as in (22.56). We saw in Section 22.6.2 thatthe maximal invariant statistic is x2/s2

∗ and the maximal invariant parameter is ∆ =µ2/σ2. Exercise 7.8.23 defined the noncentral F statistic in (7.132) as (U/ν)/(V/µ),where U and V are independent, U ∼ χ2

ν(∆) (noncentral), and V ∼ χ2µ (central).

Here, we know that X and S2∗ are independent,

X ∼ N(µ, σ2/n) ⇒√

n X/σ ∼ N(√

nµ/σ, 1) ⇒ nX2/σ2 ∼ χ21(n∆), (22.72)

and (n− 1)S2∗/σ2 ∼ χ2

n−1. Thus

nX2

S2∗= T2 ∼ F1,n−1(n∆), (22.73)

where T is the usual Student’s t from (22.60).The noncentral F has monotone likelihood ratio (see Exercise 21.8.13), hence the

UMP invariant level α test rejects the null when T2 > F1,n−1,α. Equivalently, sincet2ν = F1,ν, it rejects when |T| > tn−1,α/2, so the two-sided t test is the UMP invariant

test.


Section 15.2.1 presents the F test in linear regression. Here, Y ∼ N(xβ, σ2In), and wepartition β as in (22.68), so that β′ = (β′1, β′2), where β1 is p1 × 1, β2 is p2 × 1, andp = p1 + p2. We test whether β2 = 0:

H0 : β2 = 0, β1 ∈ Rp1 , σ2 > 0 versus HA : β2 ∈ Rp2 − 0, β1 ∈ Rp1 , σ2 > 0.(22.74)

We assume that x′x is invertible.The invariance is not obvious, so we take a few preliminary steps. First, we can

assume that x1 and x2 are orthogonal, i.e., x′1x2 = 0. If not, we can rewrite themodel as we did in (16.19). (We’ll leave off the asterisks.) Next, we reduce theproblem by sufficiency. By Exercise 13.8.19, (β, SSe) is the sufficient statistic, whereβ = (x′x)−1x′Y and SSe = ‖Y − xβ‖2. These two elements are independent, withSSe ∼ σ2χ2

n−p and(β1β2

)∼ N

((β1β2

), σ2

((x′1x1)

−1 00 (x′2x2)

−1

)). (22.75)

See Theorem 12.1 on page 183, where the zeros in the covariance matrix are dueto x′1x2 = 0. Now (22.74) is invariant under adding a constant vector to β1, and


multiplying Y by a scalar. But there is also orthogonal invariance that is hidden. LetB be the symmetric square root of x′2x2, and set

β∗2 = Bβ2 ∼ N(β∗2 , σ2Ip2 ), β∗2 = Bβ2. (22.76)

Now we have reexpressed the data, but have not lost anything since (β1, β2, SSe)

is in one-to-one correspondence with (β1, β∗2 , SSe). The hypotheses are the same as in

(22.74) with β∗2 in place of β2’s. Because the covariance of β∗2 is σ2Ip2 , we can multiply

it by an orthogonal matrix without changing the covariance. Thus the invariancegroup is similar to that in (22.70), but includes the multiplier a:

G =

(a(

Ip1 00 Γ2

),(

b10

)) ∣∣∣∣ Γ2 ∈ Op2 , a ∈ (0, ∞), b1 ∈ Rp1

. (22.77)

For (a, Γ2, b1) ∈ G, the action is

(a, Γ2, b1) (β1, β∗2 , SSe) = (aβ1 + b1, aΓ2 β

∗2 , a2SSe). (22.78)

To find the maximal invariant statistic, we basically combine the ideas in Sec-tions 22.7.1 and 22.7.2: Take a = 1/

√SSe, b1 = −β1/a, and Γ2 so that Γ2 β

∗2 =

(‖β∗2‖, 0′p−1)′. Exercise 22.8.19 shows that the maximal invariant statistic is W =

‖β∗2‖2/SSe, or, equivalently, the F = (n− p)W/p2 in (15.24).The maximal invariant parameter is ∆ = ‖β∗2‖2/σ2, and similar to (22.73), F ∼

Fp2,n−p(∆). Since we are testing ∆ = 0 versus ∆ > 0, monotone likelihood ratio againproves that the F test is the UMP invariant level α test.

22.8 Exercises

Exercise 22.8.1. Here the null hypothesis is H0 : θ = θ0. Assume that Eθ [φ] is contin-uous in θ for every test φ. (a) Take the one-sided alternative HA : θ > θ0. Show thatif there is a UMP level α test, then it is admissible. (b) For the one-sided alternative,show that if there is a unique (up to equivalence) LMP level α test, it is admissible.(By “unique up to equivalence” is meant that if φ and ψ are both LMP level α, thentheir risks are equal for all θ.) (c) Now take the two-sided alternative HA : θ 6= θ0.Show that if there is a UMP unbiased level α test, then it is admissible.

Exercise 22.8.2. Suppose we test H0 : θ = 0 versus HA : θ 6= 0 based on X, whichis a constant x0 no matter what the value of θ. Show that any test φ(x0) (with 0 ≤φ(x0) ≤ 1) is admissible.

Exercise 22.8.3. Let Dα be the set of all level α tests, and D be the set of all tests. (a)Argue that if φ is admissible among all tests, and φ ∈ Dα, then φ is admissible amongtests in Dα. (b) Suppose φ ∈ Dα and is admissible among tests in Dα. Show that it isadmissible among all tests D. [Hint: Note that if a test dominates φ, it must also belevel α.]

Exercise 22.8.4. This exercise is to show the two-sided t test is Bayes and admissible.We have X1, . . . , Xn iid N(µ, σ2), n ≥ 2, and test H0 : µ = 0, σ2 > 0 versus HA : µ 6=0, σ2 > 0. The prior has been cleverly chosen by Kiefer and Schwartz (1965). We

22.8. Exercises 415

parametrize using τ ∈ R, setting σ2 = 1/(1 + τ2). Under the null, µ = 0, but underthe alternative set µ = τ/(1 + τ2). Define the pdfs ρ0(τ) and ρA(τ) by

ρ0(τ) = c01

(1 + τ2)n/2 and ρA(τ) = cA1

(1 + τ2)n/2 e12

nτ2

1+τ2 , (22.79)

where c0 and cA are constants so that the pdfs integrate to 1. (a) Show that if τ has thenull pdf ρ0, that

√n− 1 τ ∼ tn−1, Student’s t. Thus c0 = Γ(n/2)/(Γ((n− 1)/2)

√π).

(b) Show that∫ ∞−∞(ρA(τ)/cA)dτ < ∞, so that ρA is a legitimate pdf. [Extra credit:

Find cA explictly.] [Hint: Make the transformation u = 1/(1 + τ2). For ρA, expandthe exponent and find cA in terms of a confluent hypergeometric function, 1F1, fromExercise 7.8.21.] (c) Let f0(x | τ) be the pdf of X under the null written in terms of τ.Show that∫ ∞

−∞f0(x | τ)ρ0(τ)dτ = c∗0 e−

12 ∑ x2

i

∫ ∞

−∞e−

12 τ2 ∑ x2

i dτ = c∗∗0 e−12 ∑ x2

i1√∑ x2

i

(22.80)

for some constants c∗0 and c∗∗0 . [Hint: The integrand in the second expression lookslike a N(0, 1/ ∑ x2

i ) pdf for τ, without the constant.] (d) Now let fA(x | τ) be the pdfof X under the alternative, and show that∫ ∞

−∞fA(x | τ)ρA(τ)dτ = c∗A e−

12 ∑ x2

i

∫ ∞

−∞e−

12 τ2 ∑ x2

i +τ ∑ xi dτ

= c∗∗A e−12 ∑ x2

i e12 (∑ xi)

2/ ∑ x2i

1√∑ x2

i

(22.81)

for some c∗A and c∗∗A . [Hint: Complete the square in the exponent with respect toτ, then note that the integral looks like a N(∑ xi/ ∑ x2

i , 1/ ∑ x2i ) pdf.] (e) Show that

the Bayes factor BA0(x) is a strictly increasing function of (∑ xi)2/(∑ x2

i ), which isa strictly increasing function of T2, where T =

√nX/S∗, S2

∗ = ∑(Xi − X)2/(n− 1).Thus the Bayes test is the two-sided t test. (f) Show that this test is admissible.

Exercise 22.8.5. Suppose X has density f (x | θ), and consider testing H0 : θ = 0 versusHA : θ > 0. Assume that the density is differentiable at θ = 0. Define prior πn to have

πn[Θ = 0] =n + c

2n + c, πn[Θ =

1n] =

n2n + c

(22.82)

for constant c. Take n large enough that both probabilities are in (0,1). (a) Show thata Bayes test (22.33) here has the form

φπn (x) =

1 if f (x | 1/n)

f (x | 0) > 1 + cn

γn(x) if f (x | 1/n)f (x | 0) = 1 + c

n

0 if f (x | 1/n)f (x | 0) < 1 + c

n

. (22.83)

(b) Let φ be the limit of the Bayes test, φπn →w φ. (By (22.21) we know there is such alimit, at least on a subsequence. Assume we are on that subsequence.) Apply Lemma


22.4 with gn(x) = f (x | 1/n)/ f (x | 0)− 1− c/n. What can you say about φ? (c) Nowrewrite the equations in (22.83) so that

gn(x) = n(

f (x | 1/n)f (x | 0) − 1

)− c. (22.84)

Show that as n→ ∞,

gn(x) −→ l′(0 ; x)− c =∂

∂θlog( fθ(x))

∣∣∣∣θ=0− c. (22.85)

What can you say about φ now?

Exercise 22.8.6. Let X have a p-dimensional exponential family distribution, whereX itself is the natural sufficient statistic, and θ is the natural parameter. We testH0 : θ = 0 versus HA : θ ∈ TA. (a) For fixed a ∈ TA and c ∈ R, show that the test

φ(x) =

1 if x · a > cγ(x) if x · a = c

0 if x · a < c(22.86)

is Bayes wrt some prior π. [Hint: Take π[θ = 0] = π0 and π[θ = a] = 1− π0.] (b)Show that the test φ is Bayes for any a such that a = kb for some b ∈ TA and k > 0.

Exercise 22.8.7. Suppose X ∼ N(µ, I2) and we test (22.30), i.e., H0 : µ = 0 versusHA : 1 ≤ ‖µ‖ ≤ 2. Let a and b be constants so that 1 ≤ ‖(a, b)‖ ≤ 2. Define the priorπ by

π[µ = 0] = π0, π[µ = (a, b)] = π[µ = −(a, b)] = 12 (1− π0). (22.87)

(a) Show that the Bayes test wrt π can be written

φπ(x) =

1 if g(x) > dγ(x) if g(x) = d

0 if g(x) < d, g(x) = eax1+bx2 + e−(ax1+bx2), (22.88)

and d is a constant. (b) Why is φπ admissible? (c) Show that there exists a c suchthat φπ rejects the null when |ax1 + bx2| > c. [Hint: Letting u = ax1 + bx2, show thatexp(u)+ exp(−u) is strictly convex and symmetric in u, hence the set where g(x) > dhas the form |u| > c.] (d) Now suppose a and b are any constants, not both zero, andc is any positive constant. Show that the test that rejects the null when |ax1 + bx2| > cis still Bayes and admissible.

Exercise 22.8.8. Continue with the testing problem in Exercise 22.8.7. For constantρ ∈ [1, 2], let π be the prior π[µ = 0] = π0 (so π[µ 6= 0] = 1− π0), and conditionalon HA being true, has a uniform distribution on the circle µ | ‖µ‖ = ρ. (a) Similarto what we saw in (1.23), we can represent the prior under the alternative by settingµ = ρ(cos(Θ), sin(Θ)), where Θ ∼ Uniform[0, 2π). Show that the Bayes factor canthen be written

BA0(x) = e−12 ρ2 1

2π

∫ 2π

0eρ(x1 cos(θ)+x2 sin(θ))dθ. (22.89)

22.8. Exercises 417

(b) Show that with r = ‖x‖,

12π

∫ 2π

0eρ(x1 cos(θ)+x2 sin(θ))dθ =

12π

∫ 2π

0eρr cos(θ)dθ

=∞

∑k=0

ckr2kρ2k

(2k)!, where ck =

12π

∫ 2π

0cos(θ)2kdθ.

(22.90)

[Hint: For the first equality, let θx be the angle such that x = r(cos(θx), sin(θx)).Then use the double angle formulas to show that we can replace the exponent withρr cos(θ − θx). The second expression then follows by changing variables θ to θ + θx.The third expression arises by expanding the e in a Taylor series, and noting that forodd powers l, the the integral of cos(θ)l is zero.] (c) Noting that ck > 0 in (22.90),show that the Bayes factor is strictly increasing in r, and that the Bayes test wrt π hasrejection region r > c for some constant c. (d) Is this Bayes test admissible?

Exercise 22.8.9. Let φn, n = 1, 2, . . ., be a sequence of test functions such that for eachn, there exists a constant cn and function t(x) such that

φn(x) =

1 if t(x) > cnγn(x) if t(x) = cn

0 if t(x) < cn

. (22.91)

Suppose φn →w φ for test φ. We want to show that φ has same form. There ex-ists a subsequence of the cn’s and constant c ∈ [−∞, ∞] such that cn → c on thatsubsequence. Assume we are on that subsequence. (a) Argue that if cn → ∞ thenφn(x) → 0 pointwise, hence φn →w 0. What is the weak limit if cn → −∞? (b) Nowsuppose cn → c ∈ (−∞, ∞). Use Lemma 22.4 to show that (with probability one)φ(x) = 1 if t(x) > c and φ(x) = 0 if t(x) < c.

Exercise 22.8.10. Let φn, n = 1, 2, . . ., be a sequence of test functions such that foreach n, there exist constants an and bn, an ≤ bn, such that

φn(x) =

1 if t(x) < an or t(x) > bnγn(x) if t(x) = an or t(x) = bn

0 if an < t(x) < bn

. (22.92)

Suppose φ is a test with φn →w φ. We want to show that φ has the form (22.92) or(22.39) for some constants a and b. (a) Let φ1n be as in (22.91) with cn = bn, and φ2nsimilar but with cn = −an and t(x) = −t(x), so that φn = φ1n + φ2n. Show that ifφ1n →w φ1 and φ2n →w φ2, then φn →w φ1 + φ2. (b) Find the forms of the tests φ1and φ2 in part (a), and show that φ = φ1 + φ2 (with probability one) and

φ(x) =

1 if t(x) < a or t(x) > bγ(x) if t(x) = a or t(x) = b

0 if a < t(x) < b(22.93)

for some a and b (one or both possibly infinite).

Exercise 22.8.11. This exercise verifies (22.27) and (22.28). Let θ1, θ2, . . . be a count-able dense subset of T , and g(θ) be a continuous function. (a) Show that if g(θi) = 0for all i = 1, 2, . . ., then g(θ) = 0 for all θ ∈ T . (b) Show that if g(θi) ≤ 0 for alli = 1, 2, . . ., then g(θ) ≤ 0 for all θ ∈ T .


Exercise 22.8.12. This exercise is to prove Lemma 22.8. Here, C ⊂ Rp is a closed set,and the goal is to show that C is convex if and only if it can be written as C = ∩a,cHa,cfor some set of vectors a and constants c, where Ha,c = x ∈ Rp | a · x ≤ c as in(22.46). (a) Show that the intersection of convex sets is convex. Thus ∩a,cHa,c isconvex, since each halfspace is convex, which proves the “if” part of the lemma. (b)Now suppose C is convex. Let C∗ be the intersection of all halfspaces that contain C:

C∗ = ∩a,c | C⊂Ha,cHa,c. (22.94)

(i) Show that z ∈ C implies that z ∈ C∗. (ii) Suppose z /∈ C. Then by (20.104) inExercise 20.9.28, there exists a non-zero vector γ such that γ · x < γ · z for all x ∈ C.Show that there exists a c such that γ · x ≤ c < γ · z for all x ∈ C, hence z /∈ Hγ,c butC ⊂ Hγ,c. Argue that consequently z /∈ C∗. (iii) Does C∗ = C?

Exercise 22.8.13. Suppose X ∼ N(µ, Ip), and we test H0 : µ = 0 versus the multivari-ate one-sided alternative HA : µ ∈ µ ∈ Rp | µi ≥ 0 for each i − 0. (a) Show thatunder the alternative, the MLE of µ is given by µAi = max0, xi for each i, and thatthe likelihood ratio statistic is −2 log(LR) = ∑ max0, xi2. (b) For p = 2, sketch theacceptance region of the likelihood ratio test, x | − 2 log(LR) ≤ c, for fixed c > 0.Is the acceptance region convex? Is it nonincreasing? Is the test admissible? [Extracredit: Find the c so that the level of the test in part (b) is 0.05.]

Exercise 22.8.14. Continue with the multivariate one-sided normal testing problemin Exercise 22.8.13. Exercise 15.7.9 presented several methods for combining indepen-dent p-values. This exercise determines their admissibility in the normal situation.Here, the ith p-value is Ui = 1−Φ(Xi), where Φ is the N(0, 1) distribution function.(a) Fisher’s test rejects when TP(U) = −2 ∑ log(Ui) > c. Show that TP as a functionof x is convex and increasing in each component, hence the test is admissible. [Hint:It is enough to show that − log(1−Φ(xi)) is convex and increasing in xi. Show that

− ddxi

log(1−Φ(xi)) =

(∫ ∞

xi

e−12 (y

2−x2i )dy

)−1=

(∫ ∞

0e−

12 u(u+2xi)du

)−1, (22.95)

where u = y − xi. Argue that that final expression is positive and increasing inxi.] (b) Tippett’s test rejects the null when minUi < c. Sketch the acceptanceregion in the x-space for this test when p = 2. Argue that the test is admissible.(c) The maximum test rejects when maxUi < c. Sketch the acceptance region inthe x-space for this test when p = 2. Argue that the test is inadmissible. (d) TheEdgington test rejects the null when ∑ Ui < c. Take p = 2 and 0 < c < 0.5. Sketch theacceptance region in the x-space. Show that the boundary of the acceptance regionis asymptotic to the lines x1 = Φ−1(1− c) and x2 = Φ−1(1− c). Is the acceptanceregion convex in x? Is the test admissible? (e) The Liptak-Stouffer test rejects thenull when −∑ Φ−1(Ui) > c. Show that in this case, the test is equivalent to rejectingwhen ∑ Xi > c. Is it admissible?

Exercise 22.8.15. This exercise finds the UMP invariant test for the two-sample nor-mal mean test. That is, suppose X1, . . . , Xn are iid N(µx, σ2) and Y1, . . . , Ym are iidN(µy, σ2), and the Xi’s are independent of the Yi’s. Note that the variances are equal.We test

H0 : µx = µy, σ2 > 0 versus HA : µx 6= µy, σ2 > 0. (22.96)

22.8. Exercises 419

Consider the affine invariance group G = (a, b) | a ∈ R− 0, b ∈ R with action

(a, b) (X1, . . . , Xn, Y1, . . . , Ym) = a(X1, . . . , Xn, Y1, . . . , Ym) + (b, b, . . . , b). (22.97)

(a) Show that the testing problem is invariant under G. What is the action of G on theparameter (µx, µy, σ2)? (b) Show that the sufficient statistic is (X, Y, U), where U =

∑(Xi − X)2 + ∑(Yi −Y)2. (c) Now reduce the problem by sufficiency to the statisticsin part (b). What is the action of the group on the sufficient statistic? (d) Show thatthe maximal invariant statistic can be taken to be |X − Y|/

√U, or, equivalently, the

square of the two-sample t statistic:

T2 =(X−Y)2(1n + 1

m

)S2

P

, where S2P =

Un + m− 2

. (22.98)

(e) Show that T2 ∼ F1,n+m−2(∆), the noncentral F. What is the noncentrality parame-ter ∆? Is it the maximal invariant parameter? (f) Is the test that rejects the null whenT2 > F1,n+m−2,α the UMP invariant level α test? Why or why not?

Exercise 22.8.16. This exercise follows on Exercise 22.8.15, but does not assume equalvariances. Thus X1, . . . , Xn are iid N(µx, σ2

x ) and Y1, . . . , Ym are iid N(µy, σ2y ), the Xi’s

are independent of the Yi’s, and we test

H0 : µx = µy, σ2x > 0, σ2

y > 0 versus HA : µx 6= µy, σ2x > 0, σ2

y > 0. (22.99)

Use the same affine invariance group G = (a, b) | a ∈ R− 0, b ∈ R and action(22.97). (a) Show that the testing problem is invariant under G. What is the action of Gon the parameter (µx, µy, σ2

x , σ2y )? (b) Show that the sufficient statistic is (X, Y, S2

x, S2y),

where S2x = ∑(Xi −X)2/(n− 1) and S2

y = ∑(Yi −Y)2/(m− 1). (c) What is the actionof the group on the sufficient statistic? (d) Find a two-dimensional maximal invariantstatistic and maximal invariant parameter. Does it seem reasonable that there is noUMP invariant test?

Exercise 22.8.17. Consider the testing problem at the beginning of Section 22.7.1, butwith unknown variance. Thus X ∼ N(µ, σ2Ip), and we test

H0 : µ = 0, σ2 > 0 versus HA : µ ∈ Rp − 0, σ2 > 0. (22.100)

Take the invariance group to be G = aΓ | a ∈ (0, ∞), Γ ∈ Op. The action is aΓ X =aΓX. (a) Show that the testing problem is invariant under the group. (b) Show thatthe maximal invariant statistic can be taken to be the constant “1” (or any constant).[Hint: Take Γx so that Γxx = (‖x‖, 0, . . . , 0)′ to start, then choose an a.] (c) Show thatthe UMP level α test is just the constant α. Is that a very useful test? (d) Show thatthe usual level α two-sided t test (which is not invariant) has better power than theUMP invariant level α test.

Exercise 22.8.18. Here X ∼ N(µ, Ip), where we partition X = (X′1, X′2)′ and µ =

(µ′1, µ′2)′ with X1 and µ1 being p1 × 1 and X2 and µ2 being p2 × 1, as in (22.68). We

test µ2 = 0 versus µ2 6= 0 as in (22.69). Use the invariance group G in (22.70), sothat (b1, Γ2) X = ((X1 + b1)

′, (Γ2X2)′)′. (a) Find the action on the parameter space,

(b1, Γ2) µ. (b) Let b1x = −x1. Find Γ2x so that (b1x, Γ2x) x = (0′p1, ‖x2‖, 0′p2−1)

′. (c)

Show that ‖X2‖2 is the maximal invariant statistic and ‖µ2‖2 is the maximal invari-ance parameter.


Exercise 22.8.19. This exercise uses the linear regression testing problem in Section22.7.3. The action is (a, Γ2, b1) (aβ1 + b1, aΓ2 β

∗2 , a2SSe) as in (22.78). (a) Using a =

1/√

SSe, b1 = −β1/a, and Γ2 so that Γ2 β∗2 = (‖β∗2‖, 0′p−1)

′, show that the maximal

invariant statistic is ‖β∗2‖2/SSe. (b) Show that that maximal invariant statistic is aone-to-one function of the F statistic in (15.24). [Hint: From (22.76), β

∗2 = Bβ2, where

BB = (x′2x2)−1, and note that in the current model, C22 = (x′2x2)

−1.]

Bibliography

Agresti, A. (2013). Categorical Data Analysis. Wiley, third edition.

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactionson Automatic Control, 19:716–723.

Anscombe, F. J. (1948). The transformation of Poisson, binomial, and negative-binomial data. Biometrika, 35(3/4):246–254.

Appleton, D. R., French, J. M., and Vanderpump, M. P. J. (1996). Ignoring a covariate:An example of Simpson’s paradox. The American Statistician, 50(4):340–341.

Arnold, J. (1981). Statistics of natural populations. I: Estimating an allele probabilityin cryptic fathers with a fixed number of offspring. Biometrics, 37(3):495–504.

Bahadur, R. R. (1964). On Fisher’s bound for asymptotic variances. The Annals ofMathematical Statistics, 35(4):1545–1552.

Bassett, G. and Koenker, R. W. (1978). Asymptotic theory of least absolute errorregression. Journal of the American Statistical Association, 73(363):618–622.

Berger, J. O. (1993). Statistical Decision Theory and Bayesian Analysis. Springer, NewYork, second edition.

Berger, J. O. and Bayarri, M. J. (2012). Lectures on model uncertainty and multiplicity.CBMS Regional Conference in the Mathematical Sciences. https://cbms-mum.soe.ucsc.edu/Material.html.

Berger, J. O., Ghosh, J. K., and Mukhopadhyay, N. (2003). Approximations and con-sistency of Bayes factors as model dimension grows. Journal of Statistical Planningand Inference, 112(1-2):241–258.

Berger, J. O. and Sellke, T. (1987). Testing a point null hypothesis: The irreconcilabilityof P values and evidence. Journal of the American Statistical Association, 82:112–122.With discussion.

Berger, J. O. and Wolpert, R. L. (1988). The Likelihood Principle. Institute of Mathemat-ical Statistics, Hayward, CA, second edition.

Bickel, P. J. and Doksum, K. A. (2007). Mathematical Statistics: Basic Ideas and SelectedTopics, Volume I. Pearson, second edition.

421

422 Bibliography

Billingsley, P. (1995). Probability and Measure. Wiley, New York, third edition.

Billingsley, P. (1999). Convergence of Probability Measures. Wiley, New York, secondedition.

Birnbaum, A. (1955). Characterizations of complete classes of tests of some multi-parametric hypotheses, with applications to likelihood ratio tests. The Annals ofMathematical Statistics, 26(1):21–36.

Blyth, C. R. (1951). On minimax statistical decision procedures and their admissibility.The Annals of Mathematical Statistics, 22(1):22–42.

Box, G. E. P. and Muller, M. E. (1958). A note on the generation of random normaldeviates. Annals of Mathematical Statistics, 29(2):610–611.

Box, J. F. (1978). R. A. Fisher: The Life of a Scientist. Wiley, New York.

Bradley, R. A. and Terry, M. E. (1952). Rank analysis of incomplete block designs: I.The method of paired comparisons. Biometrika, 39(3/4):324–345.

Brown, L. D. (1971). Admissible estimators, recurrent diffusions, and insolubleboundary value problems. The Annals of Mathematical Statistics, 42(3):855–903.

Brown, L. D. (1981). A complete class theorem for statistical problems with finitesample spaces. The Annals of Statistics, 9(6):1289–1300.

Brown, L. D., Cai, T. T., and DasGupta, A. (2001). Interval estimation for a binomialproportion. Statistical Science, 16(2):101–133. With discussion.

Brown, L. D. and Hwang, J. T. (1982). A unified admissibility proof. In Gupta, S. S.and Berger, J. O., editors, Statistical Decision Theory and Related Topics III, volume 1,pages 205–230. Academic Press, New York.

Brown, L. D. and Marden, J. I. (1989). Complete class results for hypothesis testingproblems with simple null hypotheses. The Annals of Statistics, 17:209–235.

Burnham, K. P. and Anderson, D. R. (2003). Model Selection and Multimodel Inference: APractical Information-Theoretic Approach. Springer-Verlag, New York, second edition.

Casella, G. and Berger, R. (2002). Statistical Inference. Thomson Learning, secondedition.

Cramér, H. (1999). Mathematical Methods of Statistics. Princeton University Press.Originally published in 1946.

Diaconis, P. and Ylvisaker, D. (1979). Conjugate priors for exponential families. TheAnnals of Statistics, 7(2):269–281.

Duncan, O. D. and Brody, C. (1982). Analyzing n rankings of three items. In Hauser,R. M., Mechanic, D., Haller, A. O., and Hauser, T. S., editors, Social Structure andBehavior, pages 269–310. Academic Press, New York.

Durrett, R. (2010). Probability: Theory and Examples. Cambridge University Press,fourth edition.

Eaton, M. L. (1970). A complete class theorem for multidimensional one-sided alter-natives. The Annals of Mathematical Statistics, 41(6):1884–1888.

Bibliography 423

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression.The Annals of Statistics, 32(2):407–499.

Fahrmeir, L. and Kaufmann, H. (1985). Consistency and asymptotic normality of themaximum likelihood estimator in generalized linear models. The Annals of Statistics,13(1):342–368.

Feller, W. (1968). An Introduction to Probability Theory and its Applications, Volume I.Wiley, New York, third edition.

Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. AcademicPress, New York.

Fieller, E. C. (1932). The distribution of the index in a normal bivariate population.Biometrika, 24(3/4):428–440.

Fienberg, S. (1971). Randomization and social affairs: The 1970 draft lottery. Science,171:255–261.

Fink, D. (1997). A compendium of conjugate priors. Technical report, Montana StateUniversity, http://www.johndcook.com/CompendiumOfConjugatePriors.pdf.

Fisher, R. A. (1935). Design of Experiments. Oliver and Boyd, London. There are manyeditions. This is the first.

Fraser, D. A. S. (1957). Nonparametric Methods in Statistics. Wiley, New York.

Gibbons, J. D. and Chakraborti, S. (2011). Nonparametric Statistical Inference. CRCPress, Boca Raton, Florida, fifth edition.

Hastie, T. and Efron, B. (2013). lars: Least angle regression, lasso and forward stage-wise. https://cran.r-project.org/package=lars.

Hastie, T., Tibshirani, R., and Friedman, J. H. (2009). The Elements of Statistical Learning:Data Mining, Inference, and Prediction. Springer-Verlag, second edition.

Henson, C., Rogers, C., and Reynolds, N. (1996). Always Coca-Cola. Technical report,University Laboratory High School, Urbana, Illinois.

Hoeffding, W. (1952). The large-sample power of tests based on permutations ofobservations. The Annals of Mathematical Statistics, 23(2):169–192.

Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation fornonorthogonal problems. Technometrics, 12(1):55–67.

Hogg, R. V., McKean, J. W., and Craig, A. T. (2013). Introduction to MathematicalStatistics. Pearson, seventh edition.

Huber, P. J. and Ronchetti, E. M. (2011). Robust Statistics. Wiley, New York, secondedition.

Hurvich, C. M. and Tsai, C.-L. (1989). Regression and time series model selection insmall samples. Biometrika, 76(2):297–307.

James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proceedings ofthe Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1:Contributions to the Theory of Statistics, pages 361–379. University of California Press,Berkeley.

424 Bibliography

Jeffreys, H. (1961). Theory of Probability. Oxford University Press, Oxford, third edition.

Johnson, B. M. (1971). On the admissible estimators for certain fixed sample binomialproblems. The Annals of Mathematical Statistics, 42(5):1579–1587.

Jonckheere, A. R. (1954). A distribution-free k-sample test against ordered alterna-tives. Biometrika, 41(1/2):133–145.

Jung, K., Shavitt, S., Viswanathan, M., and Hilbe, J. M. (2014). Female hurricanesare deadlier than male hurricanes. Proceedings of the National Academy of Sciences,111(24):8782–8787.

Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American StatisticalAssociation, 90(430):773–795.

Kass, R. E. and Wasserman, L. (1995). A reference Bayesian test for nested hypothe-ses and its relationship to the Schwarz criterion. Journal of the American StatisticalAssociation, 90(431):928–934.

Kass, R. E. and Wasserman, L. (1996). The selection of prior distributions by formalrules. Journal of the American Statistical Association, 91(435):1343–1370.

Kendall, M. G. and Gibbons, J. D. (1990). Rank Correlation Methods. E. Arnold, London,fifth edition.

Kiefer, J. and Schwartz, R. (1965). Admissible Bayes character of T2-, R2-, and otherfully invariant tests for classical multivariate normal problems (corr: V43 p1742).The Annals of Mathematical Statistics, 36(3):747–770.

Knight, K. (1999). Mathematical Statistics. CRC Press, Boca Raton, Florida.

Koenker, R. W. and Bassett, G. (1978). Regression quantiles. Econometrica, 46(1):33–50.

Koenker, R. W., Portnoy, S., Ng, P. T., Zeileis, A., Grosjean, P., and Ripley, B. D. (2015).quantreg: Quantile regression. https://cran.r-project.org/package=quantreg.

Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010). Penalized regression, standarderrors, and Bayesian lassos. Bayesian Analysis, 5(2):369–411.

Lamport, L. (1994). LATEX: A Document Preparation System. Addison-Wesley, secondedition.

Lazarsfeld, P. F., Berelson, B., and Gaudet, H. (1968). The People’s Choice: How the VoterMakes up his Mind in a Presidential Campaign. Columbia University Press, New York,third edition.

Lehmann, E. L. (1991). Theory of Point Estimation. Springer, New York, second edition.

Lehmann, E. L. (2004). Elements of Large-Sample Theory. Springer, New York.

Lehmann, E. L. and Casella, G. (2003). Theory of Point Estimation. Springer, New York,second edition.

Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses. Springer, NewYork, third edition.

Bibliography 425

Li, F., Harmer, P., Fisher, K. J., McAuley, E., Chaumeton, N., Eckstrom, E., and Wilson,N. L. (2005). Tai chi and fall reductions in older adults: A randomized controlledtrial. The Journals of Gerontology: Series A, 60(2):187–194.

Lumley, T. (2009). leaps: Regression subset selection. Uses Fortran code by AlanMiller. https://cran.r-project.org/package=leaps.

Madsen, L. and Wilson, P. R. (2015). memoir — Typeset fiction, nonfiction and math-ematical books. https://www.ctan.org/pkg/memoir.

Mallows, C. L. (1973). Some comments on Cp. Technometrics, 15(4):661–675.

Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two randomvariables is stochastically larger than the other. The Annals of Mathematical Statistics,18(1):50–60.

Matthes, T. K. and Truax, D. R. (1967). Tests of composite hypotheses for the multi-variate exponential family. The Annals of Mathematical Statistics, 38(3):681–697.

Mendenhall, W. M., Million, R. R., Sharkey, D. E., and Cassisi, N. J. (1984). Stage T3squamous cell carcinoma of the glottic larynx treated with surgery and/or radi-ation therapy. International Journal of Radiation Oncology·Biology·Physics, 10(3):357–363.

Pitman, E. J. G. (1939). The estimation of the location and scale parameters of acontinuous population of any given form. Biometrika, 30(3/4):391–421.

Reeds, J. A. (1985). Asymptotic number of roots of Cauchy location likelihood equa-tions. The Annals of Statistics, 13(2):775–784.

Sacks, J. (1963). Generalized Bayes solutions in estimation problems. The Annals ofMathematical Statistics, 34(3):751–768.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics,6(2):461–464.

Sellke, T., Bayarri, M. J., and Berger, J. O. (2001). Calibration of p values for testingprecise null hypotheses. The American Statistician, 55(1):62–71.

Sen, P. K. (1968). Estimates of the regression coefficient based on Kendall’s tau. Journalof the American Statistical Association, 63(324):1379–1389.

Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. Wiley, NewYork.

Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. (2002). Bayesianmeasures of model complexity and fit. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology), 64(4):583–616.

Stein, C. (1956a). The admissibility of Hotelling’s T2-test. The Annals of MathematicalStatistics, 27:616–623.

Stein, C. (1956b). Inadmissibility of the usual estimator for the mean of a multivariatenormal distribution. In Proceedings of the Third Berkeley Symposium on MathematicalStatistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 197–206. University of California Press, Berkeley.

426 Bibliography

Stichler, R. D., Richey, G. G., and Mandel, J. (1953). Measurement of treadwear ofcommercial tires. Rubber Age, 73(2).

Strawderman, W. E. and Cohen, A. (1971). Admissibility of estimators of the meanvector of a multivariate normal distribution with quadratic loss. The Annals ofMathematical Statistics, 42(1):270–296.

Student (1908). The probable error of a mean. Biometrika, 6(1):1–25.

Terpstra, T. J. (1952). The asymptotic normality and consistency of Kendall’s testagainst trend, when ties are present in one ranking. Indagationes Mathematicae (Pro-ceedings), 55:327–333.

Theil, H. (1950). A rank-invariant method of linear and polynomial regression anal-ysis I. Indagationes Mathematicae (Proceedings), 53:386–392.

Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley, Reading, Mas-sachusetts.

van Zwet, W. R. and Oosterhoff, J. (1967). On the combination of independent teststatistics. The Annals of Mathematical Statistics, 38(3):659–680.

Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. Springer,New York, fourth edition.

von Neumann, J. and Morgenstern, O. (1944). Theory of Games and Economic Behavior.Princeton University Press, Princeton, New Jersey.

Wald, A. (1950). Statistical Decision Functions. Wiley, New York.

Wijsman, R. A. (1973). On the attainment of the Cramér-Rao lower bound. The Annalsof Statistics, 1(3):538–542.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin,1(6):80–83.

Zelazo, P. R., Zelazo, N. A., and Kolb, S. (1972). "Walking" in the newborn. Science,176(4032):314–315.

Author Index

Agresti, A. 98, 242Akaike, H. 272, 275Anderson, D. R. 276Anscombe, F. J. 145Appleton, D. R. 96Arnold, J. 89

Bahadur, R. R. 232Bassett, G. 179, 192Bayarri, M. J. 255, 257Berelson, B. 279Berger, J. O. iv, 200, 254, 255, 257, 275,

351Berger, R. ivBest, N. G. 273Bickel, P. J. ivBillingsley, P. iv, 27, 32, 132, 134, 140, 403Birnbaum, A. 406Blyth, C. R. 351Box, G. E. P. 74Box, J. F. 290Bradley, R. A. 243Brody, C. 41Brown, L. D. 145, 258, 351, 353, 355, 363,

404Burnham, K. P. 276

Cai, T. T. 145, 258Carlin, B. P. 273Casella, G. iv, 191, 233, 333, 338, 351Cassisi, N. J. 300Chakraborti, S. 295Chaumeton, N. 288

Cohen, A. 355Craig, A. T. iii, ivCramér, H. 225

DasGupta, A. 145, 258Diaconis, P. 170Doksum, K. A. ivDuncan, O. D. 41Durrett, R. 37, 127

Eaton, M. L. 409Eckstrom, E. 288Efron, B. 190, 191

Fahrmeir, L. 237Feller, W. 174Ferguson, T. S. iv, 351, 352, 358, 368Fieller, E. C. 262Fienberg, S. 293Fink, D. 170Fisher, K. J. 288Fisher, R. A. 290Fraser, D. A. S. 298French, J. M. 96Friedman, J. H. 190

Gaudet, H. 279Ghosh, J. K. 275Ghosh, M. 191Gibbons, J. D. 295, 311Gill, J. 191Grosjean, P. 192

Harmer, P. 288

427

428 Author Index

Hastie, T. 190, 191Henson, C. 174Hilbe, J. M. 187, 219, 282Hoeffding, W. 298Hoerl, A. E. 186, 198Hogg, R. V. iii, ivHuber, P. J. 191Hurvich, C. M. 276Hwang, J. T. 353

James, W. 353Jeffreys, H. 172, 253, 254Johnson, B. M. 363Johnstone, I. 190, 191Jonckheere, A. R. 312Jung, K. 187, 219, 282

Kass, R. E. 172, 254, 255, 274Kaufmann, H. 237Kendall, M. G. 311Kennard, R. W. 186, 198Kiefer, J. 414Knight, K. ivKoenker, R. W. 179, 192Kolb, S. 286Kyung, M. 191

Lamport, L. iiLazarsfeld, P. F. 279Lehmann, E. L. iv, 233, 268, 331, 333, 338,

351, 373, 383, 387, 401Li, F. 288Lumley, T. 189

Madsen, L. iiMallows, C. L. 189Mandel, J. 294Mann, H. B. 306Marden, J. I. 404Matthes, T. K. 406McAuley, E. 288McKean, J. W. iii, ivMendenhall, W. M. 300Million, R. R. 300Morgenstern, O. 357Mukhopadhyay, N. 275Muller, M. E. 74

Ng, P. T. 192

Oosterhoff, J. 396

Pitman, E. J. G. 338Portnoy, S. 192

Raftery, A. E. 255Reeds, J. A. 229Reynolds, N. 174Richey, G. G. 294Ripley, B. D. 192, 193Rogers, C. 174Romano, J. P. iv, 331, 373, 383, 387, 401Ronchetti, E. M. 191

Sacks, J. 351Schwartz, R. 414Schwarz, G. 272, 273Sellke, T. 254, 257Sen, P. K. 315Serfling, R. J. 212, 299Sharkey, D. E. 300Shavitt, S. 187, 219, 282Spiegelhalter, D. J. 273Stein, C. 352, 353, 408Stichler, R. D. 294Strawderman, W. E. 355Student 316

Terpstra, T. J. 312Terry, M. E. 243Theil, H. 315Tibshirani, R. 190, 191Truax, D. R. 406Tsai, C.-L. 276Tukey, J. W. 301

van der Linde, A. 273van Zwet, W. R. 396Vanderpump, M. P. J. 96Venables, W. N. 193Viswanathan, M. 187, 219, 282von Neumann, J. 357

Wald, A. 400Wasserman, L. 172, 254, 274Whitney, D. R. 306Wijsman, R. A. 333Wilcoxon, F. 305, 306Wilson, N. L. 288Wilson, P. R. ii

429

Wolpert, R. L. 200

Ylvisaker, D. 170

Zeileis, A. 192

Zelazo, N. A. 286

Zelazo, P. R. 286

Subject Index

The italic page numbers are references to Exercises.

ACA examplerandomized testing (two

treatments), 300action

in decision theory, 346in inference, 158of group, 410

affine transformation, 22asymptotic distribution, 148covariance, 23covariance matrix, 24expected value, 23Jacobian, 70mean, 23variance, 22

AIC, see model selection: Akaikeinformation criterion

asymptotic efficiency, 232median vs. mean, 233

asymptotic relative efficiencymedian vs. mean, 143

asymptotics, see convergence indistribution, convergence inprobability

bang for the buck example, 370–371barley seed example

t test, 316sign test, 316signed-rank test, 316

baseball exampleBradley-Terry model, 243

Bayes theorem, 93–95Bayesian inference, 158

estimationbias, 177empirical Bayes, 174, 365

hypothesis testingBayes factor, 253odds, 252prior distribution, 251,

253–255prior probabilities, 252

polio example, 97–98posterior distribution, 81, 158prior distribution, 81, 158

conjugate, 170improper, 172noninformative, 172reference, 172

sufficient statistic, 209, 217Bayesian model, 83, 156belief in afterlife example

two-by-two table, 98–99Bernoulli distribution, 9, 29, 59

conditioning on sum, 99Fisher’s information, 240mean, 60moment generating function, 60score function, 240

beta distribution, 7as exponential family, 217, 342as special case of Dirichlet, 69asymptotic distribution, 149

431

432 Subject Index

estimationBayes, 169method of moments, 163, 173

hypothesis testingmonotone likelihood ratio, 388uniformly most powerful test,

388mode, 97moment generating function,

122probability interval, 169relationship to binomial, 80relationship to gamma, 67–68sample maximum

convergence in distribution,138

sample minimumconvergence in distribution,

130sufficient statistic, 216

beta-binomial distribution, 86BIC, see model selection: Bayes

information criterionbinomial distribution, 9, 29, 60

as exponential family, 217as sum of Bernoullis, 60Bayesian inference

Bayes factor, 259beta posterior, 95conjugate prior, 170estimation, 364hypothesis testing, 252improper prior, 177

bootstrap estimator, 175completeness, 328confidence interval, 144

Clopper-Pearson interval, 258,262

convergence in probability, 136convergence to Poisson, 131, 133cumulant generating function,

30estimation

admissibility, 345, 363Bayes, 364constant risk, 364Cramér-Rao lower bound, 341mean square error, 345, 346minimaxity, 345, 356unbiased, 326, 340, 341

uniformly minimum varianceunbiased estimator, 328, 341

hypothesis testing, 259Bayes factor, 259Bayesian, 252likelihood ratio test, 251, 371locally most powerful test,

390–391randomized test, 369–370

likelihood, 201mean, 60moment generating function, 30,

60quantile, 34, 37relationship to beta, 80sufficient statistic, 216sum of binomials, 63truncated, 362variance stabilizing

transformation, 144binomial theorem, 30bivariate normal distribution, 70–71

as exponential family, 342conditional distribution, 92–93

pdf, 97correlation coefficient

estimation, 244Fisher’s information, 244hypothesis testing, 280–281uniformly minimum variance

unbiased estimator, 343estimation of variance

uniformly minimum varianceunbiased estimator, 343

hypothesis testingadmissibility, 416–417Bayes test, 416–417compact hypothesis spaces,

402, 404locally most powerful test, 389on correlation coefficient,

280–281independence, 95–96moment generating function, 71pdf, 71, 78, 97polar coordinates, 74

BLUE, see linear regression: bestlinear unbiased estimator

BMI example

Subject Index 433

Mann-Whitney/Wilcoxon test,317

bootstrap estimation, 165–168confidence interval, 165estimating bias, 165estimating standard error, 165number of possible samples, 174sampling, 166using R, 168

Box-Cox transformation, 219–220Box-Muller transformation, 74Bradley-Terry model, 243

baseball example, 243

caffeine examplebootstrap confidence interval,

174–175cancer of larynx example

two-by-two table, 300–301Cartesian product, 45Cauchy distribution, 7

as cotangent of a uniform, 53–54as special case of Student’s t, 15estimation

maximum likelihood estimate,218

median vs. mean, 143, 233trimmed mean, 234

Fisher’s information, 241hypothesis testing

locally most powerful test, 382Neyman-Pearson test, 388score test, 270

kurtosis, 36skewness, 36sufficient statistic, 216

Cauchy-Schwarz lemma, 20–21centering matrix, 109

and deviations, 118trace, 118

central chi-square distribution, seechi-square distribution

central limit theorem, 133–134Hoeffding conditions, 298Lyapunov, 299multivariate, 135

Chebyshev’s inequality, 127, 136chi-square distribution, 7, see also

noncentral chi-squaredistribution

as special case of gamma, 14, 15,111

as sum of squares of standardnormals, 110

hypothesis testinguniformly most powerful

unbiased test, 393mean, 110moment generating function, 35sum of chi-squares, 111variance, 110

Clopper-Pearson interval, 258, 262coefficient of variation, 146coin example, 81, 84

beta-binomial distribution, 86conditional distribution, 91–92conditional expectation, 84unconditional expectation, 84

complement of a set, 4completeness, 327–328

exponential family, 330–331concordant pair, 308conditional distribution, 81–96

Bayesian model, 83conditional expectation, 84from joint distribution, 91–93hierarchical model, 83independence, 95–96, 96mean from conditional mean, 87mixture model, 83simple linear model, 82variance from conditional mean

and variance, 87–88conditional independence

intent to vote example, 279–280conditional probability, 84

HIV example, 97conditional space, 40confidence interval, 158, 159, 160, 162

bootstrapcaffeine example, 174–175shoes example, 167–168, 175

compared to probabilityinterval, 159, 160

for difference of two mediansinverting

Mann-Whitney/Wilcoxontest, 320

for mean, 140for median

434 Subject Index

inverting sign test, 313–314,320–321

inverting signed-rank test, 320for slope

inverting Kendall’s τ test,314–315

from hypothesis test, 257–258,313–315

confluent hypergeometric function,122

convergence in distribution, 129–131,see also central limittheorem; weak law of largenumbers

definition, 129mapping lemma, 139–140moment generating functions,

132, 134multivariate, 134points of discontinuity, 131to a constant, 132

convergence in probability, 125–129as convergence in distribution,

132definition, 126mapping lemma, 139–140

convexity, 226, 227intersection of half-spaces, 408,

418Jensen’s inequality, 227

convolution, 50–53, 61correlation coefficient, 20, see also

sample correlationbootstrap estimator, 175hypothesis testing

draft lottery example,291–293, 319–320

inequality, 21covariance, 19, see also sample

covarianceaffine transformation, 23

covariance matrix, 24of affine transformation, 24of two vectors, 35, 107

Cramér’s theorem, 139CRLB (Cramér-Rao lower bound), 333cumulant, 27

kurtosis, 28skewness, 28

cumulant generating function, 27–28

decision theory, see also game theory;under estimation; underhypothesis testing

action, 346action space, 346admissibility, 349

finite parameter space, 360weighted squared error, 361

admissibility and Bayes,349–350, 360, 361, 364

admissibility and minimaxity,355, 364

Bayes and minimaxity, 355, 367Bayes procedure, 348

weighted squared error, 361decision procedure, 346

nonrandomized, 346randomized, 358, 366

domination, 348loss function, 347

absolute error, 347squared error, 347weighted squared error, 361

minimaxity, 355–356finite parameter space,

358–360least favorable distribution,

358, 367risk function, 347

Bayes risk, 347mean square error, 347

risk set, 366, 367∆-method, 141–142

multivariate, 145dense subset, 401determinant of matrix

as product of eigenvalues, 105,116

diabetes examplesubset selection, 282

digamma function, 36Dirichlet distribution, 68–69

beta as special case, 69covariance matrix, 79mean, 79odds ratio, 98–99pdf, 78relationship to gamma, 68–69

Dirichlet eta function, 37discordant pair, 308

Subject Index 435

discrete uniform distribution, see alsouniform distribution

estimationadmissibility, 361Bayes, 361

hypothesis testingvs. geometric, 389

moment generating function, 61sum of discrete uniforms, 50, 61

distribution, see also Bernoulli; beta;beta-binomial; binomial;bivariate normal; Cauchy;chi-square; Dirichlet;exponential; F; gamma;geometric; Gumbel;Laplace; logistic;lognormal; multinomial;multivariate normal;multivariate Student’s t;negative binomial; negativemultinomial; noncentralchi-square; noncentral F;noncentral hypergeometric;normal; Poisson; shiftedexponential; slash;Student’s t; tent; trinomial;uniform

commoncontinuous, 7discrete, 9

independence, see separate entryinvariance under a group, 293joint, 82

space, 84marginal, see marginal

distributionspherically symmetric, 73–74weak compactness, 403

distribution function, 5empirical, 165properties, 5stochastically larger, 305

Dobzhansky estimator, see under fruitfly example: estimation

dot product, 26double exponential distribution, see

Laplace distributiondraft lottery example, 291, 319

Jonckheere-Terpstra test, 319–320Kendall’s τ, 319

testing randomness, 291–293asymptotic normality, 298

dt (R routine), 149

Edgington’s test, 261, 418eigenvalues & eigenvectors, 116empirical distribution function, 165entropy and negentropy, 276estimation, 155, see also maximum

likelihood estimate;bootstrap estimation

asymptotic efficiency, 232Bayes, 214bias, 162, 173

Bayes estimator, 177consistency, 128, 162, 173Cramér-Rao lower bound

(CRLB), 333decision theory

best shift-equivariantestimator, 349

Blyth’s method, 351–352James-Stein estimator,

352–355, 362uniformly minimum variance

unbiased estimator, 349definition of estimator, 161, 325mean square error, 173method of moments, 163Pitman estimator, 344plug-in, 163–168

nonparametric, 165shift-equivariance, 336, 344

Pitman estimator, 338unbiased, 336

standard error, 162unbiased, 214, 326–327uniformly minimum variance

unbiased estimator(UMVUE), 325, 329–330

uniformly minimum varianceunbiased estimator and theCramér-Rao lower bound,333

exam score exampleTukey’s resistant-line estimate,

301–302exchangeable, 295

vs. equal distributions, 301expected value

436 Subject Index

coherence of functions, 18definition, 17linear functions, 18

exponential distribution, 7, see alsoshifted exponentialdistribution

as exponential family, 217as log of a uniform, 56as special case of gamma, 14Bayesian inference

conjugate prior, 176completeness, 331estimation

Cramér-Rao lower bound, 341unbiased, 341uniformly minimum variance

unbiased estimator, 341, 342Fisher’s information, 240hypothesis testing, 259

Neyman-Pearson test, 388kurtosis, 36, 59order statistics, 79quantiles, 33sample maximum


sample meanasymptotic distribution of

reciprocal, 150sample minimum

convergence in distribution,137–138

convergence in probability,137–138

score function, 240skewness, 36, 59sum of exponentials, 53variance stabilizing distribution,

150exponential family, 204

completeness, 330–331estimation

Cramér-Rao lower bound, 333maximum likelihood estimate,

241Fisher’s information, 241hypothesis testing

Bayes test, 416convex acceptance regions

and admissibility, 405–408

likelihood ratio, 379monotone acceptance regions

and admissibility, 408–409Neyman-Pearson test, 379uniformly most powerful test,

379uniformly most powerful

unbiased test, 384, 391–392natural parameter, 205natural sufficient statistics, 205score function, 241

F distribution, see also noncentral Fdistribution

as ratio of chi-squares, 120–121pdf, 121ratio of sample normal

variances, 121relationship to Student’s t, 121

F test, see under linear regression:hypothesis testing

Fisher’s exact test, see undertwo-by-two table

Fisher’s factorization theorem, 208Fisher’s information, 222–224, 226

expected, 223multivariate, 234–235observed, 223

Fisher’s test for combining p-values,260, 418

Fisher’s z, 151fruit fly example, 89–91

as exponential family, 342data, 89estimation

Cramér-Rao lower bound, 342Dobzhansky estimator, 91,

240, 342maximum likelihood estimate,

222, 224Fisher’s information, 240hypothesis testing

locally most powerful test, 390uniformly most powerful test,

lack of, 390loglikelihood, 217marginal distribution, 99–100

functional, 164

game theory, 356–357, see alsodecision theory

Subject Index 437

least favorable distribution, 357,366

minimaxity, 357, 366value, 357

gamma distribution, 7, 28as exponential family, 217cumulant generating function,

29estimation

method of moments, 173kurtosis, 29mean of inverse, 97moment generating function, 29relationship to beta, 67–68relationship to Dirichlet, 68–69scale transformation, 77skewness, 29sufficient statistic, 216sum of gammas, 57, 63, 67–69

gamma function, 6, 14, 15Gauss-Markov theorem, 195Gaussian distribution, see Normal

distributiongeneralized hypergeometric function,

122geometric distribution, 9

convergence in distribution, 137estimation

unbiased, 340hypothesis testing

vs. discrete uniform, 389moment generating function, 34sum of geometrics, 61

Greek examplelogistic regression, 238–240

group, 293, see also hypothesis testing:invariance

action, 410properties, 410

Gumbel distribution, 79sample maximum, 79


Hardy-Weinberg law, 90hierarchical model, 83HIV example

conditional probability, 97Hoeffding conditions, 298homoscedastic, 182

horse race example, 388hurricane example, 187

linear regression, 192Box-Cox transformation, 220lasso, 191least absolute deviations,

192–193randomization test, 296ridge regression, 188Sen-Theil estimator, 315subset selection, 189, 282–283

sample correlationPearson and Spearman,

307–308hypergeometric distribution, 289, see

also noncentralhypergeometric

asymptotic normality, 302mean, 302variance, 302

hypothesis testing, 155, 245–249, seealso Bayesian inference:hypothesis testing;likelihood ratio test;nonparametric test;randomization testing

accept/reject, 246criminal trial analogy, 248

admissibility, 405–409and limits of Bayes tests,

400–402and Bayes tests, 397–399compact hypothesis spaces,

402–404convex acceptance region,

405–408, 417monotone acceptance region,

408–409of level α tests, 414of locally most powerful test,

414of uniformly most powerful

test, 414of uniformly most powerful

unbiased test, 414alternative hypothesis, 245asymptotically most powerful

test, 393Bayesian

Bayes factor, 260

438 Subject Index

polio example, 261chi-square test, 249combining independent tests,

260–261Edgington’s test, 261, 418Fisher’s test, 260, 418Liptak-Stouffer’s test, 261, 418maximum test, 260, 418Tippett’s test, 260, 418

composite hypothesis, 251confidence interval from,

257–258decision theory, 395–396, see also

hypothesis testing:admissibility

Bayes test, 397loss function, 395maximal regret, 396minimaxity, 396risk function, 395sufficiency, 396

false positive and false negative,247

invariance, 410action, 410maximal invariant, 411reduced problem, 411

level, 247likelihood ratio test

polio example, 277locally most powerful test

(LMP), 381–382and limit of Bayes tests,

415–416Neyman-Pearson form, 372Neyman-Pearson lemma, 372

proof, 372–373nuisance parameter, 386null hypothesis, 245p-value, 256–257

as test statistic, 256uniform distribution of, 260

power, 247randomized test, 369–370rank transform test, 304simple hypothesis, 250, 370size, 247test based on estimator, 249test statistic, 246type I and type II errors, 247

unbiased, 383uniformly most powerful

unbiased test, 383–384,386–387

uniformly most powerful test(UMP), 376, 378

weak compactness, 401weak convergence, 400

idempotent matrix, 109Moore-Penrose inverse, 112–113

identifiability, 228iid, see independent and identically

distributedindependence, 42–46

conditional distributions, 95–96definition, 42densities, 44, 46

factorization, 46distribution functions, 43expected values of products of

functions, 43implies covariance is zero, 43moment generating functions, 43spaces, 44–46

independent and identicallydistributed (iid), 46

sufficient statistic, 203, 208indicator function, 19inference, 155, see also confidence

interval; estimation;hypothesis testing; modelselection; prediction

Bayesian approach, 158frequentist approach, 158

intent to vote exampleconditional independence,

279–280likelihood ratio test, 279–280

interquartile range, 38invariance, see under hypothesis

testing

Jacobian, see under transformationsJames-Stein estimator, 352–355, 362Jensen’s inequality, 227joint distribution

from conditional and marginaldistributions, 84

densities, 85

Subject Index 439

Jonckheere-Terpstra test, see undernonparametric testing

Kendall’s τ, see under Kendall’sdistance; nonparametrictesting; sample correlationcoefficient

Kendall’s distance, 60, 61Kendall’s τ, 309, 317

Kullback-Leibler divergence, 276kurtosis, 25

cumulant, 28

Laplace approximation, 274Laplace distribution, 7

as exponential family, 342estimation

Cramér-Rao lower bound, 334median vs. mean, 143, 233Pitman estimator, 339–340sample mean, 334trimmed mean, 234


asymptotically most powerfultest, 393–394

score test, 281versus normal, 374–375

kurtosis, 36, 59moment generating function, 36moments, 36sample mean


skewness, 36, 59sufficient statistic, 204, 217sum of Laplace random

variables, 63late start example, 9

quantile, 37leaps (R routine), 189least absolute deviations, 191

hurricane example, 192–193standard errors, 193

least squares estimation, 181lgamma (R function), 174likelihood, see also Fisher’s

information; likelihoodprinciple; likelihood ratio

test; maximum likelihoodestimate

deviance, 272multivariate regression, 276observed, 272

function, 199–200loglikelihood, 212, 221score function, 221, 226

multivariate, 234likelihood principle, 200, 215

binomial and negative binomial,201–202

hypothesis testing, 252unbiasedness, 202

likelihood ratio test, 251, 263asymptotic distribution, 263

composite null, 268–269simple null, 267–268

Bayes test statistic, 251deviance, 272dimensions, 267intent to vote example, 279–280likelihood ratio, 251, 370Neyman-Pearson lemma, 372score test, 269–270

many-sided, 270–271linear combination, 22, see also affine

transformationlinear model, 93, 115, see also linear

regressionlinear regression, 179–193

as exponential family, 343assumptions, 181Bayesian inference, 118, 184–185

conjugate prior, 196posterior distribution, 184, 185ridge regression estimator, 184

Box-Cox transformation,219–220

hurricane example, 220confidence interval, 183

inverting Kendall’s τ test,314–315

estimation, 181best linear unbiased estimator

(BLUE), 182, 195covariance of estimator, 182,

194maximum likelihood estimate,

218, 219

440 Subject Index

noninvertible x′x, 196Sen-Theil estimator, 315, 321standard error, 183Student’s t distribution, 183,

195Tukey’s resistant-line

estimate, 301uniformly minimum variance

unbiased estimator, 343fit, 195Gauss-Markov theorem, 195hypothesis testing

exam score example, 301–302F test, 250, 278, 279maximal invariant, 420partitioned slope vector, 250,

265–266, 278–279randomization test, 295–296uniformly most powerful

invariant test, 414lasso, 190–191

estimating tuning parameter,191

hurricane example, 191lars (R package), 190objective function, 190regression through origin, 198

least absolute deviations, 191asymptotic distribution, 193hurricane example, 192–193standard errors, 193

least squares estimation, 181linear estimation, 195matrix form, 180mean, 179median, 179prediction, 186, 194projection matrix, 182, 195quantile regression, 179

quantreg (R package), 192regularization, 185–191residuals, 195ridge regression, 185–189

admissibility, 364Bayes estimator, 184bias of estimator, 198covariance of estimator, 197,

198estimating tuning parameter,

187

estimator, 186hurricane example, 188mean square error, 198objective function, 186prediction error, 197

simple, 82matrix form, 180moment generating function,

86–87subset selection, 189–190

Akaike information criterion,276–277, 283

diabetes example, 282hurricane example, 189Mallows’ Cp, 189, 283

sufficient statistic, 218sum of squared errors, 182

distribution, 183through the origin

convergence of slope, 129, 141Liptak-Stouffer’s test, 261, 418LMP, see hypothesis testing: locally

most powerful testlocation family, 335

Fisher’s information, 232Pitman estimator, 338

admissibility, 364Bayes, 364minimaxity, 364

shift-invariance, 335–336location-scale family, 55–56

distribution function, 62kurtosis, 62moment generating function, 62pdf, 62skewness, 62

log odds ratio, 151logistic distribution, 7

as logit of a uniform, 15estimation

median vs. mean, 143, 233trimmed mean, 234


locally most powerful test, 390score test, 281

moment generating function, 36quantiles, 36sample maximum

Subject Index 441


sufficient statistic, 216logistic regression, 237–238

glm (R routine), 239Greek example, 238–240likelihood ratio test, 279maximum likelihood estimate,

237logit, 236loglikelihood, 212, 221lognormal distribution, 37LRT, see likelihood ratio testLyapunov condition, 299

M-estimator, 191Mallows’ Cp, 189, 283Mann-Whitney test, see under

nonparametric testingmapping

lemma, 139–140weak law of large numbers, 128

marginal distribution, 39–42, 82covariance from conditional

mean and covariance, 88density

discrete, 40–42pdf, 42

distribution function, 40moment generating function, 40space, 39variance from conditional mean

and variance, 87–88Markov’s inequality, 127matrix

centering, 109eigenvalues & eigenvectors, 105,

116expected value, 23idempotent, 109inverse

block formula, 123mean, 23Moore-Penrose inverse, 111–112nonnegative definite, 104, 116orthogonal, 105permutation, 292positive definite, 104, 116projection, 195pseudoinverse, 111

sign-change, 295spectral decomposition theorem,

105square root, 105, 117

symmetric, 106maximum likelihood estimate (MLE),

212, 222asymptotic efficiency, 232

multivariate, 235asymptotic normality, 224, 229

Cramér’s conditions, 225–226multivariate, 235proof, 230–231sketch of proof, 224–225

consistency, 229function of, 212, 214, 217

maximum likelihood ratio test, seelikelihood ratio test

mean, 19affine transformation, 23matrix, 23minimizes mean squared

deviation, 38vector, 23

median, 33minimizes mean absolute

deviation, 38meta-analysis, see hypothesis testing:

combining independencetests

midrange, 80minimal sufficient statistic, 203mixed-type density, 12mixture models, 83MLE, see maximum likelihood

estimateMLR, see monotone likelihood ratiomodel selection, 155, 272

Akaike information criterion(AIC), 272–273, 275–276

as posterior probability, 283Bayes information criterion

(BIC), 272–275as posterior probability, 273snoring example, 281–282

Mallows’ Cp, 189, 283penalty, 272

moment, 25–26kurtosis, 25mixed, 26

442 Subject Index

moment generating function,27

skewness, 25moment generating function, 26–27

convergence in distribution, 132mixed moment, 27uniqueness theorem, 26

monotone likelihood ratio (MLR), 379expectation lemma, 380–381uniformly most powerful test,

381, 388Moore-Penrose inverse, 111–112, 196

of idempotent matrix, 112–113multinomial distribution, 30–31, see

also binomial distribution;trinomial distribution;two-by-two table

as exponential family, 217asymptotic distribution, 151completeness, 342covariance, 31

matrix, 35estimation



hypothesis testinglikelihood ratio test, 280–281score test, 271

log odds ratioasymptotic distribution, 151confidence interval, 152

marginal distributions, 40mean, 31moment generating function, 31variance, 31

multinomial theorem, 31multivariate normal distribution,

103–116affine transformation, 106as affine transformation of

standard normals, 104Bayesian inference, 119–120conditional distribution, 116confidence region for mean, 111covariance matrix, 103, 104estimation of mean

empirical Bayes estimator,174, 365

James-Stein estimator,352–355, 362, 365

shrinkage estimator, 174hypothesis testing

admissibility, 418likelihood ratio test, 418maximal invariant, 411, 419uniformly most powerful

invariant test, 412–413independence, 107marginal distributions, 106mean, 103, 104moment generating function,

104pdf, 108prediction, 365, 366properties, 103quadratic form as chi-square,

111–113subset selection, 365, 366

multivariate Student’s t distribution,120, 185

negative binomial distribution, 9as sum of geometrics, 62likelihood, 201mean, 62moment generating function, 62

negative multinomial distribution,261

Newton-Raphson method, 222Neyman-Pearson, see under

hypothesis testingnoncentral chi-square distribution, 15

as Poisson mixture of centralchi-squares, 121

as sum of squares of normals,113

mean, 101, 114moment generating function,

101, 121monotone likelihood ratio, 389pdf, 122sum of noncentral chi-squares,

114variance, 101, 114

noncentral F distributionas ratio of chi-squares, 122monotone likelihood ratio, 389pdf, 122

Subject Index 443

nonnegative definite matrix, 104, 116nonparametric testing, 303

confidence interval, 313–315Jonckheere-Terpstra test, 311–313

asymptotic normality, 313, 318Kendall’s τ, 308–309

asymptotic normality, 309, 318cor.test (R routine), 309Kendall’s distance, 309, 317τA and τB, 311ties, 309–311, 318

Mann-Whitney/Wilcoxon test,305–307

asymptotic normality, 307, 317equivalence of two statistics,

317wilcox.test (R routine), 307

rank-transform test, 304sign test, 303–304

tread wear example, 303signed-rank test, 304–305,

316–317asymptotic normality, 305,

315–316mean and variance, 317tread wear example, 316wilcox.test (R routine), 305

Spearman’s ρ, 307–308asymptotic normality, 307cor.test (R routine), 307

normal distribution, 7, see alsobivariate normal;multivariate normal

as exponential family, 205, 342,343

as location-scale family, 55Bayesian inference, 176

Bayes risk, 361conjugate prior, 170, 171for mean, 119posterior distribution, 171, 177probability interval for mean,

119Box-Muller transformation, 74coefficient of variation

asymptotic distribution, 147standard error, 164

completeness, 331–332confidence interval

Fieller’s method for ratio oftwo means, 262

for coefficient of variation, 147for correlation coefficient, 151for difference of means, 118for mean, 108, 115, 258for mean, as probability

interval, 169, 176cumulant generating function,

28estimation

admissibility, 361–362admissibility (Blyth’s

method), 351–352Bayes estimator, 364Cramér-Rao lower bound, 334maximum likelihood estimate,

213, 218median vs. mean, 143, 233minimaxity, 364of a probability, 211of common mean, 235–236,

240, 241Pitman estimator, 344regularization, 362shift-invariance, 336trimmed mean, 234uniformly minimum variance

unbiased estimator, 332,334, 344

Fisher’s information, 240, 241hypothesis testing

Bayes factor, 259Bayesian, 254for equality of two means,

249, 277–278, 414–415invariance, 409–410locally most powerful test, 390Neyman-Pearson test,

373–374, 388on mean, 249, 261, 263–264one- vs. two-sided, 376–377power, 248randomization test, 293score test, 281uniformly most powerful

invariant test, 413, 418–419uniformly most powerful test,

378–379, 389, 390

444 Subject Index

uniformly most powerfulunbiased test, 384–385,392–393

versus Laplace, 374–375interquartile range, 38kurtosis, 28, 36, 59linear combination, 57mean of normals, 62moment generating function, 28,

62ratio of sample variances, 121sample correlation coefficient

asymptotic distribution, 148,150

variance stabilizingtransformation (Fisher’s z),151

sample mean, 106sample mean and deviations

joint distribution, 110sample mean and variance

asymptotic distribution, 146independence, 110, 113

sample variancedistribution, 113expected value, 113

score function, 240skewness, 28, 36, 59standard normal, 28, 103sufficient statistic, 203, 204, 208,

216sum of normals, 62

normalized means, 58–59

odds, 252odds ratio

Dirichlet distribution, 98–99order statistics, 75–77

as sufficient statistic, 203as transform of uniforms, 77pdf, 76, 77

orthogonal matrix, 71, 105Jacobian, 72reflection, 72rotation, 72two dimensions, 72

polar coordinates, 72

p-value, see under hypothesis testingpaired comparison

barley seed example, 316randomization test, 295sign test, 303tread wear example, 294–295

pbinom (R routine), 80pdf (probability density function), 6–8

derivative of distributionfunction, 6

Pitman estimator, see under locationfamily

pivotal quantity, 108, 162pmf (probability mass function), 8Poisson distribution, 9

as exponential family, 217as limit of binomials, 131, 133Bayesian inference, 176, 216

gamma prior, 97–98hypothesis testing, 261

completeness, 329, 331conditioning on sum, 99cumulants, 36estimation, 173

admissibility, 364Bayes estimate, 215, 364Cramér-Rao lower bound,

341, 342maximum likelihood estimate,

214, 216, 217minimaxity, 364unbiased, 326, 341, 342uniformly minimum variance

unbiased estimator, 330, 341hypothesis testing

Bayesian, 261likelihood ratio test, 277Neyman-Pearson test, 388score test, 281uniformly most powerful test,


unbiased test, 385kurtosis, 36, 59loglikelihood, 223moment generating function, 36sample mean

asymptotic distribution, 150,176

skewness, 36, 59sufficient statistic, 206, 216sum of Poissons, 51–52, 63

Subject Index 445

variance stabilizingtransformation, 150

polar coordinates, 72polio example

Bayesian inference, 97–98hypothesis testing

Bayesian, 261likelihood ratio test, 277

positive definite matrix, 104, 116prediction, 155probability, 3–4

axioms, 3frequency interpretation, 157of complement, 4of empty set, 4of union, 3, 4subjective interpretation,

157–158probability density function, see pdfprobability distribution, see

distributionprobability distribution function, see

distribution functionprobability interval

compared to confidence interval,159, 160

probability mass function, see pmfpseudoinverse, 111

qbeta (R routine), 92quadratic form, 111quantile, 33–34

late start example, 37quantile regression, 179, see also least

absolute deviations

random variable, 4coefficient of variation, 146collection, 4correlation coefficient, 20covariance, 19cumulant, 27cumulant generating function,

27–28distribution function, 5kurtosis, 25mean, 19mixed moment, 26mixture, 137moment, 25–26

moment generating function,26–27

pdf, 6pmf, 8precision, 119quantile, 33–34skewness, 25standard deviation, 19variance, 19vector, 4

randomization model, 285two treatments, 286–288

randomization testingp-value, 294randomization distribution

asymptotic normality, 298–299Hoeffding conditions, 298mean and variance of test

statistic, 297sign changes

asymptotic normality, 299Lyapunov condition, 299

testing randomnessasymptotic normality, 298draft lottery example, 298randomization distribution,

292two treatments

asymptotic normality, 297–298average null, 286exact null, 286p-value, 286–288randomization distribution,

287, 288two-by-two table

randomization distribution,289

rank, 304midrank, 304

Rao-Blackwell theorem, 210, 330rectangle (Cartesian product), 45regression, see linear regression;

logistic regressionresidence preference example

rank data, 41, 46ridge regression, see under linear

regressionRothamstead Experimental Station,

290

446 Subject Index

sample correlation coefficientKendall’s τ, 61Pearson, 194, 300

convergence to a constant, 136Fisher’s z, 151hurricane example, 307Student’s t distribution, 194

Spearman’s ρ, 307hurricane example, 307

sample covarianceconvergence to a constant, 136

sample maximumdistribution, 63

sample meanasymptotic efficiency, 233asymptotic joint distribution

with sample variance, 146asymptotic relative efficiency vs.

the median, 143bootstrap estimation, 166confidence interval

bootstrap, 168sample median

asymptotic distribution, 143asymptotic efficiency, 233asymptotic relative efficiency vs.

the mean, 143bootstrap estimation, 167confidence interval

bootstrap, 168sample variance, 108

asymptotic joint distributionwith sample mean, 146

bias, 162consistency, 128

sampling model, 285score function, see under likelihoodscore test, see under likelihood ratio

testSen-Theil estimator, 315, 321separating hyperplane theorem, 359,

367projection, 367proof, 367–368

shifted exponential distribution, 218estimation


Pitman estimator, 339


sufficient statistic, 216shoes example

confidence intervalfor correlation coefficient, 151for mean, 167–168for median, 167–168for ratio, 175

shrinkage estimator, 174sign test, see under nonparametric

testingSimpson’s paradox, 96singular value decomposition, 196skewness, 25

cumulant, 28slash distribution, 78

pdf, 78Slutsky’s theorem, 139smoking example

conditional probability, 96snoring example

hypothesis testing, 279logistic regression, 242model selection

Bayes information criterion,281–282

Spearman’s ρ, 307–308, see also undernonparametric testing

spectral decomposition theorem, 105spherically symmetric distribution,

73–74pdf, 73, 79polar coordinates, 73–74

spinner example, 9–12mean, 18

standard deviation, 19stars and bars, 174statistical model, 155–156

Bayesian, 156Student’s t distribution, 7

as ratio of standard normal toscaled square root of achi-square, 100, 114

estimationmedian vs. mean, 149trimmed mean, 234

mean, 100pdf, 100relationship to F, 121

Subject Index 447

variance, 100Student’s t statistic

convergence in distribution, 140sufficient statistic

Bayesian inference, 209, 217conditioning on, 206likelihood definition, 202likelihood function, 206minimal, 203one-to-one function, 203

supporting hyperplane, 368symmetry group, 292

tai chi exampleFisher’s exact test, 288–290

tasting tea exampleFisher’s exact test, 290

tent distributionas sum of two independent

uniforms, 14, 53, 78Tippett’s test, 260, 418trace of matrix, 105

as sum of eigenvalues, 105, 116transformation, 49–80

convolution, 50–53discrete, 49–52

one-to-one function, 65distribution functions, 52Jacobian

affine transformation, 70multiple dimensions, 66one dimension, 66

moment generating functions,56–60

orthogonal, 71pdfs, 66probability transform, 54–55

tread wear examplepaired comparison, 294–295sign test, 303signed-rank test, 316

trigamma function, 36trimmed mean

asymptotic efficiency, 234trinomial distribution, 61Tukey’s resistant-line estimate

exam score example, 301–302two-by-two table

cancer of larynx example,300–301

Fisher’s exact test, 288–290,386–387

tai chi example, 288–290tasting tea example, 290

hypothesis testing, 266–267maximum likelihood estimate,


unbiased test, 387two-sample testing

ACA example, 300Bayesian inference

polio example, 261BMI example, 317likelihood ratio test

polio example, 277Mann-Whitney/Wilcoxon test,

305–307randomization testing, 286–288,

294asymptotic normality, 297–298walking exercise example,

287–288, 298Student’s t test, 249, 277–278

uniformly most powerfulinvariant test, 418–419

U-statistic, 212UMP, see hypothesis testing:

uniformly most powerfultest

UMVUE, see estimation: uniformlyminimum varianceunbiased estimator

uniform distribution, 7, see alsodiscrete uniformdistribution

as special case of beta, 14completeness, 329, 343conditioning on the maximum,

101estimation

median vs. mean, 143, 233Pitman estimator, 344unbiased, 341uniformly minimum variance

unbiased estimator, 343hypothesis testing, 259

admissibility and Bayes,399–400

448 Subject Index

Neyman-Pearson test, 374–376uniformly most powerful test,

389kurtosis, 36likelihood function, 218order statistics, 76–77

Beta distribution of, 76Beta distribution of median,

77covariance matrix, 79gaps, 76joint pdf of minimum and

maximum, 80sample maximum


distribution, 63sample median

asymptotic distribution, 149sample minimum


convergence in probability,137

skewness, 36sufficient statistic, 204, 216sum of uniforms, 53, 78

variance, 19affine transformation, 22

variance stabilizing transformation,143–144

vectorcovariance matrix, 24covariance matrix of two, 35expected value, 23mean, 23

W. S. Gosset, 316walking exercise example

randomization model, 286two treatments

p-value, 287–288, 298weak law of large numbers (WLLN),

126, 136mapping, 128proof, 127

Wilcoxon test, see undernonparametric testing

WLLN, see weak law of largenumbers

Documents

MathematicalStatistics Old Schoolistics.net/pdfs/mathstat.pdf · Preface My idea of mathematical statistics encompasses three main areas: The mathemat-ics needed as a basis for work